Ingredients in the making of a Data Scientist

How does one prepare for a career in data science? What credentials enable you to become a data scientist? These are frequently asked questions. Swami Chandrasekaran, the Executive Architect at IBM Watson, offers a roadmap. Chandrasekarans suggested curriculum is compelling, and his analogy of a metro map is a useful one. He presents us with ten metro lines comprising of:

Fundamentals
Statistics
Programming
Machine Learning
Text Mining / Natural Language Processing
Data Visualization
Big Data
Data Ingestion
Data Munging
Toolbox

Fundamentals

Statistics

Programming

Machine Learning

Text Mining/Natural Language Processing

Corpus
Named Entity Recognition
Text Analysis
UIMA
Term Document Matrix
Term Frequency & Weight
Support Vector Machines
Association Rules
Market Based Analysis ( Market Basket Analysis ? )
Feature Extraction
Using Mahout
Using Weka
Using Natural Language Toolkit (NLTK)
Classify Text ( Document Classification? )
Vocabulary Mapping

Data Visualization

Data Exploration in R (Hist, Boxplot etc)
Uni, Bi & Multivariate Viz
ggplot2
Histogram & Pie (Uni)
Tree & Tree Map
Scatter Plot (Bi)
Line Charts (Bi)
Spatial Charts
Survey Plot
Timeline
Decision Tree
D3.js
InfoVis
IBM ManyEyes
Tableau

Big Data

Map Reduce Framework
Hadoop Components
HDFS
Data Replication Principles
Setup Hadoop ( IBM / Cloudera / HortonWorks )
Name & Data Nodes
Job & Task Tracker
M/R Programming
Sqoop : Loading Data in HDFS
Flume, Scribe : For Unstructured Data
SQL with Pig
DWH with Hive
Scribe, Chunkwa For Weblog
Using Mahout
Zookeeper, Avro
Storm : Hadoop Realtime
Rhadoop, RHIPE
rmr
Cassandra
MongoDB, Neo4j

Data Ingestion

Summary of Data Formats
Data Discovery
Data Sources & Acquisition
Data Integration
Data Fusion
Transformation, Enrichment
Data Survey
Google OpenRefine
How much Data?
Using ETL

Data Munging

Toolbox

The only thing that we would add to this extensive framework is, of course, domain expertise within a specific industry, without which one may not be able ask the right questions.