How does one prepare for a career in data science? What credentials enable you to become a data scientist? These are frequently asked questions. Swami Chandrasekaran, the Executive Architect at IBM Watson, offers a roadmap. Chandrasekarans suggested curriculum is compelling, and his analogy of a metro map is a useful one. He presents us with ten metro lines comprising of:
- Fundamentals
- Statistics
- Programming
- Machine Learning
- Text Mining / Natural Language Processing
- Data Visualization
- Big Data
- Data Ingestion
- Data Munging
- Toolbox
If you have trouble reading the map, here is a full list in text.
Fundamentals
- Metrics & Linear Algebra Fundamentals
- Hash Functions, Binary Tree, O(n)
- Relational Algebra, DB Basics
- Inner, Outer, Cross, Theta Join
- CAP Theorem
- Tabular Data
- Entropy
- Data Frames & Series
- Sharding
- OLAP
- Multidimensional Data Model
- Extract/Transform/Load(ETL)
- Reporting vs BI vs Analytics
- JSON & XML
- NoSQL
- Regex
- Vendor Landsacpe
- Env Setup
Statistics
- Pick a Dataset (UCI Repo)
- Descriptive Statistics(mean, median, range, SD, Var)
- Exploratory Data Analysis
- Histograms
- Percentiles & Outliers
- Probability Theory
- Bayes Theorem
- Random Variables
- Cumulative Distribution Function (CDF)
- Continuous Distributions (Normal, Poisson, Gaussian)
- Skewness
- Analysis of Variance (ANOVA)
- Probability Density Function (PDF)
- Central Limit Theorem
- Monte Carlo Method
- Hypothesis Testing
- p-Value
- Chi-square Test
- Estimation
- Confidence Interval (CI)
- Maximum Likelihood Estimation (MLE)
- Kernel Density Estimate
- Regression
- Covariance
- Correlation
- Pearson Coeff
- Causation
- Least Squares Fit
- Euclidean Distance
Programming
- Python Basics
- Working in Excel
- R Setup, R Studio
- R Basics
- Expressions
- Variables
- IBM SPSS, Rapid Miner
- Vectors
- Matrices
- Arrays
- Factors
- Lists
- Data Frames
- Reading CSV Data
- Reading RAW Data
- Subsetting Data
- Manipulate Data Frames
- Functions
- Factor Analysis
- Install Pkgs
Machine Learning
- What is ML?
- Numerical Var
- Categorical Variable
- Supervised Learning
- Unsupervised Learning
- Concepts, Inputs & Attributes
- Training & Test Data
- Classifier
- Prediction
- Lift
- Overfitting
- Bias & Variance
- Trees & Classification
- Classification, Classification Rate
- Decision Trees
- Boosting
- Naïve Bayes Classifiers
- K-Nearest Neighbor
- Logistic Regression
- Regression, Ranking
- Linear Regression
- Perceptron
- Clustering, Hierarchical Clustering
- K-means Clustering
- Neural Networks
- Sentiment Analysis
- Collaborative Filtering
- Tagging
Text Mining/Natural Language Processing
- Corpus
- Named Entity Recognition
- Text Analysis
- UIMA
- Term Document Matrix
- Term Frequency & Weight
- Support Vector Machines
- Association Rules
- Market Based Analysis ( Market Basket Analysis ? )
- Feature Extraction
- Using Mahout
- Using Weka
- Using Natural Language Toolkit (NLTK)
- Classify Text ( Document Classification? )
- Vocabulary Mapping
Data Visualization
- Data Exploration in R (Hist, Boxplot etc)
- Uni, Bi & Multivariate Viz
- ggplot2
- Histogram & Pie (Uni)
- Tree & Tree Map
- Scatter Plot (Bi)
- Line Charts (Bi)
- Spatial Charts
- Survey Plot
- Timeline
- Decision Tree
- D3.js
- InfoVis
- IBM ManyEyes
- Tableau
Big Data
- Map Reduce Framework
- Hadoop Components
- HDFS
- Data Replication Principles
- Setup Hadoop ( IBM / Cloudera / HortonWorks )
- Name & Data Nodes
- Job & Task Tracker
- M/R Programming
- Sqoop : Loading Data in HDFS
- Flume, Scribe : For Unstructured Data
- SQL with Pig
- DWH with Hive
- Scribe, Chunkwa For Weblog
- Using Mahout
- Zookeeper, Avro
- Storm : Hadoop Realtime
- Rhadoop, RHIPE
- rmr
- Cassandra
- MongoDB, Neo4j
Data Ingestion
- Summary of Data Formats
- Data Discovery
- Data Sources & Acquisition
- Data Integration
- Data Fusion
- Transformation, Enrichment
- Data Survey
- Google OpenRefine
- How much Data?
- Using ETL
Data Munging
- Dimensionality & Numerosity Reduction
- Normalization
- Data Scrubbing
- Handling Missing Values
- Unbiased Estimators
- Binning Sparse Values
- Feature Extraction
- Denoising
- Sampling
- Stratified Sampling
- Principal Component Analysis
Toolbox
- MS Excel w/ Analysis ToolPak
- Java, Python
- R, R-Studio, Rattle
- Weka, Knime, RapidMiner
- Hadoop Dist of Choice
- Spark, Storm
- Flume, Scribe, Chukwa
- Nutch, Talend, Scraperwiki
- Webscraper, Flume, Sqoop (Flume Dup?)
- tm, RWeka, NLTK
- RHIPE
- D3.js, ggplot2, Shiny
- IBM Languageware
- Cassandra, MongoDB
The only thing that we would add to this extensive framework is, of course, domain expertise within a specific industry, without which one may not be able ask the right questions.
Header image credit: Biocomicals.com