Be The Chosen Data Scientist: Interview Questions You Should Prepare For

Looking to Hire Data Scientists? Post your project in the Experfy Marketplace for on-demand help!

Interviews can be nerve wrecking, especially when it comes to Big Data. But as someone who has spent decades in this industry, I think it doesn’t have to be. If one is prepared, then it can just turn out to be a good dialogue and explain your value to prospective employers.

But as they say, one who has prepared well has half won the battle!

So to make things easier, I’m going to share a few questions that I’ve been asked over the years and some that I used while interviewing candidates when building a data science team. And by no means are they exhaustive.

First things first, depending on the role you’re interviewing for, the emphasis will be different. For instance:

If you’re a data scientist, there will be more questions on machine learning and statistics.
If you’re interviewing for an analyst role, there will be more of SQL, data visualisation and business analysis questions.
If it is a data engineering role, there will be heavy programming questions about ETL, data pipeline, general computer science and system programming questions.

I’m going to go category wise and suggest some great resources to refer to, to let you drill down what you need to prepare for, faster.

Statistics

The foundation of machine learning and analytics, is statistics. So having a basic understanding of the same, is very important. While you’re not expected to be a statistician, knowing the fundamentals will ensure the interviewers that you’re not just plugging data into models and actually understand what you are doing.

Question 1: What is linear regression?

A linear approach to modelling the relationship between a scalar response and one or more variables, is known as linear regression. When it’s the case of one explanatory variable in the equation, it is called simple linear regression.

LASSO and Ridge Regression by Experfy

Question 2: What is interpolation and extrapolation?

Extrapolation refers to an estimation of a value, based on extending a known sequence of facts beyond the area for which they are known. Interpolation, is the estimation of a value within two known values in a sequence of values.

What is interpolation and extrapolation by Mario’s Math Tutoring

Question 3: What is the difference between univariate, bivariate, and multivariate analysis?

Bivariate and multivariate, are terms that are used to describe how many variables are being analysed at the moment. While multivariate means more than two variables, bivariate refers to two and univariate means that only one variable is under examination.

Classification Models by Experfy
The difference between bivariate and multivariate analysis by Sciencing

Question 4: What does p-value signify about the statistical data?

A small p-value signifies evidence against the null hypothesis and is typically less than or equal to 0.05. A large p-value is greater than 0.05 and fails to reject the null hypothesis during data analysis.

What a p-value tells you about statistical data by dummies
Probability and statistics for data science with R by Experfy

Question 5: What is the difference between Type I and Type II error?

A type I error is also known as false positive finding and refers to the rejection of a true null hypothesis. The type II error is also called a false negative finding, retaining a false null hypothesis during an analysis.

The difference between Type I and Type II errors by ThoughtCo

Question 6: How do you deal with outliers?

An outlier refers to any data point that is distinctly different from the rest of your data points. In case you have outlier records in your data set, you can choose to either cap your data, assign a new value to it, try a transformation or completely remove it from your data set.

Data on the edge: Handling outliers by Veera
Data pre-processing by Experfy

Question 7: How do you handle missing data?

There are several ways to handle missing data. The easiest way(not necessarily the best way) is to just plainly get rid of all missing way. Another way is to replace the missing value with the mean(average) of the time series. A more sophisticated way is to impute the missing values using various statistical and machine learning techniques.

Working with missing data in machine learning
Data pre-processing course by Experfy

Question 8: What is nonparametric testing?

Nonparametric tests are sometimes called distribution-free tests because they are based on fewer assumptions. They don’t assume the underlying distribution is normal. You use nonparametric tests when your data is not normal.

Question 9: Describe the central limit theorem.

The central limit theorem just says that with a large sample size, sample means are normally distributed. A sample mean is the average of a random subset of a larger group. So if you randomly picked 10 people out of 100 and recorded their weights, the average of those 10 weights would be the sample mean. You could do this many times and, since it is a random selection, the sample mean would be different each time.

The CLT make no assumptions about the distribution of your underlying data. The distribution of people’s weights does not need to be normally distributed in order to know that the sample means of the weights are normally distributed.

Ingredients in the making of a data scientist

Question 10: Alice has 2 kids and one of them is a girl. What is the probability that the other child is also a girl?

You can assume that there is an equal number of males and females in the world.

The outcomes for two kids can be: {BB, BG, GB, GG}

Since it is mentioned that one of them is a girl, we can remove the BB option from the sample space. Therefore the sample space has 3 options while only one fits the second condition. Therefore the probability the second child will be a girl too is 1/3.

Machine Learning & Theory

In this category, employers want to make sure you can explain the basic concepts behind popular machine learning algorithms and models.

Nowadays, machine learning algorithms are simply just library calls from scikit-learn(if you are using Python) or various packages (if you are using R). So using the machine learning algorithm is just several short lines of code.

However, do you understand the library functions you are calling?

These are just a sample of the various questions that might be asked.

Question 1: What’s the difference between Supervised and Unsupervised Learning?

In supervised learning, the machine learning algorithm learns a function that maps an input to an output based on examples of input-output pairs. Examples of supervised learning include regression, neural networks, random forest, deep learning, etc.

In unsupervised learning, we give the machine learning algorithms data and it infers structure from the data. Examples of unsupervised learning are the various classification algorithms where it finds groups in unlabeled data.

Machine learning foundations: Supervised Learning by Experfy

Question 2: Describe PCA(Principal Component Analysis).

PCA is a dimensionality reduction technique. Let’s say we have a data set with a higher number of dimensions ( n dimensions). We select k features (also called variables/factors) among a larger set of n features, with k much smaller than n.

This smaller set of k features created using PCA is the best subset of k features(in that it minimizes the variance of the residual noise when fitting data to a linear model). Note that PCA transforms the initial features into new ones, that are linear combinations of the original features.

Unsupervised learning: Dimensionality reduction and representation by Experfy

Question 3: Why is naive Bayes so ‘naive’ ?

Naive Bayes is called ‘naive’ because it assumes that all of the features in a dataset are statistically independent. This is rarely the case in real life since there’s always some kind of correlation between the features. Having said that, Naives Bayesian algorithm is a surprisingly effective machine learning algorithm for a number of use cases.

Question 4: Discuss bias and variance tradeoffs. Give examples of ML algorithms that have low/high bias and low/high variance.

The bias-variance tradeoff is the central problem in supervised machine learning. We want to choose a model that both accurately captures the regularities in the training data but also be able to generalize to unseen data. High bias models underfit the data by missing relevant relationships between features and the target outputs. High variance model overfits the training data by being too sensitive to all the minutiae and fluctuations in the training data.

The best model has low bias and low variance. But there’s usually a tradeoff.

Parametric or linear machine learning algorithms often have a high bias but a low variance.
Non-parametric or non-linear machine learning algorithms often have a low bias but a high variance.
Examples of low-variance machine learning algorithms include Linear Regression, Linear Discriminant Analysis, and Logistic Regression.
Examples of high-variance machine learning algorithms include Decision Trees, k-Nearest Neighbors and Support Vector Machines.

Question 5: How do you handle imbalanced dataset?

Certain data science problems have highly imbalanced data. For example, in the case of fraud detection, the fraud cases might be significantly less than 1% of the sample. Any machine learning models will overlearn the non-fraudulent cases and not be able to pick up the fraudulent cases. So, how do we fix this problem?

There are several approaches:

Oversample: Oversample the minority class(fraud cases) by increasing the quantity of the rare class so that it becomes more representative. There are several statistical methods used such as bootstrapping or SMOTE(Synthetic Minority Oversampling Technique).
Undersample: Undersample the majority class so that it can be balanced with the minority class.
Both oversample/undersample: In some cases, we likely do both. Oversample the minority class and undersample the majority class.

Question 6: Name several clustering algorithms.

K- mean clustering
Agglomerative Hierarchical Clustering
Gaussian Mixture Models using Expectation Maximization
Density-Based Spatial Clustering of Applications with Noise (DBSCAN)

Question 7: Describe one method to select the optimal number of clusters in k-means?

One method to pick the optimal number of clusters is the “elbow” method. The basic idea is to run k-means clustering through the dataset for different values of k(e.g. 1 to 10). For each value of k, calculate the sum of squared errors(SSE).We then plot a line chart of the SSE for each value of k. If the line chart looks like an arm, then the “elbow” on the arm is the value of k that is optimal. The basic idea is that we want a small SSE, but that the SSE tends to decrease toward 0 as we increase k. The objective is to choose a small value of k that still has a low SSE, and the elbow usually represents where we start to have diminishing returns by increasing k.

Question 8: What are the advantages and disadvantages of neural networks?

Advantages: Neural networks have led to performance breakthroughs for unstructured data sets such as video, audio, and images(especially deep learning neural networks). They have been able to do things that no other ML algorithms have achieved.

Disadvantages: However, they require a large amount of training data. It’s also difficult to pick the right deep learning architecture, and the internal “hidden” layers are incomprehensible. Very difficult to explain. Blackbox.

Question 9: What is the ROC Curve and what is AUC (a.k.a. AUROC)?

The ROC (receiver operating characteristic) curve is a plot of the performance of binary classifiers of True Positive Rate (y-axis) vs. False Positive Rate (x-axis).

AUC is an area under the ROC curve, and it’s a common performance metric for evaluating binary classification models.

The higher the AUC the better the classifier.

Question 10: What is cross validation?

Cross-validation is a technique to evaluate predictive models and machine learning algorithms by partitioning the data set into test and training sets. The way to go about doing this is to use something called k-fold cross-validation. The original sample is randomly partitioned into k equal size subsets. Of the k subsets, a single subset is kept as the validation set for testing the model, and the remaining k-1 subsets are used as training data.

The cross-validation process is then repeated k times (k folds), with each of the k subsets used exactly once as the validation data. The k results from the folds can then be averaged (or otherwise combined) to produce a single estimate.

The advantage of this method is that all observations are used for both training and validation, and each observation is used for validation exactly once.

Data Munging, SQL, Visualization, Programming

This is a really broad topic. Depending on the role, the questions can be quite technical. Whereas the other topics are more conceptual. Questions in this category are more tactical and often times you are asked to even write code on the spot.

Question 1: Describe all the joins in SQL.

Here are the different types of the JOINs in SQL:

(INNER) JOIN: Returns records that have matching values in both tables
LEFT (OUTER) JOIN: Return all records from the left table, and the matched records from the right table
RIGHT (OUTER) JOIN: Return all records from the right table, and the matched records from the left table
FULL (OUTER) JOIN: Return all records when there is a match in either the left or right table

Question 2: What is the SQL query to find the second highest salary employee? Assume we have a table called employee with salary as a field.

Select max(salary) from employee

Where salary not in (select max(salary) from employee);

Question 3: What is the difference between Union and Union All?

UNION command is used to combine the result set of two or more select statements. However, union eliminates the duplicates. UNION ALL includes duplicates.

Question 4: How would you visualize a dataset with height, weight, and eye color in a 2-D graph. (Or ask to visualize any dataset on the whiteboard with more than 2 dimensions)

The basic idea is to encode dimensions beyond 2 as shape or colors or symbols. Anyone who has used Tableau will know this right away. I’ve personally interviewed doctoral candidates(people in a Ph.D. program) in a quantitative field who struggle with this question.

So, it’s not really a measure of intelligence. Just understanding how to visualize higher dimensional data using different types of encoding.

Question 5: What is features engineering? Describe some feature engineering you have done in the past to improve the results of your data science project.

Features engineering is the process of using knowledge about the problem space to create features(factors/variables) that make machine learning algorithms work better.

As the famous Andrew Ng said, “Coming up with features is difficult, time-consuming, requires expert knowledge. “Applied machine learning” is basically feature engineering.”

Candidate should explain how they transform existing features into new features that are meaningful to their data science projects.

Question 6: Explain MapReduce conceptually.

The map is a function which “transforms” items in some kind of list to another kind of item and put them back in the same kind of list.

Reduce is a function which “collects” the items in lists and perform some computation on all of them, thus reducing them to a single value.

Question 7: Write a regular expression to parse dates of the YYYY-MM-DD format.

/#k8SjZc9Dxkd{4}[/-](0?[1-9]|1[012])[/-](0?[1-9]|[12][0-9]|3[01])$/

Question 8: Describe the data cleaning process.

How can we cleanse:

Standardization – making data follow the same rules, notifications, codes
Enrichment – filling in missing data based on some reference value (eg. City name)
De-duplication – finding and removing seemingly same but actually duplicate data
Validations – commonly used for making sure data follows business rules

Question 9: Why is Apache Spark faster than Hadoop?

Basic idea is that it uses in-memory computation instead of writing everything to files like Hadoop.

Also read: A big data or a big data developer? What do you want to become?

Question 10: What is normalization/denormalization? Tradeoffs of each. What is the use case for denormalization?

Most relational databases are denormalized. This means the data is reorganized so that it contains no redundant data and related data separated into different tables. Normalizing reduces disk storage. The normalized the database the more complex the queries are because a query has to join many tables.

The data in a data warehouse, on the other hand, are organized to be read-only and for analytics purposes. Therefore, it does not need to be organized for a normalized fashion. A denormalized data warehouse uses fewer tables and includes many redundancies which are used for reporting and analytical purposes.

Projects & Fit Questions

These are “soft” open-ended questions to understand how you tackle a data science project. How well do you work in teams? What does your prior role look like?

Sometimes, certain behavioral questions are asked during the interview and you need to be prepared for it.

Question 1: Describe a recent project you’ve worked on.
Question 2: What are some of your favorite machine learning algorithms or statistical models and why?
Question 3: How big was your current team and the team structure? What was your role? Data scientist, data engineer, analyst, architect?
Question 4: Give an example of a team project you worked on and your contribution to the team.
Question 5: You are assigned a new data analytics/data science project. How will you begin with and what are the steps you will follow?
Question 6: What types of management styles do you work best under?
Question 7: What are your best technical skills? What areas are you weakest in?

Conclusion

The final advice I can offer is to be confident in your answers without bluffing. And if you don’t know something then admit that.

For every question asked, try to speak out loud your thinking process so the interviewers get a good idea of your problem-solving abilities. Jumping right to the solution or just giving up right away does not give them a good idea of your abilities.

Also, remember interviewing is a two-way street. It’s also for you to get to know them. Make sure the culture is a good fit for your working style. But also be open to learning new things. If you are in love with the Python data stack but the company is an R shop, then you should be open to learning R.

Technologies are constantly changing. Picking up more skills and tools along your career path will make you more valuable and marketable. So, don’t be stubborn or shy to learn new and different tools and technologies.

Explore our courses on Experfy