I work at a data science mentorship startup, and I’ve found there’s a single piece of advice that I catch myself giving over and over again to aspiring mentees. And it’s really not what I would have expected it to be.
Rather than suggesting a new library or tool, or some resume hack, I find myself recommending that they first think about what kind of data scientist they want to be.
The reason this is crucial is that data science isn’t a single, well-defined field, and companies don’t hire generic, jack-of-all-trades “data scientists”, but rather individuals with very specialized skill sets.
To see why, just imagine that you’re a company trying to hire a data scientist. You almost certainly have a fairly well-defined problem in mind that you need help with, and that problem is going to require some fairly specific technical know-how and subject matter expertise. For example, some companies apply simple models to large datasets, some apply complex models to small ones, some need to train their models on the fly, and some don’t use (conventional) models at all.
Each of these calls for a completely different skill set, so it’s especially odd that the advice that aspiring data scientists receive tends to be so generic: “learn how to use Python, build some classification/regression/clustering projects, and start applying for jobs.”
Those of us who work in the industry bear a lot of the blame for this. We tend to lump an excessive number of things into the “data science” bucket in casual conversations, blog posts and presentations. Building a robust data pipeline for production? That’s a “data science problem.” Inventing a new kind of neural network? That’s a “data science problem.”
That’s not good, because it tends to cause aspiring data scientists to lose focus on specific problem classes, and instead become jacks of all trades — something that can make it harder to get noticed or break through, in a market that’s already saturated with generalists.
But it’s hard to avoid becoming a generalist if you don’t know which common problem classes you could specialize in in the fist place. That’s why I put together a list of the five problem classes that are often lumped together under the “data science” heading:
1. Data engineer
Job description: You’ll be managing data pipelines for companies that deal with large volumes of data. That means making sure that your data is being efficiently collected and retrieved from its source when needed, cleaned and preprocessed.
Why it’s important: If you’ve only ever worked with relatively small (<5 Gb) datasets stored in .csv or .txt files, it might be hard to understand why there would exist people whose full-time jobs it is to build and maintain data pipelines. Here are a couple of reasons: 1) A 50 Gb dataset won’t fit in your computer’s RAM, so you generally need other ways to feed it into your model, and 2) that much data can take a ridiculous amount of time to process, and often has to be stored redundantly. Managing that storage takes specialized technical know-how.
Requirements: The technologies you’ll be working with include Apache Spark, Hadoop and/or Hive, as well as Kafka. You’ll most likely need to have a solid foundation in SQL.
The questions you’ll be dealing with sound like:
→ “How do I build a pipeline that can handle 10 000 requests per minute?”
→ “How can I clean this dataset without loading it all in RAM?”
2. Data analyst
Job description: Your job will be to translate data into actionable business insights. You’ll often be the go-between for technical teams and business strategy, sales or marketing teams. Data visualization is going to be a big part of your day-to-day.
Why it’s important: Highly technical people often have a hard time understanding why data analysts are so important, but they really are. Someone needs to convert a trained and tested model and mounds of user data into a digestible format so that business strategies can be designed around them. Data analysts help to make sure that data science teams don’t waste their time solving problems that don’t deliver business value.
Requirements: The technologies you’ll be working with include Python, SQL, Tableau and Excel. You’ll also need to be a good communicator.
The questions you’ll be dealing with sound like:
→ “What’s driving our user growth numbers?”
→ “How can we explain to management that the recent increase in user fees is turning people away?”
3. Data scientist
Job description: Your job will be to clean and explore datasets, and make predictions that deliver business value. Your day-to-day will involve training and optimizing models, and often deploying them to production.
Why it’s important: When you have a pile of data that’s too big for a human to parse, and too valuable to be ignored, you need some way of pulling digestible insights from it. That’s the basic job of a data scientist: to convert datasets into digestible conclusions.
Requirements: The technologies you’ll be working with include Python, scikit-learn, Pandas, SQL, and possibly Flask, Spark and/or TensorFlow/PyTorch. Some data science positions are purely technical, but the majority will require you to have some business sense, so that you don’t end up solving problems that no one has.
The questions you’ll be dealing with sound like:
→ “How many different user types do we really have?”
→ “Can we build a model to predict which products will sell to which users?”
4. Machine learning engineer
Job description: Your job will be to build, optimize and deploy machine learning models to production. You’ll generally be treating machine learning models as APIs or components, which you’ll be plugging into a full-stack app or hardware of some kind, but you may also be called upon to design models yourself.
Requirements: The technologies you’ll be working with include Python, Javascript, scikit-learn, TensorFlow/PyTorch (and/or enterprise deep learning frameworks), and SQL or MongoDB (typically used for app DBs).
The questions you’ll be dealing with sound like:
→ “How do I integrate this Keras model into our Javascript app?”
→ “How can I reduce the prediction time and prediction cost of our recommender system?”
5. Machine learning researcher
Job description: Your job will be to find new ways to solve challenging problems in data science and deep learning. You won’t be working with out-of-the-box solutions, but rather will be making your own.
Requirements: The technologies you’ll be working with include Python, TensorFlow/PyTorch (and/or enterprise deep learning frameworks), and SQL.
The questions you’ll be dealing with sound like:
→ “How do I improve the accuracy of our model to something closer to the state of the art?”
→ “Would a custom optimizer help decrease training time?”
The five job descriptions I’ve laid out here definitely don’t stand alone in all cases. At an early-stage startup, for instance, a data scientist might have to be a data engineer and/or a data analyst, too. But most jobs will fall more neatly into one of these categories than the others — and the larger the company, the more these categories will tend to apply.
Overall, the thing to remember is that in order to get hired, you’ll usually be better off building a more focused skillset: don’t learn TensorFlow if you want to become a data analyst, and don’t prioritize learning Pyspark if you want to become a machine learning researcher.
Think instead about the kind of value you want to help companies build, and get good at delivering that value. That, more than anything else, is the best way to get in the door.