Irecently was a part of an interesting Reddit discussion and a few of my answers got highly upvoted. The main point of it was the untold truths of being a machine learning engineer. I am sharing the key takeaways in a curated manner as I was one of the more active participants.
1. Using Deep Learning
Many Machine Learning enthusiasts think that they will play with fancy Deep Learning models, tune Neural Network architectures and hyperparameters. Don’t get me wrong, some do, but not many.
The truth is that ML engineers spend most of the time working on “how to properly extract the training set that will resemble real-world problem distribution”. Once you have that, you can in most cases train a classical Machine Learning model and it will work well enough.
Just out of curiosity, which is the hardest problem being solved by any of these algorithms? And which one is being used to solve it?
Do you have a feeling that deep learning on graphs is a bunch of heuristics that work sometimes and nobody has a clue why? In this post, I discuss the graph isomorphism problem, the Weisfeiler-Lehman heuristic for graph isomorphism testing, and how it can be used to analyse the expressive power of graph neural networks.
Are Deep Learning models difficult to explain in comparison to classic ML models?
OP said it nicely:
Can’t see how explaining a Convolutional Neural Net would be any harder than explaining a whole classification framework based on SVMs, Random Forests or Gradient Boosting.
I feel like this statement has become less and less true over the years as NNs have seen more research into interpretability.
It clearly still holds when comparing NNs to good old traditional statistics like GLMs or Naive Bayes. But as soon as you move to CART based methods or anything using the kernel trick this fabled interpretability goes out the window.
2. Learning Machine Learning
When learning, you tend to go through a lot of papers on arxiv-sanity with some really cool algorithms. Then you enter the industry and all you see is relatively basic stuff like logistic regression, feedforward NNs, random forests (decision trees), bag-of-words instead of embeddings, and you feel like these models could be implemented by the average undergrad or even a smart high schooler. Maybe if you’re lucky you’ll see an SVM.
Infrastructure and data pipelines are where all the real engineering work happens.
I felt similar to the OP above at the beginning of my career. But why would you use a more complicated tool to solve the task when there’s no need for it. Many real-world problems don’t require state-of-the-art NN architecture to be solved. Sometimes a simple logistic regression gets the job done.
The second part of the comment is true for smaller startups in which you usually have to take care of data pipelines by yourself. In bigger companies, there are designated departments that deal with infrastructure. But there are no shortcuts — Data Scientists still need to be well informed about how data infrastructure works.
3. Learning Theory
Learn as much fancy theory as you want, but at the end of the day, your job is going to be 99% data cleaning and infrastructure work.
99% is a bit overexaggerated. To rephrase the OP: Machine Learning Engineers don’t just play with fancy models. Sometimes they need to get their hands dirty by cleaning and labeling the data.
Why don’t you use software and services to label data?
This is very true. So much so that I thought I was alone. I work mostly in NLP and 99% of my job is labelling data and making some infrastructure in Java.
Data labeling services are usually too expensive for the big datasets that are used in practice. Some datasets are not trivial to label. I had an experience where I was working on invoice classification and you would need professional accountants to label that data.
How does Machine Learning look in the real-world?
I increasingly notice that there is a gap in understanding what do Data Scientists do. Many aspiring Data Scientists are then disappointed when expectations don’t meet reality. Data Science is not just about tweaking parameters of your favorite model and getting higher on the Kaggle leaderboard- what if I told you there is no leaderboard in the real world?!?
That is the reason I wrote Your First Machine Learning Model in the Cloud Ebook to show how does working on an actual Data Science projects looks from start to finish. This Ebook is aimed at Data Science enthusiasts and Software Engineers who are thinking to pursue a career in Data Science.