Myths, dreams, and reality of this beautiful job
What you think it is
What it really is
Expectation for results
Explaining
Business understanding
Many times data comes from complex and heterogeneous systems and this often implies lines of log files that you need to understand. Data isn’t everything; information is everything. Never forget this. Information is buried inside data and you’ll need somebody telling you where you should dig.
The larger the company, the more difficult it is to find the right people to interview and when you finally make it, their answers will generate more questions and these people may not have enough time for you and your “nerdy stuff”.
Data visualization
You’ll find yourself using data visualization more often than you would have ever imagined. Charts, slides and other graphical tools will be like silver bullets in your shotgun. Maybe you have magic formulas in your mind, graphs and so on. Forget about them. Data Science is told by graphical representations and it’s often difficult to find the proper visualization technique suitable for your audience.
Deadlines
There they are. We are slaves in a world of deadlines and expectations. When you were a software engineer you had milestones in your plan and you weren’t allowed to delay a second. In Data Science, things aren’t easier.
There are deadlines and milestones even in Data Science, and there is a great difficulty inside them: Data Science is something very close to academic research, so it doesn’t fit well in the classical, waterfall ITC project management style. Instead, some Agile framework (e.g. Scrum or Kanban) should work well, due to its physiological ability to quickly adapt to changes. But Agile is difficult to teach to managers. It can give them the false idea that there’s no clear delivery date and this is very difficult to accept by companies.
Algorithms and programming
And finally, the fun part. Python, R, Knime, reading scientific papers, optimization algorithms, cross-validation and so on. The technical and nerdy real fun is a very small part of the work and it takes very little time in the whole project lifetime. Maybe you have already lost enthusiasm in the previous phases before writing your first line of code and things no longer seem as funny as you thought at the beginning.
What’s the best way to do Data Science?
According to my experience, I can answer with a single word: Agile. There’s no need to perform all the business understanding part before writing your first Python code line. Start with a simple business understanding of a small piece of data, explore it, visualize it and begin with a simple model. Create the first, quantifiable results week by week keeping your customers constantly engaged in the process. Deliver small results with a constant delivery rate and, please, don’t fall into the waterfall trap.
Simplicity is the key. Never forget it. Start with the simplest things possible and add a small piece of complexity only if needed.
There’s a psychological sense of relief in constant, small results and this is another weapon you have to use if you want to survive in the jungle of companies’ deadlines and business processes. In this way, every colleague of yours who is committed to your project will feel your difficulties and start to understand how difficult Data Science is.
Remember, companies still think about Data Science as an ITC branch; they are not completely wrong, but they shouldn’t expect you to follow the waterfall approach. So, you have to suffer the struggle to guide your company toward an Agile way of thinking.
Concerning the explanation part of the job, I prefer to start with the simplest machine learning model possible: k-nearest neighbors. It’s very easy to understand. You only need paper, a pencil and a Cartesian plane with some points drawn on it. That’s it. If it produces very nice results, everybody will finally see you like the great business partner you think you are.
If KNN doesn’t work, then you can use regressions and decision trees (random forests, gradient boosted tree classifiers and so on), which are very easy to explain, or Bayesian networks, which have a very useful graphical representation.
Finally, visualize. Visualize everything. Ask your boss to buy you a course in data visualization, learn as much as possible about the best visualization techniques and, please, remember to avoid pie charts. They are pretty useless and misleading. If you provide a simple scatter or bar plot, people will catch all the relevant information.
Simple results are the best ones. Some days ago, my team and I presented some results about a time series analysis using only three slides: high-level KPIs describing the business phenomenon, a confusion matrix and some performance metrics. Our audience was enthusiastic since the first slide, only because we started with clear numbers explaining the business in a simple way. In many situations, a small building block can really save your life.
Conclusions
Data Science is an exciting job, but it can be very difficult to perform if you speak to a non-technical audience. Data and business are intimately related to each other and you must remember this point when you work with business-oriented people. The only way to survive is to find a middle point between a data-driven bottom-up approach and a business-driven top-down approach.
Finally, as Data Science is hard and time-consuming, delivering small results with a constant delivery rate is the only way you can keep your customers engaged.