With the increasing amount of data generated and the evolution in the field of analytics, Data Science has turned out to be a necessity for companies to stay in the game. This has led to an increase in the demand for data scientists in various organizations to make sense of their vast amount of data. There are a lot of resources available online describing the skills you can learn in order to raise your chances of breaking into the field of data science.
Let’s say you put in the time learning the skills, you gave several interviews, and now you’ve found yourself with a job in data science. But, now that you are a data scientist, how do you excel at your work? How can you prove your value and make a huge impact in the organization?
This article aims to provide you with five principles to implement effective data science projects:
Effective Writing
Data science contains several complicated concepts that can take several weeks for even an experienced professional to understand. A typical data science project often has a time horizon of several months, which involves several steps as shown in the following image.
Although there is a general theoretical sequence of a data science process, while working on a real project based on its needs, this sequence often gets altered, which makes it more complex to track.
If not reflected upon, our brain tends to forget things over time, and the retention curve could look something like this:
Due to a longer time horizon and complexity of the project your team is involved in, chances are your team is likely to lose track of your project in terms of its objectives, learnings, and insights gathered throughout its life. This could result in several redundant team discussions where you may find yourself explaining the same learnings/insight over and over again. This could consume a larger bandwidth of the meeting, which could be spent on planning the project’s future direction.
As a data scientist, it becomes essential that you write effectively, so that everyone involved in the project is aligned with its progress. Here is a list of things you can do to write effectively about your data science project:
- Project Explanation Guide – In this, you’ll explain your project’s objectives/impact on business, and build up a case where you emphasise why it is important and needs to be pursued. This document needs to be created in a way that a person with no knowledge about the project can understand it easily. In this document, you must try to restrict the use of data science jargon and make it easier for everyone to comprehend.
- Weekly update journal – Make a habit of creating a weekly journal in reverse chronological order about your project learnings, and seek feedback on it from your team. This will enable everyone to be in sync with the project’s progress and help them to reflect on learnings from time to time.
- Company blogs – The beauty of working in data science is that you get to work on exciting novel problems. You can share so much about how you overcame the challenges in your project, which could be valuable for others. As a data scientist, by contributing to your company’s blogs, you can help your company increase the overall credibility of its product in the market.
Explain/teach your learning to a wider audience
The everyday life of a data scientist can involve learning about some complex concepts. Today, many courses are available that teach those concepts, enabling anyone to break into this field without the need for a formal stat/mathematical background. Those who succeed in getting in sometimes face imposter syndrome, and the only way to get over it is to develop a strong fundamental-level understanding of data science topics.
It could be really difficult to get your head around these topics on your own, and even if you understand them to an extent, you are likely to have some voids in your understanding that you may not be aware of at that time when you read about them. And, if you proceed straight away with your data science project with that state of understanding, then these knowledge gaps can later cause problems in your project, leading to wasted effort in terms of budget and time.
The variation of your confidence about a topic vs. how much you truly know it can be learned from the above image. For a new topic, in the beginning, your confidence is low due to unfamiliarity with it. However, after a short while, you start getting it, and you reach a certain comfort zone. Unless you are being pushed, you might stay there for a very long time. Then, after a push, as you get more actively involved in learning, you’ll most likely fall into the “valley of despair” as you realize there’s a lot more you haven’t yet mastered. From there, generally, your confidence grows as your experience increases. This is known as the Dunning-Kruger effect.
So, how would you find these knowledge gaps in order to learn what you know and what you don’t know, so that you can ensure the true understanding of a concept?
If you can’t explain it simply, you don’t understand it well enough
Albert Einstein
To supercharge your learning, the Feynman Technique might just be the best way to learn and fill those knowledge gaps. This was devised by a Nobel Prize-winning physicist who leverages the power of teaching for enhanced learning.
Here is a five-step process based on the learnings of the Feynman Technique. After reflecting on our own learning experiences by using this process, we have adapted it slightly. The steps are as follows:
- Narrow down the topic and learn about it on your own.
- Explain the topic as though you are teaching it to someone in very simple terms by using examples, and identify gaps in your knowledge.
- Go back to the source material and fill those knowledge gaps.
- Assess your knowledge about the topic. If some concepts still remain unclear, learn more and repeat steps 2 through 4.
- When you feel more confident about the topic, explain it to your colleagues, take their feedback, and then perform step 4.
The questions and the feedback you receive in step 5 are invaluable for deepening your understanding of the topic. Hearing what intrigued them will increase your curiosity to learn further. After all, the more you learn about it, the more you realize you don’t know. And, only by looking at a topic from different perspectives can we understand its true meaning.
Top Google results are not necessarily the best
Working on data science projects in an industry is very different from the problems that you work with while preparing for data science. Most of the techniques we learn while working on toy problems do not apply well in real data science projects due to practical constraints.
Industry-level data problems are complex, and building solutions for them are even more complex as the notion of one best solution doesn’t exist. This article describes some of the challenges faced by data scientists in practicing data science at work. So, you need to understand and be aware of the challenges concerning your problem.
If you are starting with your first real data science project, avoid the tendency to start implementing the solution that you find in the top results of a quick Google search. Most of the blogs are introductory, and can only give you a basic gist of the problem. It would be highly unlikely that you will find a solution that perfectly aligns with your use case.
To reach a suitable solution that works for your problem in your scenario, it is important to do a literature survey where you study research published in the field of your problem. It is the most important part as it gives you a direction in the area of your research. You study the research made by various analysts, such as their methodology and the conclusions they have arrived at. This helps you to set a goal for your analysis; thus, helping you define your problem statement.
To find research in your topic of interest, use tools like Google-scholar, Arxiv-sanity, Paperswithcode, and Semanticscholar. A quick tip is to not go into full depth of each paper that you find; instead, try to skim through them and keep aside a few papers to be read in full depth that align well with your problem.
By performing a proper literature survey, you would be able to look at the problem from a lot of different angles and develop a solution that suits your problem well.
Seek simplicity
Let’s face it, data science is complicated, getting value from data is hard, and most data projects fail. Delivering value from data is hard because data is messy. Most statistical assumptions don’t always hold in practice, and sometimes, we don’t even have the right data to answer the question. To work with it often, we tend to build complex models, hoping that by throwing in enough compute power and data, these models will take care of all the shortcomings by finding relevant patterns and solve our use case. In this process, you’ll be building a black-box model where you aren’t aware of the situations where your model could fail, and therefore, produce an unreasonable output. It is a bad place to be in as you would not even know it’s a wrong output, and everyone would eventually lose trust in your model.
It is important to understand that data science is not about building black-box models. Building a data science solution is not much different from building a software solution that can be validated by anyone using it. By making your model interpretable and simple, you can make it:
- Reliable
- Easier to Debug
- Useful in directing future data collection
- Useful in human decision-making
- Trustful for its intended users
With the increasing adoption of data science in the industry, the need for interpretable machine learning solutions has strongly been picking up. Christoph Molnar has provided a great guide on making the Black-Box model interpretable.
Data science might have given us better tools to understand the world, but at the end of the day, it provides nothing more than a bunch of numbers. Those numbers might tell us valuable information, such as whether customers will buy a certain product, which customers are likely to churn, or how much sales you’ll make in the coming quarters. But, they are just numbers unless we use them to gather insights and make an informed action to get the most value from data. A simple solution deployed and used is far more valuable than a complex solution that never sees the light of day. By considering ML algorithms as tools that use data to quantify your beliefs about the domain of your problem, you can keep them in check and make them interpretable.
Understand nuances specific to the business domain
There has been an exploding interest in data science, triggered by its ability to capture, process, and analyze large amounts of data. Data scientists are hired for their diverse set of skills in mathematics and programming, which they use to test different algorithms and tweak hyperparameters in such a way that those algorithms can identify patterns from data and minimize human intervention.
While it may be true that a typical data science pipeline involves a lot of processes, one should be careful not to get lost in the technique and lose track of the business problem at hand. To be an effective data scientist, your focus should be on developing acumen to understand what strategies will work and what will not within a particular domain, given the dataset for that business problem.
Just by incorporating information about the industry domain, you can significantly improve the accuracy of your model by making efficient use of available features, and also get the added benefit of building a generalized model for real-world solutions. Performing sophisticated feature engineering is a difficult but useful skill that only comes with experience.
To get a better understanding of the domain, talk to users for whom you are building the product. By understanding their pain points and their usage pattern, you can build a solution that could resonate with their behaviour.