Here are 4 reasons why…
Table of Contents
- Introduction
- Exploratory Data Analysis
- Stakeholder Collaboration
- Feature Creation
- Mastered Visualizations
- Summary
- References
Introduction
While it may seem obvious at first to state that knowing Data Analytics before learning Data Science is key, it might surprise you then how many people jump right into Data Science without the right foundation of analyzing and presenting data. There are certain benefits to having either an internship, entry-level position, or any position really in Data Analytics beforehand. It is also important to note that this form of experience can be acquired by completing online courses and specializations in Data Analytics. That being said, if you already have a formal education in Data Science, you might already be learning the foundation of Data Analytics in one course only — most likely, which is why it is essential to add a few Data Analytics-focused learnings into your portfolio. However, the best way is to have some sort of Data Analytics practiced with other people as you will see below when I discuss the top four benefits of mastering Data Analytics before learning Data Science.
Exploratory Data Analysis
As you specialize in Data Analytics, it is no surprise that you would become efficient at exploring data. As a Data Scientist, this is usually the first step of the Data Science process, so if you skip practicing this step, your model could result in error, confusion, and misleading results. You must keep in mind that garbage in creates garbage out. Just because you throw a dataset at a Machine Learning algorithm does not mean it will answer the business question at hand.
You will have to find anomalies in the data, aggregations, missing values, transformations, preprocessing, and much more. Understanding the data first is of course important so being a master at Data Analysis is crucial. There are a few Python (and R as well) libraries that help do this automatically. However, I often find, with large datasets that they take way too long and can cause your kernel to crash and you have to restart. That is why it is important to have a manual eye at the data too. That being said, there is a large dataset mode for the library that I will present below that can skip some of the expensive and longer-lasting computations. The parameter for this situation is within the profile report of the Pandas Profiling library: minimal=True
.
Here is one particular library that is plenty easy to use:
from pandas_profiling import ProfileReport profile = ProfileReport(df, title="Pandas Profiling Report") profile.to_widgets() # or you an do the followingdf. profile_report()
Pandas profiling [3], can be viewed in your Jupyter Notebook. Some of the unique features of this library include, but are not limited to type inference, unique values, missing values, descriptive statistics, frequent values, histograms, text analysis, and file as well as image analysis.
Other than this library, overall, there are countless ways to practice exploratory data analysis, so if you have not already, find a course and master analyzing data.
Stakeholder Collaboration
Data Scientists can often learn complex Machine Learning algorithms pretty quickly in their education, skipping the important part of communicating with stakeholders to achieve a goal and articulate the Data Science process. If you have not noticed already, you will have to become a master at translating a business use case into a Data Science model. A Product Manager or other stakeholder will not come up to you and ask you to create a supervised Machine Learning algorithm with 80% accuracy. What they will do is tell you about some data, and what problem they keep seeing, you will have little guidance on Data Science, which of course is expected, because that is your job. You will have to come up with the idea of regression, classification, clustering, boosting, bagging, etc. You will have to work with them as well in order to set up success criteria — for example, what does 100 RMSE mean — and how can you address and translate it to meaningful business problems to stakeholders.
So, how can you learn collaboration? Working as a Data Analyst beforehand often requires plenty of collaboration more often than that of a Data Scientist. You will create metrics, make visualizations, and develop analytical insights from working with others almost daily or at least weekly as a Data Analyst. This practice is vital in becoming a better Data Scientist as we have learned from above.
Benefits of stakeholder collaboration practice through Data Analytics roles:
- business understanding
- problem defining
- success criteria creation
As you can see collaborating with stakeholders is an important part of both the Data Analyst and Data Scientist positions.
Feature Creation
As a Data Scientist, you will have to perform feature engineering, where you will isolate key features that contribute to the prediction of your model. In school or wherever you learned Data Science, you may have a perfect dataset that is already made for you, but in the real world, you will have to use SQL to query your database to start finding the necessary data. In addition to the columns that you already have in your tables, you will have to make new ones — usually, these are new features that can be aggregated metrics like clicks per user
, for example. As a Data Analyst, you will practice SQL the most, and as a Data Scientist, it can be frustrating if all you know is Python or R — and you can not rely on Pandas all the time, and as a result, you cannot even start the model building process without knowing how to efficiently query your database. Similarly, the focus on analytics can allow you to practice creating subqueries and metrics like the one stated above so that you can add a few to at least, say 100, new features that are completely created from you that could be more important than the base data that you have now.
Benefits of feature creation:
- ability to perform any SQL query
- improving model accuracy and error
- finding new insights about your data
Mastered Visualizations
A Data Analyst usually will master visualizations because they have to present findings in a way that is easily digestible for others in the company. Having a complex table full of values can be confusing and frustrating to read, so having the ability to highlight important metrics, insights, and results is extremely beneficial to know as a Data Scientist, too. Similarly, when you are finished with your complex Machine Learning algorithm that you have utilized to build your final model, you will be excited to share your results; however, stakeholders will need to know only the highlights and key takeaways.
The best way to do this process through visualization, and here are some of the key ways to create those visualizations:
- Tableau
- Google Data Studio
- Looker
- Seaborn library
- MatPlotLib
Of course, there are more, but here are the ones I often see used the most. By articulating insights and results through visualizations, you also help yourself to learn the process and takeaways better.
Summary
So the question is, should you become a Data Analyst first before becoming a Data Scientist? I say yes — or at least some form of it, whether that be an internship, job, a similar job like that of a Business Analyst, or becoming certified in a Data Analytics course. In addition to the four benefits that I have discussed above, another one to highlight is that it could certainly help you to land a job as a Data Scientist if you have the title or experience of Data Analytics on your resume.
To summarize, here are some of the important benefits to becoming a master in Data Analytics first before becoming a Data Scientist:
Exploratory Data Analysis Stakeholder Collaboration Feature Creation Mastered Visualizations
I hope you found my article both interesting and useful. Please feel free to comment down below if you have become a Data Analyst first in some way before becoming a Data Scientist. Has it helped you in your Data Science career now? Do you agree or disagree, and why?
References
[1] Photo by NEW DATA SERVICES on Unsplash, (2018)
[2] Photo by Lukas Blazek on Unsplash, (2017)
[3] Pandas, Pandas Profiling, (2021)
[4] Photo by DocuSign on Unsplash, (2021)
[5] Photo by Myriam Jessier on Unsplash, (2020)
[6] Photo by William Iven on Unsplash, (2015)