Fourteen Golden Nuggets to Demystify Data Science for Aspiring Data Scientists

Ready to learn Data Science? Browse Data Science Training and Certification courses developed by industry thought leaders and Experfy in Harvard Innovation Lab.

2 weeks ago, I dedicated 7 hours of my day to watch live Demystifying Data Science, a free conference for aspiring data scientists and data-curious business leaders. Designed to provide insight on the training, tools, and career paths of data scientists, the conference was fully interactive, featuring real-time chat, worldwide Q&A, and polling.

On the 1st day, 14 speakers presented live for 18 minutes before taking questions submitted via the real-time conference chat feature. The talks cover a wide range of topics: from showcasing your work to connecting with data leaders, from telling a persuasive data story to debugging myths in data science. I took some detailed notes of all the talks and decided to use this post to share the main takeaways.

1 – LILLIAN PIERSON (FOUNDER OF DATA-MANIA)

In her keynote speech “A Badass’s Guide to Breaking Into Data,” Lillian came up with a 5-step process under the acronym SPECS (Skills, Programming, Exposure, Connections, Showcasing):

To cultivate your skills, you should get focused on what you want to do and take inventory of your own skills + experience. By reviewing job descriptions for the companies that you’re interested in working at, you can take stocks of the skills that you lack. Then, you can train and practice to fill up these gap skills.
To get programming experience, you can either practice at your current job, participate in competing platforms like Kaggle, or freelance on the side for startups.
To increase your visibility (exposure), you must have a personal website and stay active on social media.
Closely related to the point above, you can get connected with data science experts on social media such as Twitter, LinkedIn, and Instagram.
Lastly, to showcase your work, remember to utilize your website, data blog, and 3rd party platforms (GitHub, Rpubs, Kaggle).

2 – DAVID ROBINSON (CHIEF DATA SCIENTIST AT DATACAMP)

David’s main advice for aspiring data scientists is to start a blog. Why?

You can use it to create a public portfolio.
You can practice your data science skills: data cleaning, machine learning, statistics, data visualization, communication.

What to blog about?

You can analyze a dataset for research, work, public, or personal projects.
You can teach a concept by doing a technical walkthrough on how it works.
You can announce something you built (an application, a software package, a research paper).

Use the #datablog on Twitter and share it to the community!

3 – KATE STRACHNYL (ADVISORY PROGRAM MANAGER AT DELOITTE)

A data visualization expert, Kate outlined the 5 common challenges to overcome in this domain in her talk:

The 1st challenge is selecting the wrong chart. The solution is to define the story you want to tell and questions you want to answer in details.
The 2nd challenge is using improper color. The solution is to be aware of different uses of color (sequential, diverging, categorical, highlight, alert), limit the color usage to be less than 5, and use them to tell a convincing story.
The 3rd challenge is having information overload. The solution is to reduce the number of charts to 3 or 4 and avoid adding too much text.
The 4th challenge is having clutter and insight detractors. The solution is to remove unnecessary borders, legends, rulers and avoid background with diverging colors.
The 5th challenge is not speaking the same language. The solution is to learn the company lingo and establish a common glossary with stakeholders.

4 – MARK MELOON (DATA SCIENTIST AT SERVICE NOW)

In his talk “Navigating the Maze of the Data Science Job Hunt,” Mark believes that being active on LinkedIn is key. He recommends to do all these things on LinkedIn: posting 3rd party content, creating original content, and writing insightful comments in replies to others.

In order to get a referral for data science roles, he recommends to look for your LinkedIn’s 2nd-level connections, reach out to them with an introduction / compliment / asking for advice, and always give value before asking for a favor.

5 – BOB HAYES (PRESIDENT OF BIZ OVER BROADWAY)

Bob ran through a detailed analysis of the practice of data science, covering the people, the skill domains, the processes, and the tools involved.

4 types of people in data science: the researcher, the business manager, the creative, and the developer.
3 skill domains in data science: domain knowledge, tech/programming, and math/statistics.
Communication is the number skills required for data science.
2 popular data science processes: the scientific one (questions -> hypotheses -> data gather -> data analysis -> action / communication) and the iterative / agile one (data <=> idea).
Major tools you should know: R, Python, SQL, Excel, TensorFlow, Spark.

Lastly, knowledge of data mining and visualization tools is the key to success in data science.

6 – RENEE TEATE (DATA SCIENTIST AT HELIOCAMPUS)

Renee is the host of the popular podcast “Becoming a Data Scientist.” In her talk “Exploring the Unknown,” she dug deep into exploratory data analysis techniques. A couple of main takeaways from her talk:

Always save a copy of the original data on your local environment.
To explore the data, summarize statistically, visualize, ask questions, dig into the details.
Here’s the common data analysis process: Business Questions -> Data Questions -> Data Answers -> Business Answers.
A couple of visualization charts to know: histogram, box plot, horizontal bar chart, row-level details, scatterplot.
Don’t be a perfectionist.

7 – BRANDON ROHRER (DATA SCIENTIST AT FACEBOOK)

I personally enjoyed Brandon’s talk – “How to Get a Foothold in the field of Data Science” – the most. He made a statement that Data Science comprises of 4 tasks:

Data Analysis: Domain knowledge, research, variable interpretation.
Data Modeling: Supervised learning, unsupervised learning, custom algorithm development.
Data Engineering: Data management, production, software engineering.
Data Mechanics: Data formatting, value interpretation, data handling.

There are 7 different data archetypes that vary in terms of their skills regarding those tasks above:

Beginner: Someone who is relatively novice in all 4 skills.
Generalist: Someone who is equally proficient in all 4 skills.
Diva: Someone who is strong at 1 skill (no matter what) but hates doing data mechanics.
Detective: Someone who is especially strong in data analysis.
Oracle: Someone who is great at data modeling.
Maker: Someone who loves data engineering.
Unicorn: A non-existent individual who excels at everything.

Brandon’s advice is try to be either a generalist, detective, oracle, or maker (depending on your strength). When applying for roles, ignore the job titles and look at the skill descriptions instead.

8 – JERRY OVERTON (HEAD OF ADVANCED ANALYTICS RESEARCH AT DXC TECHNOLOGY)

Jerry provided a diverging perspective in his talk “Things They Don’t Tell You In a Data Science Blog Post.” Here are those things:

What the reality of the scientific method in data science really looks like.
How an algorithm is developed.
What they truly useful data science skills are.
How the peer review process looks.
Ambiguous predictions about Artificial Intelligence apocalypse.

9 – VIN VASHISHTA (FOUNDER & CHIEF DATA SCIENTIST OF V-SQUARED DATA CONSULTING)

An accomplished consultant in the field, Vin gave some high-level thoughts in his catchy title talk – “You’re a long way from Kaggle Dorothy.” Main takeaways are:

You must learn to say no to perfection and multi-tasking.
Failure is possible – not all business problems can be solved with data science.
Team dynamics can be complicated.
Deadlines that don’t make sense need to be challenged.
Most importantly, take small winds and get business executives’ buy-in.

10 – MICO YUK (CEO & FOUNDER OF BI BRAINZ GROUP)

Miao’s talk is about the art and science of creating an actionable story – aka, data visualization:

Visualization is always better than reports.
Data storytelling requires right brain thinking.
Learn to avoid open-ended questions.

She then showcased a helpful tool called BI Dashboard Formula 6.0 Method, which details a data science process: plan -> shape -> design -> develop -> test -> launch. After the launch step, the method provides a 4-part data narrative including goals, KPIs, trends, and actions.

11 – RANDY LAO (DATA TECHNICIAN AT Q ANALYST LLC)

Randy’s talk “Data Science For All – How to get started” targets mainly newcomers to the field. Here are a few bullet points I gathered:

Learn mathematics: linear algebra, inferential statistics, probability theory, basic calculus.
Learn machine learning: feature engineering, algorithms, data cleaning.
Learn programming: Python for production and R for research.
Familiarize with the OSEMN data pipeline: Obtain, Scrub, Explore, Model, Interpret.
Get skilled up in R, Python, SQL, CoreML + curiosity & communication.
Start participating in Kaggle and join data science communities on Quora, Reddit, LinkedIn.

12 – ERIC WEBER (PRINCIPAL DATA SCIENTIST AT CORELOGIC)

Eric argued that people is the key to a successful career in data science. By defining yourself as a product, he came up with a people development strategy:

Develop relationships early and do not stay in isolation.
Focus on connecting with your team, thought leaders, and colleagues.
Attach measurable outcomes to this strategy (people connected, value-added, events participated).

13 – SARAH NOORAVI (MARKETING ANALYST AT MOBILITYWARE)

I also thoroughly enjoyed Sarah’s talk on the 5 pillars of data science. She outlines the 7 misconceptions associated with the field:

No.1: you need to have a CS or Math degree.
- The truth is, you only need to be a problem solver with diverse experience and domain knowledge.
No.2: a Ph.D. degree is required.
- In fact, data science is a broad field with distinctions in research and analytics, a Ph.D. is needed only if you want to go into doing research.
No.3: you need to know big data tools like MapReduce, Hadoop, and Spark.
- You only need to have exposure to them, as companies allow you to learn such tools on the job. The more important skills are Python/R and SQL.
No.4: you must be a master of all facets of data science (aka, the unicorn archetype).
- Instead, focus on a few core skills to get specialized, and learn others on the job.
No.5: complicated results equal more value.
- Nothing can be further from the truth. Complication does not guarantee better results or better business value.
No.6: data science is all about data modeling.
- Reality is that the data is messy and 80% of your job is dedicated to data gathering/processing.
No.7: all positions in a data science team are created equal.
- Everything is subjective.

Lastly, the 5 pillars of data science you need to know are coding, analytics, communication, creativity, and domain knowledge.

14 – JAKE VANDERPLAS (DIRECTOR OF OPEN SOFTWARE AT UNIVERSITY OF WASHINGTON)

Jake decided to go against big picture stuff and dug deep into declarative data visualization with Altair (an open-source software tool he developed himself). According to him, the major building blocks of visualization are data, transformation, marks, encoding, scale, and guides. The implementation of these blocks is what drives data conceptualization.

You probably know that Matplotlib is one of the most popular visualization tools in the Python ecosystem. However, as an imperative tool, it mixes the “what” with the “how.” Thus, Jake built Altair, a declarative visualization tool, to separate those tasks. Altair has a grammar-based declarative API and is entirely browser-based.