How To Showcase Your Apache Spark Skills In An Interview

Figure 0. Spark UI

How do you prove your capability as a Apache Spark data scientist when you don’t have much to show? Perhaps this is because you don’t have much experience. Perhaps it is because the work you do does not lend itself to being shown to others. You may have done impressive work, but have nothing to show for it because most developers work for companies that don’t publish their code as open source.

In this article I will show you how you can showcase your skills. By doing so, you will also learn new relevant skills along the way. I’ll show you how publishing your work using self-explanatory visualizations can separate your from the pack. I’ll also show how to select a coding project that demonstrates useful skills that are immediately applicable in a production setting. By setting this up for yourself, you will also acquire new skills that are valuable for the coding challenge portion of the interview, as well as being directly transferable to the job itself.

Someone having well organized and nontrivial open source code will be prioritised over other similarly experienced candidates because their competencies are easier to evaluate. By providing visibility you simplify the job of the recruiters and interviewers, while simultaneously preparing yourself for the interview, as well as learning relevant skills that will be useful on the job.

Publish your work

Publishing a code project is a great way to introduce yourself to an interviewer and provide them with ample talking points for various styles of interview questions. Doing this makes the interviewers job easier by giving them context on what you already know, and providing talking points for interview questions. This can be helpful for many types of interview questions, including behavioral questions, situational questions, coding questions, and system design questions.

As an example take a python repo I published on github during my role as a mentor for a robotics team. Use a readme to explain your project in a high level manner that also allows a developer to quickly test your code. Use the repo wiki to elaborate further.

The wiki is a great place to provide visuals that allow readers to quickly get a sense of what your project does. Figure 1 and Figure 2 are two of the most visually appealing visualizations that give the reader an idea at a glance of what is being done and how, and draw the reader in to learn more.

You can use the readme page or the wiki home page as a starting point to link to other pages that delve further into background knowledge you needed to provide context for what you were trying to accomplish and why your approach is sensible.

For example, this Background page of this repo gives a brief primer in the section on Alternative Approaches. The next sections explain the why, what, and how of the project. This gives an interviewer ample opportunity to explore your breadth of understanding of a field of interest.

The Related Approaches page of the repo gives a brief primer into specific techniques that were considered, which one was selected, and why. This gives an interviewer opportunity to explore your depth of understanding of specific techniques.

The Planning Stages page of the repo gives an overview of what the code actually does. It gives links to python notebooks containing visualizations, such as this sample trajectories notebook. This allows an interviewer to review your code and see how it behaves on actual data. Note the use of visualizations.

The test page of the repo showcases data from tests designed to simulate the code under realistic conditions and demonstrate its use of compute resources. This type of presentation demonstrates a data-driven mindset.

You probably won’t need all of these pages. Pick and choose the ones most suitable for showcasing your competencies and that you can discuss intelligently.

With these an interviewer could explore your skill as a developer, such as to:

  • Decompose a novel problem
  • Design a system
  • Set up a new code repository
  • Organize a code repository
  • Write performant code
  • Visualize key results
  • Communicate complex ideas
  • Write effective documentation

Providing these assets gives the interviewer opportunity to derive behavioral questions. It gives you ample content you can use to answer those behavioral questions. You can tell about a time you encountered a certain type of situation, and how you approached it, and refer to the published solution. Use the STAR technique, by describing the situation (S), distilling it down to a single task (T), and the key action you used to solve it (A), concluding with the result (R).

This approach also provides code screeners example code that can be used as a starting point for more in depth exploration of your skills. Instead of selecting a random problem, they may choose a coding challenge that is within your wheelhouse.

To utilize this approach effectively, you must know every line of content in the repo. The best way to achieve this is to select a code project of your own choice, solve it by your own design, and develop it step by step, from start to finish, for yourself. Once you get started this may not take as long as you expect. Once you’ve seen some good examples, getting going is greatly simplified. Review code projects, learn how to distinguish effective ones from half-baked ones. Pick one or two to serve as a role model for your own design.

Ideas for a coding project

Many new developers make the mistake of picking code projects similar to what they encountered in an academic environment. These are frequently not relevant to what is relevant to a real-world job as a commercial software developer. Interviewers want to see that you are familiar with heterogeneous aspects of a development task, are able to handle complexity, have an attention to a certain type of detail that is not taught in the university environment, such as knowing how to handle the messiness of real-world data.

Your mission is to showcase your ability to apply Apache Spark to a nontrivial dataset. As a general rule of thumb — if you didn’t need to perform any data cleansing or preprocessing on the dataset, the data wasn’t messy enough to be realistic. If your solution ran in a few seconds it was not challenging enough to test your algorithms. If it didn’t exhaust at least half of the memory available to your system and run the fan to cool off the cpu, then you probably could increase the complexity of the task, either by tackling a larger dataset, or by taking on a more challenging problem based on the dataset.

Here are some general ideas for code projects.

Migrate an existing solution to Spark

In this case, there might already be an existing solution for solving a problem. Your mission is to migrate the solution to Apache Spark and compare the results. Many such datasets and associated solution exist online. You would be welcome to try your hand at one of mine, the FIRST robotics motion planner problem.

Use Spark to analyze a nontrivial data set

By nontrivial here I mean something that takes more than a few seconds to process. If the data set does not push the compute resources of your development environment you might use an inefficient solution instead of a performant one. If you use a nontrivial dataset you can showcase the difference between an inefficient algorithm and a performant algorithm. It also makes the results more interesting, and gives you more to talk about.

My Apache Spark SQL course showcases a code project based around a 6.5 MiB dataset containing 1,095,695 words, 128,467 lines, and 41,762 distinct words. The analyses it uses are especially customized for this dataset — the analyses you choose for the dataset of your choice also indicate your ability to pose interesting questions, create queries for answering those questions, and efficiently implementing those queries on a large dataset.

Compare Spark with an alternative computing platform

In this approach, you select a data set, perform an analysis of it using two different programming languages or computing platforms. For example you might first use scikit-learn, numpy, or pandas, and then do the same analysis using Apache Spark. Or, you might compare and contrast Hive vs Spark.

I had performed a similar exercise myself. Some time later, Databricks published the results of a similar study on their blog, Benchmarking Apache Spark on a Single Node Machine.

Use Spark along with a cloud api

Cloud apis provide powerful means of handling large datasets for certain applications; however, preparing the data for upload to the cloud api may require substantial preprocessing. Cloud apis can also generate a lot of log data. For example, I evaluated IBM Watson and the Google Cloud speech-to-text cloud apis, and then compared the results (cf., on bleu scores and transcription rates, and on transcription rate for noisy recordings) by analyzing the log data. In this case, I used sqlite to run the queries. When I redo this I plan to use Spark SQL instead.

This type of project gives you even more to talk about: how to integrate Spark with a cloud api, what operations are suitable for Spark and which ones are more suitable to do within the cloud api, what post-processing analytics steps are there for which Spark is especially suitable. Provide visuals where meaningful, e.g. Figure 5. Your wiki can also show resource consumption by screenshotting the Spark UI, such as shown in Figure 0.


Use Spark to extract training features from a data set

Many data science jobs require the ability to train statistical models based on feature data gleaned from raw data. There is an abundance of data that you could use to demonstrate your ability to perform this.

My Apache Spark SQL course shows how to extract moving-window n-tuples from a text corpus. Figure 4 illustrates this for 4-tuples, though it is easily generalizable to arbitrary length n-tuples. This would provide a good starting point for a feature extractor that vectorizes this into a form that can be provided as input to a neural network model.

Following are some examples of modeling tasks and relevant modeling approaches you could use that would be relevant for many data science or machine learning developer roles.

Examples of modeling tasks:

  • Statistically improbable phrases
  • Anomaly detection
  • Topic modeling
  • Recommender
  • Trend analysis

Examples of relevant models :

  • Approximate K Nearest Neighbors
  • Alternating Least Squares
  • K-means clustering
  • Streaming

If you can generate features for one of these types of models or tasks start-to-finish, end-to-end, on your own you are credibly able to handle other modeling tasks.


There are many ways to learn Apache Spark and apply it realistically to nontrivial datasets; however, when it comes to interviewing it still comes down to communicating your understanding effectively. You can accomplish much of this up-front in advance, while sharpening your skills, by tackling a code project, publishing it, and writing up key results in a visually appealing way.

Learn more For additional tips and project ideas, see Apache Spark SQL on Experfy.

  • Top articles, research, podcasts, webinars and more delivered to you monthly.

  • Leave a Comment
    Next Post

    Leave a Reply

    Your email address will not be published. Required fields are marked *

    Big Data, Cloud & DevOps
    Cognitive Load Of Being On Call: 6 Tips To Address It

    If you’ve ever been on call, you’ve probably experienced the pain of being woken up at 4 a.m., unactionable alerts, alerts going to the wrong team, and other unfortunate events. But, there’s an aspect of being on call that is less talked about, but even more ubiquitous – the cognitive load. “Cognitive load” has perhaps

    5 MINUTES READ Continue Reading »
    Big Data, Cloud & DevOps
    How To Refine 360 Customer View With Next Generation Data Matching

    Knowing your customer in the digital age Want to know more about your customers? About their demographics, personal choices, and preferable buying journey? Who do you think is the best source for such insights? You’re right. The customer. But, in a fast-paced world, it is almost impossible to extract all relevant information about a customer

    4 MINUTES READ Continue Reading »
    Big Data, Cloud & DevOps
    3 Ways Businesses Can Use Cloud Computing To The Fullest

    Cloud computing is the anytime, anywhere delivery of IT services like compute, storage, networking, and application software over the internet to end-users. The underlying physical resources, as well as processes, are masked to the end-user, who accesses only the files and apps they want. Companies (usually) pay for only the cloud computing services they use,

    7 MINUTES READ Continue Reading »