Ready to learn Data Science? Browse courses like Data Science Training and Certification developed by industry thought leaders and Experfy in Harvard Innovation Lab.
Spark is to SQL what calculus is to algebra.
Old-timer Just Keeps on Tickin’
>
SQL has been around for almost 40 years. SQL has been around in commercial form since 1979. That’s when Relational Software, Inc. (which later became Oracle) released Oracle version 2 (which was the marketing renaming of what was really version 1).
Think about that for a second. A hotshot fresh-out-of-undergrad SQL-skilled new hire would be retiring in just a couple years. Some people still build their careers off of SQL skills. In other words, SQL had incredible staying power. It still does.
Enter the Young Gun
>
So what does this have to do with AI? Well, I’m going to go out on a limb and say that we’re three years into a similar journey with another landscape shifting technology: Spark. Spark was initially released as an Apache project in May of 2014. I happened to be a fresh hire (albeit PhD, not undergrad) and Spark was HOT. I mean, it was exactly what our company (and many others) needed and every release just got better.
I have a few reasons that I believe will help Spark stay meaningful through the years.
1. Daddy Warbucks Got Yer Back
>
SQL was supported by a strong company (read, had commercial support) while also taking advantage of the open source efforts of outside contributors and was eventually standardized. Spark has the commercial support of Databricks which is currently valued at nearly $1B only a few years into its existence. And as an Apache project, it is developed at an extremely rapid clip by a vibrant open-source community.
What’s probably even more important is the fact that it is used at so many large companies. In other words, it has weaseled it’s way into the core toolset of much of the world’s GDP. And that’s just the beginning, because according to DataBricks, they’re still working on reaching the other 99%.
2. One Stop ML Shop
>
One of the things that made it really great was it could pass through HQL and then soon had it’s own SQL-like language, Spark-SQL. For the first time, data wrangling and machine learning could be executed on big data in one place in well-known languages at extraordinary speed. It was the holy grail of big data science.
Spark meets two primary needs:
- Easy data wrangling (in a familiar approach: SQL)
- Many of your favorite machine learning algorithms at scale.
In other words, SQL’s staying power, and natural way of thinking about data, is what will help Spark also have staying power. Yes, most data stores are no-SQL, but the fact that you can use non-relational databases and think about the data in them as if they were relational is what makes it so powerful. Notice, SQL is still the reference here. All databases are referenced by their relation to SQL, that’s saying something.
Spark can handle pretty much any data store you throw at it and data scientists can use a common way of thinking about data (SQL) for handling it. You don’t have to use the SQL-like interface, but it’s there, and many take advantage of it. Don’t care for the SQL/HQL aproach? That’s fine, you can use Spark like many use bash for data wrangling. Spark spans many skill levels.
3. It Feels Familiar
>
Because Spark has a machine learning library, you can use it much like you would familiar data science languages like R and Python. The usefulness here goes beyond just syntax, it’s the process that makes it so user-friendly.
Interactively playing with and exploring data is one of the most powerful parts of R and Python. You can very quickly start to peel back the layers and find the stories within the data. Before Spark, that process was painful and slow (sorry MapReduce