In the 19th century, doctors might have prescribed mercury for mood swings and arsenic for asthma. It might not have occurred to them to wash their hands before your surgery. They weren’t trying to kill you, of course—they just didn’t know any better.
These early doctors had valuable data scribbled in their notebooks, but each held only one piece in a grand jigsaw puzzle. Without modern tools for sharing and analyzing information—as well as a science for making sense of that data—there wasn’t much to stop superstition from influencing what could be seen through a keyhole of observable facts.
Humans have come a long way with technology since then, but today’s boom in machine learning (ML) and artificial intelligence (AI) isn’t really a break with the past. It’s the continuation of the basic human instinct to make sense of the world around us so that we can make smarter decisions. We simply have dramatically better technology than we’ve ever had before.
“Today’s boom in machine learning and artificial intelligence isn’t really a break with the past. It’s the continuation of the basic human instinct to make sense of the world around us so that we can make smarter decisions. We simply have dramatically better technology than we’ve ever had before.”
One way to think of this pattern through the ages is as a revolution of data sets, not data points. The difference isn’t trivial. Data sets helped shape the modern world. Consider the scribes of Sumer (modern day Iraq), who pressed their styluses to tablets of clay more than 5,000 years ago. When they did so, they invented not just the first system of writing, but the first data storage and sharing technology.
If you’re inspired by the promise of AI’s better-than-human abilities, consider that stationery gives us superhuman memory. Though it’s easy to take writing for granted today, the ability to store data sets reliably represents a ground-breaking first step on the path to higher intelligence.
Unfortunately, retrieving information from clay tablets and their pre-electronic cousins is a pain. You can’t snap your fingers at a book to get its word count. Instead, you’d have to upload every word into your brain to process it. This made early data analysis time-consuming, so initial forays into it stuck to the essentials. While a kingdom might analyze how much gold it raised in taxes, only an intrepid soul would try the same line of effortful reasoning on an application like, say, medicine, where millennia of tradition encouraged just winging it.
Luckily, our species produced some incredible pioneers. For example, John Snow’s map of deaths during the 1858 cholera outbreak in London inspired the medical profession to reconsider the superstition that the disease was caused by miasma (toxic air) and to start taking a closer look at the drinking water.
If you know “The Lady With The Lamp,” Florence Nightingale, for her heroic compassion as a nurse, you might be surprised to learn that she was also an analytics pioneer. Her inventive infographics during the Crimean War saved many lives by identifying poor hygiene as a leading cause of hospital deaths and inspiring her government to take sanitation seriously.
The one-data set era took off as the value of information began to assert itself in a growing number of fields, leading to the invention of the computer. No, not the electronic buddy you’re used to today. “Computer” started out as a human profession, with its practitioners performing computations and processing data manually to extract its value.
The beauty of data is that it allows you to form an opinion out of something better than thin air. By taking a look at information, you’re inspired to ask new questions, following in the footsteps of Florence Nightingale and John Snow. That’s what the discipline of analytics is all about: inspiring models and hypotheses through exploration.
From Data Sets To Data Splitting
In the early 20th century, a desire to make better decisions under uncertainty led to the birth of a parallel profession: statistics. Statisticians help you test whether it’s sensible to behave as though the phenomenon an analyst found in the current data set also applies beyond it.
A famous example comes from Ronald A. Fisher, who developed the world’s first statistics textbook. Fisher describes performing a hypothesis test in response to his friend’s claim that she could taste whether milk was added to tea before or after the water. Hoping to prove her wrong, he was instead forced by the data to conclude that she could.
Analytics and statistics have a major Achilles’ heel: If you use the same data point for hypothesis generation and for hypothesis testing, you’re cheating. Statistical rigor requires you to call your shots before you take them; analytics is more a game of advanced hindsight. They were almost tragicomically incompatible, until the next major revolution—data splitting—changed everything.
Data splitting is a simple idea, but to a data scientist like myself, it’s one of the most profound. If you have only one data set, you must choose between analytics (untestable inspiration) and statistics (rigorous conclusions). The hack? Split your data set into two pieces, then have your cake and eat it too!
The two-data set era replaces the analytics-statistics tension with coordinated teamwork between two different breeds of data specialist. Analysts use one data set to help you frame your questions, then statisticians use the other data set to bring you rigorous answers.
Such luxury comes with a hefty price tag: quantity. Splitting is easier said than done if you’ve struggled to scrape together enough information for even one respectable data set. The two-data set era is a fairly new development that goes hand-in-hand with better processing hardware, lower storage costs and the ability to share collected information over the internet.
In fact, the technological innovations that led to the two-data set era rapidly ushered in the next phase, a three-data set era of automated inspiration. There’s a more familiar word for it: machine learning.
Using a data set destroys its purity as a source of statistical rigor. You only get one shot, so how do you know which “insight” from analytics is most worthy of testing? Well, if you had a third data set, you could use it to take your inspiration for a test drive. This screening process is called validation; it’s at the heart of what makes machine learning tick.
Once you’re free to throw everything at the validation wall and see what sticks, you can safely let everyone have a go at coming up with a solution: seasoned analyst, intern, tea leaves and even algorithms with no context about your business problem. Whichever solution works best in validation becomes a candidate for the proper statistical test. You’ve just empowered yourself to automate inspiration!
Automated Inspiration
This is why machine learning is a revolution of data sets, not just data. It depends on the luxury of having enough data for a three-way split.
Where does AI fit into the picture? Machine learning with deep neural networks is technically called deep learning, but it got another nickname that stuck: AI. Although AI once had a different meaning, today you’re most likely to find it used as a synonym for deep learning.
Deep neural networks earned their hype by virtue of outclassing less sophisticated ML algorithms on many complex tasks. But they require much more data to train them, and with processing requirements beyond those of a typical laptop. That’s why the rise of modern AI is a cloud story; the cloud allows you to rent someone else’s data center instead of committing to building your deep learning rig, making AI a try-before-you-buy proposition.
With this puzzle piece in place, we have the full complement of professions: ML/AI, analytics and statistics. The umbrella term that encompasses all of them is called data science, the discipline of making data useful.
Modern data science is the product of our three-data set era, but many industries routinely generate more than enough data. So is there a case for four data sets?
Well, what’s your next move if the model you just trained gets a low validation score? If you’re like most people, you’ll immediately demand to know why! Unfortunately, there’s no data set you can ask. You might be tempted to go sleuthing in your validation data set, but unfortunately debugging breaks its ability to screen your models effectively.
By subjecting your validation data set to analytics, you’re effectively turning your three data sets back into two. Instead of finding help, you’ve unwittingly gone back an era!
The solution lies outside the three data sets you’re already using. To unlock smarter training iteration and hyperparameter tuning, you’ll want to join the cutting edge: an era of four data sets.
If you think of the other three data sets as giving you inspiration, iteration and rigorous testing, then the fourth fuels acceleration, shortening your AI development cycle through advanced analytics techniques geared at providing clues as to what approaches to try on each round. By embracing four-way data splitting, you’ll be in the best position to take advantage of data abundance! Welcome to the future.