A Great Pitfall: Neglecting Validation

Ready to learn Data Science? Browse courses like Data Science Training and Certification developed by industry thought leaders and Experfy in Harvard Innovation Lab.

Photo by Paul on Unsplash

It was in the first chapter in your book, it was in the first lecture you attended, or it was mentioned in the first tutorial you watched. It seems simple: you do not measure your predictor’s performance based on the predictions you made on the data that you used for training. However, as management keeps putting on more pressure, or your stack takes longer to run, you start to sacrifice on validation. This whole story is about this one very basic thing: validation.

What is Validation?

You can skip this section if you already know the answer. If you are not sure, let me explain it roughly. Let us say you want to have a predictor that performs prediction at a specific task. After you get your predictor, you will want to know how well does it perform? Does it perform very poorly that it is useless, or does it perform excellently so you can declare the problem solved? In order to measure the performance, you need validation.

In order to perform validation, you need data. More specifically, you need data with the information that you want to predict. We call this information the ground truth. According to Wikipedia, ground truth is defined as: “information provided by direct observation”*. Ground truth is usually provided by humans. In our processes, we believe that the ground truth is the actual value that we want to predict for our data.

The adventure of validation begins once you have both your predictor and the data with ground truth. If the data with ground truth was not present to your development process, it is easy. You just compare the predictions with the ground truth. However, this situation is rare. You rarely lock up your whole data and never peak into it till the very end. Very often you use this data while developing your predictor. You use some of your data for training, check the data for adjusting the thresholds, or you peak into the data to get some idea how to achieve a good predictor. In this case, you need to think about how to establish a solid and valid validation mechanism.

Why Are We Doing Validation?

It seems simple: we are doing validation because we want to measure the performance of our predictor. But let us not fooled by this off-the-shelf definition. We do not just want to measure the performance. We want to know if our predictor will do a good job in a real-life setting.

Let us establish a prediction task example. In this example, we want to know if an image has a cat in it. In this task, you feed an image to the predictor, and it tells you whether there is a cat in the image or not. So by definition, it is a binary classification problem.

For this task, you can find a dataset or prepare your own. After that, you can go ahead and start developing your predictor. But one should not forget why you are developing a predictor. You do not want to detect cats in your dataset. You want a predictor that detects cats when you pick up your camera and take a picture of a cat in your local park. For validation, the dataset in your hard-disk is to help you so you do not have to chase cats in your local park all the time. But with this comfort, we often forget about the real goal and shoot for the grant price: a higher KPI (key performance indicator) value!

We are doing validation so that it will give us a good idea how our predictors will work in the real-life setting. Your KPI is just to help you, it is not your actual goal. We need to keep this in mind when developing our predictors.

When You Forget About Validation

What happens in a scenario where you forget or fail to establish a valid validation mechanism? Let me paint you a possible outcome of this scenario.

You work on your project. You focus on getting your KPI better and better. You work hard on it. You get more data, you do your research on different methods, you run a lot of experiments. In the end, you get a performance that looks great. Mission accomplished! You are now ready to present your work and deploy it.

Everybody is happy at this point. You present it to your team, to management or who you work with. If you are lucky, at this point somebody will catch the mistake you made. If you are lucky somebody will tell you that you failed to establish a valid validation mechanism. This is not the worst case scenario. You still did not deploy your predictor. Still, nobody is affected by this mistake. You just need to go back to sketching board and introduce a valid validation mechanism. But if you are not lucky you will march into a bad situation.

In the worst case situation, you deploy your predictor, it goes live. After a while, people start to realize that the predictor you developed is not performing well. Users will start to complain. Everybody becomes confused. The KPI suggest that the predictor performs very well. But why users are unhappy about the performance? After a while, you are convinced that there is something wrong. You go back to sketching board, trying to debug your model. Finally, you realize that what your KPI shows is not right. Your validation mechanism is broken. You were so focused on increasing the KPI, you sacrificed on validation. Now your reputation is damaged, and you need to come up with a new model quickly before churns start.

Why Do We Give Up or Forget Validation

There are a lot of reasons why we sacrifice on validation or forget to establish it completely. I believe the list goes very long. My list below is not complete, but these are the ones that I came across often.

Management Puts Pressure

This issue is not specific to data science or validation. Management or business side always want to have results fast. They put pressure on the team to come up with a predictor fast. As pressure builds up, you start to give up on certain aspects of your method. One aspect you choose to give up may be validation. You know that validation is not fun. You need to have a different workflow just because of that. Addition to that it takes longer time. Because of these, it may seem like a good place prune.

You Think You Do Not Need Validation

I believe this is caused by forgetting why we are doing validation. Whatever the reason is, you may think you do not need validation. Let me list a couple of reasons that I came across with:

Because you do not use machine learning.

The need for validation is not specific to machine learning. The complexity of your model does not determine the need for validation. Even if your method is a single if statement, or a complicated machine learning model, it does not matter. You need validation.

Because you think your method is not overfitting.

Even if you are using complicated regularization methods, it does not mean it works perfectly. Overfitting preventions are not perfect. Means that you cannot rely on them and measure the performance on the training set.

Because your method has nothing to do with validation.

Maybe you think you do not need validation because you have not used any data in training. However, are you sure that you have not introduced any bias into your method by yourself? This is the part where the rabbit hole goes deeper. Even if you are not using any data in training process, you as a person may have developed a method that is biased to the data you use. Let me use the cat detection example to clarify this: if your dataset only consists of dark-colored cats and you use this information while developing your algorithm, it will perform better on your dataset than performing in real life setting where there are cats of different colors.

Your Runs/Experiments Are Taking a Long Time

It is not a secret that usually machine learning operations take a long time. Even if a single run does not, you want a lot of them because of various reasons like trying different hyper-parameters, methods, etc. When your patience starts to wear off, cloud server bills start to pile up, and the deadline approaches; you try to cut some operations off. At this point, you tend to cut from validation. You may start to use fewer folds, switch to subsampling, or worst: cancel out the validation.

No Excuses. You Need Validation

Whatever the reason is, whether listed above or not, you cannot effort not having a valid validation mechanism. Validation is basic for a prediction work. You may choose one from different methods (e.g. cross-validation, subsampling, etc.) but you must have one. It is the best to one from the very early stage. It may evolve as the project goes, but even in your first runs, you must employ validation. Failing to have a valid validation mechanism will cause bigger problems in the later stages of the project.

Rabbit Hole Goes Way Deeper: Valid Validation

When we think again about why we want validation, the situation goes deeper than just employing off-the-shelf validation methods like K-fold cross-validation. If we want to make sure that we get a similar performance in real-life settings, we need to think deeper about validation.

This topic exceeds the boundaries of this story, but here are some foods for thought:

Is your data leaking the label of the instance in an inexplicit way?
Does the dataset contain a range of variation?
Is the data collected in synthetic (in-lab) environment, or collected realistically?
Did you introduce bias in your method through your yourself?