The bootstrap. The Swiss army knife of any data scientist

Gianluca Malato Gianluca Malato
February 12, 2020 Big Data, Cloud & DevOps

Photo by Mika Baumeister on Unsplash

Every measure must be followed by an error estimate. There’s no chance to avoid this. If I tell you “I’m 1,93 metres tall”, I’m not giving you any information about the precision of this measure. You could think that my precision is on the second decimal digit, but you can’t be sure.

So, what we really need is some way to assess the precision of our measure starting from the data sample we have.

If our observable is the mean value calculated over a sample, a simple precision estimate is given by the standard error. But what can we do if we are measuring something that is not the mean value? That’s the point at which bootstrap comes in help.


Bootstrap in a nutshell

Bootstrap is a technique made in order to measure confidence intervals and/or standard error of an observable that can be calculated on a sample.

It relies on the concept of resampling, which is a procedure that, starting from a data sample, simulates a new sample of the same size, considering every original value with replacement. Each value is taken at the same probability of the others (which is 1/N).

So, for example, if our sample is made by the four values {1,2,3,4}, some possible resampling can be {1,4,1,3}, {1,2,4,4}, {1,4,3,2} and so on. As you can see, the values are the same, they can be repeated and the samples have the same size as the original one.

Now, let’s say we have a sample of N values and want to calculate some observable O, that is a function that takes in input all the N values and gives in output one real number. An example can be the average value, the standard deviation or even more complex functions as the quantiles, the Sharpe ratio and so on.

Our goal is to calculate the expected value of this observable and, for example, its standard error.

To accomplish this goal, we can run the resampling procedure M times and, for each sample, we can calculate our observable. At the end of the process, we have M different values of O, that can be used to calculate the expected value, the standard error, the confidence intervals and so on.

This simple procedure is the bootstrap process.

An example in R

R has a wonderful library called bootstrap that performs all the calculations for us.

Let’s say we want to measure the skewness of the sepal length in the famous iris dataset and let’s say we want to know its expected value, its standard error and the 5th and 9th percentile.

Here follows the R code that makes this possible:

The mean value is 0.29, the standard deviation is 0.13, the 5th percentile is 0.10 and the 95th percentile is 0.49.

These numbers allow us to say that the real skewness value of the dataset is 0.29 +/- 0.13 and it is between 0.10 and 0.49 with a 90% confidence.

Surprisingly, isn’t it?

This simple procedure is universal and can be used with any kind of observable.

The larger the dataset, the higher the precision

Somebody could think that the larger our dataset, the smaller the standard error we take due to the law of large number.

It’s definitely true and we can simulate this case.

Let’s simulate N random uniformly distributed numbers and let’s calculate the skewness over them. What we expect is that the standard deviation of the bootstrapped skewness decreases as long as N increases.

The following R code performs this calculation and, (not so) surprisingly, the bootstrap standard deviation decreases as a power law with the exponent equal to -1/2.

The slope of the double-log linear regression is the exponent of the power law and it’s equal to -0.5.

Conclusions

Bootstrap is a method to extract as much information as possible from a finite size dataset and it makes us calculate the expected value of any observable and its precision (i.e. its standard deviation or confidence intervals).

It’s really useful when we need to calculate an error estimate for some scientific measure and it can easily generalized for multivariate observables.

Any data scientist should not forget to use this powerful tool, which has shown several useful application even in machine learning (e.g in the Random Forest classification/regression models).

  • Experfy Insights

    Top articles, research, podcasts, webinars and more delivered to you monthly.

  • Gianluca Malato

    Tags
    Data Science
    Leave a Comment
    Next Post
    Understanding Dataset Shift

    Understanding Dataset Shift

    Leave a Reply Cancel reply

    Your email address will not be published. Required fields are marked *

    More in Big Data, Cloud & DevOps
    Big Data, Cloud & DevOps
    Cognitive Load Of Being On Call: 6 Tips To Address It

    If you’ve ever been on call, you’ve probably experienced the pain of being woken up at 4 a.m., unactionable alerts, alerts going to the wrong team, and other unfortunate events. But, there’s an aspect of being on call that is less talked about, but even more ubiquitous – the cognitive load. “Cognitive load” has perhaps

    5 MINUTES READ Continue Reading »
    Big Data, Cloud & DevOps
    How To Refine 360 Customer View With Next Generation Data Matching

    Knowing your customer in the digital age Want to know more about your customers? About their demographics, personal choices, and preferable buying journey? Who do you think is the best source for such insights? You’re right. The customer. But, in a fast-paced world, it is almost impossible to extract all relevant information about a customer

    4 MINUTES READ Continue Reading »
    Big Data, Cloud & DevOps
    3 Ways Businesses Can Use Cloud Computing To The Fullest

    Cloud computing is the anytime, anywhere delivery of IT services like compute, storage, networking, and application software over the internet to end-users. The underlying physical resources, as well as processes, are masked to the end-user, who accesses only the files and apps they want. Companies (usually) pay for only the cloud computing services they use,

    7 MINUTES READ Continue Reading »

    About Us

    Incubated in Harvard Innovation Lab, Experfy specializes in pipelining and deploying the world's best AI and engineering talent at breakneck speed, with exceptional focus on quality and compliance. Enterprises and governments also leverage our award-winning SaaS platform to build their own customized future of work solutions such as talent clouds.

    Join Us At

    Contact Us

    1700 West Park Drive, Suite 190
    Westborough, MA 01581

    Email: support@experfy.com

    Toll Free: (844) EXPERFY or
    (844) 397-3739

    © 2023, Experfy Inc. All rights reserved.