While running an experiment, waiting for data is often the most challenging period as you are likely to get impatient. All you want during that period is for the A/B test to end as quickly as possible so you can go in a full-scale execution mode. And, the anxiety adds up when you don’t know how long you need to wait for the test to reach statistical significance.
The impatience is entirely understandable as you do not want to lose conversions on suboptimal variations. Nothing much can be done about that anxiety as the statistical test will end when it ends. But, if you can have an estimated waiting time for the A/B test to end, it could undoubtedly appease the anxiety to some extent.
Let me explain how to estimate the duration of an A/B test:
Visitor Sample Size Calculator
In statistics, you can never say with 100% confidence that an A/B test will end after X number of days. Instead, you say there is an 80% (or a 95%, whatever you choose) probability of getting a statistically significant result if it exists after X number of days.
There could be the case when there is no difference in the performance of the variations, and no matter how long you wait, you will never get a statistically significant result. Thus, it becomes essential to estimate the number of visitors required to conduct an A/B test for statistical significance before even running a test.
There are three pieces of information you would need to determine the number of visitors for the A/B test –
- A base Conversion Rate(CR) – value that you are expecting the campaign would get at the least.
- Expected Uplift – What percentage difference in CR you want to detect on the base CR (lower the uplift you wish to test, the more visitors it will need)
- Number of variations to test (the more variations you test, the more traffic you need)
The following image was taken from Statistical Rules of Thumb, by Gerald van Belle.
The formula described above is known as Lehr’s equation, which is obtained by using frequentist statistics.
- Type I Error (α) is the probability of rejecting the null hypothesis when it is true (if α=0.05, then it means that out of 100 independent tests where variations are the same, 5 tests will say variations are statistically different)
- Type II Error (β) is the probability of not rejecting the null hypothesis when it is false (if β=0.20, then it means that out of 100 independent tests where variations are different, 20 tests will say variations are the same)
- z is the Z-score value obtained from the Z-table. Visit this to know more about the Z-test.
- σ is the standard deviation of the visitor’s Bernoulli distribution. Hence,
By putting the values in Lehr’s equation, you’ll get the number of visitors (n) needed to get statistically significant results between two variations.
If there are multiple variations, then multiplying (n) with the number of variations (V) will give the overall number of visitors needed (n*V).
Divide the obtained result by the average number of daily visitors and you’ll get the number of days the A/B test is likely to take in order to find the best variation. You can use the calculator at ab-test-duration-calculator built upon the same formula.
Lehr's Equation’s Mathematical Intuition
There are two key ingredients to sample size calculations: the difference between the two variations’ conversion rates, and the variability in their measurements.
Each distribution in the above image is a model that represents the differences of conversion rates between two variations where the x-axis is the absolute difference scale of conversion rates (Δ=y0−y1).
One distribution’s center is at 0, and the other’s center is at δ(δ=CR*Uplift). The null hypothesis that there is no difference between the two variations is represented by the distribution on the left (Δ=0). The alternative hypothesis that there is some difference between the two variations is represented by the right curve (Δ=δ). Each distribution also has a variance (σ2), which is usually assumed to be the same for both.
The relationship between the standard error (SE), the absolute difference of conversion rates of the two variations, and the standard deviation of the distribution allows us to set up calculations for the sample size, n.
By multiplying SE with an appropriate z score, we add the confidence level we want in our estimation.
The critical value is where the α region of the null curve and the β region of the alternative curve meet. This point is:
- distance away from the mean of the null curve, and
- distance away from the mean of the alternate curve.
As the sum of the two distances is δ, just by rearranging the resultant linear equation, you can get the number of visitors needed to obtain a statistical difference between the two variations. The errors in our tests would be α and β. The lesser the value of α and β, the more will be the visitor estimate. Thus, the equation we get is:
You can read this sample size chapter to get a more in-depth understanding of how this equation is derived.
Conclusion
By using the method described above, you can estimate the time duration of the test needed to check statistical significance in a frequentist A/B test. However, if you perform a test using bayesian statistics, you can read the maths behind the bayesian duration calculator in order to understand its implementation.
It is a common practice to perform sample size calculations before starting an experiment to avoid bias in results. If we include very few subjects in an experiment, the results cannot be generalized to the population as this sample will not represent the target population. On the other hand, if we study more subjects than required, we could waste resources. Adequate sample size calculation thus becomes crucial in any statistical experiment to arrive at scientifically valid results.