If you use experiments to evaluate a product feature, and I hope you do, the question of the minimum required sample size to get statistically significant results is often brought up. In this article, we explain how we apply mathematical statistics and power analysis to calculate AB testing sample size.
Before launching an experiment, it is essential to calculate ROI and estimate the time required to get statistical significance. The AB test cannot last forever. However, if we don’t collect enough data, our experiment gets small statistical power, which doesn't allow us to determine the winner and make the right decision.
Let's start with terminology. Statistical power is the probability that one or another statistical criterion can correctly reject the null hypothesis H0, in the case when the alternative hypothesis H1 is true. The higher the power of the statistical test, the less likely you can make a type II error.
Type II error is tightly related to
The power analysis allows you to determine the sample size with a specific confidence level which is required to identify the effect size. Also, this analysis makes it possible to estimate the probability of detecting the given value effect size with a specified degree of certainty with the given sample size.
With "sufficiently" large samples, even small differences are statistically significant, and vice versa, with small samples, even large differences are difficult to identify.
The most important is the number of observations: the larger the sample size, the higher the statistical power. With "sufficiently" large samples, even small differences are statistically significant, and vice versa, with small samples, even large differences are difficult to identify. By knowing these patterns, we can determine in advance the minimum sample size required to get a statistically significant result. In practice, usually, a test power equal to or greater than 80% is considered acceptable (which corresponds to a β-risk of 20%). This level is a consequence of the so-called "one-to-four trade-off" relationship between the levels of α-risk and β-risk: if we accept the significance level α = 0.05, then β = 0.05 × 4 = 0.20 and the power of the criterion is P = 1-0.20 = 0.80.
Now let's look at the effect size. There are two approaches to calculating the required sample
Let's assume we test a hypothesis aimed to improve “item to wishlist” conversion rate. Delta, which covers costs of the experiment with a six months return >= 5% gain of the mentioned conversion rate. This >= 5% gain results in additional profit, which covers all the resources invested in the experiment. In addition to this, you want to be 90% sure that you will find the differences if they exist, and 95% - that you do not accept the differences that are random fluctuations.
pwr.t.test(d=.05, sig.level=.05, power=.9, alternative="two.sided") ## Two-sample t test power calculation ## ## n = 8406.896 ## d = 0.05 ## sig.level = 0.05 ## power = 0.9 ## alternative = two.sided ## ## NOTE: n is number in *each* group
Let's take a look at another case when stakeholders want to get results in a couple of weeks. In this case, we have an approximate sample size of 4000 visitors and the delta >=5%. We want to know the probability to get statistically significant results under the mentioned circumstances.
pwr.t.test(d=.05, n=2000, sig.level=.05, alternative="two.sided") ## Two-sample t test power calculation ## ## n = 2000 ## d = 0.05 ## sig.level = 0.05 ## power = 0.3524674 ## alternative = two.sided ## ## NOTE: n is number in *each* group
The probability to determine the difference, if any, is 35%, which is not too low and the probability of missing the desired effect is 65%, which is too high.
Let’s look at the chart below. It shows clearly the higher the effect size, the lower sample required for a significant result.
When planning an experiment, it is crucial to calculate the required amount of data, because any experiment requires financial and time costs. Therefore, to estimate the potential ROI of the experiment, it is important to plan all the unknown variables in advance.