In most cases, A/B test analysis aimed to compare the control group with an experimental group and determine if the experimental group affects our metric, for example, Conversion (CR). The thing is that many internal and external factors may affect our metrics. We don't have control over external factors, but we certainly have over internal.
The experiment variant itself may not have a significant influence on our metric. Instead of comparing two groups as a whole, imagine you can take an in-depth look and reveal if a statistically significant difference exists across various user segments.
If we can determine how different user segments react to a specific product change, then, we get a step closer toward a better understanding of our audiences
This approach would allow seeing the diversity inside of your product and find meaningful insights. Some of you may have a question, how all of this related to UX. At Awsmd - a data driven UX and design agency UX means a deep understanding of users, their problems and their needs. If we can determine how different user segments react to a specific product change, then, we get a step closer toward a better understanding of our audiences.
To be able to determine whether differences exist in the 3 or more groups we will use multi-factor ANalysis Of VAriance - ANOVA.
Analysis of variance allows defining the influence of several factors on the dependent variable. If we are talking about A/B testing, where the tracked metric is Conversion, then this is the dependent variable. Meanwhile, the experiment variant, traffic channel, device, user type and so on are the factors.
Note, that the ANOVA test requires groups that are mutually independent and satisfy both the normality and equal variance assumptions. It's crucial to run normality and variance homogeneity tests before using ANOVA.
Our null hypothesis H0 is that the population means of all groups are all the same:
H0 : µ1 = µ2 = µ3 ...
Our alternative hypothesis H1 is that at least one of the population means of all groups is different:
H1: µ1 ≠ µ2 or µ1 ≠ µ3 or µ2 ≠ µ3 ...
Below is a sample’s example that we are going to analyze using R.
test_aov <- aov(goal1Completions~experimentVariant* devicecategory* usertype, data=AB_test1) summary(test_aov)
The Residuals line characterizes the common variance, which is called variance within samples or residual variance. The Sum Sq column contains SSB (sum of squares between groups) and SSW (sum of squares within groups). The Mean Sq column contains MSB (Mean Square between groups) and MSW (Mean Square within groups). The F value column shows the calculated F-statistic value, which is the ratio of intergroup mean sum of squares to intragroup mean sum of squares. Finally, the Pr column (> F) a significance probability value (taking into account that the null hypothesis is correct).
Best the Conversion in combination with the experiment variant described by userType Sum sq = 10002. Our goal here to see how the experiment variant in combination with other factors describes the A/B test. As seen on the image, our metric most influenced by a combination of experimental variant and device - Sum sq = 1. Based on the P-value, we can reject the null hypothesis and conclude that there is a statistically significant difference between some of the groups.
In the ANOVA test, a significant P-value indicates that some of the group means are different. To determine which exactly pairs of groups are different we need to apply post hoc test which compares all possible pairs of means.
We will use Tukey post hoc criterion to test the null hypothesis H0: μB = μA against the alternative hypothesis H1: μB≠μA, where the A and B mean any two compared groups.
With the presence of mm groups in total, it is possible to perform m (m − 1) / 2m (m − 1) / 2 paired comparisons.
The first step is to rank all the available group means in ascending order (from 1 to mm). Then pairwise comparisons of these means are performed so that first the lowest mean is compared with the highest, i.e., mm-0 with the 1st, then mm-0 with the 2nd, etc. until (m − 1) (m − 1). Then the penultimate mean, (m − 1) (m − 1), is compared in the same way with the 1st, 2nd... to (m − 2) (m − 2). These comparisons continue until all pairs have compared with each other.
The test allows to repeatedly compare sample means without fear of making type I and type II errors.
Tukey is a post hoc test for Analysis of variance, so We will use ANOVA variables.
To visualize the results we will use plot()
The chart above shows the differences between the group means and their confidence intervals calculated with the 95% family-wise confidence level. If the confidence intervals cross the 0, this indicates that there are no differences between the respective groups.
Analyzing AB testing results and taking into account different user segments and different internal factors is a key to get a better picture of our experiment. This approach unveils new product unknowns, bring us closer to our customers and let us find desirable growth opportunities.