After a sufficient amount of data collected and AB test completed, some of you might think it’s time for data analysis, but before we start to analyze we need to ETL the data.
ETL stands for Extract, Transform, Load. In this article, we will talk about the transform part, which is in our case cleaning. Often the collected data contains outliers, and let’s say, if we want to apply a parametric method to compare means, then this method requires robust data.
Let’s talk about an experiment with “average check” metric. In the average check dataset, we want to find mean value. In this case, outliers could have a significant influence on our mean value. Let’s say if the average check of a regular customer is $30, but there are customers who typically spend $300, the mean value will be skewed. One of the options is to find and remove the outliers from the dataset manually. Sounds easy, right? However, the challenge is if our dataset contains millions of observations.