We're sorry but our site requires JavaScript

The easy and affordable data collection and storage architecture

Data can bring a huge value into product development and growth. For example with data, you can launch experiments to define what product feature is going to resonate with your customers before you implement it or build classification models which based on a user 1st session can identify if the user is going to be loyal or going to churn. Sounds pretty cool and powerful, right? However, before you start using the data, you need to collect and store it properly. Most people think that data scientists spend most of their time building algorithms and machine learning models, but the truth is that vast amount of time consumed just to get, prepare and clean the data for analysis.

Building a reliable process to collect and robust architecture to store the data is a crucial component of product success. The easier it is to get the data from the data warehouse, the faster you get results.

If you don’t have data collected and stored properly, there is no way you can get any insights and value from it

With а pace of product growth, you get more and more data from different sources. All that data needs to be combined and stored in one place. In this case, Universal Analytics becomes insufficient. Such option as building own data storage warehouse is quite an expensive, lengthy and complicated process, so the question of how to quickly build an affordable, but high-quality data aggregation and storage solution often pops up.

Snowplow + Amazon Redshift

Snowplow is a JavaScript based tracker that allows sending events to a cloud database using an API and data transfer protocols. One of the Snowplow’s advantages is a free license and easy implementation. Also, Snowplow has a similar to Universal Analytics events logic and syntax, and by just changing few parameters we can migrate from UA. The same as in Universal Analytics, Snowplow allows collecting e-commerce data, advanced parameters as well as a wide range of various indicators.

article image

After the data is collected, and stored in Amazon Redshift, one of the options is to use Amazon's internal interface to write SQL queries to get the data, but this is not the most convenient process, and we recommend the following:

BI analytics tools

Amazon has various cloud BI tools integrations with the ability to pull the data using SQL / R / python. Couple tools we want to highlight are Mode Analytics and Redash. Both tools are very affordable but have a lot of useful features to visualize data and build clear and meaningful dashboards.

article image

Raw data

To work with raw data, build models and statistical data processing there is an option to connect to Amazon directly through IDEA R or Python. This R library allows connecting to Amazon Redshift so you can get the data and process it according to your requirements.

Data architecture is a complicated thing, and it must continually evolve along with the product. The solution described in this article is a great start which allows focusing resources on the product instead of diving into a long, complicated development. In conclusion, I want to say obvious and very logical, but the most important thing that every digital product company must follow since a very early stage. The most important in data analysis is the data quality. Not algorithms, not magic data science stuff. If you don’t have data collected and stored properly, there is no way you can get any insights and value from it.