The Modern Data Stack: An Introduction

The modern data stack consists of a collection of tools that ingest, store, transform, and visualize you data.  The primary components include:

  • A data warehouse, the central store for your data (Snowflake, Big Query)
  • ETL/ELT tools that ingest data into the data warehouse (Fivetran, Stitch)
  • Data transformation tools for cleaning and transforming your data (DBT)
  • BI / reporting tools (Looker, Tableau, Mode)
  • Reverse ETL tools (Census, Hightouch)
 
With the right tools in place, you will be able to rapidly report on your data and efficiently manage and scale your data ecosystem.

Data Warehouse

At the core of your data stack is the data warehouse. Your data warehouse is most likely going to be cloud-based columnar data storage solution, which allows you to efficiently query the data that you need. Some of the more common data warehouses you may have come across are Snowflake, Google Big Query, and Amazon Redshift.

When selecting a data warehouse, price and performance are key considerations, other important factors include administration, scaling, and potentially data sharing, depending on your use case. Your team’s current cloud infrastructure can also influence your decision, for example, while Snowflake can be hosted on AWS, Azure, and the Google Cloud Platform, if you are already heavily integrated with Google Cloud, it may make more sense to go with Big Query. On the administration side, while Big Query is fully managed and scaling is handled automatically, you may actually want the flexibility to manage resource allocation which you can easily do with Snowflake. I also find access control management easier with Snowflake versus Big Query.

ETL/ELT

When it comes to ingesting data into your data warehouse, it’s usually a question of build versus buy. You can either build a custom ETL (Extract, Transform, Load) pipeline using tools like Airflow, Luigi, or more recently Dagster. Or you can go with a managed, ready to use data connector from companies like Fivetran and Stitch. There are times when an out of the box solution is not a viable option, however when it is, it’s highly recommended. Managed ETL connectors save you time and in some cases money i.e. the monthly cost of using these connectors can be cheaper than having an engineer on staff to build, maintain and update custom ETL pipelines. Additionally, most of these companies offer a trail period for you to test out a connector, which allows you to compare and contrast the right solution.

Data Transformation

The data that gets ingested into the data warehouse from your data sources, be it CRM data, payment data, customer support data, is usually in a raw state. Whether it’s cleaning up missing and bad data or joining data from different sources to build a customer journey, raw data needs to be transformed for reporting and analyses. This transformation can sometimes be done within your reporting tools, however utilizing a transformation tool like DBT that transforms the data within the warehouse and enables you to implement engineering best practices such as code reviews, CI/CD, and documentation can be beneficial. For example, with the DBT transformation layer, migrating between BI tools is easier because the core data models and transformation logic sits outside the BI tool.

BI / Reporting Tools

Most data consumers will interact with a BI tool at some point in their data exploration journey. Whether it is an enterprise platform such as Looker and Tableau or a more run of the mill tool like Excel, these BI tools are equipped to easily connect to your data warehouse and ingest the data for reporting. If a tool like Looker and Tableau is too expensive, alternatives such as Metabase and Mode offer plans that can suit smaller teams. Choosing a BI tool depends on specific team needs, size, and budget, and there is no shortage of options.

Reverse ETL

With your data centralized in your data warehouse you are able to aggregate and join together all the data from your customers’ journey across your ecosystem. Beyond reporting, analyzing, and experimenting on this data, it can actually be used to further enrich datasets and platforms that exist outside the warehouse and reporting layer in order to drive acquisition, engagement, and retention. For example, if you want to attach an engagement score to a prospect account in CRM with that data exists in your data warehouse; reverse ETL tools such as Census and Hightouch will sync data into your SaaS platforms with no coding needed.