If you are considering adding dbt to your data stack, it’s important to understand the fundamentals of how it works and what value it will bring to your data team.
What is dbt?
dbt is a data transformation tool. Data transformation includes very common use cases such as creating a view or table using a query that joins data from multiple source tables. In the case of a transformed table, you’re most likely also creating a Snowflake task, or a BigQuery scheduled query or a Postgres stored procedure on a cron schedule that ensures that the table is rebuilt at a scheduled interval to keep the data fresh. Approximately 95% of what dbt does can be done directly within your warehouse or database or with other common data pipeline and orchestration tools such as Airflow.
So why even use dbt and why do a lot of analysts love it? Apart from offering features that help streamline the data transformation workflow, it’s a great tool for data teams that are primarily comprised of analysts who know SQL but don’t necessarily have an engineering background or proficiency in programming languages such as python or R to adapt engineering best practices in their workflows and within the data pipelines they are developing. That being said, let’s review the fundamentals of dbt.
Data Models
A data model in dbt represents a view or table that will be created or updated. Data models are defined in .sql files that contain the sql definition for your table or view. These .sql model files are really the basic framework of dbt because they define the views and tables one would want to create when transforming data. The code in these sql model files mirror what an analyst would compose in a Snowflake worksheet or Bigquery console, however there are some differences. The biggest difference is that instead of directly specifying the fully qualified table/view name (e.g. <database_name>.<schema_name>.<table_name>), a reference is used.
For example, an analyst wants to create a view that queries a table named usa_high_schools_ranked that has rankings for all USA high schools in order to get just the top 50 ranked schools.
# The SQL definition would be written as:
CREATE OR REPLACE VIEW usa_high_schools_top_50 AS (
SELECT
school_name,
school_city,
school_state,
school_ranking
FROM analytics_db.schools.usa_high_schools_ranked
WHERE school_ranking <= 50
)
# When defining a model in dbt instead, the SQL definition in the
# dbt model file (dbt/models/usa_high_schools_top_50.sql), where the
# file name shall be the name of the view that will be created,
# would instead be composed as:
{{ config(materialized='view') }}
SELECT
school_name,
school_city,
school_state,
school_ranking,
school_avg_sat_scores
FROM {{ ref('usa_high_schools_ranked') }}
WHERE school_ranking <= 50
The statement {{ ref(‘usa_high_schools_ranked’’) }} tells dbt to reference the table or view defined by the data model ‘usa_high_schools_ranked’. One can then expect to find a file named dbt/models/usa_high_schools_ranked.sql in the models directory or a sub directory. Assuming it is a table, here’s what the definition could look like.
{{ config(materialized='table') }}
SELECT
school_name,
school_city,
school_state,
school_avg_sat_scores,
ROW_NUMBER() OVER(ORDER BY avg_sat_scores DESC, school_size_desc) AS school_ranking
FROM {{ source('usa_school_statistics','high_school_metrics') }}
In the above example, instead of a reference to a model, there is a pointer to a source table. This is the second way a table or view can be referred to in dbt without using its fully qualified name. Source tables generally refer to the raw data that has been initially loaded into a database from external sources such as your CRM or Google Analytics that is eventually transformed when it’s queried, cleaned, and even joined with other data to create new datasets. At the core this is what dbt is helping you do, build downstream tables and views that transform raw data into clean reporting data models that can be used by other analysts, data scientists or end-users if connected to a BI tool.
Data Sources
How are these sources defined? Raw data sources are defined in .yml files. In these definitions, the database, schema, and table names are outlined.
dbt/models/source.yml
sources:
- name: usa_school_statistics
database: education_raw_db
schema: usa_school_statistics
tables:
- name: high_school_metrics
When there is a source pointer such as {{ source(‘usa_school_statistics’,’high_school_metrics’) }}, it’s telling dbt to look at the source definition for the source named usa_school_statistics and reference the table named ‘high_school_metrics’ in the schema and database defined for the source. What you will see is that a source generally represents a schema (or dataset if using BigQuery) because when raw data is ingested into a database, tables created from the same source are usually created in the same schema. In dbt, multiple sources can be defined in a project.
dbt Projects
dbt is a python package that orchestrates the transformation of data and for it to know what to create and where, it needs a dbt project which is a directory composed of configuration files in addition to the data model .sql and source .yml files.
The key configuration files include:
- dbt_project.yml: This file must exist at the root folder of your dbt project file directory and allows you to configure settings for your project.
- profile.yml: This file is created in a hidden folder outside of the dbt project directory because it contains the sensitive configuration information needed to authenticate to your database. If you are using environment variables, it can be placed in the root directory with the dbt_project.yml
dbt Development
With dbt installed, which can be done with a basic python pip install (e.g. pip install dbt-core dbt-snowflake), and data models defined, the corresponding tables and views in the data models are then created by simply running dbt run, in a command line interface.
When dbt runs, it refers to the profile.yml file to know which database to connect to and which schema to write the data to. As mentioned before, dbt is great for data teams because it allows them to easily implement engineering best practices and one of the core principles is the distinction between development and production, i.e. users need an environment to test and develop code without impacting what is in production. For example, one does not want to replace a view directly used in a dashboard with a new SQL definition that adds new columns until it has been fully tested.
In the profiles.yml, for a project, multiple targets can be defined, by default the dev (development) target is created, and an additional prod (production) target can also be defined. With these targets, one can segregate development and production by organizing tables and views created during development in a different schema from what is designated as production.
profile.yml example:
education_stats_dbt_project:
target: dev
outputs:
dev:
host: ..
user: ..
password: ..
dbname: analytics_development
schema: dev_user
prod:
host: ..
user: ..
password: ..
dbname: analytics_production
schema: reporting
When the dbt run command is triggered, by default the dev target is used, which can be overridden with the flag dbt run –target prod. This is why using references as opposed to fully qualified table/view names is key when composing data models in dbt because the database and schema where the model tables and views are created and referenced when dbt runs are relative to what is defined in the dev or prod targets. Similarly when pointing to source tables, it’s better to not directly specify the table’s qualified name because if you have a pointer to a source in a .yml file, you can change the database or schema of the source table in that one file if ever need be without having to update all the models that reference that source table.
The other key aspect of engineering workflows that integrate seamlessly with dbt is version control. The dbt project directory (with the exception of the profiles.yml file which is in a hidden directory) can be pushed to an upstream code repository using platforms such as Github. This gives analysts the benefits of source control which includes pull request workflows to ensure that updates to production are first peer reviewed and approved. Additionally it allows multiple individuals to develop on the same code base in separate branches without impacting production. If your team chooses to orchestrate production dbt runs using the dbt Cloud platform, it can directly pull the code from your dbt project repository which ensures that dbt is running and builds tables and views using the most up to date model definitions.