Dbt Project Setup: Structure, Sources, and Staging
Goal: master the anatomy of a dbt project, understand the
staginglayer, and learn how to define and use Sources.
1. Anatomy of a dbt Project
When you run dbt init, you get a standard directory structure. Here are the key components:
dbt_project.yml
The heart of your project. If this file is missing, dbt won’t run. It configures:
- Project Name: Used for package imports.
- Profile: Which connection details to use from
profiles.yml. - Model Configurations: Set default materializations (e.g., all files in
marts/should be tables).
Key Directories
models/: Where your SQL magic happens.seeds/: CSV files (lookup tables) to be loaded into the DB.tests/: Custom SQL queries that fail if they return rows (data quality checks).macros/: Reusable code blocks (think “functions” in Python).snapshots/: Records changes to mutable data over time (SCD Type 2).
2. Organizing Models: The 3 Layers
dbt recommends a specific structure inside the models/ directory:
1. Staging (models/staging)
- Goal: Create a clean, 1:1 copy of your source data.
- Actions:
- Rename columns to be descriptive (e.g.,
rate_code->rate_code_id). - Cast data types (e.g., string date ->
TIMESTAMP). - NO Joins (mostly).
- Rename columns to be descriptive (e.g.,
- Materialization: usually
view.
2. Intermediate (models/intermediate)
- Goal: Handle complex logic internal to the data team.
- Actions: Joins, heavy transformations, grouping.
- Visibility: Hidden from end users.
3. Marts (models/marts)
- Goal: Business-ready data.
- Actions: Final selection of columns, star schemas (Facts and Dimensions).
- Materialization: usually
table(for performance).
3. Defining Sources
Hardcoding table names (e.g., FROM raw_data.trips) is brittle. dbt uses Sources to abstract this.
sources.yml
Defined in models/staging/:
|
|
The {{ source() }} Function
In your SQL models, refer to sources dynamically:
|
|
Benefits:
- Dependency Graph: dbt knows this model depends on that source.
- Freshness: You can define
freshnessblocks in YAML to alert you if source data is stale.
4. Building Your First Staging Model
A typical staging model (stg_green_tripdata.sql) looks like this:
|
|
Key Takeaway: The Staging layer is about cleaning and standardizing. It prepares the raw ingredients for the cooking (transformation) that happens next.
In the next post, we will tackle Models & Materializations—the core logic of Analytics Engineering.