Dbt Project Setup: Structure, Sources, and Staging

Goal: master the anatomy of a dbt project, understand the staging layer, and learn how to define and use Sources.


1. Anatomy of a dbt Project

When you run dbt init, you get a standard directory structure. Here are the key components:

dbt_project.yml

The heart of your project. If this file is missing, dbt won’t run. It configures:

  1. Project Name: Used for package imports.
  2. Profile: Which connection details to use from profiles.yml.
  3. Model Configurations: Set default materializations (e.g., all files in marts/ should be tables).

Key Directories

  • models/: Where your SQL magic happens.
  • seeds/: CSV files (lookup tables) to be loaded into the DB.
  • tests/: Custom SQL queries that fail if they return rows (data quality checks).
  • macros/: Reusable code blocks (think “functions” in Python).
  • snapshots/: Records changes to mutable data over time (SCD Type 2).

2. Organizing Models: The 3 Layers

dbt recommends a specific structure inside the models/ directory:

1. Staging (models/staging)

  • Goal: Create a clean, 1:1 copy of your source data.
  • Actions:
    • Rename columns to be descriptive (e.g., rate_code -> rate_code_id).
    • Cast data types (e.g., string date -> TIMESTAMP).
    • NO Joins (mostly).
  • Materialization: usually view.

2. Intermediate (models/intermediate)

  • Goal: Handle complex logic internal to the data team.
  • Actions: Joins, heavy transformations, grouping.
  • Visibility: Hidden from end users.

3. Marts (models/marts)

  • Goal: Business-ready data.
  • Actions: Final selection of columns, star schemas (Facts and Dimensions).
  • Materialization: usually table (for performance).

3. Defining Sources

Hardcoding table names (e.g., FROM raw_data.trips) is brittle. dbt uses Sources to abstract this.

sources.yml

Defined in models/staging/:

1
2
3
4
5
6
7
8
9
version: 2

sources:
  - name: staging
    database: taxi_rides_ny
    schema: trip_data_all
    tables:
      - name: green_tripdata
      - name: yellow_tripdata

The {{ source() }} Function

In your SQL models, refer to sources dynamically:

1
2
3
4
5
-- BAD
FROM raw_data.trips

-- GOOD
FROM {{ source('staging', 'green_tripdata') }}

Benefits:

  1. Dependency Graph: dbt knows this model depends on that source.
  2. Freshness: You can define freshness blocks in YAML to alert you if source data is stale.

4. Building Your First Staging Model

A typical staging model (stg_green_tripdata.sql) looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
{{ config(materialized='view') }}

with source as (

    select * from {{ source('staging', 'green_tripdata') }}

),

renamed as (

    select
        -- identifiers
        cast(vendorid as integer) as vendor_id,
        cast(ratecodeid as integer) as rate_code_id,
        
        -- timestamps
        cast(lpep_pickup_datetime as timestamp) as pickup_datetime,
        cast(lpep_dropoff_datetime as timestamp) as dropoff_datetime,
        
        -- trip info
        store_and_fwd_flag,
        cast(passenger_count as integer) as passenger_count,
        cast(trip_distance as numeric) as trip_distance,
        
        -- payment info
        cast(fare_amount as numeric) as fare_amount,
        cast(total_amount as numeric) as total_amount

    from source

)

select * from renamed

Key Takeaway: The Staging layer is about cleaning and standardizing. It prepares the raw ingredients for the cooking (transformation) that happens next.


In the next post, we will tackle Models & Materializations—the core logic of Analytics Engineering.