Dbt Project Setup: Structure, Sources, and Staging

Anonymous included in Data Engineering

2026-02-11 About 500 words 3 minutes

Contents

Goal: master the anatomy of a dbt project, understand the staging layer, and learn how to define and use Sources.

1. Anatomy of a dbt Project

When you run dbt init, you get a standard directory structure. Here are the key components:

`dbt_project.yml`

The heart of your project. If this file is missing, dbt won’t run. It configures:

Project Name: Used for package imports.
Profile: Which connection details to use from profiles.yml.
Model Configurations: Set default materializations (e.g., all files in marts/ should be tables).

Key Directories

models/: Where your SQL magic happens.
seeds/: CSV files (lookup tables) to be loaded into the DB.
tests/: Custom SQL queries that fail if they return rows (data quality checks).
macros/: Reusable code blocks (think “functions” in Python).
snapshots/: Records changes to mutable data over time (SCD Type 2).

2. Organizing Models: The 3 Layers

dbt recommends a specific structure inside the models/ directory:

1. Staging (`models/staging`)

Goal: Create a clean, 1:1 copy of your source data.
Actions:
- Rename columns to be descriptive (e.g., rate_code -> rate_code_id).
- Cast data types (e.g., string date -> TIMESTAMP).
- NO Joins (mostly).
Materialization: usually view.

2. Intermediate (`models/intermediate`)

Goal: Handle complex logic internal to the data team.
Actions: Joins, heavy transformations, grouping.
Visibility: Hidden from end users.

3. Marts (`models/marts`)

Goal: Business-ready data.
Actions: Final selection of columns, star schemas (Facts and Dimensions).
Materialization: usually table (for performance).

3. Defining Sources

Hardcoding table names (e.g., FROM raw_data.trips) is brittle. dbt uses Sources to abstract this.

`sources.yml`

Defined in models/staging/:

1
2
3
4
5
6
7
8
9


version: 2

sources:
  - name: staging
    database: taxi_rides_ny
    schema: trip_data_all
    tables:
      - name: green_tripdata
      - name: yellow_tripdata

The `{{ source() }}` Function

In your SQL models, refer to sources dynamically:

1
2
3
4
5


-- BAD
FROM raw_data.trips

-- GOOD
FROM {{ source('staging', 'green_tripdata') }}

Benefits:

Dependency Graph: dbt knows this model depends on that source.
Freshness: You can define freshness blocks in YAML to alert you if source data is stale.

4. Building Your First Staging Model

A typical staging model (stg_green_tripdata.sql) looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33


{{ config(materialized='view') }}

with source as (

    select * from {{ source('staging', 'green_tripdata') }}

),

renamed as (

    select
        -- identifiers
        cast(vendorid as integer) as vendor_id,
        cast(ratecodeid as integer) as rate_code_id,
        
        -- timestamps
        cast(lpep_pickup_datetime as timestamp) as pickup_datetime,
        cast(lpep_dropoff_datetime as timestamp) as dropoff_datetime,
        
        -- trip info
        store_and_fwd_flag,
        cast(passenger_count as integer) as passenger_count,
        cast(trip_distance as numeric) as trip_distance,
        
        -- payment info
        cast(fare_amount as numeric) as fare_amount,
        cast(total_amount as numeric) as total_amount

    from source

)

select * from renamed

Key Takeaway: The Staging layer is about cleaning and standardizing. It prepares the raw ingredients for the cooking (transformation) that happens next.

In the next post, we will tackle Models & Materializations—the core logic of Analytics Engineering.

Dbt Project Setup: Structure, Sources, and Staging

1. Anatomy of a dbt Project

dbt_project.yml

Key Directories

2. Organizing Models: The 3 Layers

1. Staging (models/staging)

2. Intermediate (models/intermediate)

3. Marts (models/marts)