Analytics Engineering Basics & Introduction to Dbt

Goal: Understand the emergence of Analytics Engineering, the shift from ETL to ELT, and how dbt brings software engineering best practices to data transformation.


1. The Rise of Analytics Engineering

The data landscape has shifted dramatically in recent years, creating a gap that traditional roles (Data Engineer, Data Analyst, Data Scientist) struggled to fill.

What Changed?

  1. Cloud Data Warehouses: Modern warehouses like Snowflake, BigQuery, and Redshift made storage and compute cheap and scalable. You don’t need to be surgical about what you load anymore.
  2. Modern Data Stack: Tools like Fivetran and Stitch solved the “Extract & Load” (EL) problem, automating data ingestion.
  3. SQL-First BI: Tools like Looker introduced version control and modeling layers directly into BI.

The Problem: The Gap

  • Data Engineers were great at building infrastructure but often disconnected from business logic.
  • Data Analysts knew the business logic but often lacked software engineering discipline (version control, testing, CI/CD).
  • Result: “Spaghetti SQL,” broken dashboards, and lack of trust in data.

The Solution: The Analytics Engineer

The Analytics Engineer sits at the intersection of these roles. They apply Software Engineering best practices to Data Analytics.

  • Focus: Writing clean, modular, tested, and version-controlled code to model data.
  • Tooling: dbt, SQL, git, CI/CD.

2. ETL vs. ELT: A Paradigm Shift

ETL (Extract -> Transform -> Load)

  • Old School: Transform data before loading it into the warehouse.
  • Pros: Storage efficient (important when storage was expensive).
  • Cons: Rigid, slow to adapt. If you need a new field, you have to rebuild the entire pipeline.

ELT (Extract -> Load -> Transform)

  • New Standard: Load raw data immediately into the warehouse, then transform it there.
  • Pros:
    • Agility: Raw data is always available. You can change transformations retroactively without reloading data.
    • Power: Leverages the massive compute power of modern cloud IP (MPP).
  • Role: dbt lives entirely in the T (Transform) of ELT.

3. Data Modeling: Kimball’s Legacy

We follow Kimball’s Dimensional Modeling approach to structure data in the warehouse. The goal is data that is understandable to business users and performant for queries.

The Star Schema

  1. Fact Tables (Verbs): Measurements and metrics. Things that happened.
    • Examples: sales, orders, page_views, trips.
    • High volume, highly dynamic.
  2. Dimension Tables (Nouns): Context and descriptive attributes.
    • Examples: customers, products, locations, dates.
    • Low volume, static or slowly changing.

The Kitchen Analogy

  • Staging (The Pantry): Raw ingredients (data). Access restricted to cooks (Analytics Engineers).
  • Processing (The Kitchen): Where ingredients are chopped, combined, and cooked (Transformed).
  • Presentation (The Dining Room): The final plated dish. Clean, organized, and ready for the customer (Business Users/BI Tools) to consume.

4. Enter dbt (Data Build Tool)

dbt is the tool that enables Analytics Engineering. It focuses purely on the Transformation layer.

What dbt Does

  • Compiles & Runs: You write SELECT statements in SQL (augmented with Jinja). dbt compiles them into CREATE VIEW or CREATE TABLE and runs them in your warehouse.
  • Software Engineering Practices:
    • Modularity: Use ref() to reference other models. Build complex logic from simple, reusable blocks.
    • Testing: Define tests (unique, not null, referential integrity) in YAML.
    • Documentation: Auto-generate documentation and lineage graphs from your code.
    • Version Control: Everything is code (SQL + YAML), managed in Git.

How it Works

When you run dbt run:

  1. dbt compiles your code (resolves dependencies and references).
  2. It constructs a DAG (Directed Acyclic Graph) of execution order.
  3. It pushes the SQL to your warehouse (Snowflake/BigQuery) to execute.
  4. Data is materialized as Views or Tables.

Cloud vs. Core

  • dbt Core: Open-source, CLI-based. You manage orchestration (Airflow, etc.).
  • dbt Cloud: SaaS platform. Includes IDE, scheduler, CI/CD, and hosted documentation. Recommended for most teams.

In the next post, we will dive into the dbt Project Structure and how to set up your first project.