Introduction to Modern Data Platforms & Bruin Basics
Goal: Understand the need for unified data platforms in the modern data stack and learn the fundamental concepts of Bruin: Projects, Pipelines, Assets, Variables, and Commands.
1. The Modern Data Stack & The Need for Data Platforms
A typical modern data stack involves several distinct components to move data from source to destination and make it useful for the business:
- Ingestion (Extract/Load): Moving data from third-party sources or operational databases into a data warehouse or data lake.
- Transformation: Cleaning raw data, joining tables, creating reports, and pushing the final results to BI tools.
- Orchestration: Scheduling scripts and services, managing dependencies, and ensuring tasks run in the correct order.
- Data Quality & Governance: Validating Data accuracy, completeness, and consistency before delivering it to consumers.
The Problem: Tool Sprawl
Historically, teams had to use a different specialized tool for each of these layers (e.g., Fivetran for ingestion, dbt for transformation, Airflow for orchestration, Great Expectations for data quality). This fragmentation meant data teams had to spend an enormous amount of time configuring, integrating, and maintaining these completely separate tools.
The Solution: A Unified Data Platform
A Data Platform brings all of these capabilities under one roof. Bruin is an end-to-end data platform that combines ingestion, transformations, orchestration, data quality checks, metadata, and lineage into a single unified tool.
Instead of writing custom integrations across five different tools, Bruin lets you keep your code, configurations, dependencies, and quality checks all in the same place. You don’t need a massive DevOps or infrastructure team to build a working pipeline.
2. Bruin Core Concepts
To work effectively with Bruin, you only need to understand five fundamental concepts.
1. Projects
A Project is the root directory that contains your entire data pipeline. It is initialized using the bruin init command.
The heart of the project is the .bruin.yml file, which defines your environments (e.g., default, production) and your database connections (e.g., DuckDB, BigQuery, Snowflake).
- Best Practice: The
.bruin.ymlfile should stay local and be added to.gitignoreto protect your secrets.
2. Pipelines
A Pipeline is a logical grouping of assets based on their execution schedule.
- Every pipeline has a single schedule (e.g.,
daily,monthly, or a cron expression). - Each pipeline has its own folder containing a
pipeline.ymlfile. This is where you configure the specific connections and custom variables required for that schedule.
3. Assets
An Asset is a single file that performs a specific task, almost always related to creating or updating a table/view in the destination database.
- Types of Assets:
- Python: Used for ingestion from APIs or complex data processing.
- SQL: Used for transformations and aggregations in the warehouse.
- YAML/Seed: Used for loading static reference data (like CSV files).
- Lineage & Dependencies: Assets automatically define dependencies based on what they read. If a SQL asset queries a table created by a Python asset, Bruin infers the dependency and orchestrates them in the correct order.
4. Variables
Variables allow you to parameterize your pipelines with dynamic values at runtime.
- Built-in Variables: Provided automatically by Bruin based on the schedule, such as
start_dateandend_date. These can be directly injected into SQL using Jinja templating (e.g.,{{ start_date }}). - Custom Variables: Defined in the
pipeline.ymlfile and passed as environment variables (BRUIN_VAR_...) or injected into scripts. These are perfect for A/B testing or date-based partitioning.
5. Commands
The Bruin CLI provides all the commands needed to interact with your project:
bruin init [template] [name]: Initialize a new project.bruin validate [path]: Check for configuration issues or circular dependencies before running.bruin run [path]: Execute a pipeline or a specific asset. You can use flags like--downstreamor--start-dateto customize the run.bruin lineage [path]: Visualize the dependency graph between assets.bruin query: Run ad-hoc queries against your configured connections.
In the next post, we will put these concepts into practice by building a complete end-to-end data pipeline using Bruin and DuckDB processing NYC Taxi Data.