From API to Warehouse in Minutes: Building a Dlt Pipeline With AI

Anonymous included in Data Engineering

2026-02-27 About 1100 words 5 minutes

Contents

Goal: Build a complete, working data pipeline — from the Open Library API to a local DuckDB database — using dlt, an agentic IDE (Cursor), and the dlt MCP server. You write one prompt; the AI writes the code.

1. What You’ll Build

By the end of this workshop, you’ll have:

A working dlt pipeline that fetches book data from the Open Library API (no API key required).
Normalized relational tables stored in a local DuckDB database.
An interactive dashboard to query and inspect your pipeline’s state and data.
Optionally, a marimo notebook with visualizations powered by ibis.

The key insight of this workshop: you won’t write the pipeline code by hand. Instead, you’ll use an AI agent — given access to dlt documentation via an MCP server — to generate, run, and debug the pipeline from a natural language prompt.

2. Prerequisites

Before starting, make sure you have:

Requirement	Details
Agentic IDE	Cursor (recommended), Windsurf, or VS Code + GitHub Copilot
Python 3.11+	`python --version` should return 3.11 or higher
uv	Fast Python package manager (recommended over pip)
dlt	Will be installed in the steps below

Install uv if you don’t have it:

1

curl -LsSf https://astral.sh/uv/install.sh | sh

3. Setting Up the Project

Step 1: Create a New Project Folder

Open your terminal and create a fresh directory for the pipeline:

1
2


mkdir my-dlt-pipeline
cd my-dlt-pipeline

Then open this folder in Cursor (or your preferred agentic IDE).

Step 2: Configure the dlt MCP Server

The dlt MCP server gives your AI agent access to dlt documentation, code examples, and your pipeline metadata — turning it into a true data engineering assistant.

In Cursor: Go to Settings → Tools & MCP → New MCP Server and add:

In VS Code (Copilot): Create .vscode/mcp.json in your project folder with the same configuration structure (replacing mcpServers with servers).

In Claude Code: Run in your terminal:

1

claude mcp add dlt -- uv run --with "dlt[duckdb]" --with "dlt-mcp[search]" python -m dlt_mcp

Once configured, the AI agent can answer questions like “What tables are in my pipeline?” or “Show me the schema for the books table” — directly from your editor chat.

Step 3: Install dlt

1

pip install "dlt[workspace]"

Step 4: Initialize the dlt Project

1

dlt init dlthub:open_library duckdb

This scaffolds your project with:

A pre-configured pipeline template for the Open Library API.
A duckdb destination configuration.
Sample prompts you can use with your AI agent.

You now have everything ready to start prompting.

4. Building the Pipeline with AI

Step 5: Prompt the Agent to Build and Run the Pipeline

This is where the magic happens. Open the AI chat in your IDE and use a prompt like:

1
2
3
4
5
6
7


Please generate a REST API Source for Open Library API, as specified in @open_library-docs.yaml
Start with endpoint(s) books and skip incremental loading for now.
Place the code in open_library_pipeline.py and name the pipeline open_library_pipeline.
If the file exists, use it as a starting point.
Do not add or modify any other files.
Use @dlt rest api as a tutorial.
After adding the endpoints, allow the user to run the pipeline with python open_library_pipeline.py and await further instructions.

The agent will:

Read the OpenAPI spec and dlt documentation via the MCP server.
Generate the pipeline code in open_library_pipeline.py.
Run the pipeline.
Load data into your local DuckDB database.

All from a single prompt — no manual coding required.

Step 6: Debug with the Agent

If errors occur during the pipeline run, simply paste the error message into the chat. The AI agent will diagnose the issue and apply a fix. This tight feedback loop — prompt → run → error → fix — is the core productivity advantage of AI-assisted data engineering.

Common issues and how the agent handles them:

API pagination errors: Agent adjusts the pagination config.
Schema type conflicts: Agent adds explicit type hints to the resource.
Missing dependencies: Agent updates requirements.txt and re-runs.

5. Inspecting Your Data

Step 7: Launch the dlt Dashboard

Once the pipeline runs successfully, launch the built-in dashboard:

1

dlt pipeline open_library_pipeline show

This opens a local web application where you can:

View the pipeline run history and state.
Explore the inferred schema: tables, columns, and data types.
Query the loaded data directly in the browser.
Debug issues with run-level metadata.

The dashboard is especially useful for understanding how dlt normalized nested JSON from the API into flat relational tables.

Step 8: Query Pipeline Metadata via Chat

With the dlt MCP server active, you can ask the AI questions about your running pipeline directly from the IDE chat:

“What tables were created in the pipeline?”
“Show me the schema for the books table.”
“How many rows were loaded in the last run?”

The agent has access to your pipeline’s live metadata and will return accurate, up-to-date answers — no SQL required.

6. Bonus: Visualizations with marimo + ibis

Step 9: Build an Interactive Report

Want to take the analysis further? Prompt the agent to create an interactive visualization notebook:

“Create a marimo notebook that visualizes the top 10 authors by book count. Use ibis for data access. Reference: https://dlthub.com/docs/general-usage/dataset-access/marimo"

By including the documentation link, the agent uses the correct API patterns. Once generated, run the notebook:

1
2
3
4
5


# Development mode (editable)
marimo edit your_notebook.py

# Report mode (read-only viewer)
marimo run your_notebook.py

marimo is a reactive Python notebook framework, and ibis is a portable dataframe API that works across DuckDB, BigQuery, and Snowflake — a powerful pair for data exploration.

7. Key Takeaways

This workshop demonstrates a shift in how data pipelines are built:

You are the architect: You define what data you want, from which source, and where it should go.
The AI is the implementer: It writes the boilerplate extraction code, handles API specifics, and iterates quickly on errors.
dlt is the foundation: It provides reliable, well-documented patterns that AI agents can follow consistently — schema management, incremental loading, and destination adapters all come for free.

The result is a dramatically shorter feedback loop from “I need this data in my warehouse” to “the pipeline is running in production”.

This post is part of the Data Engineering Zoomcamp 2026 workshop series. For the full course, visit DataTalksClub. For dlt documentation, visit dlthub.com/docs.