From API to Warehouse in Minutes: Building a Dlt Pipeline With AI
Goal: Build a complete, working data pipeline — from the Open Library API to a local DuckDB database — using
dlt, an agentic IDE (Cursor), and the dlt MCP server. You write one prompt; the AI writes the code.
1. What You’ll Build
By the end of this workshop, you’ll have:
- A working
dltpipeline that fetches book data from the Open Library API (no API key required). - Normalized relational tables stored in a local DuckDB database.
- An interactive dashboard to query and inspect your pipeline’s state and data.
- Optionally, a marimo notebook with visualizations powered by ibis.
The key insight of this workshop: you won’t write the pipeline code by hand. Instead, you’ll use an AI agent — given access to dlt documentation via an MCP server — to generate, run, and debug the pipeline from a natural language prompt.
2. Prerequisites
Before starting, make sure you have:
| Requirement | Details |
|---|---|
| Agentic IDE | Cursor (recommended), Windsurf, or VS Code + GitHub Copilot |
| Python 3.11+ | python --version should return 3.11 or higher |
| uv | Fast Python package manager (recommended over pip) |
| dlt | Will be installed in the steps below |
Install uv if you don’t have it:
|
|
3. Setting Up the Project
Step 1: Create a New Project Folder
Open your terminal and create a fresh directory for the pipeline:
|
|
Then open this folder in Cursor (or your preferred agentic IDE).
Step 2: Configure the dlt MCP Server
The dlt MCP server gives your AI agent access to dlt documentation, code examples, and your pipeline metadata — turning it into a true data engineering assistant.
In Cursor: Go to Settings → Tools & MCP → New MCP Server and add:
In VS Code (Copilot): Create .vscode/mcp.json in your project folder with the same configuration structure (replacing mcpServers with servers).
In Claude Code: Run in your terminal:
|
|
Once configured, the AI agent can answer questions like “What tables are in my pipeline?” or “Show me the schema for the books table” — directly from your editor chat.
Step 3: Install dlt
|
|
Step 4: Initialize the dlt Project
|
|
This scaffolds your project with:
- A pre-configured pipeline template for the Open Library API.
- A
duckdbdestination configuration. - Sample prompts you can use with your AI agent.
You now have everything ready to start prompting.
4. Building the Pipeline with AI
Step 5: Prompt the Agent to Build and Run the Pipeline
This is where the magic happens. Open the AI chat in your IDE and use a prompt like:
|
|
The agent will:
- Read the OpenAPI spec and dlt documentation via the MCP server.
- Generate the pipeline code in
open_library_pipeline.py. - Run the pipeline.
- Load data into your local DuckDB database.
All from a single prompt — no manual coding required.
Step 6: Debug with the Agent
If errors occur during the pipeline run, simply paste the error message into the chat. The AI agent will diagnose the issue and apply a fix. This tight feedback loop — prompt → run → error → fix — is the core productivity advantage of AI-assisted data engineering.
Common issues and how the agent handles them:
- API pagination errors: Agent adjusts the pagination config.
- Schema type conflicts: Agent adds explicit type hints to the resource.
- Missing dependencies: Agent updates
requirements.txtand re-runs.
5. Inspecting Your Data
Step 7: Launch the dlt Dashboard
Once the pipeline runs successfully, launch the built-in dashboard:
|
|
This opens a local web application where you can:
- View the pipeline run history and state.
- Explore the inferred schema: tables, columns, and data types.
- Query the loaded data directly in the browser.
- Debug issues with run-level metadata.
The dashboard is especially useful for understanding how dlt normalized nested JSON from the API into flat relational tables.
Step 8: Query Pipeline Metadata via Chat
With the dlt MCP server active, you can ask the AI questions about your running pipeline directly from the IDE chat:
“What tables were created in the pipeline?”
“Show me the schema for the books table.”
“How many rows were loaded in the last run?”
The agent has access to your pipeline’s live metadata and will return accurate, up-to-date answers — no SQL required.
6. Bonus: Visualizations with marimo + ibis
Step 9: Build an Interactive Report
Want to take the analysis further? Prompt the agent to create an interactive visualization notebook:
“Create a marimo notebook that visualizes the top 10 authors by book count. Use ibis for data access. Reference: https://dlthub.com/docs/general-usage/dataset-access/marimo"
By including the documentation link, the agent uses the correct API patterns. Once generated, run the notebook:
|
|
marimo is a reactive Python notebook framework, and ibis is a portable dataframe API that works across DuckDB, BigQuery, and Snowflake — a powerful pair for data exploration.
7. Key Takeaways
This workshop demonstrates a shift in how data pipelines are built:
- You are the architect: You define what data you want, from which source, and where it should go.
- The AI is the implementer: It writes the boilerplate extraction code, handles API specifics, and iterates quickly on errors.
dltis the foundation: It provides reliable, well-documented patterns that AI agents can follow consistently — schema management, incremental loading, and destination adapters all come for free.
The result is a dramatically shorter feedback loop from “I need this data in my warehouse” to “the pipeline is running in production”.
This post is part of the Data Engineering Zoomcamp 2026 workshop series. For the full course, visit DataTalksClub. For dlt documentation, visit dlthub.com/docs.