Lambda, Kappa, and the Architecture You Actually Need

Anonymous included in Data Engineering

2026-05-04 About 1400 words 7 minutes

Contents

Note

ARCHITECTURE REVIEW BOARD: Two competing system designs detected. Lambda: dual-pathway parallel replication. Kappa: single continuous stream. Initiating comparative analysis.

This is the final post in the From Batch to Streaming series. We have covered why batch breaks down, how the three-component streaming stack works, how state and checkpoints provide fault tolerance, how watermarks handle out-of-order events, and how delivery guarantees determine the risk profile of your pipeline.

Now we step back and look at the whole picture: the architectural patterns that govern how batch and streaming coexist (or don’t) in a real data platform.

Two patterns dominate the literature. One was formulated out of necessity. The other was formulated as a direct critique of the first.

Λ The Lambda Architecture

The Lambda architecture was described by Nathan Marz in 2011 and became widely adopted in the years when streaming infrastructure was immature. It solves the problem of how to serve both accurate historical data and low-latency near-real-time data simultaneously.

The core insight: run two parallel systems.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18


                      [Source Events]
                           │
              ┌────────────┴────────────┐
              │                         │
              ▼                         ▼
     [Batch Layer]              [Speed Layer]
     (all historical data)      (recent data only)
     ├── Runs periodically       ├── Runs continuously
     ├── High accuracy           ├── Approximate / fast
     ├── High latency            ├── Low latency
     └── dbt + BigQuery          └── Flink + Redis
              │                         │
              └────────────┬────────────┘
                           ▼
                    [Serving Layer]
                    (merges both views)
                    ├── Queries batch for history
                    └── Queries speed for recent

A query for “last 90 days revenue by zone” is served by the batch layer. A query for “revenue in the last 5 minutes by zone” is served by the speed layer. The serving layer merges them—batch results for the bulk of history, speed layer results for the recent window not yet processed by batch.

Why Lambda Made Sense

In 2011–2015, streaming systems were unreliable and difficult to operate. Kafka existed but was young. Flink did not yet exist. Storm was the dominant streaming engine—powerful but operationally complex.

The batch layer provided a reliable safety net: every night, reprocess everything from scratch. Any streaming errors would be overwritten by the next batch run. The system was self-healing by design.

The Lambda Problem

Lambda requires you to implement your business logic twice: once in a batch framework (Spark, dbt) and once in a streaming framework (Storm, Flink). Two codebases. Two engineering teams. Two debugging paths. Two failure modes to manage.

When the batch and speed layers produce slightly different numbers—due to different data access patterns, different timestamp handling, different rounding—the serving layer must arbitrate. The query results are inconsistent depending on which layer is consulted.

The operational cost is real. Martin Kleppmann wrote the canonical critique in 2014: “Questioning the Lambda Architecture”. The question was: if your streaming layer were reliable enough, why would you need the batch layer at all?

κ The Kappa Architecture

Jay Kreps (co-creator of Kafka) proposed the Kappa architecture in 2014 as the answer to Kleppmann’s question.

The central thesis: use only one system, processing everything as a stream.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


              [Source Events]
                    │
                    ▼
         [Stream Processing Layer]
         (Kafka + Flink, always running)
         ├── Processes current events continuously
         └── Replays historical events via Kafka offset rewind
                    │
                    ▼
           [Serving Layer]
           (single source of truth)

There is no separate batch layer. For reprocessing—when you need to fix a bug or apply a new transformation to historical data—you:

Start a new Flink job with the corrected logic
Point it at the beginning of the Kafka topic (rewind to offset 0)
Let it reprocess historical events at stream speed
Once it catches up to the current event position, swap it in as the live pipeline

1
2
3
4
5
6
7


# Reprocessing: start a new job consuming from the earliest offset
kafka-topics.sh --create --topic rides-v2 --replication-factor 1 --partitions 3

# New Flink job reads from beginning
flink run -py corrected_job.py \
  --kafka-start-offset earliest \
  --output-topic rides-v2-aggregated

Why Kappa Only Became Practical Later

Kappa requires:

A durable, replayable log with sufficient retention to hold all historical data (or a bootstrapping mechanism from cold storage)
A stream processing engine reliable and expressive enough to replace a mature batch system
Sufficient compute to process historical data at acceptable speed when reprocessing

By 2015–2016, Kafka had long retention support, Flink had become production-grade, and compute costs had fallen enough to make continuous streaming economically viable. The conditions that made Lambda necessary had been largely resolved.

🏥 Comparing the Two

Dimension	Lambda	Kappa
Code duplication	High (implement logic twice)	Low (single codebase)
Operational complexity	High (two systems)	Moderate (one system, well-configured)
Data correctness	High (batch overwrites streaming errors)	Requires reliable streaming
Reprocessing	Run the batch job again	Rewind Kafka offset, new job
Historical query	Served by batch layer	Served by streaming replay or materialized view
Infrastructure cost	High (two processing stacks)	Moderate
Adoption maturity	Wide (many existing deployments)	Growing (modern standard)

🔬 What the Industry Actually Runs

Lambda: Remains Common in Legacy Deployments

Many large organisations built Lambda architectures in 2013–2018 and are still running them. The cost of migrating away from a functioning system is high, and Lambda works. It is just expensive to maintain.

Netflix has operated a Lambda-inspired architecture for parts of its analytics stack. Twitter (pre-acquisition) ran significant Lambda infrastructure. LinkedIn, despite inventing Kafka, had Lambda deployments for years before consolidating.

Kappa: The Modern Default

New data platform builds in 2020+ almost universally adopt Kappa or Kappa-adjacent patterns. The infrastructure has matured enough to make the reliability argument for Lambda obsolete in most contexts.

Uber’s real-time data platform, Cloudflare’s analytics pipeline, and the streaming layers at most modern fintech companies use single-stream architectures without a parallel batch safety net.

The Hybrid Reality: “Lambda Lite”

In practice, many organisations run neither textbook Lambda nor textbook Kappa. They run something in between:

A streaming pipeline for real-time decisions (fraud, routing, recommendations)
A batch pipeline for historical analysis and reporting (dbt + warehouse)
No serving layer merge: the two pipelines serve different use cases and never need to agree on the same number

This is not Lambda (no attempt to merge streaming and batch for the same query) and not Kappa (batch is still there). It is simply two systems doing two different jobs. Most data engineering teams actually operate this way, without needing to give it an architecture name.

🧬 Making the Choice

If you are designing a new data platform today:

Adopt Kappa (streaming-first) if:

Your primary use case requires sub-minute data freshness
You are building from scratch without legacy batch infrastructure to preserve
Your team has sufficient streaming expertise to operate Flink reliably
Your source events are replayable (Kafka with adequate retention, or S3 event archive for bootstrap)

Adopt Lambda (batch + streaming) if:

You already have a mature, tested batch pipeline that serves historical analytics
Your streaming use case is additive (you want real-time on top of existing batch, not instead of it)
Your streaming infrastructure is not yet reliable enough to be the sole source of truth

Adopt the Hybrid (separate systems, separate use cases) if:

Batch analytics and real-time operational decisions are genuinely distinct workloads
You want to avoid the serving-layer complexity of merging two systems
Your team is more comfortable operating one mature batch system and one streaming system, separately

🔚 Closing the Series

This series started with a simple question: why does batch processing have an expiry date?

The answer, after six posts, is nuanced. Batch does not expire for everyone. For many organisations, nightly ETL jobs and daily dbt runs serve the business perfectly. The midnight job will run for another decade.

But for the use cases where data latency is measured in seconds, not hours—fraud decisions, real-time recommendations, live dashboards, IoT monitoring—batch cannot do the job. A continuous stream, processed by a reliable engine with proper state management, watermarks, and delivery guarantees, is the architecture that fills the gap.

The tools are now mature enough to build it. The patterns are well-understood. The engineering challenge has shifted from “can we do this?” to “do we need to, and what will it cost to operate?”

That is the right question to be asking.

← Previous: Exactly Once is a Lie (And What to Do About It)

Series start: The Midnight Job: Why Batch Processing Has an Expiry Date →