Lambda, Kappa, and the Architecture You Actually Need
Note
ARCHITECTURE REVIEW BOARD: Two competing system designs detected. Lambda: dual-pathway parallel replication. Kappa: single continuous stream. Initiating comparative analysis.
This is the final post in the From Batch to Streaming series. We have covered why batch breaks down, how the three-component streaming stack works, how state and checkpoints provide fault tolerance, how watermarks handle out-of-order events, and how delivery guarantees determine the risk profile of your pipeline.
Now we step back and look at the whole picture: the architectural patterns that govern how batch and streaming coexist (or don’t) in a real data platform.
Two patterns dominate the literature. One was formulated out of necessity. The other was formulated as a direct critique of the first.
Λ The Lambda Architecture
The Lambda architecture was described by Nathan Marz in 2011 and became widely adopted in the years when streaming infrastructure was immature. It solves the problem of how to serve both accurate historical data and low-latency near-real-time data simultaneously.
The core insight: run two parallel systems.
|
|
A query for “last 90 days revenue by zone” is served by the batch layer. A query for “revenue in the last 5 minutes by zone” is served by the speed layer. The serving layer merges them—batch results for the bulk of history, speed layer results for the recent window not yet processed by batch.
Why Lambda Made Sense
In 2011–2015, streaming systems were unreliable and difficult to operate. Kafka existed but was young. Flink did not yet exist. Storm was the dominant streaming engine—powerful but operationally complex.
The batch layer provided a reliable safety net: every night, reprocess everything from scratch. Any streaming errors would be overwritten by the next batch run. The system was self-healing by design.
The Lambda Problem
Lambda requires you to implement your business logic twice: once in a batch framework (Spark, dbt) and once in a streaming framework (Storm, Flink). Two codebases. Two engineering teams. Two debugging paths. Two failure modes to manage.
When the batch and speed layers produce slightly different numbers—due to different data access patterns, different timestamp handling, different rounding—the serving layer must arbitrate. The query results are inconsistent depending on which layer is consulted.
The operational cost is real. Martin Kleppmann wrote the canonical critique in 2014: “Questioning the Lambda Architecture”. The question was: if your streaming layer were reliable enough, why would you need the batch layer at all?
κ The Kappa Architecture
Jay Kreps (co-creator of Kafka) proposed the Kappa architecture in 2014 as the answer to Kleppmann’s question.
The central thesis: use only one system, processing everything as a stream.
|
|
There is no separate batch layer. For reprocessing—when you need to fix a bug or apply a new transformation to historical data—you:
- Start a new Flink job with the corrected logic
- Point it at the beginning of the Kafka topic (rewind to offset 0)
- Let it reprocess historical events at stream speed
- Once it catches up to the current event position, swap it in as the live pipeline
|
|
Why Kappa Only Became Practical Later
Kappa requires:
- A durable, replayable log with sufficient retention to hold all historical data (or a bootstrapping mechanism from cold storage)
- A stream processing engine reliable and expressive enough to replace a mature batch system
- Sufficient compute to process historical data at acceptable speed when reprocessing
By 2015–2016, Kafka had long retention support, Flink had become production-grade, and compute costs had fallen enough to make continuous streaming economically viable. The conditions that made Lambda necessary had been largely resolved.
🏥 Comparing the Two
| Dimension | Lambda | Kappa |
|---|---|---|
| Code duplication | High (implement logic twice) | Low (single codebase) |
| Operational complexity | High (two systems) | Moderate (one system, well-configured) |
| Data correctness | High (batch overwrites streaming errors) | Requires reliable streaming |
| Reprocessing | Run the batch job again | Rewind Kafka offset, new job |
| Historical query | Served by batch layer | Served by streaming replay or materialized view |
| Infrastructure cost | High (two processing stacks) | Moderate |
| Adoption maturity | Wide (many existing deployments) | Growing (modern standard) |
🔬 What the Industry Actually Runs
Lambda: Remains Common in Legacy Deployments
Many large organisations built Lambda architectures in 2013–2018 and are still running them. The cost of migrating away from a functioning system is high, and Lambda works. It is just expensive to maintain.
Netflix has operated a Lambda-inspired architecture for parts of its analytics stack. Twitter (pre-acquisition) ran significant Lambda infrastructure. LinkedIn, despite inventing Kafka, had Lambda deployments for years before consolidating.
Kappa: The Modern Default
New data platform builds in 2020+ almost universally adopt Kappa or Kappa-adjacent patterns. The infrastructure has matured enough to make the reliability argument for Lambda obsolete in most contexts.
Uber’s real-time data platform, Cloudflare’s analytics pipeline, and the streaming layers at most modern fintech companies use single-stream architectures without a parallel batch safety net.
The Hybrid Reality: “Lambda Lite”
In practice, many organisations run neither textbook Lambda nor textbook Kappa. They run something in between:
- A streaming pipeline for real-time decisions (fraud, routing, recommendations)
- A batch pipeline for historical analysis and reporting (dbt + warehouse)
- No serving layer merge: the two pipelines serve different use cases and never need to agree on the same number
This is not Lambda (no attempt to merge streaming and batch for the same query) and not Kappa (batch is still there). It is simply two systems doing two different jobs. Most data engineering teams actually operate this way, without needing to give it an architecture name.
🧬 Making the Choice
If you are designing a new data platform today:
Adopt Kappa (streaming-first) if:
- Your primary use case requires sub-minute data freshness
- You are building from scratch without legacy batch infrastructure to preserve
- Your team has sufficient streaming expertise to operate Flink reliably
- Your source events are replayable (Kafka with adequate retention, or S3 event archive for bootstrap)
Adopt Lambda (batch + streaming) if:
- You already have a mature, tested batch pipeline that serves historical analytics
- Your streaming use case is additive (you want real-time on top of existing batch, not instead of it)
- Your streaming infrastructure is not yet reliable enough to be the sole source of truth
Adopt the Hybrid (separate systems, separate use cases) if:
- Batch analytics and real-time operational decisions are genuinely distinct workloads
- You want to avoid the serving-layer complexity of merging two systems
- Your team is more comfortable operating one mature batch system and one streaming system, separately
🔚 Closing the Series
This series started with a simple question: why does batch processing have an expiry date?
The answer, after six posts, is nuanced. Batch does not expire for everyone. For many organisations, nightly ETL jobs and daily dbt runs serve the business perfectly. The midnight job will run for another decade.
But for the use cases where data latency is measured in seconds, not hours—fraud decisions, real-time recommendations, live dashboards, IoT monitoring—batch cannot do the job. A continuous stream, processed by a reliable engine with proper state management, watermarks, and delivery guarantees, is the architecture that fills the gap.
The tools are now mature enough to build it. The patterns are well-understood. The engineering challenge has shifted from “can we do this?” to “do we need to, and what will it cost to operate?”
That is the right question to be asking.
← Previous: Exactly Once is a Lie (And What to Do About It)
Series start: The Midnight Job: Why Batch Processing Has an Expiry Date →