Exactly Once Is a Lie (And What to Do About It)
Caution
SEMANTIC AUDIT IN PROGRESS: “Exactly once” guarantee detected in vendor documentation. Cross-referencing with distributed systems literature. Discrepancy identified. Initiating accuracy protocol.
Every messaging system makes a promise about how many times each message will be delivered. The options are:
- At most once: the message may be delivered zero or one time. It could be lost. It will never be duplicated.
- At least once: the message will definitely be delivered. It may be delivered more than once. It will never be lost.
- Exactly once: the message will be delivered precisely one time. No loss. No duplication.
Exactly once sounds like the obvious choice. It is also why understanding delivery guarantees is one of the most important topics in streaming systems—because exactly once requires significant infrastructure to implement, carries real performance costs, and is still not perfectly achievable across all system boundaries.
🔁 Why Duplicates Happen
Before looking at solutions, understand the failure mode.
A Kafka consumer reads a message and processes it. Before it can commit (acknowledge) the offset back to Kafka, it crashes.
|
|
The event has been written to the database twice. From the consumer’s perspective, it has no way to know the first write succeeded—it crashed before it could confirm.
This is not a Kafka bug. It is a fundamental reality of distributed systems: networks can be interrupted at any moment, and there is no way to atomically link the acknowledgement to an external side effect.
🎛️ At Most Once: Accept the Loss
The simplest config: commit offsets before processing.
|
|
If the consumer crashes between the commit and the write, the event is lost. This is acceptable for use cases where losing occasional events is tolerable: log collection, non-critical metrics, analytics where approximate counts are sufficient.
🔂 At Least Once: The Default Streaming Mode
The safer default: commit offsets only after successful processing.
|
|
If the consumer crashes after writing but before committing, the event is reprocessed. We get duplicates. But we never lose data.
At least once is the default posture for most streaming pipelines. The implication is that your downstream sink must be able to handle duplicate writes gracefully.
🎯 Exactly Once: The Full Machinery
True exactly once—no loss, no duplicates—requires three coordinated components:
- Idempotent producer: Kafka ensures that if the same message is sent twice due to a retry, the broker deduplicates it.
- Transactional producer: Multiple messages can be written atomically to Kafka—all succeed or all fail.
- Flink checkpointing aligned with Kafka offset commits: the processor’s state and the consumer offset are snapshotted atomically.
In Flink: Two-Phase Commit
Flink implements exactly once end-to-end using a two-phase commit protocol between its checkpointing mechanism and the sink connector.
When a checkpoint begins:
- Flink’s Kafka source records the current Kafka offsets in the checkpoint.
- The Kafka sink connector begins a transaction for all records produced since the last checkpoint.
When the checkpoint completes: 3. The Kafka sink commits the transaction—records become visible to downstream consumers. 4. If the checkpoint fails: the transaction is aborted. Records are not visible. The job restarts from the previous checkpoint.
This ensures that records appear in the output topic if and only if the corresponding Kafka input offsets have been checkpointed.
|
|
The Cost of Exactly Once
This performance profile matters in practice:
| Guarantee | Latency | Throughput | Complexity |
|---|---|---|---|
| At most once | Lowest | Highest | Trivial |
| At least once | Low | High | Low |
| Exactly once | Higher | Lower (~20-30% overhead typical) | High |
Exactly once requires longer checkpoint intervals (to reduce the overhead of transaction coordination), which increases end-to-end latency. For many use cases, the latency penalty is unacceptable.
🏥 The Practical Alternative: Idempotent Sinks
In most production streaming systems, the engineering answer is not exactly-once semantics—it is at least once delivery with an idempotent sink.
An idempotent operation produces the same result whether applied once or a hundred times. If your sink is designed to handle duplicate writes without producing duplicate data, you get the effective behaviour of exactly once without the overhead of distributed transactions.
UPSERT as Idempotency
|
|
Running this statement ten times produces one row—the final values. This works because we are overwriting with the same values, not accumulating.
|
|
The key is to use replacement semantics (overwrite with final value) rather than accumulation semantics (add to running total) when designing for idempotency.
Deduplication with a Unique Constraint
For event-level sinks where each event should appear exactly once:
|
|
The event ID becomes your deduplication key. The database enforces uniqueness. Duplicate deliveries from your Kafka consumer are absorbed without side effects.
🧬 Choosing Your Guarantee
Use this decision framework for each pipeline you build:
|
|
For the vast majority of data engineering use cases—aggregation pipelines, analytical dashboards, feature stores—at least once with UPSERT sinks is the correct and practical choice. Save exactly once for payment processing, notification dispatch, and other irreversible side effects.
🔬 What the Industry Actually Does
At Uber, the real-time analytics platform (Apache Pinot + Kafka) uses at least once delivery with time-series based deduplication. At LinkedIn (where Kafka was born), the standard recommendation for data pipelines is at least once with idempotent consumers. Netflix’s Flink deployments use exactly once selectively—for financial settlement pipelines and billing—and at least once for their analytics workloads.
The pattern is consistent: match the delivery guarantee to the cost of the duplication. If a duplicate record in your dashboard causes a minor inaccuracy, at least once is fine. If a duplicate triggers a double charge, invest in exactly once.
← Previous: The Late Data Problem: Dead Zones, Watermarks, and Event Time
Next: Lambda, Kappa, and the Architecture You Actually Need →