The Late Data Problem: Dead Zones, Watermarks, and Event Time
Warning
TEMPORAL ANOMALY DETECTED: Events arriving 47 seconds past window boundary. Initiating watermark assessment. Late event triage protocol engaged.
Of all the challenges that distinguish streaming from batch, the late data problem is the most philosophically interesting—and the most practically consequential.
In batch processing, the question does not arise. When the midnight job runs, all data from the previous day is present. The boundary is clear. The calculation is complete.
In a streaming system, data has no such guarantee. An event generated at 13:58:47 might not arrive at your Kafka consumer until 14:04:32. It was generated before the top of the hour. It arrived after. The 13:00–14:00 window has already closed and its results have been written to the database.
What do you do with it?
⏱️ Two Clocks, One Problem
Every streaming system must manage two separate notions of time:
Event time: when the event actually happened. This is recorded by the source—the taxi meter, the payment terminal, the IoT sensor. It is embedded in the event data.
Processing time: when the event arrives at and is processed by your streaming engine. This is the wall clock time at the moment Flink reads the message from Kafka.
|
|
In an ideal world with instant network transmission, event time and processing time are identical. In the real world, they diverge—and that divergence is the late data problem.
Why does this matter? If you aggregate by processing time, results are distorted: events that happened at 13:58 are counted in the 14:00 window because that is when they arrived. Your hourly revenue figure for 13:00–14:00 is understated, and your 14:00–15:00 figure is inflated by rides that occurred before the hour.
If you aggregate by event time, you get accurate results—but you need a mechanism to decide when the window is “complete enough” to emit without waiting forever.
💧 Watermarks: The Progress Clock
A watermark is Flink’s model of event-time progress. It is a timestamp that asserts: all events with an event time earlier than this watermark have been received.
When the watermark passes the end boundary of a window, Flink closes that window and emits its results.
|
|
The watermark is derived from the maximum event time seen in the stream, minus a configured tolerance interval.
Declaring Watermarks in Flink SQL
|
|
The INTERVAL '5' SECOND tolerance means: before closing any window, Flink will wait until it has seen events with timestamps at least 5 seconds past the window boundary. Events arriving up to 5 seconds late are included in the correct window.
The Latency-Accuracy Tradeoff
The watermark tolerance directly controls a tradeoff:
| Tolerance | Latency | Accuracy |
|---|---|---|
| 0 seconds | Lowest (results as soon as window boundary passes) | Lowest (any late events are missed) |
| 5 seconds | Low | Good for moderate network delays |
| 5 minutes | High | Good for mobile networks, intermittent connectivity |
| Unbounded | Infinite (never closes) | Perfect (theoretically) |
There is no universally correct answer. The right tolerance depends on the characteristics of your data source and the acceptable delay in your downstream systems.
For a live dashboard, 5-second latency may be acceptable. For a financial settlement batch that runs once per day, you might tolerate 30 minutes of latency to capture stragglers from poor connectivity zones.
🚑 What Happens to Events That Miss the Watermark
Even with a tolerance, some events will arrive after the window has closed. These are late events.
Flink offers three strategies:
Strategy 1: Discard (Default)
Late events are silently dropped. If your business logic can tolerate minor inaccuracy—and many real-time use cases can—this is the simplest approach.
|
|
Strategy 2: Allowed Lateness
Flink keeps the window state open for an additional allowed lateness period. Late events that arrive within this window update the emitted result—Flink sends a corrected output.
|
|
The downstream sink must handle receiving multiple results for the same window key—typically via an UPSERT.
Strategy 3: Side Output for Late Events
Route late events to a separate stream for separate handling—analysis, alerting, or manual reconciliation.
|
|
🌐 Network Dead Zones: A Real Example
The quintessential streaming late-data scenario: a delivery driver enters a basement car park at 13:58:50. Their device loses connectivity. They complete a transaction at 13:59:22.
At 14:07:15—over seven minutes later—the device reconnects. The event, timestamped 13:59:22, arrives at the Kafka broker.
What your watermark decision determines:
- With 5s tolerance: event is discarded. The 13:00–14:00 window closed at 14:00:05. Miss.
- With 10m tolerance: event is included. The window remains open until 14:10:00. Hit.
- With allowed lateness (5 min): window closed at 14:05, but state held until 14:10. Late event updates the emitted result. The dashboard receives a correction.
The same architecture; different outcomes based on the watermark configuration you chose when you built the pipeline.
🔬 Diagnosing Your Lateness Profile
Before setting a watermark tolerance, measure the actual distribution of event-time vs. processing-time delays in your source data.
A useful analysis to run on historical data or a sample of incoming events:
|
|
The 99th percentile delay in your data stream is the minimum watermark tolerance needed to include 99% of events in the correct window. Setting it to the 95th percentile accepts that 5% of events will land in the wrong window or be discarded.
This is an engineering decision, not a technical constraint. Make it deliberately.
← Previous: Stateful Stream Processing: What Your Pipeline Actually Remembers