YC Medical
ENTER

The Late Data Problem: Dead Zones, Watermarks, and Event Time

Warning

TEMPORAL ANOMALY DETECTED: Events arriving 47 seconds past window boundary. Initiating watermark assessment. Late event triage protocol engaged.

Of all the challenges that distinguish streaming from batch, the late data problem is the most philosophically interesting—and the most practically consequential.

In batch processing, the question does not arise. When the midnight job runs, all data from the previous day is present. The boundary is clear. The calculation is complete.

In a streaming system, data has no such guarantee. An event generated at 13:58:47 might not arrive at your Kafka consumer until 14:04:32. It was generated before the top of the hour. It arrived after. The 13:00–14:00 window has already closed and its results have been written to the database.

What do you do with it?


⏱️ Two Clocks, One Problem

Every streaming system must manage two separate notions of time:

Event time: when the event actually happened. This is recorded by the source—the taxi meter, the payment terminal, the IoT sensor. It is embedded in the event data.

Processing time: when the event arrives at and is processed by your streaming engine. This is the wall clock time at the moment Flink reads the message from Kafka.

1
2
3
Event generated:         13:58:47  (event time)
Network delay:           +00:05:45
Kafka consumer receives: 14:04:32  (processing time)

In an ideal world with instant network transmission, event time and processing time are identical. In the real world, they diverge—and that divergence is the late data problem.

Why does this matter? If you aggregate by processing time, results are distorted: events that happened at 13:58 are counted in the 14:00 window because that is when they arrived. Your hourly revenue figure for 13:00–14:00 is understated, and your 14:00–15:00 figure is inflated by rides that occurred before the hour.

If you aggregate by event time, you get accurate results—but you need a mechanism to decide when the window is “complete enough” to emit without waiting forever.


💧 Watermarks: The Progress Clock

A watermark is Flink’s model of event-time progress. It is a timestamp that asserts: all events with an event time earlier than this watermark have been received.

When the watermark passes the end boundary of a window, Flink closes that window and emits its results.

1
2
3
4
5
6
7
Stream events (by event time):
13:45:12 → 13:52:33 → 13:58:47 → 14:01:19 → 14:06:45 → ...

Watermark (BoundedOutOfOrderness, 5min tolerance):
→ W(13:40:12) → W(13:47:33) → W(13:53:47) → W(13:56:19) → W(14:01:45) → ...

Window [13:00, 14:00) closes when watermark reaches 14:00:00.

The watermark is derived from the maximum event time seen in the stream, minus a configured tolerance interval.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
CREATE TABLE taxi_rides (
    PULocationID    INTEGER,
    DOLocationID    INTEGER,
    trip_distance   DOUBLE,
    total_amount    DOUBLE,
    -- Raw epoch milliseconds from the source
    tpep_pickup_datetime BIGINT,
    -- Computed event time column
    event_time AS TO_TIMESTAMP_LTZ(tpep_pickup_datetime, 3),
    -- Watermark: tolerate up to 5 seconds of out-of-order arrival
    WATERMARK FOR event_time AS event_time - INTERVAL '5' SECOND
) WITH (
    'connector' = 'kafka',
    'properties.bootstrap.servers' = 'redpanda:29092',
    'topic' = 'rides',
    'format' = 'json'
);

The INTERVAL '5' SECOND tolerance means: before closing any window, Flink will wait until it has seen events with timestamps at least 5 seconds past the window boundary. Events arriving up to 5 seconds late are included in the correct window.

The Latency-Accuracy Tradeoff

The watermark tolerance directly controls a tradeoff:

Tolerance Latency Accuracy
0 seconds Lowest (results as soon as window boundary passes) Lowest (any late events are missed)
5 seconds Low Good for moderate network delays
5 minutes High Good for mobile networks, intermittent connectivity
Unbounded Infinite (never closes) Perfect (theoretically)

There is no universally correct answer. The right tolerance depends on the characteristics of your data source and the acceptable delay in your downstream systems.

For a live dashboard, 5-second latency may be acceptable. For a financial settlement batch that runs once per day, you might tolerate 30 minutes of latency to capture stragglers from poor connectivity zones.


🚑 What Happens to Events That Miss the Watermark

Even with a tolerance, some events will arrive after the window has closed. These are late events.

Flink offers three strategies:

Strategy 1: Discard (Default)

Late events are silently dropped. If your business logic can tolerate minor inaccuracy—and many real-time use cases can—this is the simplest approach.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
-- Tumbling window with no late event handling
-- Events arriving after watermark closes the window: discarded
SELECT
    PULocationID,
    window_start,
    window_end,
    COUNT(*) AS num_trips
FROM TABLE(
    TUMBLE(TABLE taxi_rides, DESCRIPTOR(event_time), INTERVAL '1' HOUR)
)
GROUP BY PULocationID, window_start, window_end;

Strategy 2: Allowed Lateness

Flink keeps the window state open for an additional allowed lateness period. Late events that arrive within this window update the emitted result—Flink sends a corrected output.

1
2
3
4
5
# In the DataStream API (Java/Python):
windowed_stream \
    .window(TumblingEventTimeWindows.of(Time.hours(1))) \
    .allowed_lateness(Time.minutes(10)) \  # Keep state open 10 min past window end
    .aggregate(RevenueAggregator())

The downstream sink must handle receiving multiple results for the same window key—typically via an UPSERT.

Strategy 3: Side Output for Late Events

Route late events to a separate stream for separate handling—analysis, alerting, or manual reconciliation.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
from pyflink.datastream import OutputTag

late_events_tag = OutputTag("late-events")

main_stream = windowed_stream \
    .window(TumblingEventTimeWindows.of(Time.hours(1))) \
    .side_output_late_data(late_events_tag) \
    .aggregate(RevenueAggregator())

# Handle late events separately
late_stream = main_stream.get_side_output(late_events_tag)
late_stream.add_sink(late_event_alert_sink)

🌐 Network Dead Zones: A Real Example

The quintessential streaming late-data scenario: a delivery driver enters a basement car park at 13:58:50. Their device loses connectivity. They complete a transaction at 13:59:22.

At 14:07:15—over seven minutes later—the device reconnects. The event, timestamped 13:59:22, arrives at the Kafka broker.

What your watermark decision determines:

  • With 5s tolerance: event is discarded. The 13:00–14:00 window closed at 14:00:05. Miss.
  • With 10m tolerance: event is included. The window remains open until 14:10:00. Hit.
  • With allowed lateness (5 min): window closed at 14:05, but state held until 14:10. Late event updates the emitted result. The dashboard receives a correction.

The same architecture; different outcomes based on the watermark configuration you chose when you built the pipeline.


🔬 Diagnosing Your Lateness Profile

Before setting a watermark tolerance, measure the actual distribution of event-time vs. processing-time delays in your source data.

A useful analysis to run on historical data or a sample of incoming events:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
import pandas as pd

# Load a sample of events
df = pd.read_parquet('sample_events.parquet')

# Calculate delay in seconds
df['processing_time'] = pd.Timestamp.now()
df['event_time'] = pd.to_datetime(df['tpep_pickup_datetime'], unit='ms')
df['delay_seconds'] = (df['processing_time'] - df['event_time']).dt.total_seconds()

# Percentile distribution of delays
print(df['delay_seconds'].describe(percentiles=[0.5, 0.9, 0.95, 0.99]))
# p99 delay tells you what watermark tolerance covers ~99% of events

The 99th percentile delay in your data stream is the minimum watermark tolerance needed to include 99% of events in the correct window. Setting it to the 95th percentile accepts that 5% of events will land in the wrong window or be discarded.

This is an engineering decision, not a technical constraint. Make it deliberately.


← Previous: Stateful Stream Processing: What Your Pipeline Actually Remembers

Next: Exactly Once is a Lie (And What to Do About It) →