Run It Again: The Idempotency Test for Data Pipelines
Warning
DUPLICATE DOSAGE DETECTED: Pipeline retry completed successfully. Revenue increased by 37% without a corresponding transaction increase. Suspected non-idempotent write path.
At 02:14, the daily orders pipeline fails while loading the final partition.
At 02:19, the orchestrator retries it.
At 02:27, the retry succeeds. The incident appears resolved — until the finance dashboard opens in the morning and reports revenue that is dramatically higher than the payment processor.
The retry did not merely repeat computation. It repeated side effects.
This is the second post in the Reliable Data Systems series. The subject is idempotency: the property that allows a pipeline to run multiple times and still produce the same final state.
The Retry Paradox
Retries are essential in distributed systems. Networks time out. APIs return temporary errors. Workers crash. Warehouses briefly reject connections.
But every retry asks a dangerous question:
Which parts of the previous attempt actually completed?
Suppose a job extracts 100,000 orders, transforms them, and inserts them into a reporting table. The database commits the rows, but the worker crashes before reporting success to the orchestrator.
|
|
The orchestrator sees failure. The warehouse sees success. Both are correct from their local perspective.
Exactly-once execution across this boundary is not something we should assume. Instead, we design the write so that repeating it is harmless.
Idempotency in One Equation
An operation f is idempotent when:
|
|
Applying it twice has the same effect as applying it once.
For a data pipeline:
|
|
This does not mean every internal step executes only once. It means repeated executions converge on one correct result.
Failure Pattern 1: Blind Append
The simplest load is also the easiest to duplicate:
|
|
Run it twice and every row appears twice.
Blind append is safe only when each input is guaranteed to be processed once, a guarantee that disappears as soon as retries, manual reruns, or backfills exist.
Repair: Delete and Replace the Partition
For partitioned batch data, replace the target interval atomically:
|
|
Whether this transaction runs once or five times, the final partition contains one copy of the source rows.
In warehouses that support partition replacement or insert overwrite, prefer the native operation:
|
|
Failure Pattern 2: Duplicate Business Events
Sometimes duplicates already exist in the source. A payment provider retries a webhook because your endpoint timed out after processing it.
The event arrives twice:
The correct defence is a stable idempotency key.
|
|
The event ID represents identity. The merge operation ensures that one logical event maps to one target row.
Do not generate the key after ingestion using a random UUID. A random identifier makes every retry look like a new event.
If the source provides no event ID, derive a deterministic key from stable business fields:
|
|
This is less robust than a producer-issued ID, but it is repeatable.
Failure Pattern 3: Non-Deterministic Transformations
A pipeline can avoid duplicate rows and still produce different output on every run.
|
|
Two problems exist:
current_timestampchanges on every run.- If two orders share the same
created_at, the row number ordering is unstable.
Make the transformation deterministic:
|
|
Given the same input, a deterministic transformation produces the same output.
This matters for reconciliation. If rerunning a model changes thousands of rows for reasons unrelated to source data, you cannot distinguish legitimate corrections from computational noise.
Checkpoints Are Not Idempotency
Checkpoints record progress:
|
|
They help a pipeline resume without reprocessing everything. They do not make the write safe if the checkpoint and the target commit can disagree.
Consider this sequence:
- Write partition to warehouse
- Crash
- Update checkpoint never happens
- Restart reads old checkpoint
- Write partition again
The checkpoint is functioning exactly as designed. The target operation still needs idempotency.
Use checkpoints to improve efficiency. Use idempotent writes to guarantee correctness.
Build a Rerun Test
Idempotency should be tested, not assumed.
A simple integration test:
|
|
A stronger test compares row-level hashes:
|
|
Run the pipeline twice. Both metrics should remain unchanged.
The Idempotency Checklist
Before enabling automatic retries, verify:
- Every record has a stable business or event key
- Writes use
merge, upsert, or partition replacement instead of blind append - Multi-step target changes are transactional where possible
- Transformations are deterministic
- Checkpoints can lag behind target commits without corrupting data
- The same interval can be rerun manually
- A repeated-run integration test exists
Retries are not an edge case. They are normal operation.
A pipeline that cannot be safely rerun is a pipeline waiting for an ordinary infrastructure failure to become a data incident.
Next in the Reliable Data Systems series: moving beyond job status to observability that measures freshness, volume, distribution, and lineage.