YC Medical
ENTER

Run It Again: The Idempotency Test for Data Pipelines

Warning

DUPLICATE DOSAGE DETECTED: Pipeline retry completed successfully. Revenue increased by 37% without a corresponding transaction increase. Suspected non-idempotent write path.

At 02:14, the daily orders pipeline fails while loading the final partition.

At 02:19, the orchestrator retries it.

At 02:27, the retry succeeds. The incident appears resolved — until the finance dashboard opens in the morning and reports revenue that is dramatically higher than the payment processor.

The retry did not merely repeat computation. It repeated side effects.

This is the second post in the Reliable Data Systems series. The subject is idempotency: the property that allows a pipeline to run multiple times and still produce the same final state.


The Retry Paradox

Retries are essential in distributed systems. Networks time out. APIs return temporary errors. Workers crash. Warehouses briefly reject connections.

But every retry asks a dangerous question:

Which parts of the previous attempt actually completed?

Suppose a job extracts 100,000 orders, transforms them, and inserts them into a reporting table. The database commits the rows, but the worker crashes before reporting success to the orchestrator.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
Orchestrator          Worker             Warehouse
     │                   │                    │
     │── run job ───────▶│                    │
     │                   │── insert rows ────▶│
     │                   │                    │ ✓ committed
     │                   X crash              │
     │                                        │
     │── retry job ─────▶│                    │
     │                   │── insert rows ────▶│
     │                   │                    │ ✓ duplicated

The orchestrator sees failure. The warehouse sees success. Both are correct from their local perspective.

Exactly-once execution across this boundary is not something we should assume. Instead, we design the write so that repeating it is harmless.


Idempotency in One Equation

An operation f is idempotent when:

1
f(f(x)) = f(x)

Applying it twice has the same effect as applying it once.

For a data pipeline:

1
2
3
4
5
run(partition="2026-05-17")
run(partition="2026-05-17")
run(partition="2026-05-17")

final table state == one successful run

This does not mean every internal step executes only once. It means repeated executions converge on one correct result.


Failure Pattern 1: Blind Append

The simplest load is also the easiest to duplicate:

1
2
3
insert into analytics.orders
select *
from staging.orders_2026_05_17;

Run it twice and every row appears twice.

Blind append is safe only when each input is guaranteed to be processed once, a guarantee that disappears as soon as retries, manual reruns, or backfills exist.

Repair: Delete and Replace the Partition

For partitioned batch data, replace the target interval atomically:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
begin transaction;

delete from analytics.orders
where order_date = date '2026-05-17';

insert into analytics.orders
select *
from staging.orders_2026_05_17;

commit;

Whether this transaction runs once or five times, the final partition contains one copy of the source rows.

In warehouses that support partition replacement or insert overwrite, prefer the native operation:

1
2
3
4
insert overwrite analytics.orders
partition (order_date = date '2026-05-17')
select *
from staging.orders_2026_05_17;

Failure Pattern 2: Duplicate Business Events

Sometimes duplicates already exist in the source. A payment provider retries a webhook because your endpoint timed out after processing it.

The event arrives twice:

The correct defence is a stable idempotency key.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
merge into analytics.payment_events as target
using staging.payment_events as source
on target.event_id = source.event_id

when matched then update set
  status = source.status,
  received_at = source.received_at

when not matched then insert (
  event_id,
  payment_id,
  status,
  received_at
) values (
  source.event_id,
  source.payment_id,
  source.status,
  source.received_at
);

The event ID represents identity. The merge operation ensures that one logical event maps to one target row.

Do not generate the key after ingestion using a random UUID. A random identifier makes every retry look like a new event.

If the source provides no event ID, derive a deterministic key from stable business fields:

1
2
3
4
5
6
7
md5(concat_ws(
  '||',
  source_system,
  order_id,
  event_type,
  cast(event_timestamp as string)
)) as event_key

This is less robust than a producer-issued ID, but it is repeatable.


Failure Pattern 3: Non-Deterministic Transformations

A pipeline can avoid duplicate rows and still produce different output on every run.

1
2
3
4
5
6
7
8
select
  order_id,
  current_timestamp as processed_at,
  row_number() over (
    partition by customer_id
    order by created_at
  ) as customer_order_number
from raw.orders;

Two problems exist:

  1. current_timestamp changes on every run.
  2. If two orders share the same created_at, the row number ordering is unstable.

Make the transformation deterministic:

1
2
3
4
5
6
7
8
select
  order_id,
  source_loaded_at as processed_at,
  row_number() over (
    partition by customer_id
    order by created_at, order_id
  ) as customer_order_number
from raw.orders;

Given the same input, a deterministic transformation produces the same output.

This matters for reconciliation. If rerunning a model changes thousands of rows for reasons unrelated to source data, you cannot distinguish legitimate corrections from computational noise.


Checkpoints Are Not Idempotency

Checkpoints record progress:

1
last_successful_partition = 2026-05-17

They help a pipeline resume without reprocessing everything. They do not make the write safe if the checkpoint and the target commit can disagree.

Consider this sequence:

  1. Write partition to warehouse
  2. Crash
  3. Update checkpoint never happens
  4. Restart reads old checkpoint
  5. Write partition again

The checkpoint is functioning exactly as designed. The target operation still needs idempotency.

Use checkpoints to improve efficiency. Use idempotent writes to guarantee correctness.


Build a Rerun Test

Idempotency should be tested, not assumed.

A simple integration test:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
def test_daily_orders_job_is_idempotent(warehouse):
    run_orders_job("2026-05-17")
    first = warehouse.query("""
        select count(*) as rows, sum(revenue) as revenue
        from analytics.orders
        where order_date = '2026-05-17'
    """)

    run_orders_job("2026-05-17")
    second = warehouse.query("""
        select count(*) as rows, sum(revenue) as revenue
        from analytics.orders
        where order_date = '2026-05-17'
    """)

    assert second == first

A stronger test compares row-level hashes:

1
2
3
4
5
select
  count(*) as row_count,
  sum(hash(order_id, customer_id, revenue, status)) as content_hash
from analytics.orders
where order_date = date '2026-05-17';

Run the pipeline twice. Both metrics should remain unchanged.


The Idempotency Checklist

Before enabling automatic retries, verify:

  • Every record has a stable business or event key
  • Writes use merge, upsert, or partition replacement instead of blind append
  • Multi-step target changes are transactional where possible
  • Transformations are deterministic
  • Checkpoints can lag behind target commits without corrupting data
  • The same interval can be rerun manually
  • A repeated-run integration test exists

Retries are not an edge case. They are normal operation.

A pipeline that cannot be safely rerun is a pipeline waiting for an ordinary infrastructure failure to become a data incident.


Next in the Reliable Data Systems series: moving beyond job status to observability that measures freshness, volume, distribution, and lineage.