YC Medical
ENTER

Backfills Without Regret: Reprocessing Production Data Safely

Caution

HISTORICAL REPROCESSING REQUESTED: 18 months of partitions selected. Estimated scan: 74 TB. Production warehouse active. Initiating controlled backfill protocol.

A tax calculation was wrong for eighteen months.

The transformation has been fixed. The pull request passed. Today’s data is correct.

Now comes the dangerous part: applying the correction to history.

A backfill is a pipeline run over data that should already have been processed. It sounds like ordinary execution with an earlier date range. In production, it competes with live workloads, reopens previously settled metrics, and can duplicate or partially overwrite trusted tables.

This is the final post in the Reliable Data Systems series. We will build a backfill process that is bounded, idempotent, observable, and reversible.


Why Backfills Become Incidents

Backfills concentrate risk in four areas.

1. Scale

The daily job processes one partition. An eighteen-month backfill processes roughly 550 partitions.

A query that is cheap once can be expensive 550 times. It may exhaust warehouse slots, overload an operational source, or delay the live pipeline that creates tomorrow’s data.

2. Correctness

Historical records may not match today’s schema. Columns were added. Business rules changed. Reference tables were updated.

Running current code against old data can create a version of history that never existed.

3. Partial Publication

If consumers read the target while partitions are being rewritten, they see a mixed world:

1
2
3
4
January–April: corrected logic
May–October: old logic
November: currently rebuilding
December–June: old logic

The table is available but internally inconsistent.

4. Irreversibility

An in-place overwrite can destroy the previous trusted result. If reconciliation fails after 400 partitions, rollback becomes another large data operation.


Step 1: Write a Backfill Specification

Do not begin with a command. Begin with a bounded specification.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
backfill_id: revenue-tax-fix-2026-06
owner: finance-data

source_range:
  start: 2024-12-01
  end: 2026-05-31

target:
  table: finance.daily_revenue
  partition_key: revenue_date

reason:
  issue: FIN-2841
  change: Correct tax-inclusive revenue calculation

execution:
  partitions_per_batch: 7
  max_concurrency: 2
  pause_during_live_pipeline: true

validation:
  row_count_tolerance: 0
  revenue_delta_expected: "-1.5% to -0.8%"

rollback:
  snapshot: finance.daily_revenue_pre_FIN_2841

The specification creates an audit trail and prevents accidental scope expansion. “Backfill everything” is not an acceptable production plan.


Step 2: Prove Idempotency on One Partition

Before touching eighteen months, rerun one representative day twice.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
begin transaction;

delete from finance.daily_revenue_backfill
where revenue_date = date '2025-06-15';

insert into finance.daily_revenue_backfill
select *
from transformed_revenue
where revenue_date = date '2025-06-15';

commit;

Validate that:

  • Row count is unchanged after the second run
  • Primary or business keys remain unique
  • Aggregate values remain identical
  • No rows outside the selected partition change
1
2
3
4
5
6
7
select
  count(*) as rows,
  count(distinct merchant_id) as merchants,
  sum(gross_revenue) as gross_revenue,
  sum(net_revenue) as net_revenue
from finance.daily_revenue_backfill
where revenue_date = date '2025-06-15';

If one partition cannot be rerun safely, the full backfill is not ready.


Step 3: Write to a Shadow Table

Avoid rewriting the production table immediately.

1
2
create table finance.daily_revenue_v2
like finance.daily_revenue;

Build corrected history in the shadow table:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
insert overwrite finance.daily_revenue_v2
partition (revenue_date = date '2025-06-15')
select
  revenue_date,
  merchant_id,
  gross_revenue,
  gross_revenue - tax_amount as net_revenue,
  current_timestamp as recalculated_at,
  'FIN-2841' as backfill_id
from staging.revenue_events
where revenue_date = date '2025-06-15';

The shadow table provides:

  • Isolation from current consumers
  • A complete rollback path
  • Side-by-side reconciliation
  • Freedom to restart failed partitions

Storage is cheaper than reconstructing a destroyed trusted dataset.


Step 4: Throttle the Work

A backfill should behave like a polite tenant.

Process small batches:

1
2
3
4
5
for week in date_ranges(start, end, days=7):
    wait_for_live_pipeline()
    run_backfill(week, max_workers=2)
    validate_batch(week)
    record_checkpoint(week)

Controls should include:

  • Maximum concurrent partitions
  • Query timeout
  • Daily cost or bytes-scanned budget
  • Pause window around live jobs
  • Source API rate limit
  • Automatic stop on validation failure

Do not optimise for the shortest possible completion time. Optimise for predictable completion without degrading production.

Separate Compute Where Possible

Use an isolated warehouse, cluster, or workload queue:

1
2
Live pipelines → production warehouse / high-priority queue
Backfill jobs  → backfill warehouse / low-priority queue

This prevents historical correction from starving current data.


Step 5: Reconcile Every Batch

Technical success is not proof of correctness.

Compare old and new results:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
select
  old.revenue_date,
  old.row_count as old_rows,
  new.row_count as new_rows,
  new.net_revenue - old.net_revenue as revenue_delta,
  100.0 * (new.net_revenue - old.net_revenue)
    / nullif(old.net_revenue, 0) as delta_percent
from old_daily_summary old
join new_daily_summary new using (revenue_date)
order by revenue_date;

Use three levels of reconciliation.

Structural

  • Same expected partitions
  • Same row counts where grain is unchanged
  • Unique keys remain unique
  • Required fields remain non-null

Aggregate

  • Revenue delta falls within the expected range
  • Record totals reconcile with source systems
  • No unexplained discontinuities appear at month boundaries

Row-Level

Sample changed records and explain the difference:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
select
  old.merchant_id,
  old.revenue_date,
  old.net_revenue as old_value,
  new.net_revenue as new_value,
  new.net_revenue - old.net_revenue as delta
from finance.daily_revenue old
join finance.daily_revenue_v2 new
  using (merchant_id, revenue_date)
where old.net_revenue != new.net_revenue
order by abs(new.net_revenue - old.net_revenue) desc
limit 100;

Every large delta should be attributable to the intended logic change.


Step 6: Publish Atomically

Consumers should not observe a half-complete backfill.

For a full-table replacement, switch a view:

1
2
create or replace view finance.daily_revenue_current as
select * from finance.daily_revenue_v2;

For partition-level publication, validate a complete bounded interval before swapping or copying those partitions in one controlled operation.

Record publication metadata:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
insert into governance.data_releases (
  dataset,
  release_id,
  published_at,
  source_range,
  code_version,
  approved_by
) values (
  'finance.daily_revenue',
  'FIN-2841',
  current_timestamp,
  '2024-12-01..2026-05-31',
  'git:8f31c2a',
  'finance-controller'
);

Historical changes should be as traceable as application deployments.


Step 7: Keep the Rollback Window Open

Do not immediately delete the old table.

Monitor downstream systems after publication:

  • Dashboard totals
  • Finance exports
  • Reverse-ETL syncs
  • ML feature distributions
  • Query errors and latency

If an issue appears, switch the view back:

1
2
create or replace view finance.daily_revenue_current as
select * from finance.daily_revenue;

Retain the old version for an agreed safety period. Only clean it up after consumers and data owners sign off.


The Backfill Runbook

Before execution:

  • Bound the date range and business objective
  • Estimate rows, bytes scanned, runtime, and cost
  • Snapshot or preserve the current trusted result
  • Test idempotency on one partition
  • Define expected aggregate changes

During execution:

  • Use isolated or low-priority compute
  • Process small checkpointed batches
  • Reconcile after every batch
  • Stop automatically when thresholds fail
  • Keep live pipelines ahead of backfill work

Before publication:

  • Complete structural, aggregate, and sampled row checks
  • Obtain approval from the data owner
  • Publish atomically
  • Announce changed historical metrics
  • Preserve a fast rollback path

The Reliability Series, Completed

Reliable data systems are not created by one tool.

They are created by a set of engineering properties:

  1. Data contracts preserve meaning across producer changes.
  2. Idempotency makes retries and reruns safe.
  3. Observability detects bad data even when jobs succeed.
  4. Schema evolution lets interfaces change without coordinated downtime.
  5. Controlled backfills correct history without destabilising the present.

The common principle is simple: assume failure, change, and reprocessing are normal.

A mature data platform is not one that never encounters these conditions. It is one that can encounter them without losing trust.