Backfills Without Regret: Reprocessing Production Data Safely
Caution
HISTORICAL REPROCESSING REQUESTED: 18 months of partitions selected. Estimated scan: 74 TB. Production warehouse active. Initiating controlled backfill protocol.
A tax calculation was wrong for eighteen months.
The transformation has been fixed. The pull request passed. Today’s data is correct.
Now comes the dangerous part: applying the correction to history.
A backfill is a pipeline run over data that should already have been processed. It sounds like ordinary execution with an earlier date range. In production, it competes with live workloads, reopens previously settled metrics, and can duplicate or partially overwrite trusted tables.
This is the final post in the Reliable Data Systems series. We will build a backfill process that is bounded, idempotent, observable, and reversible.
Why Backfills Become Incidents
Backfills concentrate risk in four areas.
1. Scale
The daily job processes one partition. An eighteen-month backfill processes roughly 550 partitions.
A query that is cheap once can be expensive 550 times. It may exhaust warehouse slots, overload an operational source, or delay the live pipeline that creates tomorrow’s data.
2. Correctness
Historical records may not match today’s schema. Columns were added. Business rules changed. Reference tables were updated.
Running current code against old data can create a version of history that never existed.
3. Partial Publication
If consumers read the target while partitions are being rewritten, they see a mixed world:
|
|
The table is available but internally inconsistent.
4. Irreversibility
An in-place overwrite can destroy the previous trusted result. If reconciliation fails after 400 partitions, rollback becomes another large data operation.
Step 1: Write a Backfill Specification
Do not begin with a command. Begin with a bounded specification.
|
|
The specification creates an audit trail and prevents accidental scope expansion. “Backfill everything” is not an acceptable production plan.
Step 2: Prove Idempotency on One Partition
Before touching eighteen months, rerun one representative day twice.
|
|
Validate that:
- Row count is unchanged after the second run
- Primary or business keys remain unique
- Aggregate values remain identical
- No rows outside the selected partition change
|
|
If one partition cannot be rerun safely, the full backfill is not ready.
Step 3: Write to a Shadow Table
Avoid rewriting the production table immediately.
|
|
Build corrected history in the shadow table:
|
|
The shadow table provides:
- Isolation from current consumers
- A complete rollback path
- Side-by-side reconciliation
- Freedom to restart failed partitions
Storage is cheaper than reconstructing a destroyed trusted dataset.
Step 4: Throttle the Work
A backfill should behave like a polite tenant.
Process small batches:
|
|
Controls should include:
- Maximum concurrent partitions
- Query timeout
- Daily cost or bytes-scanned budget
- Pause window around live jobs
- Source API rate limit
- Automatic stop on validation failure
Do not optimise for the shortest possible completion time. Optimise for predictable completion without degrading production.
Separate Compute Where Possible
Use an isolated warehouse, cluster, or workload queue:
|
|
This prevents historical correction from starving current data.
Step 5: Reconcile Every Batch
Technical success is not proof of correctness.
Compare old and new results:
|
|
Use three levels of reconciliation.
Structural
- Same expected partitions
- Same row counts where grain is unchanged
- Unique keys remain unique
- Required fields remain non-null
Aggregate
- Revenue delta falls within the expected range
- Record totals reconcile with source systems
- No unexplained discontinuities appear at month boundaries
Row-Level
Sample changed records and explain the difference:
|
|
Every large delta should be attributable to the intended logic change.
Step 6: Publish Atomically
Consumers should not observe a half-complete backfill.
For a full-table replacement, switch a view:
|
|
For partition-level publication, validate a complete bounded interval before swapping or copying those partitions in one controlled operation.
Record publication metadata:
|
|
Historical changes should be as traceable as application deployments.
Step 7: Keep the Rollback Window Open
Do not immediately delete the old table.
Monitor downstream systems after publication:
- Dashboard totals
- Finance exports
- Reverse-ETL syncs
- ML feature distributions
- Query errors and latency
If an issue appears, switch the view back:
|
|
Retain the old version for an agreed safety period. Only clean it up after consumers and data owners sign off.
The Backfill Runbook
Before execution:
- Bound the date range and business objective
- Estimate rows, bytes scanned, runtime, and cost
- Snapshot or preserve the current trusted result
- Test idempotency on one partition
- Define expected aggregate changes
During execution:
- Use isolated or low-priority compute
- Process small checkpointed batches
- Reconcile after every batch
- Stop automatically when thresholds fail
- Keep live pipelines ahead of backfill work
Before publication:
- Complete structural, aggregate, and sampled row checks
- Obtain approval from the data owner
- Publish atomically
- Announce changed historical metrics
- Preserve a fast rollback path
The Reliability Series, Completed
Reliable data systems are not created by one tool.
They are created by a set of engineering properties:
- Data contracts preserve meaning across producer changes.
- Idempotency makes retries and reruns safe.
- Observability detects bad data even when jobs succeed.
- Schema evolution lets interfaces change without coordinated downtime.
- Controlled backfills correct history without destabilising the present.
The common principle is simple: assume failure, change, and reprocessing are normal.
A mature data platform is not one that never encounters these conditions. It is one that can encounter them without losing trust.