Schema Evolution Without the Emergency Migration
Important
TRANSPLANT WINDOW OPEN: Existing field scheduled for replacement. Twelve downstream consumers detected. Zero-downtime protocol requires expansion before contraction.
Renaming a database column sounds trivial.
|
|
The statement executes in milliseconds. The migration can still break twelve dashboards, three dbt models, one reverse-ETL sync, and a machine-learning feature job.
The difficult part of schema evolution is not changing the schema. It is coordinating independent systems that read and write it on different deployment schedules.
This is the fourth post in the Reliable Data Systems series. We will use the expand-and-contract pattern to evolve data interfaces without requiring a perfectly synchronised release.
Why Atomic Migrations Fail Organisationally
Inside one transaction, a database migration may be atomic. Across an organisation, adoption is not.
|
|
If the producer removes the old field at 10:00, every lagging consumer fails.
Distributed systems rarely support “everyone deploy at once” as a dependable operating model. The safe strategy is to make old and new versions coexist temporarily.
The Expand-and-Contract Pattern
The migration has three phases:
- Expand: add the new interface while preserving the old one
- Migrate: move producers, historical data, and consumers
- Contract: remove the deprecated interface after usage reaches zero
Suppose we want to replace customer.name with first_name and last_name.
Phase 1: Expand
Add new nullable columns without removing the old one:
|
|
Update the producer to dual-write:
|
|
The old contract still works. New consumers can begin migrating.
Phase 2: Backfill and Migrate
Populate the new fields for historical records:
|
|
Real names are more complex than this example, so the production migration would need explicit parsing rules and an exception queue. The key point is sequencing: backfill happens before new fields become required.
Consumers switch to the new columns:
|
|
During this period, monitor reads of the deprecated field.
Phase 3: Contract
Only when all consumers have migrated:
|
|
The destructive change happens last, not first.
Backward and Forward Compatibility
Compatibility language can be confusing because it depends on which side is old.
Backward Compatible
New consumers can read old data.
For example, a new consumer treats a newly introduced field as optional:
|
|
It can process both old events without the field and new events with it.
Forward Compatible
Old consumers can read new data.
An old JSON consumer that ignores unknown fields remains functional when the producer adds card_network.
Full Compatibility
Old and new consumers can read old and new data.
This is the ideal during a rolling migration, though not every serialization format or change type supports it.
| Change | Usually Safe? | Why |
|---|---|---|
| Add optional field | Yes | Old readers ignore it |
| Add required field | No | Old records do not contain it |
| Remove field | No | Existing consumers may read it |
| Rename field | No | Equivalent to remove + add |
| Widen integer type | Often | Depends on reader format |
| Change field meaning | No | Schema may pass while semantics fail |
Version Events Deliberately
For event streams, sometimes coexistence is clearer with explicit versions.
|
|
Or with a version field:
Topic-per-version provides strong isolation but duplicates infrastructure and may require producers to publish to both topics.
In-message versioning keeps one topic but shifts complexity into every consumer.
Choose based on the size of the change:
- Additive changes: evolve the existing version
- Structural redesign: publish a new major version
- Semantic reinterpretation: use a new field or major version
Do not reuse a version number for a different meaning. Historical replay depends on being able to interpret old events exactly as they were originally defined.
Schema Registry Is a Gate, Not a Migration Plan
A schema registry can enforce compatibility before events reach Kafka:
|
|
This is valuable, but it does not answer:
- Which dashboards still use the old field?
- Who owns the migration?
- How will historical rows be backfilled?
- When is it safe to remove the old version?
- What happens to long-retained events during replay?
The registry prevents certain incompatible writes. The migration plan coordinates people, code, and history.
Measure Deprecation Instead of Guessing
The most dangerous sentence in a schema migration is:
I don’t think anyone uses this field anymore.
Replace intuition with evidence.
For warehouse columns, inspect query history:
|
|
For APIs, log field usage or endpoint version. For event streams, inventory consumer groups and declared schemas.
Create a deprecation scorecard:
| Consumer | Owner | Migrated | Last Legacy Read |
|---|---|---|---|
customer_360 |
Analytics | Yes | 2026-05-22 |
| CRM sync | Growth Eng | Yes | 2026-05-28 |
| Churn features | ML Platform | No | 2026-06-01 |
Contraction is blocked until the remaining consumer migrates or explicitly accepts breakage.
Make Rollback Possible
Every migration should define rollback before deployment.
During expansion, rollback is easy because the old interface remains.
During contraction, rollback is harder because data may stop being written to the old field. Delay irreversible cleanup:
- Stop consumers from reading old field
- Continue dual-writing for a safety period
- Disable old writes but retain column
- Observe
- Remove column in a separate release
Separating these steps gives monitoring time to reveal forgotten dependencies.
A migration that adds and drops fields in one release eliminates the rollback path precisely when uncertainty is highest.
The Evolution Rule
Safe schema evolution assumes:
- Producers and consumers deploy independently
- Historical data must remain interpretable
- Some dependencies are undocumented
- Rollback will eventually be needed
Expand first. Migrate with measurement. Contract only after evidence.
The SQL statement may take milliseconds. The interface migration should take as long as downstream safety requires.
Next in the Reliable Data Systems series: how to backfill corrected logic without overwhelming production, duplicating records, or publishing half-finished results.