YC Medical
ENTER

Schema Evolution Without the Emergency Migration

Important

TRANSPLANT WINDOW OPEN: Existing field scheduled for replacement. Twelve downstream consumers detected. Zero-downtime protocol requires expansion before contraction.

Renaming a database column sounds trivial.

1
alter table customers rename column name to full_name;

The statement executes in milliseconds. The migration can still break twelve dashboards, three dbt models, one reverse-ETL sync, and a machine-learning feature job.

The difficult part of schema evolution is not changing the schema. It is coordinating independent systems that read and write it on different deployment schedules.

This is the fourth post in the Reliable Data Systems series. We will use the expand-and-contract pattern to evolve data interfaces without requiring a perfectly synchronised release.


Why Atomic Migrations Fail Organisationally

Inside one transaction, a database migration may be atomic. Across an organisation, adoption is not.

1
2
3
4
Producer deploys v2 at 10:00
Consumer A deploys v2 at 10:05
Consumer B deploys v2 next Tuesday
Dashboard C is owned by a team nobody remembered

If the producer removes the old field at 10:00, every lagging consumer fails.

Distributed systems rarely support “everyone deploy at once” as a dependable operating model. The safe strategy is to make old and new versions coexist temporarily.


The Expand-and-Contract Pattern

The migration has three phases:

  1. Expand: add the new interface while preserving the old one
  2. Migrate: move producers, historical data, and consumers
  3. Contract: remove the deprecated interface after usage reaches zero

Suppose we want to replace customer.name with first_name and last_name.

Phase 1: Expand

Add new nullable columns without removing the old one:

1
2
alter table customers add column first_name varchar;
alter table customers add column last_name varchar;

Update the producer to dual-write:

1
2
3
4
5
customer = {
    "name": f"{first_name} {last_name}",  # Legacy consumers
    "first_name": first_name,             # New consumers
    "last_name": last_name,
}

The old contract still works. New consumers can begin migrating.

Phase 2: Backfill and Migrate

Populate the new fields for historical records:

1
2
3
4
5
update customers
set
  first_name = split_part(name, ' ', 1),
  last_name = substring(name from position(' ' in name) + 1)
where first_name is null;

Real names are more complex than this example, so the production migration would need explicit parsing rules and an exception queue. The key point is sequencing: backfill happens before new fields become required.

Consumers switch to the new columns:

1
2
- select customer_id, name from customers
+ select customer_id, first_name, last_name from customers

During this period, monitor reads of the deprecated field.

Phase 3: Contract

Only when all consumers have migrated:

1
2
3
alter table customers drop column name;
alter table customers alter column first_name set not null;
alter table customers alter column last_name set not null;

The destructive change happens last, not first.


Backward and Forward Compatibility

Compatibility language can be confusing because it depends on which side is old.

Backward Compatible

New consumers can read old data.

For example, a new consumer treats a newly introduced field as optional:

1
card_network = event.get("card_network", "unknown")

It can process both old events without the field and new events with it.

Forward Compatible

Old consumers can read new data.

An old JSON consumer that ignores unknown fields remains functional when the producer adds card_network.

Full Compatibility

Old and new consumers can read old and new data.

This is the ideal during a rolling migration, though not every serialization format or change type supports it.

Change Usually Safe? Why
Add optional field Yes Old readers ignore it
Add required field No Old records do not contain it
Remove field No Existing consumers may read it
Rename field No Equivalent to remove + add
Widen integer type Often Depends on reader format
Change field meaning No Schema may pass while semantics fail

Version Events Deliberately

For event streams, sometimes coexistence is clearer with explicit versions.

1
2
payments.v1
payments.v2

Or with a version field:

Topic-per-version provides strong isolation but duplicates infrastructure and may require producers to publish to both topics.

In-message versioning keeps one topic but shifts complexity into every consumer.

Choose based on the size of the change:

  • Additive changes: evolve the existing version
  • Structural redesign: publish a new major version
  • Semantic reinterpretation: use a new field or major version

Do not reuse a version number for a different meaning. Historical replay depends on being able to interpret old events exactly as they were originally defined.


Schema Registry Is a Gate, Not a Migration Plan

A schema registry can enforce compatibility before events reach Kafka:

1
2
3
4
5
6
Producer proposes schema v7
Registry compares v7 with v6
Compatible? publish allowed
Breaking? deployment rejected

This is valuable, but it does not answer:

  • Which dashboards still use the old field?
  • Who owns the migration?
  • How will historical rows be backfilled?
  • When is it safe to remove the old version?
  • What happens to long-retained events during replay?

The registry prevents certain incompatible writes. The migration plan coordinates people, code, and history.


Measure Deprecation Instead of Guessing

The most dangerous sentence in a schema migration is:

I don’t think anyone uses this field anymore.

Replace intuition with evidence.

For warehouse columns, inspect query history:

1
2
3
4
5
6
7
select
  user_name,
  query_text,
  start_time
from warehouse_query_history
where lower(query_text) like '%customers.name%'
  and start_time >= current_date - interval '30' day;

For APIs, log field usage or endpoint version. For event streams, inventory consumer groups and declared schemas.

Create a deprecation scorecard:

Consumer Owner Migrated Last Legacy Read
customer_360 Analytics Yes 2026-05-22
CRM sync Growth Eng Yes 2026-05-28
Churn features ML Platform No 2026-06-01

Contraction is blocked until the remaining consumer migrates or explicitly accepts breakage.


Make Rollback Possible

Every migration should define rollback before deployment.

During expansion, rollback is easy because the old interface remains.

During contraction, rollback is harder because data may stop being written to the old field. Delay irreversible cleanup:

  1. Stop consumers from reading old field
  2. Continue dual-writing for a safety period
  3. Disable old writes but retain column
  4. Observe
  5. Remove column in a separate release

Separating these steps gives monitoring time to reveal forgotten dependencies.

A migration that adds and drops fields in one release eliminates the rollback path precisely when uncertainty is highest.


The Evolution Rule

Safe schema evolution assumes:

  • Producers and consumers deploy independently
  • Historical data must remain interpretable
  • Some dependencies are undocumented
  • Rollback will eventually be needed

Expand first. Migrate with measurement. Contract only after evidence.

The SQL statement may take milliseconds. The interface migration should take as long as downstream safety requires.


Next in the Reliable Data Systems series: how to backfill corrected logic without overwhelming production, duplicating records, or publishing half-finished results.