The Green Pipeline That Produced the Wrong Answer
Caution
FALSE NEGATIVE DETECTED: Orchestrator reports success. Warehouse tables are populated. Dashboard is rendering. Business totals are wrong. Initiating semantic integrity scan.
The most dangerous data pipeline is not the one that fails.
A failed pipeline is visible. Airflow turns red. PagerDuty fires. Someone opens the logs and starts debugging.
The dangerous pipeline finishes successfully, writes rows to the warehouse, and updates every dashboard on schedule — while quietly changing the meaning of the data.
This is the first post in a series on reliable data systems. We begin with the failure mode that ordinary monitoring misses: syntactically valid, semantically wrong data.
The Incident: A Column That Never Broke
Imagine a payment service publishing this event:
The data team interprets amount as cents. The transformation divides it by 100 and reports $12.99.
Six months later, the payments team migrates providers. The new provider emits major currency units:
The field still exists. It is still numeric. The JSON parser succeeds. The ingestion job succeeds. The warehouse accepts the value. The dbt model divides by 100 and reports $0.13.
Every technical check is green because no technical interface was broken. The semantic contract was broken.
|
|
This is why pipeline uptime is not the same as data reliability.
What a Data Contract Actually Is
A data contract is an explicit agreement between a data producer and its consumers. It defines more than column names and types.
A useful contract covers five dimensions:
| Dimension | Contract Question | Example |
|---|---|---|
| Structure | Which fields exist? | payment_id is required |
| Type | What representation is valid? | amount is decimal(18,2) |
| Semantics | What does the value mean? | Amount is in major currency units |
| Constraints | Which values are allowed? | Status is pending, captured, or refunded |
| Operations | How may it change? | Breaking changes require 14 days’ notice |
A JSON Schema or protobuf definition handles structure and type well. It does not automatically communicate whether a timestamp is UTC, whether revenue includes tax, or whether deleted users remain in the feed.
Those details are not documentation trivia. They are part of the interface.
Contract Example
|
|
The contract is valuable only if it is executable. A YAML file nobody validates is merely optimistic documentation.
Put Checks at Both Ends
Reliability requires validation near the producer and near the consumer.
Producer-Side Validation
The producer should reject events that violate the declared schema before publishing them.
|
|
This catches malformed data at the point where the owning team has the most context. The payments service knows why amount changed. The warehouse ingestion job does not.
Consumer-Side Validation
Consumers still need their own checks because transport, historical replay, and undocumented upstream changes can introduce surprises.
|
|
The producer contract protects the platform. The consumer test protects the analytical use case. They overlap intentionally.
Compatibility: Not Every Change Is Equal
Schema changes fall into three broad categories.
Additive Changes
Adding an optional field is usually backward compatible:
|
|
Old consumers ignore the new field. New consumers can adopt it when ready.
Breaking Structural Changes
Renaming or removing a field breaks consumers immediately:
|
|
A safer migration publishes both fields during a transition period:
Consumers migrate first. The deprecated field is removed only after usage reaches zero.
Breaking Semantic Changes
Semantic changes are harder because the schema can remain identical:
|
|
This should be treated as a new field or a new contract version, not an invisible reinterpretation.
When meaning changes, naming should change with it.
Contract Enforcement in CI
The best time to discover a breaking change is before deployment.
A contract-aware CI workflow can compare the proposed schema against the version currently in production:
|
|
If a pull request removes captured_at, changes amount from decimal to string, or makes an optional field required, CI should fail with a list of affected consumers.
This shifts data reliability left: from a dashboard incident on Tuesday morning to a code review comment on Monday afternoon.
Ownership Is the Missing Field
Contracts fail when nobody owns them.
Every production dataset needs:
- A named producing team
- A discoverable communication channel
- An on-call or escalation path
- A deprecation policy
- A list of critical downstream consumers
Without ownership, a contract violation becomes a warehouse problem by default. The data team then reverse-engineers an upstream service it does not control.
The contract should make responsibility explicit:
|
|
Data reliability is partly a technical problem and partly an organisational interface problem. A schema registry cannot negotiate a migration deadline. People must do that.
The Reliability Rule
Do not ask only:
Did the pipeline run?
Ask:
Did the data preserve the meaning its consumers depend on?
A reliable system validates structure, type, semantics, and change policy. It detects incompatibility before deployment and assigns ownership before an incident.
The green pipeline was not healthy. It was merely alive.
Next in the Reliable Data Systems series: why retries create duplicates, and how idempotent pipelines make repeated execution safe.