The Green Pipeline That Produced the Wrong Answer

Anonymous included in Data Engineering

2026-05-11 About 1100 words 5 minutes

Contents

Caution

FALSE NEGATIVE DETECTED: Orchestrator reports success. Warehouse tables are populated. Dashboard is rendering. Business totals are wrong. Initiating semantic integrity scan.

The most dangerous data pipeline is not the one that fails.

A failed pipeline is visible. Airflow turns red. PagerDuty fires. Someone opens the logs and starts debugging.

The dangerous pipeline finishes successfully, writes rows to the warehouse, and updates every dashboard on schedule — while quietly changing the meaning of the data.

This is the first post in a series on reliable data systems. We begin with the failure mode that ordinary monitoring misses: syntactically valid, semantically wrong data.

The Incident: A Column That Never Broke

Imagine a payment service publishing this event:

The data team interprets amount as cents. The transformation divides it by 100 and reports $12.99.

Six months later, the payments team migrates providers. The new provider emits major currency units:

The field still exists. It is still numeric. The JSON parser succeeds. The ingestion job succeeds. The warehouse accepts the value. The dbt model divides by 100 and reports $0.13.

Every technical check is green because no technical interface was broken. The semantic contract was broken.

1
2


Source API     Ingestion     Warehouse     dbt     Dashboard
   ✓              ✓             ✓          ✓          WRONG

This is why pipeline uptime is not the same as data reliability.

What a Data Contract Actually Is

A data contract is an explicit agreement between a data producer and its consumers. It defines more than column names and types.

A useful contract covers five dimensions:

Dimension	Contract Question	Example
Structure	Which fields exist?	`payment_id` is required
Type	What representation is valid?	`amount` is `decimal(18,2)`
Semantics	What does the value mean?	Amount is in major currency units
Constraints	Which values are allowed?	Status is `pending`, `captured`, or `refunded`
Operations	How may it change?	Breaking changes require 14 days’ notice

A JSON Schema or protobuf definition handles structure and type well. It does not automatically communicate whether a timestamp is UTC, whether revenue includes tax, or whether deleted users remain in the feed.

Those details are not documentation trivia. They are part of the interface.

Contract Example

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30


dataset: payments
owner: payments-platform
version: 2

fields:
  payment_id:
    type: string
    required: true
    unique: true

  amount:
    type: decimal(18,2)
    required: true
    description: Captured amount in major currency units
    constraints:
      min: 0

  currency:
    type: string
    required: true
    constraints:
      allowed_values: [USD, NZD, AUD, EUR]

  captured_at:
    type: timestamp
    timezone: UTC

change_policy:
  breaking_change_notice_days: 14
  compatibility: backward

The contract is valuable only if it is executable. A YAML file nobody validates is merely optimistic documentation.

Put Checks at Both Ends

Reliability requires validation near the producer and near the consumer.

Producer-Side Validation

The producer should reject events that violate the declared schema before publishing them.

1
2
3


def publish_payment(event: dict) -> None:
    validate(event, payment_contract)
    kafka.send("payments.v2", event)

This catches malformed data at the point where the owning team has the most context. The payments service knows why amount changed. The warehouse ingestion job does not.

Consumer-Side Validation

Consumers still need their own checks because transport, historical replay, and undocumented upstream changes can introduce surprises.

1
2
3
4
5
6


-- dbt singular test: impossible payment values
select payment_id
from {{ ref('stg_payments') }}
where amount < 0
   or currency not in ('USD', 'NZD', 'AUD', 'EUR')
   or captured_at > current_timestamp

The producer contract protects the platform. The consumer test protects the analytical use case. They overlap intentionally.

Compatibility: Not Every Change Is Equal

Schema changes fall into three broad categories.

Additive Changes

Adding an optional field is usually backward compatible:

1
2
3


  payment_id: string
  amount: decimal
+ card_network: string | null

Old consumers ignore the new field. New consumers can adopt it when ready.

Breaking Structural Changes

Renaming or removing a field breaks consumers immediately:

1
2


- captured_at
+ completed_at

A safer migration publishes both fields during a transition period:

Consumers migrate first. The deprecated field is removed only after usage reaches zero.

Breaking Semantic Changes

Semantic changes are harder because the schema can remain identical:

1
2


- amount means subtotal before tax
+ amount means total after tax

This should be treated as a new field or a new contract version, not an invisible reinterpretation.

When meaning changes, naming should change with it.

Contract Enforcement in CI

The best time to discover a breaking change is before deployment.

A contract-aware CI workflow can compare the proposed schema against the version currently in production:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


steps:
  - name: Validate contract syntax
    run: data-contract lint contracts/payments.yml

  - name: Check backward compatibility
    run: data-contract diff \
      --base origin/main \
      --candidate contracts/payments.yml \
      --fail-on-breaking

  - name: Run consumer contract tests
    run: pytest tests/contracts/payments

If a pull request removes captured_at, changes amount from decimal to string, or makes an optional field required, CI should fail with a list of affected consumers.

This shifts data reliability left: from a dashboard incident on Tuesday morning to a code review comment on Monday afternoon.

Ownership Is the Missing Field

Contracts fail when nobody owns them.

Every production dataset needs:

A named producing team
A discoverable communication channel
An on-call or escalation path
A deprecation policy
A list of critical downstream consumers

Without ownership, a contract violation becomes a warehouse problem by default. The data team then reverse-engineers an upstream service it does not control.

The contract should make responsibility explicit:

1
2
3
4
5


owner:
  team: payments-platform
  slack: "#team-payments"
  repository: "github.com/company/payments-service"
  escalation: "payments-oncall"

Data reliability is partly a technical problem and partly an organisational interface problem. A schema registry cannot negotiate a migration deadline. People must do that.

The Reliability Rule

Do not ask only:

Did the pipeline run?

Ask:

Did the data preserve the meaning its consumers depend on?

A reliable system validates structure, type, semantics, and change policy. It detects incompatibility before deployment and assigns ownership before an incident.

The green pipeline was not healthy. It was merely alive.

Next in the Reliable Data Systems series: why retries create duplicates, and how idempotent pipelines make repeated execution safe.