Your Pipeline Is Up. Is Your Data Healthy?

Anonymous included in Data Engineering

2026-05-25 About 1300 words 6 minutes

Contents

Note

VITAL SIGNS MONITOR ACTIVE: Job status normal. Row count declining. Freshness threshold exceeded. Distribution drift detected in country_code. Patient requires more than a heartbeat check.

Traditional pipeline monitoring answers one question:

Did the job succeed?

Data observability answers a harder question:

Is the data arriving on time, in the expected quantity, with the expected shape and meaning?

A green orchestration graph tells you that code executed without raising an exception. It does not tell you that an upstream API returned an empty page, that 40% of customer IDs became null, or that a timestamp parser silently shifted every event by twelve hours.

This is the third post in the Reliable Data Systems series. We will define the vital signs of a dataset and turn them into measurable service levels.

Monitoring the Machine vs. Monitoring the Product

Infrastructure monitoring watches the machinery:

CPU utilisation
Memory pressure
Worker availability
Query duration
Task success or failure

Data monitoring watches the product:

Freshness
Volume
Schema
Distribution
Quality
Lineage

Both are necessary. They diagnose different classes of failure.

1
2
3
4
5
6
7
8
9


Pipeline health
├── Infrastructure
│   ├── Did the task run?
│   ├── Did it exceed memory?
│   └── How long did it take?
└── Data
    ├── Did new records arrive?
    ├── Are values plausible?
    └── Which consumers are affected?

If an API responds 200 OK with an empty array, infrastructure monitoring reports success. Data monitoring detects a volume collapse.

Vital Sign 1: Freshness

Freshness measures how recently the dataset was updated.

For an hourly orders table:

1
2
3


select
  current_timestamp - max(loaded_at) as freshness_lag
from analytics.orders;

The raw number matters less than the expectation. A four-hour lag is catastrophic for a fraud dashboard and irrelevant for a monthly finance report.

Define a freshness objective per data product:

1
2
3
4
5
6


dataset: analytics.orders
service_level:
  freshness:
    target: 30 minutes
    warning: 45 minutes
    critical: 90 minutes

Freshness should be measured against event or source availability when possible, not only warehouse load time. A job may load at 10:00 using source data that stopped updating at 06:00.

1
2
3
4


select
  current_timestamp - max(order_created_at) as source_freshness_lag,
  current_timestamp - max(loaded_at) as warehouse_freshness_lag
from analytics.orders;

The two metrics separate upstream delay from pipeline delay.

Vital Sign 2: Volume

Volume detects missing or unexpectedly duplicated data.

A fixed threshold works for stable datasets:

1
2
3


select count(*) as hourly_orders
from analytics.orders
where loaded_at >= current_timestamp - interval '1' hour;

But most businesses have seasonality. Monday traffic differs from Saturday traffic. A simple “row count > 1,000” rule either misses real incidents or generates constant noise.

A better baseline compares the current interval with equivalent historical intervals:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


with hourly as (
  select
    date_trunc('hour', order_created_at) as hour,
    count(*) as order_count
  from analytics.orders
  where order_created_at >= current_timestamp - interval '28' day
  group by 1
),
baseline as (
  select
    extract(dow from hour) as weekday,
    extract(hour from hour) as hour_of_day,
    avg(order_count) as expected,
    stddev(order_count) as deviation
  from hourly
  group by 1, 2
)
select *
from baseline;

Alert when the current count falls outside an expected range, such as three standard deviations from the seasonal baseline.

Statistical detection is not magic. Product launches, holidays, and outages all look anomalous. The goal is to find changes worth investigation, not to prove every anomaly is a defect.

Vital Sign 3: Distribution

Row counts can remain stable while values become wrong.

Suppose payment_method normally looks like:

Value	Expected Share
card	72%
bank_transfer	18%
wallet	10%

After an upstream deployment:

Value	Current Share
null	72%
bank_transfer	18%
wallet	10%

The row count is unchanged. The distribution is not.

Track:

Null rate
Distinct count
Min and max
Mean and percentiles
Categorical frequency
String length

1
2
3
4
5
6
7
8


select
  count(*) as rows,
  avg(case when payment_method is null then 1.0 else 0.0 end) as null_rate,
  count(distinct payment_method) as distinct_methods,
  percentile_cont(0.5) within group (order by amount) as median_amount,
  percentile_cont(0.99) within group (order by amount) as p99_amount
from analytics.payments
where payment_date = current_date;

Distribution checks catch semantic failures that schema tests cannot.

Vital Sign 4: Quality Rules

Some expectations come from the business, not statistics.

1
2
3
4
5


-- A captured payment must have a positive amount
select payment_id
from analytics.payments
where status = 'captured'
  and amount <= 0;

1
2
3
4


-- An order cannot be delivered before it was created
select order_id
from analytics.orders
where delivered_at < created_at;

These are invariant checks: conditions that should never be false.

Organise them by severity:

Severity	Meaning	Response
Critical	Data is unsafe for consumption	Stop publication and page owner
Warning	Data may be degraded	Alert owner and annotate dataset
Informational	Trend deserves review	Add to daily report

Not every failed test should wake someone at 03:00. Alert urgency must match business impact.

Vital Sign 5: Lineage and Blast Radius

Detection without context produces slow incident response.

When raw.payments.currency begins returning nulls, responders need to know:

Which staging model consumes it?
Which marts depend on those models?
Which dashboards and machine-learning features are downstream?
Who owns each affected asset?

1
2
3
4
5
6
7


raw.payments
    ↓
stg_payments
    ↓
fct_revenue ──────→ executive_revenue_dashboard
    ↓
customer_ltv ─────→ churn_model_features

Lineage turns an anomaly into an impact statement:

currency null rate increased from 0.1% to 71.8%. Two critical dashboards and one model feature table are affected.

That is actionable. “Test 184 failed” is not.

Define Data SLIs and SLOs

An SLI is a measured indicator. An SLO is the target.

Example indicators:

1
2
3


Freshness SLI = percentage of hourly checks where lag < 30 minutes
Completeness SLI = percentage of source records present in warehouse
Validity SLI = percentage of rows passing critical business rules

Example objectives:

1
2
3
4
5
6
7
8


dataset: finance.daily_revenue
slo:
  freshness:
    target: "99.5% of days published by 07:00"
  completeness:
    target: "99.99% of captured payments represented"
  validity:
    target: "100% of rows use a recognised currency"

The SLO forces a conversation about what reliability means. “High quality” is vague. “99.99% of captured payments represented by 07:00” can be measured and reviewed.

Design Alerts for Humans

An alert should contain:

What changed
Expected vs. observed value
Earliest affected interval
Likely upstream cause
Downstream impact
Owner and runbook

1
2
3
4
5
6
7
8
9


CRITICAL: analytics.orders freshness breach

Expected: < 30 minutes
Observed: 112 minutes
Started: 2026-05-25 06:00 NZST
Upstream: raw.orders last event at 05:43
Impact: Operations dashboard, fulfilment SLA report
Owner: #data-platform
Runbook: go/runbooks/orders-freshness

Avoid sending five alerts for one root cause. Group correlated failures using lineage and time. Alert fatigue is itself a reliability defect because it trains responders to ignore the system.

The Observability Rule

Pipeline status is one vital sign, not a diagnosis.

A healthy data product must be:

Fresh enough for its consumers
Complete enough for its decisions
Valid according to business rules
Stable in shape and distribution
Traceable through lineage

The goal is not a dashboard filled with metrics. The goal is to reduce the time between data becoming wrong and the right owner understanding why.

Next in the Reliable Data Systems series: how to evolve schemas without forcing every producer and consumer to deploy at the same moment.