Your Pipeline Is Up. Is Your Data Healthy?
Note
VITAL SIGNS MONITOR ACTIVE: Job status normal. Row count declining. Freshness threshold exceeded. Distribution drift detected in country_code. Patient requires more than a heartbeat check.
Traditional pipeline monitoring answers one question:
Did the job succeed?
Data observability answers a harder question:
Is the data arriving on time, in the expected quantity, with the expected shape and meaning?
A green orchestration graph tells you that code executed without raising an exception. It does not tell you that an upstream API returned an empty page, that 40% of customer IDs became null, or that a timestamp parser silently shifted every event by twelve hours.
This is the third post in the Reliable Data Systems series. We will define the vital signs of a dataset and turn them into measurable service levels.
Monitoring the Machine vs. Monitoring the Product
Infrastructure monitoring watches the machinery:
- CPU utilisation
- Memory pressure
- Worker availability
- Query duration
- Task success or failure
Data monitoring watches the product:
- Freshness
- Volume
- Schema
- Distribution
- Quality
- Lineage
Both are necessary. They diagnose different classes of failure.
|
|
If an API responds 200 OK with an empty array, infrastructure monitoring reports success. Data monitoring detects a volume collapse.
Vital Sign 1: Freshness
Freshness measures how recently the dataset was updated.
For an hourly orders table:
|
|
The raw number matters less than the expectation. A four-hour lag is catastrophic for a fraud dashboard and irrelevant for a monthly finance report.
Define a freshness objective per data product:
|
|
Freshness should be measured against event or source availability when possible, not only warehouse load time. A job may load at 10:00 using source data that stopped updating at 06:00.
|
|
The two metrics separate upstream delay from pipeline delay.
Vital Sign 2: Volume
Volume detects missing or unexpectedly duplicated data.
A fixed threshold works for stable datasets:
|
|
But most businesses have seasonality. Monday traffic differs from Saturday traffic. A simple “row count > 1,000” rule either misses real incidents or generates constant noise.
A better baseline compares the current interval with equivalent historical intervals:
|
|
Alert when the current count falls outside an expected range, such as three standard deviations from the seasonal baseline.
Statistical detection is not magic. Product launches, holidays, and outages all look anomalous. The goal is to find changes worth investigation, not to prove every anomaly is a defect.
Vital Sign 3: Distribution
Row counts can remain stable while values become wrong.
Suppose payment_method normally looks like:
| Value | Expected Share |
|---|---|
| card | 72% |
| bank_transfer | 18% |
| wallet | 10% |
After an upstream deployment:
| Value | Current Share |
|---|---|
| null | 72% |
| bank_transfer | 18% |
| wallet | 10% |
The row count is unchanged. The distribution is not.
Track:
- Null rate
- Distinct count
- Min and max
- Mean and percentiles
- Categorical frequency
- String length
|
|
Distribution checks catch semantic failures that schema tests cannot.
Vital Sign 4: Quality Rules
Some expectations come from the business, not statistics.
|
|
|
|
These are invariant checks: conditions that should never be false.
Organise them by severity:
| Severity | Meaning | Response |
|---|---|---|
| Critical | Data is unsafe for consumption | Stop publication and page owner |
| Warning | Data may be degraded | Alert owner and annotate dataset |
| Informational | Trend deserves review | Add to daily report |
Not every failed test should wake someone at 03:00. Alert urgency must match business impact.
Vital Sign 5: Lineage and Blast Radius
Detection without context produces slow incident response.
When raw.payments.currency begins returning nulls, responders need to know:
- Which staging model consumes it?
- Which marts depend on those models?
- Which dashboards and machine-learning features are downstream?
- Who owns each affected asset?
|
|
Lineage turns an anomaly into an impact statement:
currencynull rate increased from 0.1% to 71.8%. Two critical dashboards and one model feature table are affected.
That is actionable. “Test 184 failed” is not.
Define Data SLIs and SLOs
An SLI is a measured indicator. An SLO is the target.
Example indicators:
|
|
Example objectives:
|
|
The SLO forces a conversation about what reliability means. “High quality” is vague. “99.99% of captured payments represented by 07:00” can be measured and reviewed.
Design Alerts for Humans
An alert should contain:
- What changed
- Expected vs. observed value
- Earliest affected interval
- Likely upstream cause
- Downstream impact
- Owner and runbook
|
|
Avoid sending five alerts for one root cause. Group correlated failures using lineage and time. Alert fatigue is itself a reliability defect because it trains responders to ignore the system.
The Observability Rule
Pipeline status is one vital sign, not a diagnosis.
A healthy data product must be:
- Fresh enough for its consumers
- Complete enough for its decisions
- Valid according to business rules
- Stable in shape and distribution
- Traceable through lineage
The goal is not a dashboard filled with metrics. The goal is to reduce the time between data becoming wrong and the right owner understanding why.
Next in the Reliable Data Systems series: how to evolve schemas without forcing every producer and consumer to deploy at the same moment.