YC Medical
ENTER

The Midnight Job: Why Batch Processing Has an Expiry Date

Note

DIAGNOSTIC BRIEF: Scheduled batch jobs are the drip-feed IV of the data world — reliable, predictable, and completely blind to what is happening right now.

There is a cron job. It runs at midnight. It aggregates the previous day’s data, loads it into the warehouse, and updates the dashboard by 2 AM.

For years, this arrangement worked. It still works for many organisations. But as the gap between when data is generated and when decisions are made becomes increasingly costly, the midnight job is running out of time.

This is the first post in a series on migrating from batch to streaming architectures. Before we write a single line of Flink SQL or configure a Kafka consumer, we need to understand precisely what problem we are solving and why the industry has spent a decade building the infrastructure to solve it.


🕛 The Anatomy of a Batch Pipeline

A batch pipeline is built around a simple idea: collect events, wait, then process them together.

1
2
3
[Events Generated] → [Stored in Files/DB] → [Scheduled Job Runs] → [Results Written]
     (continuous)          (accumulate)         (periodic: hourly,         (stale)
                                                 daily, weekly)

In practice, this looks like:

  • A Python script triggered by Airflow at midnight
  • A dbt run that transforms raw tables into analytics models every hour
  • A SQL query reading from yesterday’s partition in BigQuery

The batch paradigm has genuine strengths. It is simple to reason about, easy to debug, and maps naturally to how humans think about time (days, weeks, months). The data warehouse ecosystem was built for it.

But it has a fundamental property that becomes a liability as data volumes grow and business decisions accelerate: it processes the past.


⏱️ The Four Failure Modes of Batch

1. The Stale Dashboard Problem

Your analyst opens the sales dashboard at 9 AM. The numbers shown are from yesterday. A large transaction completed at 8:55 AM is not visible. A decision is made based on incomplete information.

In retail, logistics, or financial services, this delay is not a minor inconvenience. It is a structural blind spot. The batch pipeline makes data available after the decision window has closed.

2. The Midnight Job That Fails

At 2:17 AM, the nightly ETL job fails. Perhaps a source table schema changed. Perhaps a third-party API timed out. Perhaps the cluster ran out of memory processing an unexpectedly large file.

1
2
3
4
# This is fine 364 nights per year
# On night 365, it fails at row 4.2 million
for row in read_csv('events_2026_03_29.csv'):
    transform_and_load(row)

The on-call engineer wakes up to alerts. The job reruns at 4 AM. The dashboard is empty until morning. By the time stakeholders arrive at 9 AM, the pipeline has recovered — but several hours of downstream reports are delayed, and the root cause analysis has not yet started.

The batch pipeline turns every failure into a multi-hour incident. A streaming pipeline, processing events continuously, typically fails fast on individual bad records and continues processing the rest of the stream.

3. The All-or-Nothing Window

Batch windows are binary. The midnight job does not know about a spike in fraudulent transactions at 11:30 PM unless it is running at that exact moment. The alert triggers at 2 AM — 2.5 hours after the damage began.

Real-time fraud detection catches the spike in seconds, not hours. The difference is measured in dollars.

4. The Backfill Spiral

An upstream data source changes its schema. You fix the pipeline. But now you need to reprocess the last 30 days of data to backfill the corrected values.

For a batch pipeline processing daily partitions, this means running the job 30 times — with careful manual coordination to avoid double-processing, check partition boundaries, and reconcile overwritten data.

At scale, backfill operations become multi-day engineering projects. A well-designed streaming pipeline with an append-only log (like Kafka) can replay historical events from any point in time by rewinding to the stored offset.


📡 What Streaming Changes

A streaming architecture reconceptualises data as a continuous flow of events rather than a bounded set of records.

1
2
[Event Generated] → [Broker (Kafka/Redpanda)] → [Consumer Processes Immediately] → [Result Written]
    (milliseconds)      (persisted log)              (continuous)                      (fresh)

The dashboard updates in seconds, not hours. The fraud alert fires when the transaction occurs, not when the next job runs. The bad record fails gracefully while the rest of the stream continues.

The tradeoff is complexity. A streaming pipeline introduces components and failure modes that do not exist in the batch world:

Dimension Batch Streaming
Data freshness Minutes to hours Milliseconds to seconds
Operational complexity Low High
Debugging Simple (fixed input) Harder (infinite, unbounded)
Late data handling N/A (all data present) Watermarks, UPSERTs required
Infrastructure Simple cron + warehouse Broker + stream processor + sink
Cost model Compute spikes at job time Steady continuous compute

Neither is universally better. The choice depends on how stale data can be before it stops being useful.


🧬 When to Stay Batch, When to Stream

Not every pipeline needs to stream. Here is a simple framework:

Stay batch if:

  • The decision it informs is made once per day (e.g., monthly revenue reports, compliance exports)
  • The latency between generation and use exceeds an hour by design
  • The data volume requires heavy transformation that benefits from bulk processing (e.g., ML training runs)

Consider streaming if:

  • The data informs real-time decisions (fraud detection, live routing, recommendation engines)
  • Downstream systems need data within minutes of generation
  • You are building a user-facing feature where staleness is visible (e.g., live order tracking, real-time dashboards)
  • Event volumes are so high that storing everything before processing is impractical

Most modern data stacks use both. Analytics and reporting run on batch. Operational systems run on streams. The architecture that serves both is a broader discussion — but it starts with understanding where batch ends.


🔬 The DEZoomcamp Context

If you completed the Data Engineering Zoomcamp, you built a batch pipeline: ingestion → dbt transformation → BigQuery → dashboard. This is an excellent foundation.

The streaming track — Kafka, Flink, PyFlink — sits alongside it as the second half of the picture. Not a replacement for the batch work, but a parallel capability that handles the cases batch cannot.

In the next post, we will look at the architecture of a streaming pipeline in detail: what each component does, how they connect, and how the data flows from source to sink without ever stopping.


Next: The Streaming Stack: Anatomy of a Pipeline That Never Sleeps →