YC Medical
ENTER

The Streaming Stack: Anatomy of a Pipeline That Never Sleeps

Note

SYSTEM ARCHITECTURE SCAN: Three primary organs identified. Broker: message-persistent. Processor: stateful. Sink: queryable. All systems continuous.

In the previous post, we established why batch has an expiry date. Now we look at what replaces it—not a single tool, but a three-component system that must be understood as a whole.

The streaming stack has three organs: the broker, the stream processor, and the sink. Each has a specific, non-negotiable role. Understanding what each one does—and more importantly, what it does not do—is the prerequisite for building a production-grade pipeline.


🫀 Organ 1: The Broker (The Heart)

The broker is the heartbeat of the streaming stack. It receives events from producers and makes them available to consumers—continuously, at high throughput, with durability.

Kafka (or its lighter sibling Redpanda, which implements the Kafka API without the JVM overhead) is the dominant choice in the industry.

What the Broker Actually Does

The broker stores events in an append-only, ordered log. Think of it as a commit log with a retention policy rather than a database table. Every incoming event is written to the end of a topic partition.

1
2
3
4
Topic: "taxi-rides"
Partition 0: [event_0] [event_1] [event_2] [event_3] → (new events appended here)
Partition 1: [event_4] [event_5] [event_6] [event_7] →
Partition 2: [event_8] [event_9] [event_10] [event_11] →

Consumers read from a specific offset within a partition. Kafka tracks where each consumer group left off. If a consumer crashes and restarts, it resumes from the last committed offset—not from the beginning of the stream.

The Crucial Property: Retention

Kafka retains events for a configurable window (e.g., 7 days, or until a storage threshold is reached). This means:

  1. Multiple consumers can read the same events independently—a Flink job and a separate analytics consumer can both read the same taxi-rides topic without interfering with each other.
  2. Replay is possible—if a bug is introduced in your stream processor, you can fix the code and reprocess historical events by rewinding to an earlier offset.

This is fundamentally different from a message queue, where events are destroyed once consumed. Kafka behaves more like a distributed, time-limited filesystem for event streams.

Partitioning and Throughput

Throughput scales horizontally by adding partitions. A topic with 12 partitions can be consumed by up to 12 parallel consumer instances simultaneously, each handling a fraction of the total event volume.

1
2
3
4
5
6
# Producer: control which partition events land in via message key
producer.send(
    topic='taxi-rides',
    key=str(pickup_location_id).encode(),  # Events with same key → same partition
    value=event_bytes
)

Keying events ensures that all events from the same pickup location are processed by the same consumer—preserving ordering guarantees that matter for certain aggregations.


🧠 Organ 2: The Stream Processor (The Brain)

The broker makes events available. The stream processor makes them useful.

Apache Flink (or PySpark Streaming, or Kafka Streams) reads from the broker, applies transformations, aggregations, filters, and enrichments, then writes results to a downstream sink.

Stateless vs. Stateful Processing

The simplest case is stateless processing: for every incoming event, apply a transformation and emit an output. No memory of previous events is required.

1
2
3
4
5
6
7
8
9
-- Stateless: transform and pass through
INSERT INTO enriched_rides
SELECT
    PULocationID,
    DOLocationID,
    trip_distance,
    total_amount * 1.15 AS total_with_tax,  -- Simple transformation
    TO_TIMESTAMP_LTZ(tpep_pickup_datetime, 3) AS pickup_time
FROM raw_rides;

Stateful processing requires the processor to remember information across events. This is where streaming becomes genuinely powerful—and genuinely complex.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
-- Stateful: aggregate over a time window
SELECT
    PULocationID,
    window_start,
    window_end,
    COUNT(*) AS num_trips,
    SUM(total_amount) AS total_revenue
FROM TABLE(
    TUMBLE(TABLE taxi_rides, DESCRIPTOR(pickup_time), INTERVAL '1' HOUR)
)
GROUP BY PULocationID, window_start, window_end;

To answer “how many rides departed from zone 132 in the last hour”, Flink must maintain an in-memory state accumulator per zone per window—and know when that window is complete.

Checkpointing: The Processor’s Memory

If the stream processor crashes mid-stream, what happens to accumulated state? Without protection, it is lost.

Flink solves this with checkpointing: at configurable intervals, it takes a consistent snapshot of all in-progress state and persists it to a durable store (typically S3 or HDFS). On restart, the job resumes from the last successful checkpoint—not from scratch.

1
2
env = StreamExecutionEnvironment.get_execution_environment()
env.enable_checkpointing(10_000)  # Snapshot every 10 seconds

This is what makes Flink suitable for production: it offers fault tolerance by design, not as an afterthought.


💾 Organ 3: The Sink (The Memory)

The sink is where processed results land and become queryable by downstream systems: dashboards, APIs, other pipelines.

Common sinks in the data engineering context:

Sink Use Case
PostgreSQL / Cloud SQL Operational data, low-latency queries, OLTP workloads
BigQuery / Snowflake Analytics, reporting, aggregated warehouse tables
Elasticsearch Full-text search, log analytics
Another Kafka topic Event-driven microservices, multi-stage pipelines
Cloud Storage (S3/GCS) Long-term event archival, downstream batch jobs

The choice of sink determines the access pattern. If a dashboard requires sub-second query response on per-minute aggregates, PostgreSQL is a reasonable target. If analysts need to run complex ad-hoc queries over months of data, a columnar analytics warehouse is more appropriate.

In many production architectures, the same stream writes to multiple sinks simultaneously: operational data to PostgreSQL, analytical aggregates to BigQuery, raw events to S3 for archival.


🔌 How the Three Organs Connect

The full data flow looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
[Source / Producer]
      │  (writes events as byte streams)
[Kafka / Redpanda Broker]
      │  (consumer reads from offset)
[Apache Flink Stream Processor]
      │  - Filters, transforms, aggregates
      │  - Maintains state across events
      │  - Checkpoints state to durable store
[Sink: PostgreSQL / BigQuery / S3 / ...]
[Dashboard / API / Downstream System]

Each arrow hides significant engineering depth. The producer must decide how to serialise events (JSON, Avro, Protobuf). The broker must be configured for the right retention, replication factor, and partition count. The stream processor must handle out-of-order events. The sink must handle upserts if aggregated results are corrected by late-arriving data.


🏥 How This Compares to the Batch Equivalent

Let’s map this to the batch pipeline you know from the Zoomcamp:

Streaming Component Batch Equivalent
Kafka topic CSV / Parquet file in GCS/S3
Kafka offset / consumer group Partition date in a table, airflow task state
Flink stream processor dbt model run, Spark batch job
Flink checkpoint Airflow task retry state
Sink (PostgreSQL) BigQuery table

The key difference is not the tools—it is the temporal model. Batch systems process bounded datasets at scheduled intervals. Streaming systems process unbounded event sequences continuously.

The engineering challenge of streaming is handling everything that boundedness simplified: when is a window complete? How do you handle events that arrive late? How do you guarantee exactly-once results without checkpointing infrastructure?


🔬 What Comes Next

Now that the anatomy is clear, the next posts will go deeper into the engineering challenges that make streaming hard:

  • Stateful aggregations: how Flink maintains and manages state across millions of in-flight events
  • Late data and watermarks: when to close a window and what to do with events that missed the cutoff
  • Delivery guarantees: what “exactly once” actually means in a distributed system, and why it costs something

Each of these is a design space in its own right. Understanding the anatomy first makes each one easier to reason about.


← Previous: The Midnight Job: Why Batch Processing Has an Expiry Date

Next: Stateful Stream Processing: What Your Pipeline Actually Remembers →