Streaming Foundations I: Redpanda and Python Producers

Anonymous included in Data Engineering

2026-03-13 About 900 words 5 minutes

Contents

Goal: Understand the fundamentals of message brokers, use Redpanda to simplify Kafka architecture, and write a Python producer to stream NYC yellow taxi events into a topic.

1. Why Do We Need a Message Broker?

When building real-time data streaming applications, the most fundamental component is the Message Broker. Think of it as a high-throughput, low-latency central post office.

Imagine you have a service generating data in real-time (like a taxi meter) and multiple downstream services that want to consume this data (e.g., a billing system, a real-time monitoring dashboard, and a long-term database storage service). Without a message broker, you would have to tightly couple your meter code with all these downstream services.

By introducing a message broker, the architecture transforms into a classic Publish-Subscribe (Pub-Sub) model. Producers only need to send data to the broker; Consumers only need to read from the broker. Both sides are completely decoupled and don’t even need to know the other exists.

2. Redpanda: A Lighter, Faster “Kafka”

When discussing stream processing, Apache Kafka is the industry standard. However, Kafka traditionally relies on the JVM and requires an external ZooKeeper cluster to manage node coordination. This can feel bloated for local development or lightweight deployments.

Enter a modern alternative: Redpanda.

Redpanda is a message broker written in C++ that is fully compatible with the Kafka API.

No JVM: It boots in seconds and has an extremely low memory footprint.
Built-in Consensus (Raft): It eliminates the need for an external ZooKeeper component.
Zero Code Changes: Because it implements the Kafka protocol, to our Python code, it is Kafka. We can directly use standard libraries like kafka-python.

Booting up Redpanda

We can quickly spin up a single-node Redpanda instance using docker-compose.yml:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20


services:
  redpanda:
    image: redpandadata/redpanda:v25.3.9
    command:
      - redpanda
      - start
      - --smp
      - '1'
      - --reserve-memory
      - 0M
      - --overprovisioned
      - --node-id
      - '1'
      - --kafka-addr
      - PLAINTEXT://0.0.0.0:29092,OUTSIDE://0.0.0.0:9092
      - --advertise-kafka-addr
      - PLAINTEXT://redpanda:29092,OUTSIDE://localhost:9092
    ports:
      - 9092:9092
      - 29092:29092

Note on Network Configuration (advertise-kafka-addr): Kafka connections happen in two steps. The client first requests cluster metadata, and then it connects to the actual data node.

We configure two addresses:

OUTSIDE://localhost:9092: For Python scripts running on our host laptop.

PLAINTEXT://redpanda:29092: For internal Docker communications between containers (like the Flink nodes we’ll build later).

Run docker compose up redpanda -d to start our lightweight “Kafka”.

3. Writing a Python Producer

With the core infrastructure running, let’s write a Python script acting as a Producer. We will use the NYC Yellow Taxi dataset as our data source.

First, install the dependencies:

1
2


uv init -p 3.12
uv add kafka-python pandas pyarrow

Defining the Data Schema

In stream processing, having a solid data structure is critical. We use Python’s @dataclass to define the structure of every taxi ride:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


import json
from dataclasses import dataclass
import dataclasses
import pandas as pd
from kafka import KafkaProducer
import time

@dataclass
class Ride:
    PULocationID: int
    DOLocationID: int
    trip_distance: float
    total_amount: float
    tpep_pickup_datetime: int  # epoch milliseconds

def ride_from_row(row):
    return Ride(
        PULocationID=int(row['PULocationID']),
        DOLocationID=int(row['DOLocationID']),
        trip_distance=float(row['trip_distance']),
        total_amount=float(row['total_amount']),
        tpep_pickup_datetime=int(row['tpep_pickup_datetime'].timestamp() * 1000),
    )

Tip: Why convert time to epoch milliseconds? It’s a universal, unambiguous numerical format that downstream streaming frameworks (like Flink) can easily understand and parse.

Configuring the Serializer

Kafka handles Raw Bytes. Brokers don’t care if you are sending images, JSON, or plain text. Therefore, the Python client needs to know how to convert our Ride object into a byte stream.

We write a custom serializer to convert the dataclass to a JSON dictionary, and then encode it into UTF-8 bytes:

1
2
3
4
5
6
7


def ride_serializer(ride):
    # Convert dataclass to dict
    ride_dict = dataclasses.asdict(ride)
    # Convert dict to JSON string
    json_str = json.dumps(ride_dict)
    # Encode as UTF-8 bytes
    return json_str.encode('utf-8')

Sending the Data Stream

Finally, we instantiate the KafkaProducer. We point it to the host’s localhost:9092 port, loop through a Pandas DataFrame, and push simulated streaming data to the rides topic.

If the topic doesn’t exist, the broker will automatically create it by default.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26


# Initialize Producer
server = 'localhost:9092'
producer = KafkaProducer(
    bootstrap_servers=[server],
    value_serializer=ride_serializer
)

# Simulate reading our downloaded Parquet dataset
url = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2025-11.parquet"
columns = ['PULocationID', 'DOLocationID', 'trip_distance', 'total_amount', 'tpep_pickup_datetime']
df = pd.read_parquet(url, columns=columns).head(1000)

topic_name = 'rides'
t0 = time.time()

# Simulate real-time event generation
for _, row in df.iterrows():
    ride = ride_from_row(row)
    producer.send(topic_name, value=ride)
    print(f"Sent: {ride}")
    time.sleep(0.01) # Brief sleep to simulate event intervals

producer.flush() # Ensure all queued messages are sent

t1 = time.time()
print(f'took {(t1 - t0):.2f} seconds')

Run this script, and your terminal will rapidly scroll through sent logs. In just ten seconds, 1000 JSON byte streams carrying ride metadata have been safely stored in Redpanda.

Next up: Now that the data is in, how do we get it out? In the next post, we will write a Python Consumer and explore the challenges of persisting these raw event streams directly into a PostgreSQL database.