Streaming Foundations I: Redpanda and Python Producers
Goal: Understand the fundamentals of message brokers, use Redpanda to simplify Kafka architecture, and write a Python producer to stream NYC yellow taxi events into a topic.
1. Why Do We Need a Message Broker?
When building real-time data streaming applications, the most fundamental component is the Message Broker. Think of it as a high-throughput, low-latency central post office.
Imagine you have a service generating data in real-time (like a taxi meter) and multiple downstream services that want to consume this data (e.g., a billing system, a real-time monitoring dashboard, and a long-term database storage service). Without a message broker, you would have to tightly couple your meter code with all these downstream services.
By introducing a message broker, the architecture transforms into a classic Publish-Subscribe (Pub-Sub) model. Producers only need to send data to the broker; Consumers only need to read from the broker. Both sides are completely decoupled and don’t even need to know the other exists.
2. Redpanda: A Lighter, Faster “Kafka”
When discussing stream processing, Apache Kafka is the industry standard. However, Kafka traditionally relies on the JVM and requires an external ZooKeeper cluster to manage node coordination. This can feel bloated for local development or lightweight deployments.
Enter a modern alternative: Redpanda.
Redpanda is a message broker written in C++ that is fully compatible with the Kafka API.
- No JVM: It boots in seconds and has an extremely low memory footprint.
- Built-in Consensus (Raft): It eliminates the need for an external ZooKeeper component.
- Zero Code Changes: Because it implements the Kafka protocol, to our Python code, it is Kafka. We can directly use standard libraries like
kafka-python.
Booting up Redpanda
We can quickly spin up a single-node Redpanda instance using docker-compose.yml:
|
|
Note on Network Configuration (
advertise-kafka-addr): Kafka connections happen in two steps. The client first requests cluster metadata, and then it connects to the actual data node.We configure two addresses:
OUTSIDE://localhost:9092: For Python scripts running on our host laptop.PLAINTEXT://redpanda:29092: For internal Docker communications between containers (like the Flink nodes we’ll build later).
Run docker compose up redpanda -d to start our lightweight “Kafka”.
3. Writing a Python Producer
With the core infrastructure running, let’s write a Python script acting as a Producer. We will use the NYC Yellow Taxi dataset as our data source.
First, install the dependencies:
|
|
Defining the Data Schema
In stream processing, having a solid data structure is critical. We use Python’s @dataclass to define the structure of every taxi ride:
|
|
Tip: Why convert time to epoch milliseconds? It’s a universal, unambiguous numerical format that downstream streaming frameworks (like Flink) can easily understand and parse.
Configuring the Serializer
Kafka handles Raw Bytes. Brokers don’t care if you are sending images, JSON, or plain text. Therefore, the Python client needs to know how to convert our Ride object into a byte stream.
We write a custom serializer to convert the dataclass to a JSON dictionary, and then encode it into UTF-8 bytes:
|
|
Sending the Data Stream
Finally, we instantiate the KafkaProducer. We point it to the host’s localhost:9092 port, loop through a Pandas DataFrame, and push simulated streaming data to the rides topic.
If the topic doesn’t exist, the broker will automatically create it by default.
|
|
Run this script, and your terminal will rapidly scroll through sent logs. In just ten seconds, 1000 JSON byte streams carrying ride metadata have been safely stored in Redpanda.
Next up: Now that the data is in, how do we get it out? In the next post, we will write a Python Consumer and explore the challenges of persisting these raw event streams directly into a PostgreSQL database.