PyFlink I: Architecture, Checkpoints, and Pass-Through Jobs
Goal: Understand Flink’s internal architecture (JobManager and TaskManager). Build a custom Docker environment to run PyFlink. Write a Flink SQL “Pass-Through” job to experience how Flink automatically handles Kafka reading, Checkpointing, and PostgreSQL writing.
1. Why Flink?
In the previous two posts, we manually wired up a streaming pipeline: Python -> Kafka -> Python -> PostgreSQL. However, we also saw that relying on native code to handle offset management, fault recovery, multi-threading, and window aggregations is an engineering disaster.
Apache Flink exists to solve these hardcore foundation engineering problems. Flink takes over the heaviest distributed computing tasks for us:
- Windowing — Built-in tumbling, sliding, and session windows (no need to hand-write complex timing logic).
- Checkpointing — Automatic fault recovery! Say goodbye to tracking manual offsets.
- Connectors Ecosystem — Out-of-the-box interfaces for Kafka, JDBC, Filesystems, and more (no more writing boilerplate
psycopg2or complex serialization logic). - Parallelism — Automatically scale data processing from a single worker out to an entire server cluster.
- Flink SQL — An elegant, highly expressive way to manage and define data pipelines using native SQL queries!
However, powerful engines require proper infrastructure. We must maintain a Flink cluster somewhat similarly to a 24/7 server.
2. Flink Architecture: Brains and Brawn (JobManager & TaskManager)
A Flink service is driven by two main process types:
- JobManager (The Brain): The master node, acting as the coordinator and manager. It accepts all Flink Job Submissions and oversees state recovery (Checkpoints) during execution. It also exposes port 8081 for the Web UI, visualizing all running details.
- TaskManager (The Brawn): The worker node. It holds a finite number of “Task Slots”. This is the actual physical vessel that performs computations, merges data, and handles I/O.
Task Slots:
Think of a slot as a dedicated highway lane (a heavily isolated slice of CPU/Memory). If you define a Parallelism = 3, it means your job securely claims three lanes simultaneously, splitting the workload into 3 independent parallel pipes for maximum throughput. If a cluster provides 15 available task slots, you could simultaneously run 5 of these degree-3 micro-pipelines.
Customizing the PyFlink Environment
The native Flink Docker image only natively supports Java. We need to extend its ecosystem by injecting Python support, PyFlink packages, and third-party connector JARs (like kafka-connector and postgres-jdbc-driver).
In our project, we can build a customized Dockerfile.flink containing these modules, generating a pyflink-workshop image.
Then, we deploy it in our docker-compose.yml:
|
|
Run docker compose up --build -d, navigate to localhost:8081, and you’ll see a healthy TaskManager node with 15 fully-loaded slots ready for action!
3. Writing the First “Pass-Through” Streaming Job
Our first Flink streaming job will mimic our old native Python consumer: fetch from Kafka -> save directly to PostgreSQL. But this time, we aren’t writing low-level APIs. Flink will handle all fault tolerance and wiring.
We need two Flink Dynamic Tables.
Step 1: Create the Source Virtual Table (Kafka)
Here, we use Flink DDL to map the data directly (no separate deserializer needed):
|
|
latest-offset vs earliest-offset:
'latest-offset'is perfect for general deployment (only processing streams arriving after the job boots).- If you want to replay historical data or perform Backfilling, use
'earliest-offset'so Flink treats Kafka like a batch database to sweep. - To seamlessly resume from a crash, Flink relies on clearly defined Checkpoint Recoveries or the precise
'timestamp'boot mode.
Step 2: Create the Sink Virtual Table (PostgreSQL)
We use the official JDBC Sink connector to pipe the water into a new pool.
|
|
Step 3: Wire the Pipes and Enable Checkpoints
Now we write the execution body: enabling checkpoints. With this, if a task violently crashes and restarts, Flink won’t blindly read offsets. It strictly resumes from a high-throughput snapshot it securely stored internally.
|
|
Submitting the Job
Submit the task to the JobManager:
|
|
Once submitted, you’ll see the Job suspending into the background. Checking the localhost:8081 dashboard will reveal not only the DAG graph of the computation but also the precise details of the 10-second Checkpoint persistence.
Through declarative configurations, we achieved enterprise-grade disaster recovery without writing hundreds of lines of fragile communication code.
Next up: Stream processing is never just a straight line. Now that we have a distributed engine, let’s put it under pressure. In the next chapter, we unpack the very soul of Flink: Windows and using Watermarks to handle delayed data.