Spark Internals: Clusters, Shuffles, Joins, and RDDs
Goal: Understand how Spark executes jobs across a cluster, how operations like GROUP BY and JOIN are distributed, and what Resilient Distributed Datasets (RDDs) are.
1. Spark Cluster Architecture
When running beyond a single machine, Spark uses a master-worker architecture:
- Master — the entry point that receives jobs and distributes work.
- Driver — the program that submits a Spark job (an Airflow DAG, a Python script, etc.).
- Executors — worker nodes that actually process the data.
flowchart LR
A["Driver (Spark Job)"] -->|"spark-submit
port 4040"| Master
subgraph Cluster["Spark Cluster"]
Master([Master])
Master --> E1{{Executor 1}}
Master --> E2{{Executor 2}}
Master --> E3{{Executor 3}}
end
subgraph Data["DataFrame Partitions (in GCS/S3)"]
P1[Partition 1]
P2[Partition 2]
P3[Partition 3]
end
E1 <--> P1
E2 <--> P2
E3 <--> P3
flowchart LR
A["Driver (Spark Job)"] -->|"spark-submit
port 4040"| Master
subgraph Cluster["Spark Cluster"]
Master([Master])
Master --> E1{{Executor 1}}
Master --> E2{{Executor 2}}
Master --> E3{{Executor 3}}
end
subgraph Data["DataFrame Partitions (in GCS/S3)"]
P1[Partition 1]
P2[Partition 2]
P3[Partition 3]
end
E1 <--> P1
E2 <--> P2
E3 <--> P3
flowchart LR
A["Driver (Spark Job)"] -->|"spark-submit
port 4040"| Master
subgraph Cluster["Spark Cluster"]
Master([Master])
Master --> E1{{Executor 1}}
Master --> E2{{Executor 2}}
Master --> E3{{Executor 3}}
end
subgraph Data["DataFrame Partitions (in GCS/S3)"]
P1[Partition 1]
P2[Partition 2]
P3[Partition 3]
end
E1 <--> P1
E2 <--> P2
E3 <--> P3flowchart LR
A["Driver (Spark Job)"] -->|"spark-submit
port 4040"| Master
subgraph Cluster["Spark Cluster"]
Master([Master])
Master --> E1{{Executor 1}}
Master --> E2{{Executor 2}}
Master --> E3{{Executor 3}}
end
subgraph Data["DataFrame Partitions (in GCS/S3)"]
P1[Partition 1]
P2[Partition 2]
P3[Partition 3]
end
E1 <--> P1
E2 <--> P2
E3 <--> P3
Each executor fetches a partition from the Data Lake, processes it, and writes the output. If there are more partitions than executors, executors keep pulling partitions until all are processed.
Spark vs Hadoop: Unlike Hadoop, where data is stored locally on executor nodes, Spark separates storage from compute. This is possible because modern cloud storage (S3, GCS) is cheap and fast enough to make data locality unnecessary.
2. How GROUP BY Works (Shuffling)
Consider this query that calculates hourly revenue per zone:
|
|
Since data is split across partitions, records with the same key (hour, zone) may live in different partitions. Spark solves this in two stages:
Stage 1: Local Aggregation
Each executor groups the data within its own partition and produces intermediate results.
| Executor | Intermediate Result |
|---|---|
| Executor 1 | (hour 1, zone 1, $100, 5 trips), (hour 1, zone 2, $200, 10 trips) |
| Executor 2 | (hour 1, zone 1, $50, 2 trips), (hour 1, zone 2, $250, 11 trips) |
| Executor 3 | (hour 1, zone 1, $200, 10 trips), (hour 2, zone 1, $75, 3 trips) |
Stage 2: Shuffle & Reduce
Spark shuffles (redistributes) the data so that all records with the same key end up in the same partition. Then it performs a final aggregation.
| After Shuffle | Final Result |
|---|---|
Partition A: all (hour 1, zone 1) records |
(hour 1, zone 1, $350, 17 trips) |
Partition B: all (hour 1, zone 2) and (hour 2, zone 1) records |
(hour 1, zone 2, $450, 21 trips), (hour 2, zone 1, $75, 3 trips) |
The algorithm used for shuffling is called external merge sort.
Performance note: Shuffling is expensive because it moves data across the network. Minimizing the data that needs to be shuffled is key to optimizing Spark jobs. By default, Spark creates 200 partitions after a shuffle.
3. How JOINs Work
Joining Two Large Tables (Sort-Merge Join)
When joining two large tables (e.g., green taxi revenue and yellow taxi revenue), Spark uses the same shuffle + merge strategy as GROUP BY:
- Shuffle both tables so that matching keys end up in the same partition.
- Merge (reduce) the records with matching keys into a single output.
|
|
on=— the join key(s), forming a composite key.how='outer'— includes records even if only one side has data (fills the other side with nulls).
Joining a Large Table with a Small Table (Broadcast Join)
When one table is small (e.g., a zones lookup table), shuffling is wasteful. Instead, Spark uses broadcasting:
graph LR
Z[Zones Table] -.->|Broadcast| E1 & E2 & E3
subgraph Executors
E1{{Executor 1}} -.-> Z1["zones (local copy)"]
E2{{Executor 2}} -.-> Z2["zones (local copy)"]
E3{{Executor 3}} -.-> Z3["zones (local copy)"]
end
B1[Big Table P1] --> E1 --> R1[Result 1]
B2[Big Table P2] --> E2 --> R2[Result 2]
B3[Big Table P3] --> E3 --> R3[Result 3]
graph LR
Z[Zones Table] -.->|Broadcast| E1 & E2 & E3
subgraph Executors
E1{{Executor 1}} -.-> Z1["zones (local copy)"]
E2{{Executor 2}} -.-> Z2["zones (local copy)"]
E3{{Executor 3}} -.-> Z3["zones (local copy)"]
end
B1[Big Table P1] --> E1 --> R1[Result 1]
B2[Big Table P2] --> E2 --> R2[Result 2]
B3[Big Table P3] --> E3 --> R3[Result 3]
graph LR
Z[Zones Table] -.->|Broadcast| E1 & E2 & E3
subgraph Executors
E1{{Executor 1}} -.-> Z1["zones (local copy)"]
E2{{Executor 2}} -.-> Z2["zones (local copy)"]
E3{{Executor 3}} -.-> Z3["zones (local copy)"]
end
B1[Big Table P1] --> E1 --> R1[Result 1]
B2[Big Table P2] --> E2 --> R2[Result 2]
B3[Big Table P3] --> E3 --> R3[Result 3]graph LR
Z[Zones Table] -.->|Broadcast| E1 & E2 & E3
subgraph Executors
E1{{Executor 1}} -.-> Z1["zones (local copy)"]
E2{{Executor 2}} -.-> Z2["zones (local copy)"]
E3{{Executor 3}} -.-> Z3["zones (local copy)"]
end
B1[Big Table P1] --> E1 --> R1[Result 1]
B2[Big Table P2] --> E2 --> R2[Result 2]
B3[Big Table P3] --> E3 --> R3[Result 3]
Spark sends a copy of the entire small table to every executor. Each executor then performs a local lookup — no shuffle needed, making the join orders of magnitude faster.
|
|
4. Resilient Distributed Datasets (RDDs)
RDDs are the low-level abstraction that Spark DataFrames are built on top of. While DataFrames have a schema, RDDs are simply distributed collections of objects.
From DataFrame to RDD
|
|
Map, Filter, and ReduceByKey
You can re-implement SQL-like logic with RDD operations:
|
|
Converting Back to a DataFrame
|
|
When to use RDDs? Almost never directly. DataFrames and Spark SQL are higher-level, more optimized, and easier to work with. But understanding RDDs helps you reason about how Spark works under the hood.
5. mapPartitions: Per-Partition Processing
The mapPartitions() method transforms entire partitions instead of individual elements — useful for batch operations like ML inference:
|
|
- Each partition is converted to a Pandas DataFrame and passed to the model.
yieldreturns a generator, which Spark uses to build the output RDD incrementally.- Use case: when your model expects a batch of records (e.g., a NumPy array) rather than one row at a time.
In the next post, we’ll move from local development to the cloud — connecting Spark to Google Cloud Storage, setting up standalone clusters, and running jobs on Dataproc.