Spark SQL: Running SQL Queries on DataFrames
Goal: Learn how to combine multiple datasets in Spark, register DataFrames as temporary tables, and run SQL queries — bridging the gap between DataFrame operations and SQL.
1. Why Spark SQL?
There are dedicated tools for running SQL on data lakes (Hive, Presto, Athena). But if you already have a Spark cluster, Spark SQL lets you run SQL queries without setting up additional infrastructure.
Spark SQL is especially useful when:
- You’re mixing SQL and Python logic in the same pipeline.
- You want to reuse existing SQL queries (e.g., from a dbt project) inside Spark.
- You need to prototype quickly before moving to a dedicated SQL engine.
2. Combining Two Datasets
Let’s work with the green and yellow NYC taxi datasets. Since their pickup/dropoff column names differ, we first need to align them:
|
|
Finding Common Columns
|
|
Adding a Service Type Identifier
Before merging, we add a column to track which dataset each record came from:
|
|
F.lit()adds a literal (constant) value to every row.
Merging with unionAll
|
|
3. Querying with Temporary Tables
Spark SQL needs a table to query. We can register a DataFrame as a temporary table:
|
|
Note:
registerTempTable()is deprecated in newer Spark versions. UsecreateOrReplaceTempView()instead.
Now we can run standard SQL:
|
|
This produces the same result as df_trips_data.groupBy('service_type').count().show() — but expressed in SQL.
4. A Real-World Revenue Query
Let’s replicate a revenue aggregation model in Spark SQL:
|
|
- SQL queries in Spark are transformations (lazy). You need an action like
show()orwrite()to trigger execution. - We use positional references (
1, 2, 3) inGROUP BYfor cleaner syntax.
5. Writing Results and Controlling Partitions
Writing the result directly:
|
|
With our dataset, this could create 200+ tiny Parquet files — not ideal. Use coalesce() to reduce the partition count:
|
|
| Method | Purpose |
|---|---|
repartition(n) |
Increase partitions (involves a shuffle). |
coalesce(n) |
Decrease partitions (no shuffle, more efficient). |
Rule of thumb: Use
coalesce()when reducing partitions andrepartition()when increasing them.coalesce()avoids the expensive shuffle operation.
In the next post, we’ll look under the hood at Spark internals — how clusters work, how GROUP BY and JOINs are executed across distributed partitions, and what RDDs are.