Batch Processing Fundamentals & Introduction to Apache Spark
Goal: Understand what batch processing is, how it compares to streaming, and why Apache Spark is the go-to engine for large-scale batch data transformations.
1. Batch vs Streaming
There are two fundamental ways to process data:
| Approach | Description | Example |
|---|---|---|
| Batch Processing | Process chunks of data at regular intervals. | Aggregate all taxi trips at the end of each month. |
| Streaming | Process data on the fly, as it arrives. | Process a taxi trip event the moment a ride starts. |
Most companies that deal with data use batch processing for the majority of their workloads (roughly 90%). Streaming is reserved for use cases where low latency is critical.
This series of posts covers batch processing. We’ll cover streaming in a later module.
2. Types of Batch Jobs
A batch job is a unit of work that processes data in batches. Jobs can be scheduled at different intervals:
- Weekly — end-of-week reports
- Daily — the most common cadence
- Hourly — near-real-time analytics
- Every N minutes — frequent micro-batches
Batch jobs can be implemented with a variety of technologies, including:
- Python scripts — flexible, can run anywhere (Kubernetes, AWS Batch, etc.)
- SQL — declarative transformations (e.g., dbt models)
- Spark — distributed processing for large datasets (the focus of this module)
- Flink — another distributed engine (more commonly used for streaming)
3. Orchestrating Batch Jobs
Batch jobs rarely run in isolation. A typical workflow chains multiple technologies together:
graph LR;
A(Data Lake / CSV) --> B(Python)
B --> C[(SQL / dbt)]
C --> D(Spark)
D --> E(Python)
graph LR;
A(Data Lake / CSV) --> B(Python)
B --> C[(SQL / dbt)]
C --> D(Spark)
D --> E(Python)
graph LR;
A(Data Lake / CSV) --> B(Python)
B --> C[(SQL / dbt)]
C --> D(Spark)
D --> E(Python)graph LR;
A(Data Lake / CSV) --> B(Python)
B --> C[(SQL / dbt)]
C --> D(Spark)
D --> E(Python)
Tools like Airflow, Prefect, or Kestra orchestrate these multi-step pipelines — handling scheduling, retries, and dependency management.
4. Pros and Cons of Batch Jobs
Advantages
- Easy to manage: Many mature tools exist to build, schedule, and monitor batch pipelines.
- Re-executable: If a job fails, you can simply retry it. Idempotency is straightforward.
- Scalable: Scripts can run on larger machines; Spark jobs can scale out to bigger clusters.
Disadvantages
- Delay: Each step in the pipeline takes time. If the full workflow runs for 20 minutes, the output data is always at least 20 minutes behind reality.
In practice, the advantages usually outweigh the delay. Batch processing is the workhorse of modern data platforms.
5. What is Apache Spark?
Apache Spark is an open-source, multi-language, unified analytics engine for large-scale data processing.
- Engine — it pulls data in, processes it, and outputs it.
- Multi-language — native support for Java and Scala; wrappers for Python (PySpark), R, and others.
- Unified — handles both batch and streaming workloads.
graph LR;
A[(Data Lake)] -->|Pull data| B(Spark)
B -->|Transform| B
B -->|Output data| A
graph LR;
A[(Data Lake)] -->|Pull data| B(Spark)
B -->|Transform| B
B -->|Output data| A
graph LR;
A[(Data Lake)] -->|Pull data| B(Spark)
B -->|Transform| B
B -->|Output data| Agraph LR;
A[(Data Lake)] -->|Pull data| B(Spark)
B -->|Transform| B
B -->|Output data| A
Spark runs on clusters with multiple nodes, where each node pulls and transforms data in parallel.
6. Why Do We Need Spark?
Spark is used for transforming data in a Data Lake — especially when the transformation logic is too complex for SQL.
Tools like Hive, Presto, and Athena are excellent for SQL-based transformations. But when you need to apply complex manipulation — such as machine learning models, custom business logic, or advanced aggregations — Spark gives you the full power of a general-purpose programming language.
graph LR;
A[(Data Lake)] --> B{Can the job be
expressed with SQL?}
B -->|Yes| C(Hive / Presto / Athena)
B -->|No| D(Spark)
C & D --> E[(Data Lake)]
graph LR;
A[(Data Lake)] --> B{Can the job be
expressed with SQL?}
B -->|Yes| C(Hive / Presto / Athena)
B -->|No| D(Spark)
C & D --> E[(Data Lake)]
graph LR;
A[(Data Lake)] --> B{Can the job be
expressed with SQL?}
B -->|Yes| C(Hive / Presto / Athena)
B -->|No| D(Spark)
C & D --> E[(Data Lake)]graph LR;
A[(Data Lake)] --> B{Can the job be
expressed with SQL?}
B -->|Yes| C(Hive / Presto / Athena)
B -->|No| D(Spark)
C & D --> E[(Data Lake)]
A Typical ML Workflow
In practice, many pipelines combine SQL and Spark:
graph LR;
A((Raw Data)) --> B[(Data Lake)]
B --> C(SQL / Athena)
C --> D(Spark Job)
D -->|Train a model| E(Python ML Training)
D -->|Apply a model| F(Spark: Apply Model)
E --> G([Trained Model])
G --> F
F -->|Save output| B
graph LR;
A((Raw Data)) --> B[(Data Lake)]
B --> C(SQL / Athena)
C --> D(Spark Job)
D -->|Train a model| E(Python ML Training)
D -->|Apply a model| F(Spark: Apply Model)
E --> G([Trained Model])
G --> F
F -->|Save output| B
graph LR;
A((Raw Data)) --> B[(Data Lake)]
B --> C(SQL / Athena)
C --> D(Spark Job)
D -->|Train a model| E(Python ML Training)
D -->|Apply a model| F(Spark: Apply Model)
E --> G([Trained Model])
G --> F
F -->|Save output| Bgraph LR;
A((Raw Data)) --> B[(Data Lake)]
B --> C(SQL / Athena)
C --> D(Spark Job)
D -->|Train a model| E(Python ML Training)
D -->|Apply a model| F(Spark: Apply Model)
E --> G([Trained Model])
G --> F
F -->|Save output| B
The rule of thumb: use SQL when you can, use Spark when you must. For everything that can be expressed declaratively, SQL is simpler. For everything else, there’s Spark.
In the next post, we’ll get hands-on with PySpark — creating Spark sessions, reading data, working with DataFrames, and writing our first transformations.