Running Spark in the Cloud: GCS, Standalone Clusters, and Dataproc
Goal: Move from local Spark development to cloud execution — connecting to Google Cloud Storage, running standalone Spark clusters, and submitting jobs on Google Cloud Dataproc.
1. Connecting Spark to Google Cloud Storage
So far we’ve worked with local files. To use data stored in a Cloud Storage bucket, Spark needs the GCS Connector.
Uploading Data with gsutil
|
|
-m— multithreaded upload for speed.-r— recursive (upload folder contents).
Configuring the GCS Connector
-
Download the connector JAR:
1gsutil cp gs://hadoop-lib/gcs/gcs-connector-hadoop3-2.2.5.jar ./lib/ -
Configure Spark with the connector and credentials:
1 2 3 4 5 6 7 8 9 10 11 12 13import pyspark from pyspark.sql import SparkSession from pyspark.conf import SparkConf from pyspark.context import SparkContext credentials_location = '~/.google/credentials/google_credentials.json' conf = SparkConf() \ .setMaster('local[*]') \ .setAppName('test') \ .set("spark.jars", "./lib/gcs-connector-hadoop3-2.2.5.jar") \ .set("spark.hadoop.google.cloud.auth.service.account.enable", "true") \ .set("spark.hadoop.google.cloud.auth.service.account.json.keyfile", credentials_location) -
Create and configure the Spark context:
1 2 3 4 5 6 7sc = SparkContext(conf=conf) hadoop_conf = sc._jsc.hadoopConfiguration() hadoop_conf.set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS") hadoop_conf.set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem") hadoop_conf.set("fs.gs.auth.service.account.json.keyfile", credentials_location) hadoop_conf.set("fs.gs.auth.service.account.enable", "true") -
Create the session and read remote data:
1 2 3 4 5spark = SparkSession.builder \ .config(conf=sc.getConf()) \ .getOrCreate() df_green = spark.read.parquet('gs://<your-bucket>/pq/green/*/*')
From here, everything works exactly as it did with local files.
2. Creating a Standalone Spark Cluster
When creating a Spark session in a notebook with master("local[*]"), the cluster lives only as long as the notebook kernel. For persistent clusters, use Spark Standalone Mode.
Starting the Master
|
|
The Spark dashboard is now available at localhost:8080. Copy the Spark master URL from the dashboard (format: spark://<hostname>:7077).
Starting a Worker
|
|
The worker will appear in the dashboard. Connect your code to the cluster:
|
|
Note: Port
8080is for the web UI. Port7077is the Spark protocol port for job submission.
Shutting Down
|
|
3. Parametrizing Scripts
Hard-coded paths and dates make scripts inflexible. Use argparse to accept parameters:
|
|
Run it:
|
|
4. Submitting Jobs with spark-submit
Instead of specifying the master URL inside your script, use spark-submit to configure cluster settings externally:
|
|
This separates infrastructure config (which cluster, how many resources) from application logic (what data to process). Your script stays clean:
|
|
5. Running on Google Cloud Dataproc
Dataproc is Google’s managed Spark service. It handles cluster provisioning, scaling, and teardown — so you don’t have to.
Creating a Cluster
- Go to Dataproc in the GCP Console and enable the API.
- Click Create Cluster:
- Choose a name and region (same region as your GCS bucket).
- Select Standard for production or Single Node for experimentation.
- Click Create. The cluster will be ready in a few seconds.
Submitting a Job via the Web UI
- Go to your cluster’s Job tab → Submit Job.
- Set Job type to
PySpark. - Set Main Python file to the GCS path of your script (e.g.,
gs://<bucket>/scripts/my_script.py). - Add Arguments (e.g.,
--input_green=gs://...,--output=gs://...). - Click Submit.
Important: Your script must not set
master(). Dataproc manages the connection automatically.
Submitting a Job via gcloud CLI
First, grant your service account the Dataproc Administrator role in IAM. Then:
|
|
| Method | Best For |
|---|---|
| Web UI | Quick ad-hoc testing and debugging. |
| gcloud CLI | Automation, CI/CD pipelines, and orchestration tools. |
Summary: The Full Journey
Over these five posts, we’ve covered the complete batch processing workflow:
- Fundamentals — batch vs streaming, job types, orchestration.
- PySpark Basics — sessions, DataFrames, partitions, transformations.
- Spark SQL — combining datasets, temporary tables, SQL queries.
- Spark Internals — clusters, shuffles, joins, RDDs.
- Cloud Deployment — GCS, standalone clusters, Dataproc.
From here, you have the foundation to build production-grade Spark pipelines — whether running locally for development or on managed cloud infrastructure for production workloads.