YC Medical
ENTER

Running Spark in the Cloud: GCS, Standalone Clusters, and Dataproc

Goal: Move from local Spark development to cloud execution — connecting to Google Cloud Storage, running standalone Spark clusters, and submitting jobs on Google Cloud Dataproc.


1. Connecting Spark to Google Cloud Storage

So far we’ve worked with local files. To use data stored in a Cloud Storage bucket, Spark needs the GCS Connector.

Uploading Data with gsutil

1
gsutil -m cp -r <local_folder> gs://<bucket_name>/<destination_folder>
  • -m — multithreaded upload for speed.
  • -r — recursive (upload folder contents).

Configuring the GCS Connector

  1. Download the connector JAR:

    1
    
    gsutil cp gs://hadoop-lib/gcs/gcs-connector-hadoop3-2.2.5.jar ./lib/
  2. Configure Spark with the connector and credentials:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    
    import pyspark
    from pyspark.sql import SparkSession
    from pyspark.conf import SparkConf
    from pyspark.context import SparkContext
    
    credentials_location = '~/.google/credentials/google_credentials.json'
    
    conf = SparkConf() \
        .setMaster('local[*]') \
        .setAppName('test') \
        .set("spark.jars", "./lib/gcs-connector-hadoop3-2.2.5.jar") \
        .set("spark.hadoop.google.cloud.auth.service.account.enable", "true") \
        .set("spark.hadoop.google.cloud.auth.service.account.json.keyfile", credentials_location)
  3. Create and configure the Spark context:

    1
    2
    3
    4
    5
    6
    7
    
    sc = SparkContext(conf=conf)
    
    hadoop_conf = sc._jsc.hadoopConfiguration()
    hadoop_conf.set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
    hadoop_conf.set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
    hadoop_conf.set("fs.gs.auth.service.account.json.keyfile", credentials_location)
    hadoop_conf.set("fs.gs.auth.service.account.enable", "true")
  4. Create the session and read remote data:

    1
    2
    3
    4
    5
    
    spark = SparkSession.builder \
        .config(conf=sc.getConf()) \
        .getOrCreate()
    
    df_green = spark.read.parquet('gs://<your-bucket>/pq/green/*/*')

From here, everything works exactly as it did with local files.


2. Creating a Standalone Spark Cluster

When creating a Spark session in a notebook with master("local[*]"), the cluster lives only as long as the notebook kernel. For persistent clusters, use Spark Standalone Mode.

Starting the Master

1
2
# From your Spark install directory
./sbin/start-master.sh

The Spark dashboard is now available at localhost:8080. Copy the Spark master URL from the dashboard (format: spark://<hostname>:7077).

Starting a Worker

1
./sbin/start-worker.sh spark://<hostname>:7077

The worker will appear in the dashboard. Connect your code to the cluster:

1
2
3
4
spark = SparkSession.builder \
    .master("spark://<hostname>:7077") \
    .appName('test') \
    .getOrCreate()

Note: Port 8080 is for the web UI. Port 7077 is the Spark protocol port for job submission.

Shutting Down

1
2
./sbin/stop-worker.sh
./sbin/stop-master.sh

3. Parametrizing Scripts

Hard-coded paths and dates make scripts inflexible. Use argparse to accept parameters:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
import argparse
from pyspark.sql import SparkSession

parser = argparse.ArgumentParser()
parser.add_argument('--input_green', required=True)
parser.add_argument('--input_yellow', required=True)
parser.add_argument('--output', required=True)
args = parser.parse_args()

spark = SparkSession.builder \
    .appName('test') \
    .getOrCreate()

df_green = spark.read.parquet(args.input_green)
df_yellow = spark.read.parquet(args.input_yellow)
# ... transformations ...
df_result.write.parquet(args.output, mode='overwrite')

Run it:

1
2
3
4
python my_script.py \
    --input_green=data/pq/green/2020/*/ \
    --input_yellow=data/pq/yellow/2020/*/ \
    --output=data/report-2020

4. Submitting Jobs with spark-submit

Instead of specifying the master URL inside your script, use spark-submit to configure cluster settings externally:

1
2
3
4
5
6
spark-submit \
    --master="spark://<hostname>:7077" \
    my_script.py \
        --input_green=data/pq/green/2020/*/ \
        --input_yellow=data/pq/yellow/2020/*/ \
        --output=data/report-2020

This separates infrastructure config (which cluster, how many resources) from application logic (what data to process). Your script stays clean:

1
2
3
spark = SparkSession.builder \
    .appName('test') \
    .getOrCreate()

5. Running on Google Cloud Dataproc

Dataproc is Google’s managed Spark service. It handles cluster provisioning, scaling, and teardown — so you don’t have to.

Creating a Cluster

  1. Go to Dataproc in the GCP Console and enable the API.
  2. Click Create Cluster:
    • Choose a name and region (same region as your GCS bucket).
    • Select Standard for production or Single Node for experimentation.
  3. Click Create. The cluster will be ready in a few seconds.

Submitting a Job via the Web UI

  1. Go to your cluster’s Job tab → Submit Job.
  2. Set Job type to PySpark.
  3. Set Main Python file to the GCS path of your script (e.g., gs://<bucket>/scripts/my_script.py).
  4. Add Arguments (e.g., --input_green=gs://..., --output=gs://...).
  5. Click Submit.

Important: Your script must not set master(). Dataproc manages the connection automatically.

Submitting a Job via gcloud CLI

First, grant your service account the Dataproc Administrator role in IAM. Then:

1
2
3
4
5
6
7
8
gcloud dataproc jobs submit pyspark \
    --cluster=<your-cluster-name> \
    --region=<your-region> \
    gs://<bucket>/scripts/my_script.py \
    -- \
        --input_green=gs://<bucket>/pq/green/2020/*/ \
        --input_yellow=gs://<bucket>/pq/yellow/2020/*/ \
        --output=gs://<bucket>/report-2020
Method Best For
Web UI Quick ad-hoc testing and debugging.
gcloud CLI Automation, CI/CD pipelines, and orchestration tools.

Summary: The Full Journey

Over these five posts, we’ve covered the complete batch processing workflow:

  1. Fundamentals — batch vs streaming, job types, orchestration.
  2. PySpark Basics — sessions, DataFrames, partitions, transformations.
  3. Spark SQL — combining datasets, temporary tables, SQL queries.
  4. Spark Internals — clusters, shuffles, joins, RDDs.
  5. Cloud Deployment — GCS, standalone clusters, Dataproc.

From here, you have the foundation to build production-grade Spark pipelines — whether running locally for development or on managed cloud infrastructure for production workloads.