Cloud Transplants: The GCP Integration Protocol

Important

SYSTEM ALERT: Local infrastructure limits detected. Processing capacity constrained. Storage nodes fragmented. Initiate cloud migration protocols for horizontal scalability.

We have all been there. Your local Postgres database hums along nicely with a few million rows. But then, the data grows. And grows. Queries slow to a crawl. Disk space runs out.

Here is the hard truth: Local databases are not designed for petabyte-scale analytics. To build for the future, we need to migrate to the cloud—where storage is infinite and compute is elastic.

At the Yellow Capsule, we treat cloud migration like a neural network upgrade: carefully planned, incrementally deployed, and centrally managed. Enter the GCP Integration Protocol.


🧬 System Scan: The Cloud Architecture

Files 06 through 09 guide us through a complete cloud migration:

File Purpose
06_gcp_kv.yaml Configure cloud credentials and project settings.
07_gcp_setup.yaml Provision cloud resources (GCS bucket, BigQuery dataset).
08_gcp_taxi.yaml The core ETL pipeline, now targeting GCP.
09_gcp_taxi_scheduled.yaml Automated, scheduled version for production.

🛠️ Phase 1: Configuration Management (The KV Store)

Before connecting to any cloud resource, we need to securely store credentials and project configuration. Kestra’s Key-Value (KV) Store is the solution.

06_gcp_kv.yaml: Setting the Parameters

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
tasks:
  - id: gcp_project_id
    type: io.kestra.plugin.core.kv.Set
    key: GCP_PROJECT_ID
    kvType: STRING
    value: kestra-156156

  - id: gcp_location
    type: io.kestra.plugin.core.kv.Set
    key: GCP_LOCATION
    kvType: STRING
    value: australia-southeast1

  - id: gcp_bucket_name
    type: io.kestra.plugin.core.kv.Set
    key: GCP_BUCKET_NAME
    kvType: STRING
    value: kestra-156156-zoomcamp-demo

  - id: gcp_dataset
    type: io.kestra.plugin.core.kv.Set
    key: GCP_DATASET
    kvType: STRING
    value: zoomcamp
  • io.kestra.plugin.core.kv.Set: Stores a key-value pair in Kestra’s internal store.
  • These values are retrieved later using {{ kv('KEY_NAME') }}.
  • Security Note: Sensitive credentials like GCP_CREDS (your service account JSON) should be set via the UI or a secure secrets manager, not hardcoded.

System Note: Run this workflow once to initialize your environment. It’s the “boot sector” for your cloud configuration.


🛠️ Phase 2: Infrastructure Provisioning (The Setup)

07_gcp_setup.yaml: Building the Organs

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
tasks:
  - id: create_gcs_bucket
    type: io.kestra.plugin.gcp.gcs.CreateBucket
    ifExists: SKIP
    storageClass: REGIONAL
    name: "{{kv('GCP_BUCKET_NAME')}}"

  - id: create_bq_dataset
    type: io.kestra.plugin.gcp.bigquery.CreateDataset
    name: "{{kv('GCP_DATASET')}}"
    ifExists: SKIP

pluginDefaults:
  - type: io.kestra.plugin.gcp
    values:
      serviceAccount: "{{kv('GCP_CREDS')}}"
      projectId: "{{kv('GCP_PROJECT_ID')}}"
      location: "{{kv('GCP_LOCATION')}}"
  • CreateBucket: Creates a Google Cloud Storage bucket.
  • CreateDataset: Creates a BigQuery dataset.
  • ifExists: SKIP: Makes the task idempotent. Safe to re-run without errors.
  • pluginDefaults: Defines default credentials and settings for all GCP plugin tasks in this workflow. DRY principle applied.

🔪 Phase 3: The Cloud ETL Pipeline (08_gcp_taxi.yaml)

This is the transplanted version of our Postgres pipeline, now operating on GCP infrastructure.

Step 1: Extract & Upload to GCS

1
2
3
4
5
6
7
8
9
- id: extract
  type: io.kestra.plugin.scripts.shell.Commands
  commands:
    - wget -qO- .../{{render(vars.file)}}.gz | gunzip > {{render(vars.file)}}

- id: upload_to_gcs
  type: io.kestra.plugin.gcp.gcs.Upload
  from: "{{render(vars.data)}}"
  to: "{{render(vars.gcs_file)}}"
  • Data is downloaded locally, then immediately uploaded to the GCS bucket.
  • GCS acts as the “staging area” before loading into BigQuery.

Step 2: Create External Table (The Portal)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
- id: bq_yellow_table_ext
  type: io.kestra.plugin.gcp.bigquery.Query
  sql: |
    CREATE OR REPLACE EXTERNAL TABLE `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}_ext`
    (
        VendorID STRING,
        tpep_pickup_datetime TIMESTAMP,
        ...
    )
    OPTIONS (
        format = 'CSV',
        uris = ['{{render(vars.gcs_file)}}'],
        skip_leading_rows = 1
    );

System Note: An External Table in BigQuery reads data directly from GCS files without copying it. This is powerful for initial exploration and schema definition.

Step 3: Materialize to Native Table

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
- id: bq_yellow_table_tmp
  type: io.kestra.plugin.gcp.bigquery.Query
  sql: |
    CREATE OR REPLACE TABLE `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}`
    AS
    SELECT
      MD5(CONCAT(...)) AS unique_row_id,
      "{{render(vars.file)}}" AS filename,
      *
    FROM `{{kv('GCP_PROJECT_ID')}}.{{render(vars.table)}}_ext`;
  • Converts the external table to a native BigQuery table for optimized query performance.
  • Adds the unique_row_id hash and filename for lineage tracking.

Step 4: The Merge (Same Pattern, New Platform)

1
2
3
4
5
6
7
8
- id: bq_yellow_merge
  type: io.kestra.plugin.gcp.bigquery.Query
  sql: |
    MERGE INTO `...yellow_tripdata` T
    USING `...yellow_tripdata_2019_01` S
    ON T.unique_row_id = S.unique_row_id
    WHEN NOT MATCHED THEN
      INSERT (...) VALUES (...);

The same idempotent merge logic we used in Postgres, now executed in BigQuery at cloud scale.


⏰ Phase 4: Scheduled Production (09_gcp_taxi_scheduled.yaml)

The Dual-Schedule Trigger

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
triggers:
  - id: green_schedule
    type: io.kestra.plugin.core.trigger.Schedule
    cron: "0 9 1 * *"
    inputs:
      taxi: green

  - id: yellow_schedule
    type: io.kestra.plugin.core.trigger.Schedule
    cron: "0 10 1 * *"
    inputs:
      taxi: yellow

Identical to the Postgres version. The power of the Kestra abstraction: the scheduling layer is decoupled from the execution layer. Migrate from Postgres to BigQuery without touching your trigger configuration.

Dynamic Variables

1
2
variables:
  file: "{{inputs.taxi}}_tripdata_{{trigger.date | date('yyyy-MM')}}.csv"

Uses trigger.date to automatically process the correct monthly file. The pipeline runs itself, indefinitely.


🧪 Post-Op Analysis

What did we learn from the cloud migration?

Component Local (Postgres) Cloud (GCP)
Storage Local disk Google Cloud Storage (infinite)
Database PostgreSQL BigQuery (serverless, petabyte-scale)
Credentials Docker network KV Store + Service Account
Staging _staging table External Table on GCS
Performance Vertical scaling Horizontal scaling

Key Takeaways

  1. KV Store is Essential: Centralize your configuration. Never hardcode credentials.
  2. pluginDefaults Reduce Boilerplate: Define connection settings once, apply everywhere.
  3. External Tables are Powerful: Load data lazily, then materialize for performance.
  4. The Core Logic is Portable: Extract -> Transform -> Merge. The pattern survives the platform change.

When you can migrate a pipeline from local to cloud by swapping plugins and connection strings, you’ve achieved infrastructure abstraction.

This is the leap from “running on my laptop” to “operating in the cloud.”

Close the local Docker instance. Open the GCP Console. Watch your pipeline scale.

Happy building, initiate.