Cloud Transplants: The GCP Integration Protocol
Important
SYSTEM ALERT: Local infrastructure limits detected. Processing capacity constrained. Storage nodes fragmented. Initiate cloud migration protocols for horizontal scalability.
We have all been there. Your local Postgres database hums along nicely with a few million rows. But then, the data grows. And grows. Queries slow to a crawl. Disk space runs out.
Here is the hard truth: Local databases are not designed for petabyte-scale analytics. To build for the future, we need to migrate to the cloud—where storage is infinite and compute is elastic.
At the Yellow Capsule, we treat cloud migration like a neural network upgrade: carefully planned, incrementally deployed, and centrally managed. Enter the GCP Integration Protocol.
🧬 System Scan: The Cloud Architecture
Files 06 through 09 guide us through a complete cloud migration:
| File | Purpose |
|---|---|
06_gcp_kv.yaml |
Configure cloud credentials and project settings. |
07_gcp_setup.yaml |
Provision cloud resources (GCS bucket, BigQuery dataset). |
08_gcp_taxi.yaml |
The core ETL pipeline, now targeting GCP. |
09_gcp_taxi_scheduled.yaml |
Automated, scheduled version for production. |
🛠️ Phase 1: Configuration Management (The KV Store)
Before connecting to any cloud resource, we need to securely store credentials and project configuration. Kestra’s Key-Value (KV) Store is the solution.
06_gcp_kv.yaml: Setting the Parameters
|
|
io.kestra.plugin.core.kv.Set: Stores a key-value pair in Kestra’s internal store.- These values are retrieved later using
{{ kv('KEY_NAME') }}. - Security Note: Sensitive credentials like
GCP_CREDS(your service account JSON) should be set via the UI or a secure secrets manager, not hardcoded.
System Note: Run this workflow once to initialize your environment. It’s the “boot sector” for your cloud configuration.
🛠️ Phase 2: Infrastructure Provisioning (The Setup)
07_gcp_setup.yaml: Building the Organs
|
|
CreateBucket: Creates a Google Cloud Storage bucket.CreateDataset: Creates a BigQuery dataset.ifExists: SKIP: Makes the task idempotent. Safe to re-run without errors.pluginDefaults: Defines default credentials and settings for all GCP plugin tasks in this workflow. DRY principle applied.
🔪 Phase 3: The Cloud ETL Pipeline (08_gcp_taxi.yaml)
This is the transplanted version of our Postgres pipeline, now operating on GCP infrastructure.
Step 1: Extract & Upload to GCS
|
|
- Data is downloaded locally, then immediately uploaded to the GCS bucket.
- GCS acts as the “staging area” before loading into BigQuery.
Step 2: Create External Table (The Portal)
|
|
System Note: An External Table in BigQuery reads data directly from GCS files without copying it. This is powerful for initial exploration and schema definition.
Step 3: Materialize to Native Table
|
|
- Converts the external table to a native BigQuery table for optimized query performance.
- Adds the
unique_row_idhash andfilenamefor lineage tracking.
Step 4: The Merge (Same Pattern, New Platform)
|
|
The same idempotent merge logic we used in Postgres, now executed in BigQuery at cloud scale.
⏰ Phase 4: Scheduled Production (09_gcp_taxi_scheduled.yaml)
The Dual-Schedule Trigger
|
|
Identical to the Postgres version. The power of the Kestra abstraction: the scheduling layer is decoupled from the execution layer. Migrate from Postgres to BigQuery without touching your trigger configuration.
Dynamic Variables
|
|
Uses trigger.date to automatically process the correct monthly file. The pipeline runs itself, indefinitely.
🧪 Post-Op Analysis
What did we learn from the cloud migration?
| Component | Local (Postgres) | Cloud (GCP) |
|---|---|---|
| Storage | Local disk | Google Cloud Storage (infinite) |
| Database | PostgreSQL | BigQuery (serverless, petabyte-scale) |
| Credentials | Docker network | KV Store + Service Account |
| Staging | _staging table |
External Table on GCS |
| Performance | Vertical scaling | Horizontal scaling |
Key Takeaways
- KV Store is Essential: Centralize your configuration. Never hardcode credentials.
pluginDefaultsReduce Boilerplate: Define connection settings once, apply everywhere.- External Tables are Powerful: Load data lazily, then materialize for performance.
- The Core Logic is Portable: Extract -> Transform -> Merge. The pattern survives the platform change.
When you can migrate a pipeline from local to cloud by swapping plugins and connection strings, you’ve achieved infrastructure abstraction.
This is the leap from “running on my laptop” to “operating in the cloud.”
Close the local Docker instance. Open the GCP Console. Watch your pipeline scale.
Happy building, initiate.