Optimizing BigQuery: Partitioning, Clustering, and ML

Goal: Optimize costs and performance using Partitioning and Clustering, and build ML models directly in BigQuery.


1. Partitioning vs Clustering (Crucial for Optimization)

These are the two main ways to improve query speed and reduce cost.

Partitioning

Divides a table into segments based on a specific column (usually Date).

  • Benefit: When you filter by date, BQ only reads that specific partition. It ignores the rest.
  • Common Use: Partition by creation_date or pickup_datetime.

Clustering

Sorts and organizes data inside each partition based on one or more columns.

  • Benefit: Keeps similar data together. Great for filters or aggregation.
  • Common Use: Cluster by Tag, VendorID, or Region.

Comparison Table

Feature Partitioning Clustering
Granularity Coarse (Day/Month/Year). High granularity (Strings, IDs).
Cost Estimation Known upfront. (You see “This query will process 100MB”). Unknown until query finishes.
Limit Max 4000 partitions per table. No limit on clusters.
Main Use Filter by Time. Filter by high cardinality columns (e.g., Email, ID).
Maintenance Manual scope. Automatic reclustering (Free).

SQL Examples

1. Create External Table (from GCS) (Data stays in Google Cloud Storage, BQ just reads it)

1
2
3
4
5
CREATE OR REPLACE EXTERNAL TABLE `my_project.dataset.external_table`
OPTIONS (
  format = 'CSV',
  uris = ['gs://my-bucket/trip_data_2019-*.csv']
);

2. Create Partitioned Table

1
2
3
CREATE OR REPLACE TABLE `my_project.dataset.partitioned_table`
PARTITION BY DATE(tpep_pickup_datetime) AS
SELECT * FROM `my_project.dataset.external_table`;

3. Create Partitioned AND Clustered Table

1
2
3
4
CREATE OR REPLACE TABLE `my_project.dataset.clustered_table`
PARTITION BY DATE(tpep_pickup_datetime)
CLUSTER BY VendorID AS
SELECT * FROM `my_project.dataset.external_table`;

2. Best Practices

To save money and time, follow these rules:

Cost Reduction

  • 🛑 Avoid SELECT *: BigQuery is columnar. Processing extra columns costs extra money. Only select what you need.
  • 🔍 Filter on Partitioned Columns: Always include the partition column (e.g., date) in your WHERE clause.
  • ⚠️ Caution with Streaming: Streaming inserts cost more than batch loads.

Query Performance

  • Denormalize Data: Use Nested and Repeated fields (Arrays and Structs) to avoid expensive JOINS.
  • Filter before Joining: Reduce the dataset size (subquery or CTE) before joining with another table.
  • Largest Table First: Place the largest table first in a JOIN (standard SQL optimization).
  • Approximate Aggregation: For huge datasets, use APPROX_COUNT_DISTINCT() instead of COUNT(DISTINCT ...).

3. Machine Learning in BigQuery (BQ ML)

You can build ML models using standard SQL inside BigQuery. No need to export data to Python.

The ML Workflow in SQL

  1. Create/Train Model:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    
    CREATE OR REPLACE MODEL `dataset.tip_model`
    OPTIONS (
      model_type='linear_reg',      -- Algorithm (Linear Regression)
      input_label_cols=['tip_amount'], -- Target variable
      DATA_SPLIT_METHOD='AUTO_SPLIT'
    ) AS
    SELECT *
    FROM `dataset.training_data`
    WHERE tip_amount IS NOT NULL;
  2. Evaluate Model:

    1
    2
    3
    
    SELECT * FROM ML.EVALUATE(MODEL `dataset.tip_model`, (
      SELECT * FROM `dataset.test_data`
    ));
  3. Predict:

    1
    2
    3
    
    SELECT * FROM ML.PREDICT(MODEL `dataset.tip_model`, (
      SELECT * FROM `dataset.new_data`
    ));

Deployment

You can export BQML models to Docker containers (TensorFlow Serving) to run them anywhere (Kubernetes, etc.).

Steps: gcloud auth -> Export to GCS (bq extract) -> Download to local -> Run via Docker.