Optimizing BigQuery: Partitioning, Clustering, and ML
Goal: Optimize costs and performance using Partitioning and Clustering, and build ML models directly in BigQuery.
1. Partitioning vs Clustering (Crucial for Optimization)
These are the two main ways to improve query speed and reduce cost.
Partitioning
Divides a table into segments based on a specific column (usually Date).
- Benefit: When you filter by date, BQ only reads that specific partition. It ignores the rest.
- Common Use: Partition by
creation_dateorpickup_datetime.
Clustering
Sorts and organizes data inside each partition based on one or more columns.
- Benefit: Keeps similar data together. Great for filters or aggregation.
- Common Use: Cluster by
Tag,VendorID, orRegion.
Comparison Table
| Feature | Partitioning | Clustering |
|---|---|---|
| Granularity | Coarse (Day/Month/Year). | High granularity (Strings, IDs). |
| Cost Estimation | Known upfront. (You see “This query will process 100MB”). | Unknown until query finishes. |
| Limit | Max 4000 partitions per table. | No limit on clusters. |
| Main Use | Filter by Time. | Filter by high cardinality columns (e.g., Email, ID). |
| Maintenance | Manual scope. | Automatic reclustering (Free). |
SQL Examples
1. Create External Table (from GCS) (Data stays in Google Cloud Storage, BQ just reads it)
|
|
2. Create Partitioned Table
|
|
3. Create Partitioned AND Clustered Table
|
|
2. Best Practices
To save money and time, follow these rules:
Cost Reduction
- 🛑 Avoid
SELECT *: BigQuery is columnar. Processing extra columns costs extra money. Only select what you need. - 🔍 Filter on Partitioned Columns: Always include the partition column (e.g., date) in your
WHEREclause. - ⚠️ Caution with Streaming: Streaming inserts cost more than batch loads.
Query Performance
- Denormalize Data: Use Nested and Repeated fields (Arrays and Structs) to avoid expensive JOINS.
- Filter before Joining: Reduce the dataset size (subquery or CTE) before joining with another table.
- Largest Table First: Place the largest table first in a JOIN (standard SQL optimization).
- Approximate Aggregation: For huge datasets, use
APPROX_COUNT_DISTINCT()instead ofCOUNT(DISTINCT ...).
3. Machine Learning in BigQuery (BQ ML)
You can build ML models using standard SQL inside BigQuery. No need to export data to Python.
The ML Workflow in SQL
-
Create/Train Model:
1 2 3 4 5 6 7 8 9CREATE OR REPLACE MODEL `dataset.tip_model` OPTIONS ( model_type='linear_reg', -- Algorithm (Linear Regression) input_label_cols=['tip_amount'], -- Target variable DATA_SPLIT_METHOD='AUTO_SPLIT' ) AS SELECT * FROM `dataset.training_data` WHERE tip_amount IS NOT NULL; -
Evaluate Model:
1 2 3SELECT * FROM ML.EVALUATE(MODEL `dataset.tip_model`, ( SELECT * FROM `dataset.test_data` )); -
Predict:
1 2 3SELECT * FROM ML.PREDICT(MODEL `dataset.tip_model`, ( SELECT * FROM `dataset.new_data` ));
Deployment
You can export BQML models to Docker containers (TensorFlow Serving) to run them anywhere (Kubernetes, etc.).
Steps:
gcloud auth-> Export to GCS (bq extract) -> Download to local -> Run via Docker.