BigQuery Internals: The Serverless Data Warehouse
Contents
Goal: Understand Data Warehouses (OLAP) and the internal architecture of BigQuery.
1. OLTP vs OLAP
Data processing systems are divided into two main types.
| Feature | OLTP (Online Transaction Processing) | OLAP (Online Analytical Processing) |
|---|---|---|
| Purpose | Run day-to-day business operations (Real-time). | Analyze data, discover trends, support decisions. |
| User | Front-end apps, customers, clerks. | Data Analysts, Data Scientists, Managers. |
| Updates | Fast, short, frequent updates. | Periodic, long-running batch jobs. |
| Data Size | Small (Gigabytes). | Large (Terabytes/Petabytes). |
| Design | Normalized (to concise redundancy). | Denormalized (Star/Snowflake schema). |
| Example | PostgreSQL, MySQL (Backend DBs). | BigQuery, Snowflake, Redshift. |
2. What is BigQuery?
BigQuery is a Serverless, highly scalable Data Warehouse by Google Cloud.
- Serverless: No servers to manage. Google handles the infrastructure.
- Separation of Compute & Storage: You pay for storage and compute (queries) separately. This saves money because storage is very cheap.
BigQuery Internals (Architecture)
BigQuery is fast because it uses 4 key technologies working together:
flowchart TD
subgraph Compute
D[Dremel] -->|Executes SQL| S[Slots & Mixers]
end
subgraph Network
J[Jupiter Network] -->|1TB/sec speed| Compute
J -->|Connects| Storage
end
subgraph Storage
C[Colossus] -->|Stores Data| ColumnarFormat
end
subgraph Management
B[Borg] -->|Orchestrates Resources| D
B -->|Manages| C
end
flowchart TD
subgraph Compute
D[Dremel] -->|Executes SQL| S[Slots & Mixers]
end
subgraph Network
J[Jupiter Network] -->|1TB/sec speed| Compute
J -->|Connects| Storage
end
subgraph Storage
C[Colossus] -->|Stores Data| ColumnarFormat
end
subgraph Management
B[Borg] -->|Orchestrates Resources| D
B -->|Manages| C
end
flowchart TD
subgraph Compute
D[Dremel] -->|Executes SQL| S[Slots & Mixers]
end
subgraph Network
J[Jupiter Network] -->|1TB/sec speed| Compute
J -->|Connects| Storage
end
subgraph Storage
C[Colossus] -->|Stores Data| ColumnarFormat
end
subgraph Management
B[Borg] -->|Orchestrates Resources| D
B -->|Manages| C
endflowchart TD
subgraph Compute
D[Dremel] -->|Executes SQL| S[Slots & Mixers]
end
subgraph Network
J[Jupiter Network] -->|1TB/sec speed| Compute
J -->|Connects| Storage
end
subgraph Storage
C[Colossus] -->|Stores Data| ColumnarFormat
end
subgraph Management
B[Borg] -->|Orchestrates Resources| D
B -->|Manages| C
end
- Colossus (Storage): Google’s global storage system. Cheap. Stores data in Columnar Format.
- Columnar vs Record-oriented: CSV is Record-oriented (good for writing). BigQuery is Columnar (good for reading specific columns).
- Dremel (Compute): The engine that executes SQL. It splits queries into a tree structure to run in parallel.
- Jupiter (Network): Extremely fast network moving data between Compute and Storage.
- Borg: Manages all the hardware resources (Precursor to Kubernetes).