Snowflake Architecture: Solving Pain Points With 'Separation of Storage and Compute'
Goal: Understand Snowflake’s Cloud-Native architecture, specifically the “Separation of Storage and Compute,” and how it solves traditional data warehousing pain points like concurrency and scaling.
1. The Architecture: A Hybrid Approach
Snowflake is not just a “Shared-Disk” or “Shared-Nothing” architecture; it is a Multi-Cluster, Shared Data architecture. It consists of three distinct layers:
flowchart TB
L1["`**Layer 1: Cloud Services**`"]
Auth[Authentication] & Opt[Optimizer] & Meta[Metadata] & Sec[Security]
L2["`**Layer 2: Virtual Warehouses**`"]
VW1[Marketing] & VW2[Finance] & VW3[ETL Jobs]
L3["`**Layer 3: Database Storage**`"]
S3[Object Storage]
L1 --- Auth & Opt & Meta & Sec
L2 --- VW1 & VW2 & VW3
L3 --- S3
L1 -->|Manages| L2
L1 -->|Manages| L3
L2 <-->|Reads/Writes| L3
style L1 fill:#4a90d9,stroke:#2d5a8a,color:#fff
style L2 fill:#7c3aed,stroke:#5b21b6,color:#fff
style L3 fill:#059669,stroke:#047857,color:#fff
flowchart TB
L1["`**Layer 1: Cloud Services**`"]
Auth[Authentication] & Opt[Optimizer] & Meta[Metadata] & Sec[Security]
L2["`**Layer 2: Virtual Warehouses**`"]
VW1[Marketing] & VW2[Finance] & VW3[ETL Jobs]
L3["`**Layer 3: Database Storage**`"]
S3[Object Storage]
L1 --- Auth & Opt & Meta & Sec
L2 --- VW1 & VW2 & VW3
L3 --- S3
L1 -->|Manages| L2
L1 -->|Manages| L3
L2 <-->|Reads/Writes| L3
style L1 fill:#4a90d9,stroke:#2d5a8a,color:#fff
style L2 fill:#7c3aed,stroke:#5b21b6,color:#fff
style L3 fill:#059669,stroke:#047857,color:#fff
flowchart TB
L1["`**Layer 1: Cloud Services**`"]
Auth[Authentication] & Opt[Optimizer] & Meta[Metadata] & Sec[Security]
L2["`**Layer 2: Virtual Warehouses**`"]
VW1[Marketing] & VW2[Finance] & VW3[ETL Jobs]
L3["`**Layer 3: Database Storage**`"]
S3[Object Storage]
L1 --- Auth & Opt & Meta & Sec
L2 --- VW1 & VW2 & VW3
L3 --- S3
L1 -->|Manages| L2
L1 -->|Manages| L3
L2 <-->|Reads/Writes| L3
style L1 fill:#4a90d9,stroke:#2d5a8a,color:#fff
style L2 fill:#7c3aed,stroke:#5b21b6,color:#fff
style L3 fill:#059669,stroke:#047857,color:#fffflowchart TB
L1["`**Layer 1: Cloud Services**`"]
Auth[Authentication] & Opt[Optimizer] & Meta[Metadata] & Sec[Security]
L2["`**Layer 2: Virtual Warehouses**`"]
VW1[Marketing] & VW2[Finance] & VW3[ETL Jobs]
L3["`**Layer 3: Database Storage**`"]
S3[Object Storage]
L1 --- Auth & Opt & Meta & Sec
L2 --- VW1 & VW2 & VW3
L3 --- S3
L1 -->|Manages| L2
L1 -->|Manages| L3
L2 <-->|Reads/Writes| L3
style L1 fill:#4a90d9,stroke:#2d5a8a,color:#fff
style L2 fill:#7c3aed,stroke:#5b21b6,color:#fff
style L3 fill:#059669,stroke:#047857,color:#fff
- Database Storage: Underlying storage (S3/Blob/GCS). Cheap and infinite.
- Query Processing: Virtual Warehouses (Compute). Massively Parallel Processing (MPP) clusters.
- Cloud Services: The “Brain”. Handles metadata, security, optimization, and transactions.
2. Pain Point: Scalability & Cost
Traditional Problem: In legacy architectures (Shared-Nothing), storage and compute are coupled. If you run out of disk space, you must add more nodes (which adds expensive CPU/RAM you might not need). If you need more processing power for a complex report, you also have to pay for the attached storage.
Snowflake Solution: Separation of Storage and Compute.
- Independent Scaling: You can resize your Compute (resize a Virtual Warehouse from X-Small to 4X-Large) instantly without moving data.
- Cost Efficiency: You pay for storage at S3 prices (very cheap) and compute only for the seconds you use it. When no queries are running, you can auto-suspend compute to pay $0.
3. Pain Point: Concurrency (The “Monday Morning” Problem)
Traditional Problem: When the Finance team, Marketing team, and ETL jobs all try to run queries at 9 AM Monday, the system slows down. Queries queue up because resources are finite and shared.
Snowflake Solution: Multi-Cluster Warehouses.
- Isolation: You can create separate Virtual Warehouses for different teams. The “Marketing VW” does not compete with the “Finance VW”.
- Auto-Scaling: If the “Finance VW” gets overwhelmed by too many distinct users, Snowflake can automatically spin up additional clusters (e.g., Finance Cluster 1, Finance Cluster 2) to handle the load, then spin them down when demand drops.
4. Design: Hot vs. Cold Data Handling
Snowflake optimizes performance through an intelligent caching implementation, effectively managing “Hot” and “Cold” data automatically.
- Cold Data (Remote Storage): All data lives permanently in the Remote Object Storage (Layer 3). This is high latency but low cost.
- Hot Data (Local Disk/Cache): When a Virtual Warehouse executes a query, it caches the necessary Micro-partitions on the local SSDs of the compute nodes.
- First Run: Slower (fetches from Remote Storage).
- Subsequent Runs: Fast (reads from Local SSD Cache).
- Result Cache: If the exact same query is run again (and data hasn’t changed), the result is returned instantly from the Cloud Services layer (no compute used).
Micro-partitions
Snowflake divides all tables into tiny, immutable files called Micro-partitions.
- Pruning: Metadata allows Snowflake to skip huge portions of data that don’t match the query filter (similar to Partitioning in BigQuery but automatic).
- No Indexes: There are no indexes to manage or rebuild.