Dev.to

Study Notes 3.2.2: BigQuery Internal Architecture

Core Components 1. Storage (Colossus) Uses columnar storage format Separated from compute resources Highly cost-effective for data storage Cost optimization: Only pay for storage when data is at rest 2. Network Infrastructure (Jupiter) High-speed internal network within BigQuery data centers Bandwidth: ~1 terabyte per second Enables efficient communication between separated compute and storage Critical for maintaining low query latency 3. Query Engine (Dremel) Handles query execution and processing Uses tree-based architecture for query distribution Breaks down complex queries into smaller subqueries Components: Root server: Initial query reception and planning Mixers: Query subdivision and result aggregation Leaf nodes: Direct data access and basic operations Storage Architecture Column-Oriented vs Record-Oriented Storage Record-Oriented (Traditional) Similar to CSV structure Data stored row by row Better for full record retrieval Column-Oriented (BigQuery's Approach) Data stored by columns Advantages: Improved column-based aggregations Efficient for queries accessing subset of columns Better compression and performance Query Processing Workflow Query Submission Root server receives query Initial query analysis and planning Query Distribution Root server breaks down query into sub-modules Mixers further divide into smaller operations Leaf nodes receive specific tasks Data Processing Leaf nodes communicate with Colossus Execute assigned operations Return partial results to mixers Result Aggregation Mixers combine results from leaf nodes Root server performs final aggregation Returns complete result set Key Benefits Performance Distributed query processing High-speed network infrastructure Efficient columnar storage Cost Efficiency Separated storage and compute Pay primarily for query processing Economical data storage Scalability Distributed architecture Efficient handling of large datasets Automatic resource management Best Practices Note While understanding internals isn't mandatory for basic usage, it can be valuable for: Building optimized data products Making informed architectural decisions Understanding performance characteristics Implementing cost-effective solutions This architecture enables BigQuery to handle massive datasets efficiently while maintaining quick query response times through its distributed processing approach.

Feb 11, 2025 - 22:42

Study Notes 3.2.2: BigQuery Internal Architecture

Core Components

1. Storage (Colossus)

Uses columnar storage format
Separated from compute resources
Highly cost-effective for data storage
Cost optimization: Only pay for storage when data is at rest

2. Network Infrastructure (Jupiter)

High-speed internal network within BigQuery data centers
Bandwidth: ~1 terabyte per second
Enables efficient communication between separated compute and storage
Critical for maintaining low query latency

3. Query Engine (Dremel)

Handles query execution and processing
Uses tree-based architecture for query distribution
Breaks down complex queries into smaller subqueries
Components:
- Root server: Initial query reception and planning
- Mixers: Query subdivision and result aggregation
- Leaf nodes: Direct data access and basic operations

Storage Architecture

Column-Oriented vs Record-Oriented Storage

Record-Oriented (Traditional)
- Similar to CSV structure
- Data stored row by row
- Better for full record retrieval
Column-Oriented (BigQuery's Approach)
- Data stored by columns
- Advantages:
  - Improved column-based aggregations
  - Efficient for queries accessing subset of columns
  - Better compression and performance

Query Processing Workflow

Query Submission
- Root server receives query
- Initial query analysis and planning
Query Distribution
- Root server breaks down query into sub-modules
- Mixers further divide into smaller operations
- Leaf nodes receive specific tasks
Data Processing
- Leaf nodes communicate with Colossus
- Execute assigned operations
- Return partial results to mixers
Result Aggregation
- Mixers combine results from leaf nodes
- Root server performs final aggregation
- Returns complete result set

Key Benefits

Performance
- Distributed query processing
- High-speed network infrastructure
- Efficient columnar storage
Cost Efficiency
- Separated storage and compute
- Pay primarily for query processing
- Economical data storage
Scalability
- Distributed architecture
- Efficient handling of large datasets
- Automatic resource management

Best Practices Note

While understanding internals isn't mandatory for basic usage, it can be valuable for:

Building optimized data products
Making informed architectural decisions
Understanding performance characteristics
Implementing cost-effective solutions

This architecture enables BigQuery to handle massive datasets efficiently while maintaining quick query response times through its distributed processing approach.