Study Notes 3.2.2: BigQuery Internal Architecture
Core Components 1. Storage (Colossus) Uses columnar storage format Separated from compute resources Highly cost-effective for data storage Cost optimization: Only pay for storage when data is at rest 2. Network Infrastructure (Jupiter) High-speed internal network within BigQuery data centers Bandwidth: ~1 terabyte per second Enables efficient communication between separated compute and storage Critical for maintaining low query latency 3. Query Engine (Dremel) Handles query execution and processing Uses tree-based architecture for query distribution Breaks down complex queries into smaller subqueries Components: Root server: Initial query reception and planning Mixers: Query subdivision and result aggregation Leaf nodes: Direct data access and basic operations Storage Architecture Column-Oriented vs Record-Oriented Storage Record-Oriented (Traditional) Similar to CSV structure Data stored row by row Better for full record retrieval Column-Oriented (BigQuery's Approach) Data stored by columns Advantages: Improved column-based aggregations Efficient for queries accessing subset of columns Better compression and performance Query Processing Workflow Query Submission Root server receives query Initial query analysis and planning Query Distribution Root server breaks down query into sub-modules Mixers further divide into smaller operations Leaf nodes receive specific tasks Data Processing Leaf nodes communicate with Colossus Execute assigned operations Return partial results to mixers Result Aggregation Mixers combine results from leaf nodes Root server performs final aggregation Returns complete result set Key Benefits Performance Distributed query processing High-speed network infrastructure Efficient columnar storage Cost Efficiency Separated storage and compute Pay primarily for query processing Economical data storage Scalability Distributed architecture Efficient handling of large datasets Automatic resource management Best Practices Note While understanding internals isn't mandatory for basic usage, it can be valuable for: Building optimized data products Making informed architectural decisions Understanding performance characteristics Implementing cost-effective solutions This architecture enables BigQuery to handle massive datasets efficiently while maintaining quick query response times through its distributed processing approach.

Core Components
1. Storage (Colossus)
- Uses columnar storage format
- Separated from compute resources
- Highly cost-effective for data storage
- Cost optimization: Only pay for storage when data is at rest
2. Network Infrastructure (Jupiter)
- High-speed internal network within BigQuery data centers
- Bandwidth: ~1 terabyte per second
- Enables efficient communication between separated compute and storage
- Critical for maintaining low query latency
3. Query Engine (Dremel)
- Handles query execution and processing
- Uses tree-based architecture for query distribution
- Breaks down complex queries into smaller subqueries
- Components:
- Root server: Initial query reception and planning
- Mixers: Query subdivision and result aggregation
- Leaf nodes: Direct data access and basic operations
Storage Architecture
Column-Oriented vs Record-Oriented Storage
- Record-Oriented (Traditional)
- Similar to CSV structure
- Data stored row by row
- Better for full record retrieval
- Column-Oriented (BigQuery's Approach)
- Data stored by columns
- Advantages:
- Improved column-based aggregations
- Efficient for queries accessing subset of columns
- Better compression and performance
Query Processing Workflow
- Query Submission
- Root server receives query
- Initial query analysis and planning
- Query Distribution
- Root server breaks down query into sub-modules
- Mixers further divide into smaller operations
- Leaf nodes receive specific tasks
- Data Processing
- Leaf nodes communicate with Colossus
- Execute assigned operations
- Return partial results to mixers
- Result Aggregation
- Mixers combine results from leaf nodes
- Root server performs final aggregation
- Returns complete result set
Key Benefits
- Performance
- Distributed query processing
- High-speed network infrastructure
- Efficient columnar storage
- Cost Efficiency
- Separated storage and compute
- Pay primarily for query processing
- Economical data storage
- Scalability
- Distributed architecture
- Efficient handling of large datasets
- Automatic resource management
Best Practices Note
While understanding internals isn't mandatory for basic usage, it can be valuable for:
- Building optimized data products
- Making informed architectural decisions
- Understanding performance characteristics
- Implementing cost-effective solutions
This architecture enables BigQuery to handle massive datasets efficiently while maintaining quick query response times through its distributed processing approach.