Data Partitioning
Quick Reference: Sharding | Data Replication | Step 6: Consistent Hashing
Quick Reference
| Partitioning Type | Method | Use Case | Pros | Cons |
|---|---|---|---|---|
| Horizontal | Split rows | Large tables | Scale reads/writes | Complex queries |
| Vertical | Split columns | Different access patterns | Optimize access | Joins required |
| Range | By value range | Time-series, ordered data | Simple, efficient | Hotspots possible |
| Hash | Hash function | Even distribution | Balanced load | Range queries hard |
| Directory | Lookup table | Flexible mapping | Flexible | Single point of failure |
Clear Definition
Data Partitioning is the technique of dividing a database into smaller, more manageable pieces called partitions. Each partition can be stored and accessed independently, improving performance, manageability, and scalability. Partitioning can be done horizontally (splitting rows) or vertically (splitting columns).
π‘ Key Insight: Partitioning is essential for scaling large databases. It allows parallel processing, reduces query scope, and enables distributed storage.
Core Concepts
Horizontal Partitioning (Sharding)
Definition: Splitting a table by rowsβeach partition contains a subset of rows.
Example:
Users Table (Original)
βββ Partition 1: user_id 1-1000
βββ Partition 2: user_id 1001-2000
βββ Partition 3: user_id 2001-3000
Characteristics:
- Each partition has same schema
- Rows distributed across partitions
- Queries may need to access multiple partitions
- Enables parallel processing
Vertical Partitioning
Definition: Splitting a table by columnsβeach partition contains a subset of columns.
Example:
Users Table (Original)
βββ Partition 1: user_id, name, email (frequently accessed)
βββ Partition 2: user_id, bio, preferences (rarely accessed)
Characteristics:
- Different partitions have different columns
- Joins required to reconstruct full row
- Optimizes for access patterns
- Reduces I/O for frequent queries
Partitioning Strategies
1. Range Partitioning
How it Works: Partition based on value ranges.
Example:
-- Partition by date ranges
CREATE TABLE orders (
order_id INT,
order_date DATE,
...
) PARTITION BY RANGE (YEAR(order_date)) (
PARTITION p2020 VALUES LESS THAN (2021),
PARTITION p2021 VALUES LESS THAN (2022),
PARTITION p2022 VALUES LESS THAN (2023)
);
Use Cases: Time-series data, ordered data, date-based queries
Pros: Simple, efficient for range queries, easy to add new partitions
Cons: Can create hotspots (all recent data in one partition)
2. Hash Partitioning
How it Works: Use hash function to determine partition.
Example:
-- Partition by hash of user_id
CREATE TABLE users (
user_id INT,
...
) PARTITION BY HASH(user_id) PARTITIONS 4;
Use Cases: Even distribution, avoiding hotspots
Pros: Balanced load distribution, avoids hotspots
Cons: Range queries require scanning all partitions, hard to add/remove partitions
3. List Partitioning
How it Works: Partition based on specific values.
Example:
-- Partition by region
CREATE TABLE customers (
customer_id INT,
region VARCHAR(50),
...
) PARTITION BY LIST (region) (
PARTITION p_north VALUES IN ('NY', 'MA', 'CT'),
PARTITION p_south VALUES IN ('TX', 'FL', 'GA'),
PARTITION p_west VALUES IN ('CA', 'OR', 'WA')
);
Use Cases: Geographic data, categorical data
Pros: Natural grouping, efficient for category-based queries
Cons: Requires manual partition management
4. Directory-Based Partitioning
How it Works: Use lookup table to map keys to partitions.
Example:
Partition Directory:
user_id 1-1000 β Partition 1
user_id 1001-2000 β Partition 2
user_id 2001-3000 β Partition 3
Use Cases: Flexible partitioning, dynamic rebalancing
Pros: Flexible, easy to rebalance, can handle uneven distribution
Cons: Lookup overhead, single point of failure for directory
Use Cases
Horizontal Partitioning Use Cases
-
Large Tables
- Tables with millions/billions of rows
- Improve query performance
- Enable parallel processing
-
Geographic Distribution
- Partition by region/country
- Place partitions closer to users
- Comply with data residency
-
Time-Series Data
- Partition by date/time
- Archive old partitions
- Optimize recent data access
Vertical Partitioning Use Cases
-
Wide Tables
- Tables with many columns
- Split frequently vs rarely accessed columns
- Reduce I/O for common queries
-
Different Access Patterns
- Some columns accessed together
- Others accessed separately
- Optimize for specific queries
Advantages & Disadvantages
Advantages
β
Performance: Smaller partitions = faster queries
β
Scalability: Can scale partitions independently
β
Manageability: Easier backup, maintenance, archiving
β
Parallelism: Process partitions in parallel
β
Storage: Can use different storage for different partitions
Disadvantages
β Complexity: More complex queries and joins
β Cross-Partition Queries: May need to query multiple partitions
β Rebalancing: Difficult to rebalance data
β Hotspots: Uneven distribution can create hotspots
β Application Changes: May require application-level changes
Best Practices
1. Choose Right Partitioning Key
Criteria:
- Even distribution
- Aligns with query patterns
- Avoids hotspots
- Supports common access patterns
Example: For user data, partition by user_id (not by name or email)
2. Monitor Partition Sizes
- Keep partitions roughly equal size
- Monitor for hotspots
- Plan for rebalancing
3. Plan for Growth
- Design for future scale
- Make it easy to add partitions
- Consider automatic partitioning
4. Handle Cross-Partition Queries
Strategies:
- Minimize cross-partition queries
- Use aggregation layer
- Cache results
- Denormalize data
5. Backup and Maintenance
- Backup individual partitions
- Maintain partition metadata
- Archive old partitions
- Monitor partition health
Common Pitfalls
β οΈ Common Mistake: Choosing wrong partitioning key, creating hotspots.
Solution: Analyze data distribution and query patterns. Use hash partitioning for even distribution.
β οΈ Common Mistake: Too many small partitions.
Solution: Balance partition countβtoo many partitions increase overhead, too few limit parallelism.
β οΈ Common Mistake: Not planning for cross-partition queries.
Solution: Design queries to minimize cross-partition access. Use aggregation or caching.
β οΈ Common Mistake: Ignoring partition maintenance.
Solution: Regularly monitor partition sizes, rebalance as needed, archive old partitions.
Interview Tips
π― Interview Focus: Interviewers ask about scaling databases:
- Partitioning Types: Know horizontal vs vertical, range vs hash
- Partition Key Selection: Explain how to choose partitioning key
- Hotspots: Discuss how to avoid and handle hotspots
- Cross-Partition Queries: Explain strategies for handling them
- Rebalancing: Discuss how to rebalance partitions
Common Questions
- "How would you partition a table with 1 billion rows?"
- "What's the difference between horizontal and vertical partitioning?"
- "How do you avoid hotspots in partitioned tables?"
- "How would you handle queries that span multiple partitions?"
- "Explain hash partitioning vs range partitioning."
Related Topics
- Sharding: Distributed partitioning across servers
- Data Replication: Replication strategies
- Step 6: Consistent Hashing: Hash-based distribution
- Step 3: Consistency: Consistency in partitioned systems
Visual Aids
Horizontal Partitioning
Original Table:
βββββββββββββββββββββββββββ
β Users (10M rows) β
βββββββββββββββββββββββββββ
Partitioned:
ββββββββββββ ββββββββββββ ββββββββββββ
βPartition1β βPartition2β βPartition3β
β(3.3M rows)β β(3.3M rows)β β(3.3M rows)β
ββββββββββββ ββββββββββββ ββββββββββββ
Vertical Partitioning
Original Table:
βββββββββββββββββββββββββββββββββββββββ
β Users: id, name, email, bio, prefs β
βββββββββββββββββββββββββββββββββββββββ
Partitioned:
ββββββββββββββββ ββββββββββββββββ
βPartition 1 β βPartition 2 β
βid, name, emailβ βid, bio, prefsβ
ββββββββββββββββ ββββββββββββββββ
Range Partitioning Example
Orders Table:
βββ p2020: orders from 2020
βββ p2021: orders from 2021
βββ p2022: orders from 2022
βββ p2023: orders from 2023
Quick Reference Summary
Data Partitioning: Divide database into smaller pieces for performance and scalability. Horizontal partitioning splits rows, vertical partitioning splits columns.
Strategies: Range (time-series), Hash (even distribution), List (categories), Directory (flexible).
Key Consideration: Choose partitioning key carefully to avoid hotspots. Plan for cross-partition queries and rebalancing.
Previous Topic: Data Replication β
Next Topic: Sharding β
Back to: Step 2 Overview | Main Index