Sharding
Quick Reference: Data Partitioning | Step 6: Consistent Hashing | Data Replication
Quick Reference
| Sharding Strategy | Method | Pros | Cons |
|---|---|---|---|
| Range-Based | Partition by value ranges | Simple, efficient range queries | Hotspots, uneven distribution |
| Hash-Based | Hash function on shard key | Even distribution | Range queries difficult |
| Directory-Based | Lookup table | Flexible, easy rebalancing | Lookup overhead, SPOF |
| Geo-Based | Geographic location | Low latency, compliance | Cross-region queries |
Clear Definition
Sharding is a database architecture pattern where data is horizontally partitioned across multiple independent database servers (shards). Each shard is a separate database instance that holds a subset of the total data. Sharding enables horizontal scaling by distributing data and load across multiple machines.
π‘ Key Insight: Sharding is partitioning taken to the distributed levelβeach partition lives on a separate server. It's essential for scaling beyond single-server limits.
Core Concepts
How Sharding Works
- Shard Key Selection: Choose a field to determine shard assignment
- Sharding Function: Use function (hash, range, etc.) to map data to shard
- Data Distribution: Data distributed across shards based on shard key
- Query Routing: Application/router determines which shard(s) to query
Sharding Strategies
1. Range-Based Sharding
How it Works: Partition data by value ranges.
Example:
Shard 1: user_id 1-1000
Shard 2: user_id 1001-2000
Shard 3: user_id 2001-3000
Pros:
- Simple to implement
- Efficient range queries
- Easy to add new shards
Cons:
- Can create hotspots (uneven distribution)
- Difficult to rebalance
- Sequential keys problematic
Use Cases: Time-series data, ordered data, when range queries are common
2. Hash-Based Sharding
How it Works: Use hash function on shard key.
Example:
shard_id = hash(user_id) % num_shards
# user_id 12345 β hash(12345) % 4 β Shard 2
Pros:
- Even distribution
- Avoids hotspots
- Simple to implement
Cons:
- Range queries require scanning all shards
- Difficult to add/remove shards (requires rehashing)
- Cross-shard queries expensive
Use Cases: Even distribution needed, simple key-based lookups
3. Directory-Based Sharding
How it Works: Use lookup table (directory) to map keys to shards.
Example:
Shard Directory:
user_id 1-1000 β Shard 1 (Server A)
user_id 1001-2000 β Shard 2 (Server B)
user_id 2001-3000 β Shard 3 (Server C)
Pros:
- Flexible mapping
- Easy to rebalance (update directory)
- Can handle uneven distribution
Cons:
- Lookup overhead (extra query)
- Directory becomes single point of failure
- Directory can become bottleneck
Use Cases: Flexible sharding, frequent rebalancing needed
4. Consistent Hashing
How it Works: Use consistent hash ring to map data to shards.
Example: See Consistent Hashing
Pros:
- Minimal data movement when adding/removing shards
- Even distribution
- Handles shard failures gracefully
Cons:
- More complex implementation
- Virtual nodes needed for even distribution
Use Cases: Dynamic sharding, frequently adding/removing shards
Use Cases
When to Shard
-
Single Server Limits
- Database too large for single server
- Query performance degrading
- Storage capacity limits
-
High Write Throughput
- Writes bottlenecking on single server
- Need to distribute write load
- Horizontal write scaling needed
-
Geographic Distribution
- Users in different regions
- Low latency requirements
- Data residency compliance
-
Cost Optimization
- Cheaper to use multiple smaller servers
- Better resource utilization
- Scale incrementally
Real-World Examples
- Instagram: Sharded by user_id, each shard on separate server
- Uber: Sharded by city/region for location data
- Amazon: Sharded by product category and region
- Twitter: Sharded user data by user_id ranges
Advantages & Disadvantages
Advantages
β
Horizontal Scaling: Scale beyond single server limits
β
Performance: Distribute load, improve query performance
β
Availability: Failure of one shard doesn't affect others
β
Cost: Use cheaper hardware, scale incrementally
β
Geographic Distribution: Place shards closer to users
Disadvantages
β Complexity: More complex architecture and operations
β Cross-Shard Queries: Expensive, may require application logic
β Rebalancing: Difficult to rebalance data across shards
β Joins: Cross-shard joins not supported
β Transactions: Cross-shard transactions complex
Best Practices
1. Choose Right Shard Key
Criteria:
- Even distribution (avoid hotspots)
- Aligns with query patterns
- Supports common access patterns
- Stable (doesn't change frequently)
Examples:
- β Good: user_id (stable, even distribution)
- β Bad: email (can change, uneven distribution)
- β Bad: created_at (creates hotspots for recent data)
2. Avoid Hotspots
Problem: Uneven distribution creates overloaded shards
Solutions:
- Use hash-based sharding
- Use composite shard keys
- Monitor shard sizes
- Rebalance proactively
3. Handle Cross-Shard Queries
Strategies:
- Minimize: Design queries to be shard-local
- Aggregation Layer: Use separate service for aggregations
- Denormalization: Duplicate data to avoid cross-shard queries
- Caching: Cache cross-shard query results
4. Plan for Rebalancing
When to Rebalance:
- Shard size imbalance (>20% difference)
- Performance degradation
- Adding/removing shards
Strategies:
- Use consistent hashing (minimal movement)
- Dual-write during migration
- Gradual migration
- Monitor rebalancing progress
5. Shard Metadata Management
Store:
- Shard key ranges
- Shard locations
- Shard health status
- Routing rules
Options:
- Configuration service (Zookeeper, etcd)
- Database table
- Application configuration
- Service discovery
Common Pitfalls
β οΈ Common Mistake: Choosing wrong shard key, creating hotspots.
Solution: Analyze data distribution. Use hash-based sharding or composite keys. Monitor shard sizes.
β οΈ Common Mistake: Not planning for cross-shard queries.
Solution: Design queries to be shard-local. Use aggregation layer or denormalization.
β οΈ Common Mistake: Sharding too early (premature optimization).
Solution: Shard only when necessary (single server limits reached). Start with replication, then partition, then shard.
β οΈ Common Mistake: Difficult rebalancing due to poor sharding strategy.
Solution: Use consistent hashing or directory-based sharding for easier rebalancing.
β οΈ Common Mistake: Not handling shard failures.
Solution: Implement replication per shard. Have failover procedures. Monitor shard health.
Interview Tips
π― Interview Focus: Interviewers often ask about scaling databases:
- Sharding Strategies: Know range, hash, directory, consistent hashing
- Shard Key Selection: Explain how to choose shard key
- Hotspots: Discuss how to avoid and handle hotspots
- Cross-Shard Queries: Explain strategies for handling them
- Rebalancing: Discuss how to add/remove shards
Common Questions
- "How would you shard a database with 1 billion users?"
- "What's the difference between partitioning and sharding?"
- "How do you avoid hotspots when sharding?"
- "How would you handle queries that span multiple shards?"
- "Explain consistent hashing for sharding."
- "How do you add a new shard without downtime?"
Related Topics
- Data Partitioning: Single-server partitioning
- Step 6: Consistent Hashing: Hash-based distribution
- Data Replication: Replication per shard
- Step 3: Consistency: Consistency in sharded systems
Visual Aids
Sharded Architecture
ββββββββββββ
β Client β
ββββββ¬ββββββ
β
βΌ
βββββββββββββββ
β Shard Routerβ
β (Application)β
ββββββββ¬βββββββ
β
βββββ΄βββββ¬βββββββββ¬βββββββββ
β β β β
ββββΌβββ βββββΌβββ βββββΌβββ βββββΌβββ
βShardβ βShard β βShard β βShard β
β 1 β β 2 β β 3 β β 4 β
βββββββ ββββββββ ββββββββ ββββββββ
Hash-Based Sharding
user_id β hash(user_id) % 4 β Shard Assignment
user_id 12345 β hash(12345) % 4 β Shard 1
user_id 67890 β hash(67890) % 4 β Shard 3
user_id 11111 β hash(11111) % 4 β Shard 2
Range-Based Sharding
Shard 1: user_id 1-1000
Shard 2: user_id 1001-2000
Shard 3: user_id 2001-3000
Shard 4: user_id 3001-4000
Quick Reference Summary
Sharding: Distribute database across multiple servers (shards) for horizontal scaling. Each shard holds a subset of data.
Strategies: Range (simple, hotspots), Hash (even distribution), Directory (flexible), Consistent Hashing (minimal rebalancing).
Key Consideration: Choose shard key carefully to avoid hotspots. Plan for cross-shard queries and rebalancing. Shard only when necessary.
Previous Topic: Data Partitioning β
Back to: Step 2 Overview | Main Index