Sharding

Quick Reference: Data Partitioning | Step 6: Consistent Hashing | Data Replication

Quick Reference

Sharding Strategy	Method	Pros	Cons
Range-Based	Partition by value ranges	Simple, efficient range queries	Hotspots, uneven distribution
Hash-Based	Hash function on shard key	Even distribution	Range queries difficult
Directory-Based	Lookup table	Flexible, easy rebalancing	Lookup overhead, SPOF
Geo-Based	Geographic location	Low latency, compliance	Cross-region queries

Clear Definition

Sharding is a database architecture pattern where data is horizontally partitioned across multiple independent database servers (shards). Each shard is a separate database instance that holds a subset of the total data. Sharding enables horizontal scaling by distributing data and load across multiple machines.

💡 Key Insight: Sharding is partitioning taken to the distributed level—each partition lives on a separate server. It's essential for scaling beyond single-server limits.

Core Concepts

How Sharding Works

Shard Key Selection: Choose a field to determine shard assignment
Sharding Function: Use function (hash, range, etc.) to map data to shard
Data Distribution: Data distributed across shards based on shard key
Query Routing: Application/router determines which shard(s) to query

Sharding Strategies

1. Range-Based Sharding

How it Works: Partition data by value ranges.

Example:

Shard 1: user_id 1-1000
Shard 2: user_id 1001-2000
Shard 3: user_id 2001-3000

Pros:

Simple to implement
Efficient range queries
Easy to add new shards

Cons:

Can create hotspots (uneven distribution)
Difficult to rebalance
Sequential keys problematic

Use Cases: Time-series data, ordered data, when range queries are common

2. Hash-Based Sharding

How it Works: Use hash function on shard key.

Example:

shard_id = hash(user_id) % num_shards
# user_id 12345 → hash(12345) % 4 → Shard 2

Pros:

Even distribution
Avoids hotspots
Simple to implement

Cons:

Range queries require scanning all shards
Difficult to add/remove shards (requires rehashing)
Cross-shard queries expensive

Use Cases: Even distribution needed, simple key-based lookups

3. Directory-Based Sharding

How it Works: Use lookup table (directory) to map keys to shards.

Example:

Shard Directory:
user_id 1-1000   → Shard 1 (Server A)
user_id 1001-2000 → Shard 2 (Server B)
user_id 2001-3000 → Shard 3 (Server C)

Pros:

Flexible mapping
Easy to rebalance (update directory)
Can handle uneven distribution

Cons:

Lookup overhead (extra query)
Directory becomes single point of failure
Directory can become bottleneck

Use Cases: Flexible sharding, frequent rebalancing needed

4. Consistent Hashing

How it Works: Use consistent hash ring to map data to shards.

Example: See Consistent Hashing

Pros:

Minimal data movement when adding/removing shards
Even distribution
Handles shard failures gracefully

Cons:

More complex implementation
Virtual nodes needed for even distribution

Use Cases: Dynamic sharding, frequently adding/removing shards

Use Cases

When to Shard

Single Server Limits
- Database too large for single server
- Query performance degrading
- Storage capacity limits
High Write Throughput
- Writes bottlenecking on single server
- Need to distribute write load
- Horizontal write scaling needed
Geographic Distribution
- Users in different regions
- Low latency requirements
- Data residency compliance
Cost Optimization
- Cheaper to use multiple smaller servers
- Better resource utilization
- Scale incrementally

Real-World Examples

Instagram: Sharded by user_id, each shard on separate server
Uber: Sharded by city/region for location data
Amazon: Sharded by product category and region
Twitter: Sharded user data by user_id ranges

Advantages & Disadvantages

Advantages

✅ Horizontal Scaling: Scale beyond single server limits
✅ Performance: Distribute load, improve query performance
✅ Availability: Failure of one shard doesn't affect others
✅ Cost: Use cheaper hardware, scale incrementally
✅ Geographic Distribution: Place shards closer to users

Disadvantages

❌ Complexity: More complex architecture and operations
❌ Cross-Shard Queries: Expensive, may require application logic
❌ Rebalancing: Difficult to rebalance data across shards
❌ Joins: Cross-shard joins not supported
❌ Transactions: Cross-shard transactions complex

Best Practices

1. Choose Right Shard Key

Criteria:

Even distribution (avoid hotspots)
Aligns with query patterns
Supports common access patterns
Stable (doesn't change frequently)

Examples:

✅ Good: user_id (stable, even distribution)
❌ Bad: email (can change, uneven distribution)
❌ Bad: created_at (creates hotspots for recent data)

2. Avoid Hotspots

Problem: Uneven distribution creates overloaded shards

Solutions:

Use hash-based sharding
Use composite shard keys
Monitor shard sizes
Rebalance proactively

3. Handle Cross-Shard Queries

Strategies:

Minimize: Design queries to be shard-local
Aggregation Layer: Use separate service for aggregations
Denormalization: Duplicate data to avoid cross-shard queries
Caching: Cache cross-shard query results

4. Plan for Rebalancing

When to Rebalance:

Shard size imbalance (>20% difference)
Performance degradation
Adding/removing shards

Strategies:

Use consistent hashing (minimal movement)
Dual-write during migration
Gradual migration
Monitor rebalancing progress

5. Shard Metadata Management

Store:

Shard key ranges
Shard locations
Shard health status
Routing rules

Options:

Configuration service (Zookeeper, etcd)
Database table
Application configuration
Service discovery

Common Pitfalls

⚠️ Common Mistake: Choosing wrong shard key, creating hotspots.

Solution: Analyze data distribution. Use hash-based sharding or composite keys. Monitor shard sizes.

⚠️ Common Mistake: Not planning for cross-shard queries.

Solution: Design queries to be shard-local. Use aggregation layer or denormalization.

⚠️ Common Mistake: Sharding too early (premature optimization).

Solution: Shard only when necessary (single server limits reached). Start with replication, then partition, then shard.

⚠️ Common Mistake: Difficult rebalancing due to poor sharding strategy.

Solution: Use consistent hashing or directory-based sharding for easier rebalancing.

⚠️ Common Mistake: Not handling shard failures.

Solution: Implement replication per shard. Have failover procedures. Monitor shard health.

Interview Tips

🎯 Interview Focus: Interviewers often ask about scaling databases:

Sharding Strategies: Know range, hash, directory, consistent hashing
Shard Key Selection: Explain how to choose shard key
Hotspots: Discuss how to avoid and handle hotspots
Cross-Shard Queries: Explain strategies for handling them
Rebalancing: Discuss how to add/remove shards

Common Questions

"How would you shard a database with 1 billion users?"
"What's the difference between partitioning and sharding?"
"How do you avoid hotspots when sharding?"
"How would you handle queries that span multiple shards?"
"Explain consistent hashing for sharding."
"How do you add a new shard without downtime?"

Data Partitioning: Single-server partitioning
Step 6: Consistent Hashing: Hash-based distribution
Data Replication: Replication per shard
Step 3: Consistency: Consistency in sharded systems

Visual Aids

Sharded Architecture

┌──────────┐
│  Client  │
└────┬─────┘
     │
     ▼
┌─────────────┐
│ Shard Router│
│ (Application)│
└──────┬──────┘
       │
   ┌───┴────┬────────┬────────┐
   │        │        │        │
┌──▼──┐ ┌───▼──┐ ┌───▼──┐ ┌───▼──┐
│Shard│ │Shard │ │Shard │ │Shard │
│  1  │ │  2   │ │  3   │ │  4   │
└─────┘ └──────┘ └──────┘ └──────┘

Hash-Based Sharding

user_id → hash(user_id) % 4 → Shard Assignment

user_id 12345 → hash(12345) % 4 → Shard 1
user_id 67890 → hash(67890) % 4 → Shard 3
user_id 11111 → hash(11111) % 4 → Shard 2

Range-Based Sharding

Shard 1: user_id 1-1000
Shard 2: user_id 1001-2000
Shard 3: user_id 2001-3000
Shard 4: user_id 3001-4000

Quick Reference Summary

Sharding: Distribute database across multiple servers (shards) for horizontal scaling. Each shard holds a subset of data.

Strategies: Range (simple, hotspots), Hash (even distribution), Directory (flexible), Consistent Hashing (minimal rebalancing).

Key Consideration: Choose shard key carefully to avoid hotspots. Plan for cross-shard queries and rebalancing. Shard only when necessary.

Previous Topic: Data Partitioning ←

Back to: Step 2 Overview | Main Index

# Sharding

# Quick Reference

# Clear Definition

# Core Concepts

# How Sharding Works

# Sharding Strategies

# 1. Range-Based Sharding

# 2. Hash-Based Sharding

# 3. Directory-Based Sharding

# 4. Consistent Hashing

# Use Cases

# When to Shard

# Real-World Examples

# Advantages & Disadvantages

# Advantages

# Disadvantages

# Best Practices

# 1. Choose Right Shard Key

# 2. Avoid Hotspots

# 3. Handle Cross-Shard Queries

# 4. Plan for Rebalancing

# 5. Shard Metadata Management

# Common Pitfalls

# Interview Tips

# Common Questions

# Related Topics

# Visual Aids

# Sharded Architecture

# Hash-Based Sharding

# Range-Based Sharding

# Quick Reference Summary

Sharding

Quick Reference

Clear Definition

Core Concepts

How Sharding Works

Sharding Strategies

1. Range-Based Sharding

2. Hash-Based Sharding

3. Directory-Based Sharding

4. Consistent Hashing

Use Cases

When to Shard

Real-World Examples

Advantages & Disadvantages

Advantages

Disadvantages

Best Practices

1. Choose Right Shard Key

2. Avoid Hotspots

3. Handle Cross-Shard Queries

4. Plan for Rebalancing

5. Shard Metadata Management

Common Pitfalls

Interview Tips

Common Questions

Related Topics

Visual Aids

Sharded Architecture

Hash-Based Sharding

Range-Based Sharding

Quick Reference Summary