Sharding

Quick Reference: Data Partitioning | Step 6: Consistent Hashing | Data Replication


Quick Reference

Sharding StrategyMethodProsCons
Range-BasedPartition by value rangesSimple, efficient range queriesHotspots, uneven distribution
Hash-BasedHash function on shard keyEven distributionRange queries difficult
Directory-BasedLookup tableFlexible, easy rebalancingLookup overhead, SPOF
Geo-BasedGeographic locationLow latency, complianceCross-region queries

Clear Definition

Sharding is a database architecture pattern where data is horizontally partitioned across multiple independent database servers (shards). Each shard is a separate database instance that holds a subset of the total data. Sharding enables horizontal scaling by distributing data and load across multiple machines.

πŸ’‘ Key Insight: Sharding is partitioning taken to the distributed levelβ€”each partition lives on a separate server. It's essential for scaling beyond single-server limits.


Core Concepts

How Sharding Works

  1. Shard Key Selection: Choose a field to determine shard assignment
  2. Sharding Function: Use function (hash, range, etc.) to map data to shard
  3. Data Distribution: Data distributed across shards based on shard key
  4. Query Routing: Application/router determines which shard(s) to query

Sharding Strategies

1. Range-Based Sharding

How it Works: Partition data by value ranges.

Example:

Shard 1: user_id 1-1000
Shard 2: user_id 1001-2000
Shard 3: user_id 2001-3000

Pros:

  • Simple to implement
  • Efficient range queries
  • Easy to add new shards

Cons:

  • Can create hotspots (uneven distribution)
  • Difficult to rebalance
  • Sequential keys problematic

Use Cases: Time-series data, ordered data, when range queries are common

2. Hash-Based Sharding

How it Works: Use hash function on shard key.

Example:

shard_id = hash(user_id) % num_shards
# user_id 12345 β†’ hash(12345) % 4 β†’ Shard 2

Pros:

  • Even distribution
  • Avoids hotspots
  • Simple to implement

Cons:

  • Range queries require scanning all shards
  • Difficult to add/remove shards (requires rehashing)
  • Cross-shard queries expensive

Use Cases: Even distribution needed, simple key-based lookups

3. Directory-Based Sharding

How it Works: Use lookup table (directory) to map keys to shards.

Example:

Shard Directory:
user_id 1-1000   β†’ Shard 1 (Server A)
user_id 1001-2000 β†’ Shard 2 (Server B)
user_id 2001-3000 β†’ Shard 3 (Server C)

Pros:

  • Flexible mapping
  • Easy to rebalance (update directory)
  • Can handle uneven distribution

Cons:

  • Lookup overhead (extra query)
  • Directory becomes single point of failure
  • Directory can become bottleneck

Use Cases: Flexible sharding, frequent rebalancing needed

4. Consistent Hashing

How it Works: Use consistent hash ring to map data to shards.

Example: See Consistent Hashing

Pros:

  • Minimal data movement when adding/removing shards
  • Even distribution
  • Handles shard failures gracefully

Cons:

  • More complex implementation
  • Virtual nodes needed for even distribution

Use Cases: Dynamic sharding, frequently adding/removing shards


Use Cases

When to Shard

  1. Single Server Limits

    • Database too large for single server
    • Query performance degrading
    • Storage capacity limits
  2. High Write Throughput

    • Writes bottlenecking on single server
    • Need to distribute write load
    • Horizontal write scaling needed
  3. Geographic Distribution

    • Users in different regions
    • Low latency requirements
    • Data residency compliance
  4. Cost Optimization

    • Cheaper to use multiple smaller servers
    • Better resource utilization
    • Scale incrementally

Real-World Examples

  1. Instagram: Sharded by user_id, each shard on separate server
  2. Uber: Sharded by city/region for location data
  3. Amazon: Sharded by product category and region
  4. Twitter: Sharded user data by user_id ranges

Advantages & Disadvantages

Advantages

βœ… Horizontal Scaling: Scale beyond single server limits
βœ… Performance: Distribute load, improve query performance
βœ… Availability: Failure of one shard doesn't affect others
βœ… Cost: Use cheaper hardware, scale incrementally
βœ… Geographic Distribution: Place shards closer to users

Disadvantages

❌ Complexity: More complex architecture and operations
❌ Cross-Shard Queries: Expensive, may require application logic
❌ Rebalancing: Difficult to rebalance data across shards
❌ Joins: Cross-shard joins not supported
❌ Transactions: Cross-shard transactions complex


Best Practices

1. Choose Right Shard Key

Criteria:

  • Even distribution (avoid hotspots)
  • Aligns with query patterns
  • Supports common access patterns
  • Stable (doesn't change frequently)

Examples:

  • βœ… Good: user_id (stable, even distribution)
  • ❌ Bad: email (can change, uneven distribution)
  • ❌ Bad: created_at (creates hotspots for recent data)

2. Avoid Hotspots

Problem: Uneven distribution creates overloaded shards

Solutions:

  • Use hash-based sharding
  • Use composite shard keys
  • Monitor shard sizes
  • Rebalance proactively

3. Handle Cross-Shard Queries

Strategies:

  • Minimize: Design queries to be shard-local
  • Aggregation Layer: Use separate service for aggregations
  • Denormalization: Duplicate data to avoid cross-shard queries
  • Caching: Cache cross-shard query results

4. Plan for Rebalancing

When to Rebalance:

  • Shard size imbalance (>20% difference)
  • Performance degradation
  • Adding/removing shards

Strategies:

  • Use consistent hashing (minimal movement)
  • Dual-write during migration
  • Gradual migration
  • Monitor rebalancing progress

5. Shard Metadata Management

Store:

  • Shard key ranges
  • Shard locations
  • Shard health status
  • Routing rules

Options:

  • Configuration service (Zookeeper, etcd)
  • Database table
  • Application configuration
  • Service discovery

Common Pitfalls

⚠️ Common Mistake: Choosing wrong shard key, creating hotspots.

Solution: Analyze data distribution. Use hash-based sharding or composite keys. Monitor shard sizes.

⚠️ Common Mistake: Not planning for cross-shard queries.

Solution: Design queries to be shard-local. Use aggregation layer or denormalization.

⚠️ Common Mistake: Sharding too early (premature optimization).

Solution: Shard only when necessary (single server limits reached). Start with replication, then partition, then shard.

⚠️ Common Mistake: Difficult rebalancing due to poor sharding strategy.

Solution: Use consistent hashing or directory-based sharding for easier rebalancing.

⚠️ Common Mistake: Not handling shard failures.

Solution: Implement replication per shard. Have failover procedures. Monitor shard health.


Interview Tips

🎯 Interview Focus: Interviewers often ask about scaling databases:

  1. Sharding Strategies: Know range, hash, directory, consistent hashing
  2. Shard Key Selection: Explain how to choose shard key
  3. Hotspots: Discuss how to avoid and handle hotspots
  4. Cross-Shard Queries: Explain strategies for handling them
  5. Rebalancing: Discuss how to add/remove shards

Common Questions

  • "How would you shard a database with 1 billion users?"
  • "What's the difference between partitioning and sharding?"
  • "How do you avoid hotspots when sharding?"
  • "How would you handle queries that span multiple shards?"
  • "Explain consistent hashing for sharding."
  • "How do you add a new shard without downtime?"


Visual Aids

Sharded Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Client  β”‚
β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜
     β”‚
     β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Shard Routerβ”‚
β”‚ (Application)β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
       β”‚
   β”Œβ”€β”€β”€β”΄β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚        β”‚        β”‚        β”‚
β”Œβ”€β”€β–Όβ”€β”€β” β”Œβ”€β”€β”€β–Όβ”€β”€β” β”Œβ”€β”€β”€β–Όβ”€β”€β” β”Œβ”€β”€β”€β–Όβ”€β”€β”
β”‚Shardβ”‚ β”‚Shard β”‚ β”‚Shard β”‚ β”‚Shard β”‚
β”‚  1  β”‚ β”‚  2   β”‚ β”‚  3   β”‚ β”‚  4   β”‚
β””β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”˜

Hash-Based Sharding

user_id β†’ hash(user_id) % 4 β†’ Shard Assignment

user_id 12345 β†’ hash(12345) % 4 β†’ Shard 1
user_id 67890 β†’ hash(67890) % 4 β†’ Shard 3
user_id 11111 β†’ hash(11111) % 4 β†’ Shard 2

Range-Based Sharding

Shard 1: user_id 1-1000
Shard 2: user_id 1001-2000
Shard 3: user_id 2001-3000
Shard 4: user_id 3001-4000

Quick Reference Summary

Sharding: Distribute database across multiple servers (shards) for horizontal scaling. Each shard holds a subset of data.

Strategies: Range (simple, hotspots), Hash (even distribution), Directory (flexible), Consistent Hashing (minimal rebalancing).

Key Consideration: Choose shard key carefully to avoid hotspots. Plan for cross-shard queries and rebalancing. Shard only when necessary.


Previous Topic: Data Partitioning ←

Back to: Step 2 Overview | Main Index