Data Partitioning

Quick Reference: Sharding | Data Replication | Step 6: Consistent Hashing


Quick Reference

Partitioning TypeMethodUse CaseProsCons
HorizontalSplit rowsLarge tablesScale reads/writesComplex queries
VerticalSplit columnsDifferent access patternsOptimize accessJoins required
RangeBy value rangeTime-series, ordered dataSimple, efficientHotspots possible
HashHash functionEven distributionBalanced loadRange queries hard
DirectoryLookup tableFlexible mappingFlexibleSingle point of failure

Clear Definition

Data Partitioning is the technique of dividing a database into smaller, more manageable pieces called partitions. Each partition can be stored and accessed independently, improving performance, manageability, and scalability. Partitioning can be done horizontally (splitting rows) or vertically (splitting columns).

πŸ’‘ Key Insight: Partitioning is essential for scaling large databases. It allows parallel processing, reduces query scope, and enables distributed storage.


Core Concepts

Horizontal Partitioning (Sharding)

Definition: Splitting a table by rowsβ€”each partition contains a subset of rows.

Example:

Users Table (Original)
β”œβ”€β”€ Partition 1: user_id 1-1000
β”œβ”€β”€ Partition 2: user_id 1001-2000
└── Partition 3: user_id 2001-3000

Characteristics:

  • Each partition has same schema
  • Rows distributed across partitions
  • Queries may need to access multiple partitions
  • Enables parallel processing

Vertical Partitioning

Definition: Splitting a table by columnsβ€”each partition contains a subset of columns.

Example:

Users Table (Original)
β”œβ”€β”€ Partition 1: user_id, name, email (frequently accessed)
└── Partition 2: user_id, bio, preferences (rarely accessed)

Characteristics:

  • Different partitions have different columns
  • Joins required to reconstruct full row
  • Optimizes for access patterns
  • Reduces I/O for frequent queries

Partitioning Strategies

1. Range Partitioning

How it Works: Partition based on value ranges.

Example:

-- Partition by date ranges
CREATE TABLE orders (
    order_id INT,
    order_date DATE,
    ...
) PARTITION BY RANGE (YEAR(order_date)) (
    PARTITION p2020 VALUES LESS THAN (2021),
    PARTITION p2021 VALUES LESS THAN (2022),
    PARTITION p2022 VALUES LESS THAN (2023)
);

Use Cases: Time-series data, ordered data, date-based queries

Pros: Simple, efficient for range queries, easy to add new partitions

Cons: Can create hotspots (all recent data in one partition)

2. Hash Partitioning

How it Works: Use hash function to determine partition.

Example:

-- Partition by hash of user_id
CREATE TABLE users (
    user_id INT,
    ...
) PARTITION BY HASH(user_id) PARTITIONS 4;

Use Cases: Even distribution, avoiding hotspots

Pros: Balanced load distribution, avoids hotspots

Cons: Range queries require scanning all partitions, hard to add/remove partitions

3. List Partitioning

How it Works: Partition based on specific values.

Example:

-- Partition by region
CREATE TABLE customers (
    customer_id INT,
    region VARCHAR(50),
    ...
) PARTITION BY LIST (region) (
    PARTITION p_north VALUES IN ('NY', 'MA', 'CT'),
    PARTITION p_south VALUES IN ('TX', 'FL', 'GA'),
    PARTITION p_west VALUES IN ('CA', 'OR', 'WA')
);

Use Cases: Geographic data, categorical data

Pros: Natural grouping, efficient for category-based queries

Cons: Requires manual partition management

4. Directory-Based Partitioning

How it Works: Use lookup table to map keys to partitions.

Example:

Partition Directory:
user_id 1-1000   β†’ Partition 1
user_id 1001-2000 β†’ Partition 2
user_id 2001-3000 β†’ Partition 3

Use Cases: Flexible partitioning, dynamic rebalancing

Pros: Flexible, easy to rebalance, can handle uneven distribution

Cons: Lookup overhead, single point of failure for directory


Use Cases

Horizontal Partitioning Use Cases

  1. Large Tables

    • Tables with millions/billions of rows
    • Improve query performance
    • Enable parallel processing
  2. Geographic Distribution

    • Partition by region/country
    • Place partitions closer to users
    • Comply with data residency
  3. Time-Series Data

    • Partition by date/time
    • Archive old partitions
    • Optimize recent data access

Vertical Partitioning Use Cases

  1. Wide Tables

    • Tables with many columns
    • Split frequently vs rarely accessed columns
    • Reduce I/O for common queries
  2. Different Access Patterns

    • Some columns accessed together
    • Others accessed separately
    • Optimize for specific queries

Advantages & Disadvantages

Advantages

βœ… Performance: Smaller partitions = faster queries
βœ… Scalability: Can scale partitions independently
βœ… Manageability: Easier backup, maintenance, archiving
βœ… Parallelism: Process partitions in parallel
βœ… Storage: Can use different storage for different partitions

Disadvantages

❌ Complexity: More complex queries and joins
❌ Cross-Partition Queries: May need to query multiple partitions
❌ Rebalancing: Difficult to rebalance data
❌ Hotspots: Uneven distribution can create hotspots
❌ Application Changes: May require application-level changes


Best Practices

1. Choose Right Partitioning Key

Criteria:

  • Even distribution
  • Aligns with query patterns
  • Avoids hotspots
  • Supports common access patterns

Example: For user data, partition by user_id (not by name or email)

2. Monitor Partition Sizes

  • Keep partitions roughly equal size
  • Monitor for hotspots
  • Plan for rebalancing

3. Plan for Growth

  • Design for future scale
  • Make it easy to add partitions
  • Consider automatic partitioning

4. Handle Cross-Partition Queries

Strategies:

  • Minimize cross-partition queries
  • Use aggregation layer
  • Cache results
  • Denormalize data

5. Backup and Maintenance

  • Backup individual partitions
  • Maintain partition metadata
  • Archive old partitions
  • Monitor partition health

Common Pitfalls

⚠️ Common Mistake: Choosing wrong partitioning key, creating hotspots.

Solution: Analyze data distribution and query patterns. Use hash partitioning for even distribution.

⚠️ Common Mistake: Too many small partitions.

Solution: Balance partition countβ€”too many partitions increase overhead, too few limit parallelism.

⚠️ Common Mistake: Not planning for cross-partition queries.

Solution: Design queries to minimize cross-partition access. Use aggregation or caching.

⚠️ Common Mistake: Ignoring partition maintenance.

Solution: Regularly monitor partition sizes, rebalance as needed, archive old partitions.


Interview Tips

🎯 Interview Focus: Interviewers ask about scaling databases:

  1. Partitioning Types: Know horizontal vs vertical, range vs hash
  2. Partition Key Selection: Explain how to choose partitioning key
  3. Hotspots: Discuss how to avoid and handle hotspots
  4. Cross-Partition Queries: Explain strategies for handling them
  5. Rebalancing: Discuss how to rebalance partitions

Common Questions

  • "How would you partition a table with 1 billion rows?"
  • "What's the difference between horizontal and vertical partitioning?"
  • "How do you avoid hotspots in partitioned tables?"
  • "How would you handle queries that span multiple partitions?"
  • "Explain hash partitioning vs range partitioning."


Visual Aids

Horizontal Partitioning

Original Table:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Users (10M rows)       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Partitioned:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚Partition1β”‚ β”‚Partition2β”‚ β”‚Partition3β”‚
β”‚(3.3M rows)β”‚ β”‚(3.3M rows)β”‚ β”‚(3.3M rows)β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Vertical Partitioning

Original Table:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Users: id, name, email, bio, prefs β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Partitioned:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚Partition 1   β”‚  β”‚Partition 2   β”‚
β”‚id, name, emailβ”‚  β”‚id, bio, prefsβ”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Range Partitioning Example

Orders Table:
β”œβ”€β”€ p2020: orders from 2020
β”œβ”€β”€ p2021: orders from 2021
β”œβ”€β”€ p2022: orders from 2022
└── p2023: orders from 2023

Quick Reference Summary

Data Partitioning: Divide database into smaller pieces for performance and scalability. Horizontal partitioning splits rows, vertical partitioning splits columns.

Strategies: Range (time-series), Hash (even distribution), List (categories), Directory (flexible).

Key Consideration: Choose partitioning key carefully to avoid hotspots. Plan for cross-partition queries and rebalancing.


Previous Topic: Data Replication ←

Next Topic: Sharding β†’

Back to: Step 2 Overview | Main Index