09 — Sharding

Technical Overview

Sharding (horizontal partitioning) divides a dataset into smaller subsets (shards) distributed across multiple nodes. Where replication creates copies of data for availability, sharding divides data for scalability — each shard holds a portion of the total dataset. When a single node can no longer hold or process all data, sharding is the solution. The challenge is choosing a partitioning strategy that distributes load evenly, enables efficient queries, and handles the inevitable resharding as data grows.

Prerequisites

Replication concepts (08-replication.md)
Distributed systems fundamentals (01-distributed-systems-fundamentals.md)
Basic hashing and consistent hashing concepts
Understanding of database indexes and query patterns

Core Content

Why Sharding?

A single database node has finite capacity: disk, RAM, CPU, and network. When a dataset exceeds that capacity, or when write throughput exceeds what one machine can handle, horizontal scaling via sharding is the standard solution.

Scaling Dimensions:

  Vertical scaling (scale up):
    Add more RAM, faster disk, more CPUs to one machine.
    Limited by hardware maximums. Expensive. Downtime required.
    Suitable up to ~10TB, ~100K QPS on high-end hardware.

  Horizontal scaling (scale out via sharding):
    Add more machines; each holds a subset of data.
    Scales linearly (theoretically). 
    Complex: distributed transactions, cross-shard queries.
    Suitable for unlimited scale with the right access patterns.

Range-Based Sharding

In range-based sharding, the key space is divided into contiguous ranges. Each shard owns one or more ranges.

Range-Based Sharding Example (User IDs):

  Shard 1: user_id  1 –  999,999
  Shard 2: user_id  1,000,000 – 1,999,999
  Shard 3: user_id  2,000,000 – 2,999,999

  Write user_id=1,234,567 → goes to Shard 2
  Read  user_id=999,999   → goes to Shard 1

  Range scan "all users between 500,000 and 1,500,000":
    Query Shard 1 (500,000–999,999) + Shard 2 (1,000,000–1,500,000)
    → Efficient range scans!

Advantages: Efficient range scans. Locality of sequential keys (useful for time-series data).

Disadvantages: Hot spots. If a specific range receives disproportionate traffic (e.g., all new users get high IDs and all writes go to the last shard), you have a hot shard. Time-series data with range sharding on timestamp always writes to the "current" shard — a perpetual hot spot.

HBase (Google Bigtable's open-source implementation) uses range-based sharding (called "regions"). Each region covers a contiguous key range. The HBase Master assigns regions to RegionServers. Regions automatically split when they exceed the configured size (default: 256MB–10GB).

CockroachDB and TiKV also use range-based sharding. Each range is ~64MB. The cluster metadata (stored in a system range) maps key ranges to node/store assignments. Ranges are automatically split and merged.

Hash-Based Sharding

In hash-based sharding, each key is hashed, and the hash value determines the shard.

Hash Sharding:

  N = 4 (number of shards)
  shard(key) = hash(key) % N

  hash("user:alice") % 4 = 2 → Shard 2
  hash("user:bob")   % 4 = 0 → Shard 0
  hash("user:carol") % 4 = 3 → Shard 3

  Uniform distribution (with good hash function)
  No hot spots from sequential keys

  Problem: Adding a shard requires rehashing EVERYTHING

  N=4: shard(key) = hash(key) % 4
  N=5: shard(key) = hash(key) % 5

  Changing N moves nearly all keys to different shards.
  A 4→5 shard migration requires moving ~80% of all data!

Consistent Hashing

Consistent hashing (Karger et al., 1997) solves the rehashing problem. Instead of hash(key) % N, keys and nodes are placed on a virtual ring (hash ring) of values from 0 to 2^32 (or 2^64).

Consistent Hash Ring:

  Ring: 0 ──────────────────────────── 2^32

        ┌──────────────────────────────────┐
        │                                  │
    0   │    Node A           Node C       │
    ┌───┤◄──(270°)         (90°)───►       │
    │   │                                  │
    │   │                Node B            │
    │   │               (180°)             │
    └───┤                                  │
        └──────────────────────────────────┘

  Place each node at hash(node_id) on the ring.
  Place each key at hash(key) on the ring.
  A key is assigned to the first node clockwise from it.

  hash("user:alice") = 45° → Node C (next clockwise from 45°)
  hash("user:bob")   = 200° → Node A (going clockwise: 200° → 270° = Node A)
  hash("user:carol") = 350° → Node A (going clockwise: 350° → 0° → 270° = Node A)

  Adding a node (Node D at 135°):
  ─────────────────────────────────
  Before: keys between 90° and 180° → Node B
  After:  keys between 90° and 135° → Node D (new node)
           keys between 135° and 180° → Node B (unchanged)

  Only ~1/N fraction of keys need to move. No global rehash!

Virtual nodes (vnodes): A single physical node has multiple positions on the ring (virtual nodes). This: 1. Distributes data more evenly (no "lumpy" distribution) 2. Makes adding/removing nodes gradual (affects many small ranges rather than one large range)

Without vnodes (3 physical nodes):
  Node A: 0°–120°  (33% of data)
  Node B: 120°–240°  (33% of data)
  Node C: 240°–360° (33% of data)

With vnodes (each physical node has 3 virtual positions):
  Virtual positions:
  A1=10°, A2=130°, A3=250°  → Physical Node A
  B1=50°, B2=170°, B3=290°  → Physical Node B
  C1=90°, C2=210°, C3=330°  → Physical Node C

  Better load balance; adding Node D splits multiple small ranges
  instead of one large range.

Cassandra uses consistent hashing with vnodes (default: 256 vnodes per node). The partition key is hashed using Murmur3 to determine ring position.

DynamoDB uses consistent hashing internally for partition placement. The partition key is used as input.

MongoDB Sharding (Chunk-Based)

MongoDB sharding uses a hybrid approach: range-based chunks on a consistent hash ring.

MongoDB Sharding Architecture:

  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐
  │   Shard 1   │  │   Shard 2   │  │   Shard 3   │
  │  (RS 1)     │  │  (RS 2)     │  │  (RS 3)     │
  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘
         │                │                │
         └────────────────┴────────────────┘
                          │
                    ┌─────┴─────┐
                    │  Config   │
                    │  Servers  │ (metadata: which chunk is where)
                    └─────┬─────┘
                          │
                    ┌─────┴─────┐
                    │  Mongos   │ (query router)
                    └───────────┘
                          ↑
                     client queries

Chunk: a contiguous range of shard key values.
Default chunk size: 128MB.
Balancer: moves chunks between shards to keep balance.
  - If Shard 1 has 10 chunks and Shard 2 has 5, balancer moves chunks.

Shard key selection is critical in MongoDB: - Cardinality: Key must have many possible values (not a boolean). - Distribution: Key values must be uniformly distributed. - Query targeting: If most queries include the shard key, they go to one shard (targeted). Without shard key, queries scatter to all shards.

DynamoDB Partition Key Design

DynamoDB automatically shards data based on the partition key. The partition key determines the physical partition; the sort key (if any) determines order within the partition.

DynamoDB Partitioning:

  Table: Orders
  Partition Key: user_id
  Sort Key: order_date

  Partition for user_id="alice": [order_2023-01, order_2023-02, ...]
  Partition for user_id="bob":   [order_2024-01, ...]

  These are on different physical partitions → different nodes.
  Queries for a single user (single partition) are efficient.
  Queries across all users require scanning all partitions.

  Hot partition problem:
  If user "celebrity" has 1M orders, their partition is hot.
  → All reads/writes for that partition are rate-limited to that one partition.

  Fix: Add a random suffix to partition key
  Old: partition_key = user_id
  New: partition_key = user_id + "-" + random(1..N)
       Distribute reads/writes across N partitions.
       But now queries must hit all N partitions → fan-out reads.

Cross-Shard Queries and Joins

Cross-shard queries are the primary operational headache with sharding:

Cross-Shard Query Patterns:

Scatter-Gather:
  Query: SELECT * FROM orders WHERE status='pending' AND amount > 100
  (No shard key in filter)

  Mongos/coordinator → Shard 1 (subquery) ─→ results
                      → Shard 2 (subquery) ─→ results  → merge → client
                      → Shard 3 (subquery) ─→ results

  Cost: N parallel queries, wait for slowest, merge results in memory.
  Latency = max(shard latencies). Memory = total result set size.

Cross-Shard Joins:
  Table A (sharded by user_id) JOIN Table B (sharded by product_id)

  Options:
  1. Collocate data: Re-shard Table B by user_id too.
     (Only works if join is always on the same key)
  2. Broadcast join: Send Table B to every shard, join locally.
     (Only works if Table B is small)
  3. Bring data to coordinator: Coordinator fetches all matching rows.
     (Works, but expensive and memory-intensive)

Distributed Transactions Across Shards

Transactions that span multiple shards require distributed coordination:

Cross-Shard Transaction:

  "Transfer $100 from Alice (shard A) to Bob (shard B)"

  Requires:
  1. Debit Alice on Shard A
  2. Credit Bob on Shard B
  Both atomically (either both happen, or neither)

  Mechanism: Two-Phase Commit (2PC) — see 10-distributed-transactions.md

  Cost: At least 2 round-trips (prepare + commit) with fsync at each shard.
  Failure: If coordinator fails during commit, shards are "in doubt" and 
           may block indefinitely (2PC blocking protocol).

Resharding Challenges

As data grows, you need more shards. Resharding (adding shards) is operationally complex:

Online Resharding Process (Cassandra virtual node addition):

  1. Add new node to the ring.
  2. New node takes responsibility for 1/N of each existing node's ranges.
  3. Streaming: existing nodes stream their affected data to the new node.
  4. Once streaming completes, the new node serves traffic for its ranges.
  5. Old nodes delete the transferred data during compaction.

  Challenge: During streaming, the new node's data may be stale.
  Solution: New node may receive both streamed data AND new writes.
            After streaming, repair to reconcile.

  CockroachDB/TiKV automatic resharding:
  Ranges split when they exceed ~64MB.
  New ranges are assigned to nodes with available capacity.
  Rebalancing moves Raft groups to distribute load.
  All transparent to the application.

Vitess (MySQL sharding): Google's Vitess shards MySQL horizontally. It includes VTAdmin (metadata), VTGate (query router), and VTTablet (per-shard MySQL instance). Resharding in Vitess uses a "MoveTables" workflow that migrates data online with verification.

Hot Spot Handling

A hot spot is a shard that receives disproportionately more traffic than others.

Hot Spot Patterns:

  1. Sequential key hot spot (time-series):
     All new records go to the last shard.
     Fix: Use hash-based sharding for write path, 
          or pre-split ranges to spread writes.

  2. Famous entity hot spot:
     @PresidentObama's tweet → millions of reads/writes
     Fix: Application-level caching (CDN, Redis) for hot entities.
          Do NOT route all traffic to one shard.

  3. Workload hot spot:
     "Product launch" puts all writes on product catalog shard.
     Fix: Scale out that shard (add replicas); 
          use Cassandra's ability to have different RF per keyspace.

  DynamoDB adaptive capacity (2019):
    Automatically detects and redistributes hot partitions.
    A partition getting more than its allocation gets additional capacity
    from neighboring idle partitions.
    Transparent to the application.

Shard Rebalancing Algorithms

Manual Rebalancing:
  Admin manually assigns key ranges or vnodes to specific nodes.
  Predictable, but requires operator expertise and intervention.

Automatic Rebalancing (CockroachDB/TiKV style):
  Placement Driver (PD) / Cluster Manager monitors:
  - Storage usage per node
  - Read/write QPS per node
  - Number of ranges (leaders, replicas) per node

  Rebalancing triggers:
  - Node storage > threshold → move ranges off
  - Leader imbalance > threshold → transfer Raft leadership
  - New node joins → assign ranges from heavy nodes

  Rebalancing is range movement:
  1. Add a new replica of range R to target node (Raft adds learner).
  2. Learner catches up with Raft log.
  3. Learner promoted to voter.
  4. Remove old replica from source node.

  This is live migration with zero downtime.

Historical Context

1970s: Horizontal partitioning proposed in early distributed database literature. Charlie Bachman's network database model allows data distribution.

1997: Consistent hashing introduced by Karger, Lehman, Leighton, Panigrahy, Lewin, and Sitaraman at MIT in "Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web" (STOC 1997). Originally designed for web caches; became foundational for distributed databases.

2003: Google publishes the Bigtable paper, describing range-based sharding as "tablets."

2006: Amazon Dynamo's consistent hashing with virtual nodes becomes the standard model for leaderless distributed stores.

2007: MongoDB is founded; its chunk-based sharding (released publicly in 2009) brings sharding to a mainstream audience.

2010: Cassandra 0.6 introduces virtual nodes (vnodes), improving the flexibility of consistent hashing for operational management.

2013: Google Vitess (originally built at YouTube for MySQL sharding) is open-sourced.

Production Examples

Discord (Cassandra): Discord shards all data by channel_id as the partition key. Messages in a channel are sorted by message_id (time-ordered). This collocates all messages in a channel, making per-channel queries efficient. Cross-channel queries are rare in Discord's access pattern.

Uber (Schemaless/MySQL sharding): Uber built "Schemaless" on top of MySQL using application-level sharding. Each entity type is sharded by its UUID. A consistent hashing ring maps UUIDs to MySQL shard nodes. The mapping metadata is stored in a separate MySQL cluster.

Airbnb (Vitess): Airbnb migrated to Vitess for MySQL sharding. Their primary shard key is listing_id for listing data, user_id for user data. Cross-shard queries (e.g., "listings visited by user X") are handled by denormalization — storing a list of listing IDs per user.

TiDB (TiKV ranges): PingCAP's TiDB uses TiKV ranges of ~96MB. The Placement Driver monitors range sizes and splits ranges that exceed the threshold. Automatic rebalancing moves ranges based on store capacity and leader count.

Debugging Notes

Diagnosing shard imbalance:

# MongoDB: check chunk distribution
sh.status()
db.adminCommand({listShards: 1})

# Cassandra: check data distribution across nodes
nodetool status  # Shows data size per node
nodetool ring    # Shows token ring assignment

# CockroachDB: check range distribution
SHOW RANGES FROM TABLE orders;
SELECT node_id, count(*) FROM crdb_internal.ranges GROUP BY node_id;

Common sharding bugs:

Non-uniform key distribution: If most keys hash to the same shard (low cardinality), one shard is overloaded. Always check the distribution of your shard key's cardinality before deploying.
Hot key during resharding: During a shard migration, reads may hit both the old and new shard. If the migration stalls, you may serve stale data from the old shard while the new shard is being populated.
Scatter-gather timeouts: A query that scans all shards times out when even one shard is slow. Set aggressive timeouts per shard + retry logic. Consider circuit-breakers per shard.

Security Implications

Data residency: Sharding allows different shards to be in different regions or jurisdictions. Ensure your shard key design is compatible with GDPR/CCPA data residency requirements.
Cross-shard data exfiltration: If an attacker compromises one shard, they get a fraction of the data. If the shard key is predictable (e.g., sequential user IDs), the attacker can infer which IDs are on which shard.
Shard-level access control: Different shards may need different access controls (e.g., European users' data must only be accessed from EU nodes). Your routing layer must enforce this.
Rebalancing data leaks: When data moves between shards during rebalancing, it may pass through nodes that shouldn't see it. Encrypt data at rest and in transit.

Performance Implications

Write throughput scales linearly with the number of shards (assuming no cross-shard coordination and balanced load).
Read throughput scales linearly if queries can be targeted to a single shard.
Scatter-gather queries have latency = max(shard latency) and degrade as N (number of shards) grows.
Cross-shard transactions add 2PC overhead: typically 2× the single-shard latency.
Consistent hashing overhead: The routing layer (mongos, vtgate, application) must hash the key and look up the shard. This adds 0.1–1ms per request for the routing lookup.

Failure Modes and Real Incidents

2013, MongoDB Balancer Bug: MongoDB's chunk balancer ran continuously, moving chunks between shards. A bug caused the balancer to move a chunk that was actively being written to, causing a brief window where writes went to the old location. Some writes were lost. Fix: MongoDB added explicit locking and transaction semantics around chunk migrations.

2016, DynamoDB Hot Partition Throttling: A gaming company's DynamoDB table used game_id as the partition key. A major game launch caused 95% of all reads to hit one game's partition. DynamoDB throttled that partition (partition-level throughput limits). Reads failed. The fix required either adaptive capacity (which Amazon added in 2019) or adding a random suffix to the partition key.

2018, Cassandra Rebalancing Storm: An operator added 10 new nodes to a 50-node Cassandra cluster. The vnodes algorithm moved 10/60 = 16.7% of data. Because all 10 nodes started streaming simultaneously, the cluster was saturated with rebalancing traffic for 6 hours. During this time, write latency was 10× normal. Fix: add nodes one at a time with adequate time between additions.

Modern Usage

PlanetScale (Vitess): MySQL-compatible database as a service built on Vitess. Automatic sharding, resharding, and read-replica management.
Amazon Aurora Serverless v2: Automatically scales storage and compute; sharding is handled internally.
CockroachDB Serverless: Automatic range splitting and rebalancing. The application never needs to think about shards.
Neon (Serverless PostgreSQL): Separates storage from compute; storage layer handles sharding of the WAL.
MongoDB Atlas: Managed MongoDB with automatic sharding, chunk migration, and index management.

Future Directions

Automatic shard key selection: ML-based systems that analyze query patterns and suggest optimal shard keys before deployment.
Elastic sharding: Cloud-native databases that add/remove shards based on traffic, billed by usage. Aurora Serverless, Firestore, and Cosmos DB are early examples.
Disaggregated storage: When storage is shared (NVMe over Fabrics, CXL), "sharding" becomes a routing/compute problem rather than a data placement problem. The storage layer handles distribution.

Exercises

Consistent Hashing Implementation: Implement a consistent hash ring in Python with N physical nodes, each with V virtual nodes. Test: (a) add 3 nodes and check distribution of 10,000 keys, (b) add a 4th node and measure what fraction of keys moved, (c) compare to modulo hashing (100% of keys moved when N changes).
Hot Spot Analysis: For a social media application with users, posts, and likes, design three sharding strategies: (a) shard posts by post_id hash, (b) shard posts by user_id, (c) shard posts by post creation time. For each strategy, identify which access patterns are efficient and which create hot spots. Consider: a celebrity with 10M followers posts a video.
MongoDB Shard Key Selection: Given a collection with fields: user_id, product_id, order_date, order_status. Evaluate each field as a shard key: cardinality, distribution, and impact on the following queries: (a) all orders for user X, (b) all orders for product Y, (c) all pending orders in the last 7 days, (d) all orders over $100. Which shard key would you choose and why?
Cross-Shard Transaction Benchmark: Implement a simple bank transfer simulation with 4 shards. Measure: (a) throughput of single-shard transfers (both accounts on same shard), (b) throughput of cross-shard transfers (2PC), (c) failure rate during cross-shard transfers when a shard becomes unavailable mid-transaction. Compare (a) and (b) latency.
Resharding Simulation: Simulate a 3-shard cluster (with consistent hashing) receiving 10,000 writes per second. After 1 minute, add a 4th shard. Simulate the data migration (each key in the migrating range must be "moved"). Measure: (a) what % of traffic is disrupted during migration, (b) how long migration takes, (c) what would need to change to make the migration zero-downtime.

References

Karger, D., Lehman, E., Leighton, T., Panigrahy, R., Lewin, M., & Sitaraman, R. (1997). "Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web." STOC 1997.
DeCandia, G., et al. (2007). "Dynamo: Amazon's Highly Available Key-value Store." SOSP 2007.
Chang, F., et al. (2006). "Bigtable: A Distributed Storage System for Structured Data." OSDI 2006.
Lakshman, A., & Malik, P. (2010). "Cassandra: A Decentralized Structured Storage System." ACM SIGOPS, 44(2), 35–40.
Rao, J., Zhang, C., Megiddo, N., & Lohman, G. (2002). "Automating Physical Database Design in a Parallel Database." SIGMOD 2002.
Corbett, J. C., et al. (2012). "Spanner: Google's Globally Distributed Database." OSDI 2012.
Bai, B., et al. (2021). "Auto-Partitioning for Cloud Databases: An Experimental Study." VLDB 2021.