Distributed Cache Design

Overview

A distributed cache is one of the highest-leverage components in a production system: it can reduce database load by 90%+, cut read latency from 10ms to sub-millisecond, and enable read throughput that would be physically impossible with a relational database. This document walks through the full design of a distributed cache system — requirements, capacity estimation, consistency hashing, Redis Cluster internals, eviction policies, and the hard problems (hot keys, thundering herd, cache poisoning).

This is framed as a system design exercise (10M requests/sec, 1TB data, 1ms p99 read latency, 99.99% availability) and then grounds each decision in production reality.

Prerequisites

Understanding of consistent hashing (virtual nodes, hash rings)
Redis fundamentals: data structures, persistence, replication
Distributed systems: CAP theorem, eventual consistency, gossip protocols
Network fundamentals: TCP, latency considerations
Understanding of cache eviction and TTL semantics

Historical Context

Early web caching was simple: reverse proxies (Squid, Varnish) caching HTTP responses. Application-level caching began with memcached, developed by Brad Fitzpatrick at LiveJournal in 2003 to cope with MySQL load. Memcached's design — simple hash table, LRU eviction, text protocol — made it easy to understand and extremely fast.

Redis (Remote Dictionary Server) was created by Salvatore Sanfilippo (antirez) in 2009. His innovation over memcached: rich data structures (lists, sets, sorted sets, hashes), persistence (RDB/AOF), and pub/sub. Redis enabled new use cases: session storage, leaderboards, rate limiters, and message queues.

Redis Cluster was released in 2015 after years of development, providing native horizontal sharding without external proxies. Before Redis Cluster, teams used Twemproxy (Twitter's proxy) or client-side consistent hashing.

Memcached remains relevant for pure key-value caching where multi-threaded performance and simplicity matter. Meta (Facebook) runs the world's largest memcached deployment, managing petabytes across hundreds of thousands of servers, described in their famous "Scaling Memcache at Facebook" paper (2013).

Requirements and Capacity Estimation

  Functional requirements:
  - get(key) → value or null
  - set(key, value, ttl)
  - delete(key)
  - High read:write ratio (10:1 typical for web caches)

  Non-functional requirements:
  - Throughput: 10M requests/second
  - Data size: 1TB total
  - Read latency: p99 < 1ms
  - Availability: 99.99% (52 minutes downtime/year)
  - Consistency: eventual (stale reads acceptable for short window)

  Capacity estimation:

  Data sizing:
    1TB total ÷ 100GB per node (reasonable Redis node size) = 10 nodes
    With 3× replication for HA: 30 nodes total
    (10 primary + 20 replicas, 2 replicas per primary)

  Throughput sizing:
    Single Redis node: ~100,000 ops/sec (single-threaded, typical)
    For 10M ops/sec: 10M / 100K = 100 nodes
    With 3× replication: 300 nodes (but replicas can serve reads!)

    Shard model:
    Primary handles: writes + some reads
    Replicas handle: reads (scale reads horizontally)

    Effective read capacity per shard: 3 × 100K = 300K ops/sec
    Shards needed for reads: 10M / 300K = ~34 primary shards
    Nodes needed: 34 primaries × 3 (replication) = 102 nodes

    Final design: ~34 primary shards, 2 replicas each = 102 nodes
    (data per shard: 1TB / 34 ≈ 30GB/primary — comfortable for 100GB nodes)

  Network sizing:
    Average key-value: 1KB
    10M req/sec × 1KB = 10GB/sec network bandwidth
    10GbE NICs per server: 1.25GB/sec
    Servers needed for network: 10GB / 1.25GB = 8 servers minimum
    → Network is not the bottleneck; compute is
    For p99 1ms: must minimize network hops (1 LAN hop, co-located nodes)

Consistent Hashing

Consistent hashing determines which node stores a given key, with the property that when nodes are added/removed, only K/N keys need to be remapped (K = keys, N = nodes):

  Basic hash ring:

  Hash space: 0 to 2^32 - 1 (or 2^64)

  Place nodes on the ring by hashing their node ID:

       0
       │
  NodeC│(hash("node-C") = 10)
       │
       │
  NodeA│(hash("node-A") = 90)
       │
       │
  NodeB│(hash("node-B") = 200)
       │
       ...
  max (2^32 - 1)

  To find which node owns key K:
  1. Compute hash(K)
  2. Walk the ring clockwise until you hit a node
  3. That node owns the key

  hash("user:123") = 150 → walk clockwise → NodeB (200)
  hash("session:abc") = 50 → walk clockwise → NodeA (90)

  Adding a node (NodeD at position 120):
  Only keys with hash in (90, 120] need to move: NodeB → NodeD
  All other keys unaffected. ≈ K/N remapped.

  Problem: without virtual nodes, distribution is uneven.
  Node positions are random → some nodes get 3× the keys of others.

  Virtual nodes (vnodes) solution:

  Each physical node gets V virtual positions on the ring.
  Typical: V = 150-200 vnodes per physical node.

  NodeA creates: NodeA-0, NodeA-1, ..., NodeA-149
  hash("NodeA-0") = 15
  hash("NodeA-1") = 88
  hash("NodeA-2") = 234
  ...

  150 positions × N nodes = 150N points on ring
  Each physical node gets ~150/150N = 1/N fraction of the ring (even distribution)

  Adding a node still only moves K/N keys, but rebalancing is more granular.

Redis Cluster Internals

Redis Cluster uses a different approach: 16,384 fixed hash slots instead of a continuous ring:

  Redis Cluster Hash Slot Assignment:

  key → CRC16(key) % 16384 → slot number (0-16383)

  Slot-to-node mapping stored in cluster state (gossip):

  Node 1 (primary):   slots 0 - 5460
  Node 2 (primary):   slots 5461 - 10922
  Node 3 (primary):   slots 10923 - 16383

  (Each primary has 1-2 replicas for HA)

  Why 16384? (not 2^16 = 65536?)
  "16k slots is enough for 1000 nodes; gossip heartbeat with 16k slots
  fits in 2KB message. 65536 slots would need 8KB heartbeat messages."
  — Antirez (Redis author)

  Hash tags for multi-key operations:

  {user:123}:profile and {user:123}:preferences both hash to slot CRC16("user:123") % 16384
  → guaranteed to be on same node
  → allows MGET/MSET and Lua scripts across related keys

  Without hash tags, related keys may land on different nodes:
  MGET user:123:profile user:123:preferences
  → CROSSSLOT error if on different nodes

Client Interaction: MOVED and ASK

  MOVED redirect (permanent):

  Client sends: GET user:123 to Node 1
  Node 1: slot 9000 is on Node 2 now
  Node 1 responds: -MOVED 9000 10.0.0.2:6379
  Client: updates slot map cache, resends to Node 2

  This happens during/after resharding.
  Smart clients (redis-py, jedis) cache the slot map and go direct.

  ASK redirect (temporary, during resharding):

  During resharding (slot migrating from Node 2 to Node 3):
  - Key may be on Node 2 (not yet migrated) or Node 3 (migrated)

  Client sends: GET user:456 to Node 2
  Key already migrated → Node 2 responds: -ASK 5461 10.0.0.3:6379
  Client: sends ASKING command first, then GET to Node 3
  (ASKING is one-time hint; don't update slot cache permanently)

  Cluster topology gossip:

  Each node sends PING to a random subset of other nodes every second.
  PONG response includes cluster state (slot assignments, node status).
  Failure detection: if node doesn't PONG within cluster-node-timeout (default 15s):
    Mark as PFAIL (probable failure)
    If majority of masters agree → FAIL
    Replica promotion (failover) begins

Cache Eviction Policies

When Redis memory reaches maxmemory, it must evict keys to accept new writes. Policy choice significantly impacts hit rate and application behavior:

  Redis eviction policies:

  noeviction:
    Reject writes when full. Returns OOM error.
    Use: when cache miss is acceptable and data must not be lost.
    NOT suitable for a cache (defeats the purpose).

  allkeys-lru:
    Evict any key using LRU (Least Recently Used).
    Use: general-purpose cache where all keys should be eligible for eviction.
    Best for: uniform access patterns.

  volatile-lru:
    Evict only keys with TTL set, using LRU.
    Keys without TTL are never evicted.
    Use: mixed cache + persistent storage in one Redis instance (anti-pattern in practice).

  allkeys-lfu:
    Evict any key using LFU (Least Frequently Used).
    LFU counts access frequency, not recency.
    Use: Zipfian access patterns (most keys accessed rarely, a few accessed constantly).
    Better than LRU for: "Star Wars Episode VII trailer" effect —
      a viral key is accessed millions of times, then never again.
      LRU keeps it; LFU would evict it after the spike.

  volatile-ttl:
    Evict keys with shortest remaining TTL first.
    Use: when TTL represents the "value" of data.

  allkeys-random:
    Evict random keys.
    Use: uniform access with truly random replacement needed.

  LRU approximation in Redis:
    Redis doesn't maintain a true LRU list (O(N) memory overhead).
    Instead: sample 5 random keys, evict the one with oldest access time.
    maxmemory-samples 10 → better accuracy, more CPU.
    maxmemory-samples 5  → default, good enough for most cases.

Caching Patterns

  Cache-Aside (Lazy Loading):

  Read path:
  1. Check cache: GET user:123
  2. Hit: return value
  3. Miss: query DB, write to cache (SET user:123 <data> EX 3600), return value

  Write path:
  1. Update DB
  2. Invalidate cache: DEL user:123 (NOT write-through, to avoid race conditions)

  Race condition (write-around):
  Thread A: read user:123 → cache miss
  Thread B: update user:123 in DB, DEL user:123 in cache
  Thread A: reads old data from DB, SET user:123 (old data) in cache ← stale!

  Fix: use short TTL + accept brief staleness, OR
       use version-tagged keys (user:123:v5), OR
       use a write lock before read-fill.

  Cache-aside properties:
  + Application controls caching logic
  + Only frequently-read data cached (lazy)
  - Cache miss = 2 I/Os (cache + DB)
  - Stale data window between write and invalidation
  - Cold start: empty cache → all requests hit DB until warmed

  ---

  Read-Through:
  Cache is in front of DB; cache fills itself on miss.
  Application always reads from cache.
  Cache driver handles DB queries.

  + Simpler application code
  + Cache warms on first access
  - Cache layer needs DB access credentials
  - Less flexible than cache-aside

  ---

  Write-Through:
  Write to cache and DB synchronously in same operation.

  Advantages: cache always consistent with DB (no stale window)
  Disadvantages: write latency = cache + DB latency (additive)
                 many written keys never read (wasted cache space)

  ---

  Write-Behind (Write-Back):
  Write to cache only; async flush to DB later.

  Advantages: write latency = cache only (fast writes)
  Risks: data loss if cache crashes before flush (NOT for critical data)
  Use: analytics counters, session data, metrics aggregation

Hot Key Problem (Celebrity Problem)

In any large cache, a small number of keys can receive disproportionate traffic:

  The problem:

  100 cache nodes, evenly distributed keys.
  Normal distribution: 100K req/s / 100 = 1K req/s per node.

  A celebrity (e.g., Taylor Swift) goes trending:
  Key "artist:taylor-swift:profile" → 500K req/s to ONE node.
  → That node CPU 100%, network saturated, latency spikes for all its keys.

  Detection:
  # Redis hot key analysis:
  redis-cli --hotkeys                              # requires maxmemory-policy LFU
  redis-cli monitor | grep "artist:taylor" | wc -l  # real-time (high overhead)

  Solution 1: Client-side local cache

  Each app server has a local in-process cache (HashMap, Guava Cache).
  Store hot keys for 1-5 seconds locally.

  100 app servers × 1 local cache = 100 copies of hot key
  → celebrity traffic distributed across 100 app servers
  → Redis sees only 1 write/5s per app server for refresh

  Tradeoff: up to 5s stale data per app server (usually acceptable).

  Solution 2: Key sharding with random suffix

  Instead of: artist:taylor-swift:profile
  Use:        artist:taylor-swift:profile:{random 0-9}

  Write: replicate to all 10 shards
  Read: pick random suffix → distributed across 10 shards

  Tradeoff: write 10× copies, consistency update complexity.

  Solution 3: CDN for public data

  Cache public read-heavy data at CDN edge (Cloudflare, Fastly).
  → Celebrity profile page served from CDN, never hits Redis.

Thundering Herd on Cold Start

When a cached key expires simultaneously for many clients:

  The problem:

  10,000 web servers all cache "featured_products" with TTL=60s.
  At t=60s: ALL 10,000 servers get cache miss simultaneously.
  All 10,000 send queries to the database: "SELECT * FROM products WHERE featured=true"
  Database overwhelmed, falls over.

  This is the "thundering herd" or "cache stampede" problem.

  Solution 1: Mutex / Single-flight

  When cache miss detected:
    if distributed_lock.acquire("populate:featured_products"):
      # This server populates the cache
      data = db.query(...)
      cache.set("featured_products", data, ttl=60)
      distributed_lock.release(...)
    else:
      # Other servers wait a bit, then re-check cache
      time.sleep(0.1)
      return cache.get("featured_products")  # now populated

  Tradeoff: adds latency for all servers except the one populating.

  Solution 2: Probabilistic Early Expiration (PER)

  Key expires at t=60s, but randomly start refreshing before expiry:

  def should_refresh(ttl_remaining, beta=1):
      # Returns True probabilistically as TTL approaches 0
      return random.random() < beta * math.log(access_count) / ttl_remaining

  In practice: if TTL < 10% of original (6 seconds remaining):
    with probability 1/N_servers: start background refresh

  Result: cache is refreshed before expiry by one server;
          others continue serving valid (slightly stale) data.

  Solution 3: Staggered TTLs

  Add random jitter to TTL:
  ttl = base_ttl + random.randint(-10, 10)  # ± 10 seconds

  Not all 10,000 expirations happen at the same second.
  Simple, effective for cases where staggering is acceptable.

  Solution 4: Background refresh

  A background job refreshes popular keys before they expire.
  Application never gets a miss for hot keys.
  Tradeoff: requires identifying "popular keys" (LFU metrics help).

Monitoring Cache Effectiveness

  Key metrics:

  1. Hit ratio = hits / (hits + misses)
     Target: >95% for most caches
     <80%: investigate cache sizing or TTL settings

  2. Eviction rate = keys_evicted / second
     High eviction rate = cache too small or key cardinality too high

  3. Memory utilization = used_memory / maxmemory
     >90%: approaching eviction; consider scaling

  4. Connected clients: watch for connection pool exhaustion

  5. Replication lag: replica behind primary = stale reads possible

  Redis INFO command:
  redis-cli info stats | grep -E "keyspace_hits|keyspace_misses|evicted_keys"
  redis-cli info memory | grep -E "used_memory_human|maxmemory_human"
  redis-cli info replication | grep -E "role|connected_slaves|master_replid"

  Prometheus metrics (via redis_exporter):
  redis_keyspace_hits_total
  redis_keyspace_misses_total
  redis_evicted_keys_total
  redis_memory_used_bytes
  redis_connected_clients

  Cache hit ratio alert:
  sum(rate(redis_keyspace_hits_total[5m])) /
  (sum(rate(redis_keyspace_hits_total[5m])) + sum(rate(redis_keyspace_misses_total[5m])))
  < 0.90  →  alert: cache hit rate below 90%

Redis Cluster Topology Diagram

  Redis Cluster with 3 primary shards, 2 replicas each:

  ┌─────────────────────────────────────────────────────────────┐
  │                    Redis Cluster                             │
  │                                                              │
  │  Shard 1: slots 0-5460                                       │
  │  ┌─────────────┐    repl    ┌──────────────┐                │
  │  │  Primary 1  │ ─────────> │  Replica 1a  │                │
  │  │  10.0.0.1   │ ─────────> │  Replica 1b  │                │
  │  └─────────────┘            └──────────────┘                │
  │         ▲                                                    │
  │         │ gossip (PING/PONG)                                 │
  │         │                                                    │
  │  Shard 2: slots 5461-10922                                   │
  │  ┌─────────────┐    repl    ┌──────────────┐                │
  │  │  Primary 2  │ ─────────> │  Replica 2a  │                │
  │  │  10.0.0.2   │ ─────────> │  Replica 2b  │                │
  │  └─────────────┘            └──────────────┘                │
  │         ▲                                                    │
  │         │ gossip                                             │
  │         │                                                    │
  │  Shard 3: slots 10923-16383                                  │
  │  ┌─────────────┐    repl    ┌──────────────┐                │
  │  │  Primary 3  │ ─────────> │  Replica 3a  │                │
  │  │  10.0.0.3   │ ─────────> │  Replica 3b  │                │
  │  └─────────────┘            └──────────────┘                │
  │                                                              │
  │  Failover: if Primary 1 fails:                               │
  │    Replica 1a elected new primary (by replica vote)         │
  │    Replica 1b now replicates from new Primary 1a            │
  │    New slot assignment gossiped to all nodes                 │
  └─────────────────────────────────────────────────────────────┘

  Client with smart routing:
  ┌────────────────────────────────────┐
  │  redis-py / jedis / ioredis        │
  │  Maintains slot map: {0-5460: P1,  │
  │                       5461-10922: P2,│
  │                       10923-16383: P3}│
  │  On MOVED: update slot map + retry │
  │  On PFAIL: retry on replica        │
  └────────────────────────────────────┘

Debugging Notes

# Check cluster health:
redis-cli -c cluster info | grep -E "cluster_state|cluster_slots|cluster_known_nodes"
# cluster_state: ok (or fail if any slot uncovered)
# cluster_slots_ok: should equal 16384

# Check slot assignment:
redis-cli cluster nodes
# Shows: nodeID host:port flags master-id ping-sent pong-recv epoch link-state slots

# Check which node a key is on:
redis-cli cluster keyslot "user:123"  # → slot number
redis-cli cluster slots               # → slot ranges with node IPs

# Debug a MOVED redirect:
redis-cli -c -h 10.0.0.1 -p 6379 get "user:999"
# -c flag: cluster mode (auto-follow MOVED redirects)

# Check memory:
redis-cli info memory
# Look for: used_memory_rss >> used_memory (memory fragmentation)
# mem_fragmentation_ratio > 1.5 → consider MEMORY PURGE or restart

# Find large keys:
redis-cli --bigkeys
# Or for more detail:
redis-cli --memkeys

# Check replication lag:
redis-cli info replication | grep master_repl_offset
# Compare primary offset vs replica offset — difference = bytes behind

Security Implications

Redis has no authentication by default. Always set requirepass or use Redis ACLs (user <name> on ><password> ~* +@all).
Redis should NEVER be exposed to the public internet. The "Redis Crackit" attack (2015) exploited unauthenticated Redis to write SSH keys to /root/.ssh/authorized_keys. Countless Redis instances were compromised.
Use bind 127.0.0.1 or a private VPC subnet; never bind 0.0.0.0 on a public interface.
Cache poisoning: if an attacker can control cache keys or values (via application logic), they can serve malicious content to all users sharing that cache. Validate and sanitize data at application layer before caching.
Separate caches for different sensitivity levels: don't store payment data and public product catalog in the same Redis instance.

Performance Implications

Redis is single-threaded for command execution (I/O threads added in 6.0 for network, but command processing is still serial). One slow operation (KEYS *, SORT on large set, SMEMBERS on huge set) blocks all other commands.
Pipeline commands to reduce round trips: instead of 1000 individual GETs, use MGET (multi-get) or Redis pipeline.
Memory fragmentation (high mem_fragmentation_ratio) wastes RAM. Periodic MEMORY PURGE defragments (Redis 4.0+, enable with activedefrag yes).
Cluster reads from replicas: READONLY command on replica connection. Reduces primary load but accepts slight replication lag.

Failure Modes

Symptom	Cause	Fix
cluster_state: fail	Uncovered slot (node down, no replicas available)	Restore node or add replica
CLUSTERDOWN error	Too many nodes failed — quorum lost	Manual recovery with cluster fix
Hot node CPU spike	Hot key concentrated on one shard	Local cache + key sharding
Memory OOM despite eviction	Fragmentation or large volatile keys	Adjust maxmemory-policy, MEMORY PURGE
Hit rate drops suddenly	Expired TTL mass-eviction or cache flush	Check TTL settings, load spike
Replication lag > 1s	Replica overwhelmed or network congestion	Add replica bandwidth, reduce write rate

Modern Usage

Redis Stack: Extension of Redis with JSON, Search (full-text), Time Series, and probabilistic data structures (Bloom filter, HyperLogLog, Top-K). Enables Redis as a primary data store for certain use cases.
Redis Sentinel vs Cluster: Sentinel provides HA for single-shard Redis (automatic failover, no sharding). Use Sentinel when data fits on one node but HA is needed. Use Cluster when data or throughput requires horizontal sharding.
Dragonfly and KeyDB: Redis-compatible servers with multi-threaded architecture. Dragonfly claims 25× throughput of Redis on same hardware. Early production adoption but not yet as battle-tested.

Future Directions

Redis 8.0+: Continued work on multi-threading, better memory efficiency, and native vector search for AI/ML embedding similarity use cases.
Valkey: Linux Foundation fork of Redis (2024), created after Redis changed its license from BSD to SSPL in 2024. Major cloud providers (AWS, GCP, Azure) have committed to Valkey.
eBPF-accelerated cache bypass: Research into eBPF XDP programs that serve cache responses directly from NIC without kernel networking stack, targeting sub-100 microsecond latency.

Exercises

Set up a 3-node Redis Cluster locally (Docker Compose with 6 containers: 3 primary + 3 replicas). Verify slot assignment, insert 1000 keys, and observe their distribution across shards using redis-cli cluster keyslot.
Implement the thundering herd protection with a distributed mutex (Redis SET NX + EX pattern). Simulate 100 concurrent goroutines/threads hitting the same expired key. Verify only one thread populates the cache.
Reproduce the hot key problem: create one key that receives 95% of traffic in a simulation. Implement local cache (process-level HashMap) to shield Redis from the hot key. Measure Redis QPS before and after.
Configure Redis with maxmemory 100mb and maxmemory-policy allkeys-lfu. Load 200MB of data (using a script). Monitor redis_evicted_keys_total and verify LFU is retaining frequently accessed keys over rarely accessed ones.
Implement consistent hashing from scratch in the language of your choice: build a hash ring, add 5 nodes with 150 virtual nodes each, and verify that adding a 6th node moves approximately 1/6 of keys (K/N remapping property).

References

"Scaling Memcache at Facebook" — Rajesh Nishtala et al., NSDI 2013 (landmark paper on large-scale caching)
Redis Cluster specification: redis.io/docs/reference/cluster-spec/
Antirez (Salvatore Sanfilippo) blog: antirez.com — original design decisions for Redis
"An Analysis of Hash Map Implementations in Popular Languages" — various (understanding hash table internals)
Consistent hashing original paper: "Consistent Hashing and Random Trees" — Karger et al., STOC 1997
Redis documentation: redis.io/docs/
"Redis in Action" — Josiah Carlson, Manning (2013)
"Thundering Herds & Promises" — Instagram Engineering Blog (cache stampede solutions)
"How We Scaled our Cache and Got a Good Night's Sleep" — Shopify Engineering Blog
Valkey project: valkey.io