Redis Internals

Overview

Redis appears deceptively simple from the outside: a key-value store accessible over TCP. Internally, it is a masterwork of performance-conscious systems engineering. Every data structure is implemented with two representations — a compact encoding for small sizes, a full encoding for large sizes — controlled by automatic threshold upgrades. The single-threaded event loop eliminates lock contention entirely. Persistence is handled via fork-and-copy-on-write for near-zero-overhead snapshots. This document dissects Redis from the inside out: memory layout, data structure internals, event loop, persistence, clustering, and eviction.

Prerequisites

C programming language fundamentals
Understanding of memory layout: heap allocation, pointers, struct packing
Linux fork() and copy-on-write semantics
Basic networking: TCP, epoll/kqueue event loops
Understanding of skip lists (probabilistic data structure)

Historical Context

Redis was created by Salvatore Sanfilippo (antirez) in 2009 while building a real-time analytics system for a startup. He needed a data structure server that could handle sorted sets with fast rank queries — something memcached could not do. He wrote Redis in C over a weekend as a prototype and open-sourced it.

Key milestones: - 2009: Redis 1.0. Core data structures: string, list, set, hash, sorted set. - 2010: Twitter adopted Redis for rate limiting. First large-scale production user. - 2012: Redis 2.6. Lua scripting, object encoding improvements. - 2013: Redis 2.8. PSYNC replication, Sentinel HA. - 2015: Redis 3.0. Redis Cluster (native sharding). Biggest release. - 2017: Redis 4.0. RDB+AOF hybrid persistence, LOLWUT, lazy freeing. - 2018: Redis 5.0. Streams data type (Kafka-like log structure). - 2020: Redis 6.0. ACLs, multi-threaded I/O, SSL/TLS support. - 2022: Redis 7.0. Redis Functions, Lua improvements, listpack replaces ziplist. - 2024: Redis changed license from BSD to RSALv2+SSPL. Valkey fork created.

Antirez stepped back from active development in 2020. Redis Ltd. took over maintenance.

Architecture: Single-Threaded Event Loop

Redis processes commands on a single thread. This design eliminates all mutex/lock overhead that plagues multi-threaded servers:

  Redis Main Thread Event Loop:

  ┌─────────────────────────────────────────────────────────────┐
  │                   Redis Main Thread                          │
  │                                                              │
  │  epoll/kqueue event multiplexer                              │
  │  ┌──────────────────────────────────────────────────────┐   │
  │  │  fd 5 (client A): READ ready                         │   │
  │  │  fd 6 (client B): WRITE ready                        │   │
  │  │  fd 7 (AOF file): WRITE ready                        │   │
  │  └──────────────────────────────────────────────────────┘   │
  │         │                                                    │
  │         ▼                                                    │
  │  Process events one at a time:                               │
  │                                                              │
  │  1. Read command bytes from client A                         │
  │  2. Parse RESP protocol (simple, inline, or multibulk)      │
  │  3. Look up command handler in command table                 │
  │  4. Execute command (e.g., GET key):                         │
  │     a. Hash key → dict bucket                               │
  │     b. Follow pointer chain                                  │
  │     c. Check encoding → return value                        │
  │  5. Write response to client output buffer                   │
  │  6. Register output buffer for WRITE event                  │
  │  7. Handle next event                                        │
  │                                                              │
  │  No mutexes. No shared state between threads.               │
  │  All state owned by single thread = zero lock contention.   │
  └─────────────────────────────────────────────────────────────┘

  Threading in Redis 6.0+:

  I/O threads (for network reads/writes only):
  - Read from multiple sockets in parallel
  - Parse RESP protocol in parallel
  - But: command EXECUTION still on main thread

  Background threads (always existed):
  - BIO_CLOSE_FILE: close file descriptors (blocking)
  - BIO_AOF_FSYNC: fsync AOF file (blocking)
  - BIO_LAZY_FREE: free memory asynchronously (UNLINK command)

Simple Dynamic String (SDS)

Every string value in Redis (and all internal byte arrays) uses SDS instead of C strings:

  C string:  pointer to null-terminated char array
  Problem:   O(N) strlen(), no binary safety (null = end), no preallocation

  SDS layout (sds.h):

  struct sdshdr64 {      // for strings > 1GB (rare)
    uint64_t len;        // current string length (not including null)
    uint64_t alloc;      // allocated size (not including header + null)
    unsigned char flags; // SDS type (3 bits)
    char buf[];          // actual string data (flexible array member)
  };

  Smaller types (for typical strings):
  sdshdr5:  len stored in flags (for strings < 32 bytes)
  sdshdr8:  uint8_t len/alloc   (for strings < 256 bytes)
  sdshdr16: uint16_t len/alloc  (for strings < 65535 bytes)
  sdshdr32: uint32_t len/alloc

  Memory layout in memory:

  [header][len][alloc][flags][b][u][f][f][e][r][0x00]
                              ^
                              SDS pointer points here (to buf[0])

  The SDS pointer IS a valid C string pointer (null-terminated).
  SDS header is accessed by going backwards: ptr - sizeof(header).

  Advantages:
  - O(1) length: read len field
  - Binary-safe: len field, not null terminator, defines end
  - Preallocation: alloc > len avoids realloc on append
  - Interoperable with C string functions (pointer is null-terminated)

  SDS growth strategy:
  if (new_len < 1MB):  alloc = new_len * 2    (double)
  if (new_len >= 1MB): alloc = new_len + 1MB  (add 1MB)

  Reduces reallocation amortized cost to O(1) for append patterns.

redisObject: The Universal Value Type

Every value stored in Redis is wrapped in a redisObject:

  redisObject (object.h):

  typedef struct redisObject {
    unsigned type:4;      // 4 bits: OBJ_STRING/OBJ_LIST/OBJ_SET/OBJ_ZSET/OBJ_HASH
    unsigned encoding:4;  // 4 bits: how value is internally encoded
    unsigned lru:24;      // 24 bits: LRU clock (seconds precision) OR LFU counter
    int refcount;         // 32 bits: reference count
    void *ptr;            // 64 bits: pointer to actual value OR inline small int
  } robj;

  Total: 16 bytes per redisObject

  Small integer optimization (OBJ_ENCODING_INT):
  if value is integer in range [-2^63, 2^63-1]:
    ptr = (void*)(long)integer_value  (pointer stores int directly!)
    No heap allocation for small integers.
    Further: integers 0-9999 are pre-allocated (shared objects):
    INCR counter → returns shared object for 0, 1, 2, ..., 9999
    No allocation/deallocation for common counters.

  Encoding progression (automatic, based on element count/size):

  OBJ_STRING:  OBJ_ENCODING_INT  (integer)
               OBJ_ENCODING_EMBSTR  (string <= 44 bytes, embedded in robj)
               OBJ_ENCODING_RAW  (SDS, for large strings)

  OBJ_LIST:    OBJ_ENCODING_LISTPACK (≤128 elements, each ≤64 bytes)
               OBJ_ENCODING_QUICKLIST (larger)

  OBJ_HASH:    OBJ_ENCODING_LISTPACK (≤128 fields, each ≤64 bytes)
               OBJ_ENCODING_HT (hashtable, for larger)

  OBJ_SET:     OBJ_ENCODING_INTSET (small all-integer sets)
               OBJ_ENCODING_LISTPACK (≤128 elements)
               OBJ_ENCODING_HT (larger)

  OBJ_ZSET:    OBJ_ENCODING_LISTPACK (≤128 elements, each ≤64 bytes)
               OBJ_ENCODING_SKIPLIST + HT (larger)

Data Structure Internals

Listpack (formerly Ziplist)

Listpack is a compact sequential encoding for small collections:

  Listpack layout (continuous memory block):

  [total_bytes:4][num_elements:2][entry][entry]...[entry][end:1]

  Each entry:
  [encoding+data][backlen]

  encoding types:
    7-bit uint: 0xxxxxxx  (0-127, 1 byte)
    6-bit string: 10xxxxxx + N bytes  (string up to 63 bytes)
    13-bit int, 16-bit int, 24-bit int, 32-bit int, 64-bit int variants

  backlen: variable-length encoding of previous entry's total size
           (enables backward traversal: LRANGE from end, ZREVRANGE)

  Memory efficiency example:
  Hash with 3 fields {name: Alice, age: 30, city: NYC}

  As listpack: ~30 bytes total
  As hashtable: ~300 bytes (dict_entry × 3, sds × 6, malloc overhead)

  10× memory savings for small hashes.

  Listpack O(N) access: must iterate from start/end to find element.
  Acceptable for small sizes (≤128 elements).
  Upgrade to hashtable/skiplist when threshold exceeded.

Quicklist

List implementation for larger lists:

  Quicklist = doubly-linked list of listpack nodes

  ┌──────────┐    ┌──────────┐    ┌──────────┐
  │ listpack │ ↔  │ listpack │ ↔  │ listpack │
  │ [1,2,3,  │    │ [11,12,  │    │ [21,22,  │
  │  4,5,6,  │    │  13,14,  │    │  23,24,  │
  │  7,8,9,  │    │  15,16,  │    │  25,26,  │
  │  10]     │    │  17,18]  │    │  27,28]  │
  └──────────┘    └──────────┘    └──────────┘

  Each listpack node has max fill (list-max-listpack-size).
  default: 128 bytes per listpack or 8192 bytes per node.

  LPUSH/RPUSH: add to head/tail listpack node (O(1))
  LINDEX i: traverse to correct listpack, then linear scan (O(N))
  LRANGE start end: return elements in range (efficient for small slices)

Skiplist + Hashtable for Sorted Set

The sorted set (ZSET) uses two data structures simultaneously:

  ZADD leaderboard 1000 "Alice" 950 "Bob" 900 "Charlie"

  Skiplist (for range queries by score):

  Level 4: [head] ─────────────────────────────> [tail]
  Level 3: [head] ──────────────> [Bob:950] ───> [tail]
  Level 2: [head] ──> [Charlie:900] ──> [Bob:950] ──> [Alice:1000] ──> [tail]
  Level 1: [head] ──> [Charlie:900] ──> [Bob:950] ──> [Alice:1000] ──> [tail]

  Level chosen randomly at insertion (geometric distribution, p=0.25).
  Expected level: log_4(N) levels for N elements.

  Skiplist supports:
  - ZRANK: O(log N) rank query (count nodes at level 1 up to target)
  - ZRANGEBYSCORE min max: O(log N + K) for K results
  - ZADD / ZREM: O(log N)

  Hashtable (for O(1) member-to-score lookup):

  key = "Alice" → value = 1000.0 (score)
  key = "Bob"   → value = 950.0

  Supports:
  - ZSCORE member: O(1)
  - ZRANK requires skiplist traversal: O(log N)

  Both structures point to the same SDS strings (shared, not copied).
  Memory overhead: ~64 bytes per element (skiplist node + dict entry + pointers)

Stream (Radix Tree of Listpacks)

Redis Streams (5.0+) use a radix tree as the primary index:

  Stream entries identified by ID: <milliseconds>-<sequence>
  e.g., 1700000000000-0, 1700000000000-1, 1700000001000-0

  Radix tree (trie) with listpack leaves:

  IDs with common prefix share tree nodes.
  Entries within a radix tree leaf stored as listpack (compact).

  XADD: append to listpack leaf; split when full.
  XREAD COUNT 100: traverse radix tree, decode listpacks.
  XRANGE id1 id2: range query using radix tree ordering.

  Consumer groups (XREADGROUP):
  - Per-group last-delivered-id tracking
  - PEL (Pending Entries List): messages delivered but not acked
  - XACK removes from PEL

Persistence

RDB (Redis Database) Snapshot

  RDB Process (fork + copy-on-write):

  t=0:  Redis main thread: fork()
        Child process created (copy-on-write sharing all memory)

  t=0+: Main thread: continues serving requests, mutating data
        Child thread: serializes in-memory state to disk

  Copy-on-write (CoW):
        When main thread modifies a page, OS creates a copy for main thread.
        Child still sees original page (snapshot at fork time).

  ┌──────────────┐     fork()     ┌──────────────┐
  │ Main process  │ ─────────────> │ Child process │
  │ (serving req) │               │ (writing RDB) │
  │  modifies pg1 │ CoW: gets copy│  reads pg1    │
  │  modifies pg2 │ CoW: gets copy│  reads pg2    │
  └──────────────┘               └──────────────┘
                                       │
                                       ▼
                                  /var/redis/dump.rdb

  RDB file format:
  [MAGIC:REDIS][VERSION][AUX fields][DB section: key-value pairs][EOF][CRC64 checksum]

  Each key-value: [expiry?][type byte][key][value-encoding]

  Advantages:
  + Fast startup (single-file binary load)
  + Compact (binary encoding, no command log)
  + Point-in-time snapshot

  Disadvantages:
  - Data loss window: everything since last RDB (minutes to hours)
  - fork() overhead on large datasets (CoW page faults can spike latency)

  Configuration:
  save 3600 1      # save if 1 key changed in 3600s
  save 300 100     # save if 100 keys changed in 300s
  save 60 10000    # save if 10000 keys changed in 60s

AOF (Append-Only File)

  AOF records every write command as it executes:

  SET user:123 "Alice"  →  *3\r\n$3\r\nSET\r\n$8\r\nuser:123\r\n$5\r\nAlice\r\n
  INCR counter          →  *2\r\n$4\r\nINCR\r\n$7\r\ncounter\r\n

  AOF fsync policies:

  appendfsync always:     fsync() after every command
    Durability: complete (no data loss)
    Performance: ~1000 writes/sec (disk-limited)

  appendfsync everysec:   background thread fsyncs every second
    Durability: lose up to 1 second of writes
    Performance: ~100,000+ writes/sec (buffered)
    DEFAULT and recommended for most use cases.

  appendfsync no:         OS decides when to flush (usually 30s)
    Durability: lose up to 30 seconds
    Performance: maximum
    Use only for caches, not for persistent data.

  AOF rewrite:
  Over time, AOF grows large (every INCR recorded individually).
  Compaction: BGREWRITEAOF forks a child to write minimal AOF
  (only current state, not full history):

  1000 INCR counter records → SET counter 1000
  Reduces AOF size by 100-1000×.

  RDB + AOF Hybrid (Redis 4.0+, default in 7.0):
  On AOF rewrite: write full RDB snapshot + new commands since snapshot.
  Fast load (RDB) + minimal data loss (AOF since last RDB).
  Recommended for production.

Cluster Gossip and Failover

  Gossip Protocol:

  Each node maintains a clusterNode list for all known nodes.
  Every second: select random subset of nodes, send PING.
  PONG response: includes node's view of cluster state.

  PING payload (gossip section): includes info on 3 random other nodes
  → Cluster state propagates in O(log N) rounds for N nodes.

  Failure detection:

  Node A sends PING to Node B.
  If no PONG within cluster-node-timeout/2:
    A marks B as PFAIL (probable failure).

  A gossips B's PFAIL status to other nodes.
  When majority of primaries have PFAIL for B:
    B transitions to FAIL.
  B's replicas detect FAIL, start election:
    Replica with largest replication offset starts election.
    Requests votes from other primaries (Raft-like).
    Majority vote → replica becomes new primary.
    Gossip new primary announcement to all nodes.

  Manual failover:
  CLUSTER FAILOVER (on replica): gracefully promote replica to primary
  CLUSTER FAILOVER FORCE: skip data consistency check (use if primary unreachable)
  CLUSTER FAILOVER TAKEOVER: skip election (use for disaster recovery)

Eviction Policies

  When maxmemory is reached:

  LRU approximation (default for allkeys-lru):

  lru field in redisObject (24 bits):
    LRU_CLOCK_RESOLUTION = 1000ms
    lru = server.unixtime (current second, truncated to 24 bits)
    → wraps around every 194 days (2^24 seconds / 86400)

  eviction sampling:
    maxmemory-samples keys randomly selected from dict
    evict key with smallest (server.lruclock - obj.lru) % LRU_WRAP
    → approximates true LRU with O(maxmemory-samples) work

  LFU approximation:

  lru field repurposed for LFU:
    upper 16 bits: last_decrement_time (minutes resolution)
    lower 8 bits:  access counter (0-255)

  Counter increment: probabilistic on each access
    p = 1 / (counter * lfu-log-factor + 1)
    → higher counter = lower probability of increment (logarithmic)
    Counter saturates near 255 for very hot keys.

  Counter decay: every lfu-decay-time minutes without access,
    counter -= 1 (accessed_before = minutes_since_last_decrement)

  LFU evicts key with lowest counter (least frequently accessed).
  Handles "Netflix new release" pattern: watched once, never again
  → counter decays → eligible for eviction.

Redis Data Structure Internals ASCII

  Memory layout of HSET myhash field1 value1 field2 value2 (small hash):

  redisObject (16 bytes):
  ┌────────┬──────────┬────────┬──────────┬──────────────────┐
  │ type   │ encoding │  lru   │ refcount │      ptr         │
  │OBJ_HASH│LISTPACK  │ 123456 │    1     │  ──────────────> │
  │  4b    │   4b     │  24b   │  32b     │       64b        │
  └────────┴──────────┴────────┴──────────┴──────────────────┘
                                                    │
                                                    ▼
  Listpack (continuous block in heap):
  ┌──────────┬─────────┬─────────┬─────────┬─────────┬─────────┬───┐
  │total_bytes│num_elem │ f1 enc  │ field1  │ v1 enc  │ value1  │...│
  │   4 bytes │ 2 bytes │ 1 byte  │ 6 bytes │ 1 byte  │ 6 bytes │   │
  └──────────┴─────────┴─────────┴─────────┴─────────┴─────────┴───┘

  Memory for HSET myhash field1 value1:
  16 bytes (redisObject) + ~25 bytes (listpack) = ~41 bytes

  vs. naive dict approach:
  128 bytes (dict struct) + 64×2 (dict entries) + 40×4 (SDS strings) = ~416 bytes

  10× memory efficiency for small hashes.

Debugging Notes

# Memory analysis:
redis-cli info memory
# used_memory: logical memory used by Redis
# used_memory_rss: physical memory (RSS) from OS perspective
# mem_fragmentation_ratio = used_memory_rss / used_memory
# >1.5: significant fragmentation — consider MEMORY PURGE

# Check encoding of a key:
redis-cli object encoding myhash
# Output: listpack or hashtable

# Check object refcount and idle time:
redis-cli object refcount mykey
redis-cli object idletime mykey   # seconds since last access

# Check memory usage of a specific key:
redis-cli memory usage myhash    # includes metadata and pointers
redis-cli memory usage myhash SAMPLES 5  # sample nested structures

# Debug internals (DANGEROUS on production):
redis-cli debug object myhash
# Output: Value at:0x7f4bc0 refcount:1 encoding:listpack serializedlength:42 lru:...

# Check keyspace (warning: KEYS is O(N), blocks main thread):
redis-cli dbsize                 # total key count (safe)
redis-cli scan 0 count 100       # iterate keys safely (non-blocking)

# Check slow queries:
redis-cli slowlog get 10         # last 10 slow commands
redis-cli slowlog len            # total slow log entries
# Default threshold: slowlog-log-slower-than 10000 (10ms)

Security Implications

Redis's KEYS * command is O(N) and blocks the entire server for the duration. Attackers who can issue commands can trivially cause denial of service. Never expose Redis to untrusted callers.
Redis ACLs (6.0+) allow fine-grained access control per user: limit to specific commands and key patterns. Use ACL SETUSER appuser on >password ~app:* +get +set +del to restrict app access.
The CONFIG SET command allows changing Redis configuration at runtime — including dir and dbfilename, which enables writing arbitrary files if an attacker can issue commands. Disable CONFIG SET in ACLs for non-admin users.
Lua scripts execute on the main thread. A malicious Lua script can spin the CPU indefinitely. Set lua-time-limit (default 5000ms) to kill runaway scripts.

Performance Implications

Single-threaded execution means one slow command stalls everything. Profile with SLOWLOG to find them. Common offenders: KEYS *, SORT on large lists, SMEMBERS on large sets.
Pipelining: send multiple commands without waiting for individual responses. Eliminates N-1 round trips for N operations. Redis can process 1M pipelined commands/sec vs 100K individual commands/sec.
Connection pools: each new Redis connection costs a TCP handshake (~1ms). Pool connections at the application layer (redis-py's ConnectionPool, jedis connection pool).
Memory overhead: every key costs ~64-128 bytes overhead (redisObject + dict_entry + SDS metadata), regardless of value size. Millions of small keys can cost GBs in overhead alone — consider hashing small objects into larger hash fields.

Failure Modes

Symptom	Root Cause	Fix
Latency spikes every N seconds	BGSAVE fork causing CoW page faults	Reduce RDB save frequency; use smaller instances
Memory > maxmemory despite eviction	Fragmentation	MEMORY PURGE; restart with activedefrag yes
Replica lag growing	Replication buffer full	Increase client-output-buffer-limit replica
Cluster marked node FAIL too quickly	cluster-node-timeout too low	Increase to 15-30s for busy/slow nodes
AOF grows without rewrite	auto-aof-rewrite-min-size not met	Trigger BGREWRITEAOF manually
Fork fails with ENOMEM	Not enough virtual memory for fork	Enable/increase swap; reduce dataset size

Modern Usage

Redis Stack (RediSearch, RedisJSON, RedisTimeSeries): Extends Redis with secondary indexing, JSON path queries, and time series compression. Used as a primary database for certain workloads.
Redis as a job queue (Sidekiq, Bull, Celery): Millions of background jobs per day. Sorted sets as priority queues; lists as FIFO queues. Producer-consumer via BLPOP (blocking list pop).
Redis Pub/Sub: Simple message broadcast; NOT persisted — messages lost if no subscriber connected. Use Streams for reliable message delivery.
Valkey (2024 fork): Linux Foundation project, compatible drop-in replacement. AWS ElastiCache and GCP Memorystore migrating to Valkey.

Future Directions

Multi-threading for command execution: Valkey is exploring multi-threaded command processing (not just I/O), which would break the single-threaded guarantee but enable linear scaling with core count.
Vector search: Redis Stack's vector similarity search enables KNN queries on embedding vectors — used for AI-powered recommendations, semantic search, and RAG (Retrieval Augmented Generation) applications.
Cluster-wide Lua/Functions: Extending Redis Functions to execute across multiple shards atomically (currently limited to single-slot operations).

Exercises

Instrument Redis memory encoding thresholds: create a hash with 1 field (HSET test f1 v1). Check encoding with OBJECT ENCODING. Add fields until encoding changes from listpack to hashtable. Determine the exact threshold from the server's hash-max-listpack-entries config.
Measure copy-on-write overhead: start a Redis instance with 1GB dataset. Trigger BGSAVE and simultaneously run a write benchmark. Measure RSS memory increase during the fork (CoW page faults). Compare to no-write baseline.
Implement a rate limiter using Redis sorted sets (ZADD + ZREMRANGEBYSCORE + ZCARD) with a sliding window. Test with concurrent requests to verify correct limiting behavior.
Explore the skiplist structure: add 1000 elements to a sorted set (ZADD leaderboard ). Use DEBUG OBJECT leaderboard to see the serialized length. Compare to a hash with same 1000 fields.
Test AOF durability: configure appendfsync always, write 1000 keys, then kill -9 the Redis process. Restart and verify all 1000 keys are present. Repeat with appendfsync everysec and kill during a write burst — verify data loss is bounded to ~1 second.

References

Redis source code: github.com/redis/redis (particularly: src/object.c, src/t_*.c, src/aof.c, src/rdb.c, src/cluster.c, src/ae.c)
antirez blog: antirez.com (design rationale in author's own words)
"Redis internals: SDS" — antirez, 2009
Redis documentation: redis.io/docs/reference/internals/
"Inside Redis: What every developer should know" — Jan-Erik Rediger, 2023
Redis memory optimization guide: redis.io/docs/management/optimization/memory-optimization/
Skiplist original paper: "Skip Lists: A Probabilistic Alternative to Balanced Trees" — William Pugh, 1990
"Design and Implementation of a Key-Value Store" — Salvatore Sanfilippo, 2009 (original Redis thesis)
Valkey: valkey.io
"Scaling Memcache at Facebook" — Nishtala et al., NSDI 2013 (for comparison/contrast with Redis scaling)