Skip to content

Redis Internals

Overview

Redis appears deceptively simple from the outside: a key-value store accessible over TCP. Internally, it is a masterwork of performance-conscious systems engineering. Every data structure is implemented with two representations — a compact encoding for small sizes, a full encoding for large sizes — controlled by automatic threshold upgrades. The single-threaded event loop eliminates lock contention entirely. Persistence is handled via fork-and-copy-on-write for near-zero-overhead snapshots. This document dissects Redis from the inside out: memory layout, data structure internals, event loop, persistence, clustering, and eviction.

Prerequisites

  • C programming language fundamentals
  • Understanding of memory layout: heap allocation, pointers, struct packing
  • Linux fork() and copy-on-write semantics
  • Basic networking: TCP, epoll/kqueue event loops
  • Understanding of skip lists (probabilistic data structure)

Historical Context

Redis was created by Salvatore Sanfilippo (antirez) in 2009 while building a real-time analytics system for a startup. He needed a data structure server that could handle sorted sets with fast rank queries — something memcached could not do. He wrote Redis in C over a weekend as a prototype and open-sourced it.

Key milestones: - 2009: Redis 1.0. Core data structures: string, list, set, hash, sorted set. - 2010: Twitter adopted Redis for rate limiting. First large-scale production user. - 2012: Redis 2.6. Lua scripting, object encoding improvements. - 2013: Redis 2.8. PSYNC replication, Sentinel HA. - 2015: Redis 3.0. Redis Cluster (native sharding). Biggest release. - 2017: Redis 4.0. RDB+AOF hybrid persistence, LOLWUT, lazy freeing. - 2018: Redis 5.0. Streams data type (Kafka-like log structure). - 2020: Redis 6.0. ACLs, multi-threaded I/O, SSL/TLS support. - 2022: Redis 7.0. Redis Functions, Lua improvements, listpack replaces ziplist. - 2024: Redis changed license from BSD to RSALv2+SSPL. Valkey fork created.

Antirez stepped back from active development in 2020. Redis Ltd. took over maintenance.

Architecture: Single-Threaded Event Loop

Redis processes commands on a single thread. This design eliminates all mutex/lock overhead that plagues multi-threaded servers:

  Redis Main Thread Event Loop:

  ┌─────────────────────────────────────────────────────────────┐
  │                   Redis Main Thread                          │
  │                                                              │
  │  epoll/kqueue event multiplexer                              │
  │  ┌──────────────────────────────────────────────────────┐   │
  │  │  fd 5 (client A): READ ready                         │   │
  │  │  fd 6 (client B): WRITE ready                        │   │
  │  │  fd 7 (AOF file): WRITE ready                        │   │
  │  └──────────────────────────────────────────────────────┘   │
  │         │                                                    │
  │         ▼                                                    │
  │  Process events one at a time:                               │
  │                                                              │
  │  1. Read command bytes from client A                         │
  │  2. Parse RESP protocol (simple, inline, or multibulk)      │
  │  3. Look up command handler in command table                 │
  │  4. Execute command (e.g., GET key):                         │
  │     a. Hash key → dict bucket                               │
  │     b. Follow pointer chain                                  │
  │     c. Check encoding → return value                        │
  │  5. Write response to client output buffer                   │
  │  6. Register output buffer for WRITE event                  │
  │  7. Handle next event                                        │
  │                                                              │
  │  No mutexes. No shared state between threads.               │
  │  All state owned by single thread = zero lock contention.   │
  └─────────────────────────────────────────────────────────────┘

  Threading in Redis 6.0+:

  I/O threads (for network reads/writes only):
  - Read from multiple sockets in parallel
  - Parse RESP protocol in parallel
  - But: command EXECUTION still on main thread

  Background threads (always existed):
  - BIO_CLOSE_FILE: close file descriptors (blocking)
  - BIO_AOF_FSYNC: fsync AOF file (blocking)
  - BIO_LAZY_FREE: free memory asynchronously (UNLINK command)

Simple Dynamic String (SDS)

Every string value in Redis (and all internal byte arrays) uses SDS instead of C strings:

  C string:  pointer to null-terminated char array
  Problem:   O(N) strlen(), no binary safety (null = end), no preallocation

  SDS layout (sds.h):

  struct sdshdr64 {      // for strings > 1GB (rare)
    uint64_t len;        // current string length (not including null)
    uint64_t alloc;      // allocated size (not including header + null)
    unsigned char flags; // SDS type (3 bits)
    char buf[];          // actual string data (flexible array member)
  };

  Smaller types (for typical strings):
  sdshdr5:  len stored in flags (for strings < 32 bytes)
  sdshdr8:  uint8_t len/alloc   (for strings < 256 bytes)
  sdshdr16: uint16_t len/alloc  (for strings < 65535 bytes)
  sdshdr32: uint32_t len/alloc

  Memory layout in memory:

  [header][len][alloc][flags][b][u][f][f][e][r][0x00]
                              ^
                              SDS pointer points here (to buf[0])

  The SDS pointer IS a valid C string pointer (null-terminated).
  SDS header is accessed by going backwards: ptr - sizeof(header).

  Advantages:
  - O(1) length: read len field
  - Binary-safe: len field, not null terminator, defines end
  - Preallocation: alloc > len avoids realloc on append
  - Interoperable with C string functions (pointer is null-terminated)

  SDS growth strategy:
  if (new_len < 1MB):  alloc = new_len * 2    (double)
  if (new_len >= 1MB): alloc = new_len + 1MB  (add 1MB)

  Reduces reallocation amortized cost to O(1) for append patterns.

redisObject: The Universal Value Type

Every value stored in Redis is wrapped in a redisObject:

  redisObject (object.h):

  typedef struct redisObject {
    unsigned type:4;      // 4 bits: OBJ_STRING/OBJ_LIST/OBJ_SET/OBJ_ZSET/OBJ_HASH
    unsigned encoding:4;  // 4 bits: how value is internally encoded
    unsigned lru:24;      // 24 bits: LRU clock (seconds precision) OR LFU counter
    int refcount;         // 32 bits: reference count
    void *ptr;            // 64 bits: pointer to actual value OR inline small int
  } robj;

  Total: 16 bytes per redisObject

  Small integer optimization (OBJ_ENCODING_INT):
  if value is integer in range [-2^63, 2^63-1]:
    ptr = (void*)(long)integer_value  (pointer stores int directly!)
    No heap allocation for small integers.
    Further: integers 0-9999 are pre-allocated (shared objects):
    INCR counter → returns shared object for 0, 1, 2, ..., 9999
    No allocation/deallocation for common counters.

  Encoding progression (automatic, based on element count/size):

  OBJ_STRING:  OBJ_ENCODING_INT  (integer)
               OBJ_ENCODING_EMBSTR  (string <= 44 bytes, embedded in robj)
               OBJ_ENCODING_RAW  (SDS, for large strings)

  OBJ_LIST:    OBJ_ENCODING_LISTPACK (≤128 elements, each ≤64 bytes)
               OBJ_ENCODING_QUICKLIST (larger)

  OBJ_HASH:    OBJ_ENCODING_LISTPACK (≤128 fields, each ≤64 bytes)
               OBJ_ENCODING_HT (hashtable, for larger)

  OBJ_SET:     OBJ_ENCODING_INTSET (small all-integer sets)
               OBJ_ENCODING_LISTPACK (≤128 elements)
               OBJ_ENCODING_HT (larger)

  OBJ_ZSET:    OBJ_ENCODING_LISTPACK (≤128 elements, each ≤64 bytes)
               OBJ_ENCODING_SKIPLIST + HT (larger)

Data Structure Internals

Listpack (formerly Ziplist)

Listpack is a compact sequential encoding for small collections:

  Listpack layout (continuous memory block):

  [total_bytes:4][num_elements:2][entry][entry]...[entry][end:1]

  Each entry:
  [encoding+data][backlen]

  encoding types:
    7-bit uint: 0xxxxxxx  (0-127, 1 byte)
    6-bit string: 10xxxxxx + N bytes  (string up to 63 bytes)
    13-bit int, 16-bit int, 24-bit int, 32-bit int, 64-bit int variants

  backlen: variable-length encoding of previous entry's total size
           (enables backward traversal: LRANGE from end, ZREVRANGE)

  Memory efficiency example:
  Hash with 3 fields {name: Alice, age: 30, city: NYC}

  As listpack: ~30 bytes total
  As hashtable: ~300 bytes (dict_entry × 3, sds × 6, malloc overhead)

  10× memory savings for small hashes.

  Listpack O(N) access: must iterate from start/end to find element.
  Acceptable for small sizes (≤128 elements).
  Upgrade to hashtable/skiplist when threshold exceeded.

Quicklist

List implementation for larger lists:

  Quicklist = doubly-linked list of listpack nodes

  ┌──────────┐    ┌──────────┐    ┌──────────┐
  │ listpack │ ↔  │ listpack │ ↔  │ listpack │
  │ [1,2,3,  │    │ [11,12,  │    │ [21,22,  │
  │  4,5,6,  │    │  13,14,  │    │  23,24,  │
  │  7,8,9,  │    │  15,16,  │    │  25,26,  │
  │  10]     │    │  17,18]  │    │  27,28]  │
  └──────────┘    └──────────┘    └──────────┘

  Each listpack node has max fill (list-max-listpack-size).
  default: 128 bytes per listpack or 8192 bytes per node.

  LPUSH/RPUSH: add to head/tail listpack node (O(1))
  LINDEX i: traverse to correct listpack, then linear scan (O(N))
  LRANGE start end: return elements in range (efficient for small slices)

Skiplist + Hashtable for Sorted Set

The sorted set (ZSET) uses two data structures simultaneously:

  ZADD leaderboard 1000 "Alice" 950 "Bob" 900 "Charlie"

  Skiplist (for range queries by score):

  Level 4: [head] ─────────────────────────────> [tail]
  Level 3: [head] ──────────────> [Bob:950] ───> [tail]
  Level 2: [head] ──> [Charlie:900] ──> [Bob:950] ──> [Alice:1000] ──> [tail]
  Level 1: [head] ──> [Charlie:900] ──> [Bob:950] ──> [Alice:1000] ──> [tail]

  Level chosen randomly at insertion (geometric distribution, p=0.25).
  Expected level: log_4(N) levels for N elements.

  Skiplist supports:
  - ZRANK: O(log N) rank query (count nodes at level 1 up to target)
  - ZRANGEBYSCORE min max: O(log N + K) for K results
  - ZADD / ZREM: O(log N)

  Hashtable (for O(1) member-to-score lookup):

  key = "Alice" → value = 1000.0 (score)
  key = "Bob"   → value = 950.0

  Supports:
  - ZSCORE member: O(1)
  - ZRANK requires skiplist traversal: O(log N)

  Both structures point to the same SDS strings (shared, not copied).
  Memory overhead: ~64 bytes per element (skiplist node + dict entry + pointers)

Stream (Radix Tree of Listpacks)

Redis Streams (5.0+) use a radix tree as the primary index:

  Stream entries identified by ID: <milliseconds>-<sequence>
  e.g., 1700000000000-0, 1700000000000-1, 1700000001000-0

  Radix tree (trie) with listpack leaves:

  IDs with common prefix share tree nodes.
  Entries within a radix tree leaf stored as listpack (compact).

  XADD: append to listpack leaf; split when full.
  XREAD COUNT 100: traverse radix tree, decode listpacks.
  XRANGE id1 id2: range query using radix tree ordering.

  Consumer groups (XREADGROUP):
  - Per-group last-delivered-id tracking
  - PEL (Pending Entries List): messages delivered but not acked
  - XACK removes from PEL

Persistence

RDB (Redis Database) Snapshot

  RDB Process (fork + copy-on-write):

  t=0:  Redis main thread: fork()
        Child process created (copy-on-write sharing all memory)

  t=0+: Main thread: continues serving requests, mutating data
        Child thread: serializes in-memory state to disk

  Copy-on-write (CoW):
        When main thread modifies a page, OS creates a copy for main thread.
        Child still sees original page (snapshot at fork time).

  ┌──────────────┐     fork()     ┌──────────────┐
  │ Main process  │ ─────────────> │ Child process │
  │ (serving req) │               │ (writing RDB) │
  │  modifies pg1 │ CoW: gets copy│  reads pg1    │
  │  modifies pg2 │ CoW: gets copy│  reads pg2    │
  └──────────────┘               └──────────────┘
                                       │
                                       ▼
                                  /var/redis/dump.rdb

  RDB file format:
  [MAGIC:REDIS][VERSION][AUX fields][DB section: key-value pairs][EOF][CRC64 checksum]

  Each key-value: [expiry?][type byte][key][value-encoding]

  Advantages:
  + Fast startup (single-file binary load)
  + Compact (binary encoding, no command log)
  + Point-in-time snapshot

  Disadvantages:
  - Data loss window: everything since last RDB (minutes to hours)
  - fork() overhead on large datasets (CoW page faults can spike latency)

  Configuration:
  save 3600 1      # save if 1 key changed in 3600s
  save 300 100     # save if 100 keys changed in 300s
  save 60 10000    # save if 10000 keys changed in 60s

AOF (Append-Only File)

  AOF records every write command as it executes:

  SET user:123 "Alice"  →  *3\r\n$3\r\nSET\r\n$8\r\nuser:123\r\n$5\r\nAlice\r\n
  INCR counter          →  *2\r\n$4\r\nINCR\r\n$7\r\ncounter\r\n

  AOF fsync policies:

  appendfsync always:     fsync() after every command
    Durability: complete (no data loss)
    Performance: ~1000 writes/sec (disk-limited)

  appendfsync everysec:   background thread fsyncs every second
    Durability: lose up to 1 second of writes
    Performance: ~100,000+ writes/sec (buffered)
    DEFAULT and recommended for most use cases.

  appendfsync no:         OS decides when to flush (usually 30s)
    Durability: lose up to 30 seconds
    Performance: maximum
    Use only for caches, not for persistent data.

  AOF rewrite:
  Over time, AOF grows large (every INCR recorded individually).
  Compaction: BGREWRITEAOF forks a child to write minimal AOF
  (only current state, not full history):

  1000 INCR counter records → SET counter 1000
  Reduces AOF size by 100-1000×.

  RDB + AOF Hybrid (Redis 4.0+, default in 7.0):
  On AOF rewrite: write full RDB snapshot + new commands since snapshot.
  Fast load (RDB) + minimal data loss (AOF since last RDB).
  Recommended for production.

Cluster Gossip and Failover

  Gossip Protocol:

  Each node maintains a clusterNode list for all known nodes.
  Every second: select random subset of nodes, send PING.
  PONG response: includes node's view of cluster state.

  PING payload (gossip section): includes info on 3 random other nodes
  → Cluster state propagates in O(log N) rounds for N nodes.

  Failure detection:

  Node A sends PING to Node B.
  If no PONG within cluster-node-timeout/2:
    A marks B as PFAIL (probable failure).

  A gossips B's PFAIL status to other nodes.
  When majority of primaries have PFAIL for B:
    B transitions to FAIL.
  B's replicas detect FAIL, start election:
    Replica with largest replication offset starts election.
    Requests votes from other primaries (Raft-like).
    Majority vote → replica becomes new primary.
    Gossip new primary announcement to all nodes.

  Manual failover:
  CLUSTER FAILOVER (on replica): gracefully promote replica to primary
  CLUSTER FAILOVER FORCE: skip data consistency check (use if primary unreachable)
  CLUSTER FAILOVER TAKEOVER: skip election (use for disaster recovery)

Eviction Policies

  When maxmemory is reached:

  LRU approximation (default for allkeys-lru):

  lru field in redisObject (24 bits):
    LRU_CLOCK_RESOLUTION = 1000ms
    lru = server.unixtime (current second, truncated to 24 bits)
    → wraps around every 194 days (2^24 seconds / 86400)

  eviction sampling:
    maxmemory-samples keys randomly selected from dict
    evict key with smallest (server.lruclock - obj.lru) % LRU_WRAP
    → approximates true LRU with O(maxmemory-samples) work

  LFU approximation:

  lru field repurposed for LFU:
    upper 16 bits: last_decrement_time (minutes resolution)
    lower 8 bits:  access counter (0-255)

  Counter increment: probabilistic on each access
    p = 1 / (counter * lfu-log-factor + 1)
    → higher counter = lower probability of increment (logarithmic)
    Counter saturates near 255 for very hot keys.

  Counter decay: every lfu-decay-time minutes without access,
    counter -= 1 (accessed_before = minutes_since_last_decrement)

  LFU evicts key with lowest counter (least frequently accessed).
  Handles "Netflix new release" pattern: watched once, never again
  → counter decays → eligible for eviction.

Redis Data Structure Internals ASCII

  Memory layout of HSET myhash field1 value1 field2 value2 (small hash):

  redisObject (16 bytes):
  ┌────────┬──────────┬────────┬──────────┬──────────────────┐
  │ type   │ encoding │  lru   │ refcount │      ptr         │
  │OBJ_HASH│LISTPACK  │ 123456 │    1     │  ──────────────> │
  │  4b    │   4b     │  24b   │  32b     │       64b        │
  └────────┴──────────┴────────┴──────────┴──────────────────┘
                                                    │
                                                    ▼
  Listpack (continuous block in heap):
  ┌──────────┬─────────┬─────────┬─────────┬─────────┬─────────┬───┐
  │total_bytes│num_elem │ f1 enc  │ field1  │ v1 enc  │ value1  │...│
  │   4 bytes │ 2 bytes │ 1 byte  │ 6 bytes │ 1 byte  │ 6 bytes │   │
  └──────────┴─────────┴─────────┴─────────┴─────────┴─────────┴───┘

  Memory for HSET myhash field1 value1:
  16 bytes (redisObject) + ~25 bytes (listpack) = ~41 bytes

  vs. naive dict approach:
  128 bytes (dict struct) + 64×2 (dict entries) + 40×4 (SDS strings) = ~416 bytes

  10× memory efficiency for small hashes.

Debugging Notes

# Memory analysis:
redis-cli info memory
# used_memory: logical memory used by Redis
# used_memory_rss: physical memory (RSS) from OS perspective
# mem_fragmentation_ratio = used_memory_rss / used_memory
# >1.5: significant fragmentation — consider MEMORY PURGE

# Check encoding of a key:
redis-cli object encoding myhash
# Output: listpack or hashtable

# Check object refcount and idle time:
redis-cli object refcount mykey
redis-cli object idletime mykey   # seconds since last access

# Check memory usage of a specific key:
redis-cli memory usage myhash    # includes metadata and pointers
redis-cli memory usage myhash SAMPLES 5  # sample nested structures

# Debug internals (DANGEROUS on production):
redis-cli debug object myhash
# Output: Value at:0x7f4bc0 refcount:1 encoding:listpack serializedlength:42 lru:...

# Check keyspace (warning: KEYS is O(N), blocks main thread):
redis-cli dbsize                 # total key count (safe)
redis-cli scan 0 count 100       # iterate keys safely (non-blocking)

# Check slow queries:
redis-cli slowlog get 10         # last 10 slow commands
redis-cli slowlog len            # total slow log entries
# Default threshold: slowlog-log-slower-than 10000 (10ms)

Security Implications

  • Redis's KEYS * command is O(N) and blocks the entire server for the duration. Attackers who can issue commands can trivially cause denial of service. Never expose Redis to untrusted callers.
  • Redis ACLs (6.0+) allow fine-grained access control per user: limit to specific commands and key patterns. Use ACL SETUSER appuser on >password ~app:* +get +set +del to restrict app access.
  • The CONFIG SET command allows changing Redis configuration at runtime — including dir and dbfilename, which enables writing arbitrary files if an attacker can issue commands. Disable CONFIG SET in ACLs for non-admin users.
  • Lua scripts execute on the main thread. A malicious Lua script can spin the CPU indefinitely. Set lua-time-limit (default 5000ms) to kill runaway scripts.

Performance Implications

  • Single-threaded execution means one slow command stalls everything. Profile with SLOWLOG to find them. Common offenders: KEYS *, SORT on large lists, SMEMBERS on large sets.
  • Pipelining: send multiple commands without waiting for individual responses. Eliminates N-1 round trips for N operations. Redis can process 1M pipelined commands/sec vs 100K individual commands/sec.
  • Connection pools: each new Redis connection costs a TCP handshake (~1ms). Pool connections at the application layer (redis-py's ConnectionPool, jedis connection pool).
  • Memory overhead: every key costs ~64-128 bytes overhead (redisObject + dict_entry + SDS metadata), regardless of value size. Millions of small keys can cost GBs in overhead alone — consider hashing small objects into larger hash fields.

Failure Modes

Symptom Root Cause Fix
Latency spikes every N seconds BGSAVE fork causing CoW page faults Reduce RDB save frequency; use smaller instances
Memory > maxmemory despite eviction Fragmentation MEMORY PURGE; restart with activedefrag yes
Replica lag growing Replication buffer full Increase client-output-buffer-limit replica
Cluster marked node FAIL too quickly cluster-node-timeout too low Increase to 15-30s for busy/slow nodes
AOF grows without rewrite auto-aof-rewrite-min-size not met Trigger BGREWRITEAOF manually
Fork fails with ENOMEM Not enough virtual memory for fork Enable/increase swap; reduce dataset size

Modern Usage

  • Redis Stack (RediSearch, RedisJSON, RedisTimeSeries): Extends Redis with secondary indexing, JSON path queries, and time series compression. Used as a primary database for certain workloads.
  • Redis as a job queue (Sidekiq, Bull, Celery): Millions of background jobs per day. Sorted sets as priority queues; lists as FIFO queues. Producer-consumer via BLPOP (blocking list pop).
  • Redis Pub/Sub: Simple message broadcast; NOT persisted — messages lost if no subscriber connected. Use Streams for reliable message delivery.
  • Valkey (2024 fork): Linux Foundation project, compatible drop-in replacement. AWS ElastiCache and GCP Memorystore migrating to Valkey.

Future Directions

  • Multi-threading for command execution: Valkey is exploring multi-threaded command processing (not just I/O), which would break the single-threaded guarantee but enable linear scaling with core count.
  • Vector search: Redis Stack's vector similarity search enables KNN queries on embedding vectors — used for AI-powered recommendations, semantic search, and RAG (Retrieval Augmented Generation) applications.
  • Cluster-wide Lua/Functions: Extending Redis Functions to execute across multiple shards atomically (currently limited to single-slot operations).

Exercises

  1. Instrument Redis memory encoding thresholds: create a hash with 1 field (HSET test f1 v1). Check encoding with OBJECT ENCODING. Add fields until encoding changes from listpack to hashtable. Determine the exact threshold from the server's hash-max-listpack-entries config.

  2. Measure copy-on-write overhead: start a Redis instance with 1GB dataset. Trigger BGSAVE and simultaneously run a write benchmark. Measure RSS memory increase during the fork (CoW page faults). Compare to no-write baseline.

  3. Implement a rate limiter using Redis sorted sets (ZADD + ZREMRANGEBYSCORE + ZCARD) with a sliding window. Test with concurrent requests to verify correct limiting behavior.

  4. Explore the skiplist structure: add 1000 elements to a sorted set (ZADD leaderboard ). Use DEBUG OBJECT leaderboard to see the serialized length. Compare to a hash with same 1000 fields.

  5. Test AOF durability: configure appendfsync always, write 1000 keys, then kill -9 the Redis process. Restart and verify all 1000 keys are present. Repeat with appendfsync everysec and kill during a write burst — verify data loss is bounded to ~1 second.

References

  • Redis source code: github.com/redis/redis (particularly: src/object.c, src/t_*.c, src/aof.c, src/rdb.c, src/cluster.c, src/ae.c)
  • antirez blog: antirez.com (design rationale in author's own words)
  • "Redis internals: SDS" — antirez, 2009
  • Redis documentation: redis.io/docs/reference/internals/
  • "Inside Redis: What every developer should know" — Jan-Erik Rediger, 2023
  • Redis memory optimization guide: redis.io/docs/management/optimization/memory-optimization/
  • Skiplist original paper: "Skip Lists: A Probabilistic Alternative to Balanced Trees" — William Pugh, 1990
  • "Design and Implementation of a Key-Value Store" — Salvatore Sanfilippo, 2009 (original Redis thesis)
  • Valkey: valkey.io
  • "Scaling Memcache at Facebook" — Nishtala et al., NSDI 2013 (for comparison/contrast with Redis scaling)