Redis Internals
Overview
Redis appears deceptively simple from the outside: a key-value store accessible over TCP. Internally, it is a masterwork of performance-conscious systems engineering. Every data structure is implemented with two representations — a compact encoding for small sizes, a full encoding for large sizes — controlled by automatic threshold upgrades. The single-threaded event loop eliminates lock contention entirely. Persistence is handled via fork-and-copy-on-write for near-zero-overhead snapshots. This document dissects Redis from the inside out: memory layout, data structure internals, event loop, persistence, clustering, and eviction.
Prerequisites
- C programming language fundamentals
- Understanding of memory layout: heap allocation, pointers, struct packing
- Linux fork() and copy-on-write semantics
- Basic networking: TCP, epoll/kqueue event loops
- Understanding of skip lists (probabilistic data structure)
Historical Context
Redis was created by Salvatore Sanfilippo (antirez) in 2009 while building a real-time analytics system for a startup. He needed a data structure server that could handle sorted sets with fast rank queries — something memcached could not do. He wrote Redis in C over a weekend as a prototype and open-sourced it.
Key milestones: - 2009: Redis 1.0. Core data structures: string, list, set, hash, sorted set. - 2010: Twitter adopted Redis for rate limiting. First large-scale production user. - 2012: Redis 2.6. Lua scripting, object encoding improvements. - 2013: Redis 2.8. PSYNC replication, Sentinel HA. - 2015: Redis 3.0. Redis Cluster (native sharding). Biggest release. - 2017: Redis 4.0. RDB+AOF hybrid persistence, LOLWUT, lazy freeing. - 2018: Redis 5.0. Streams data type (Kafka-like log structure). - 2020: Redis 6.0. ACLs, multi-threaded I/O, SSL/TLS support. - 2022: Redis 7.0. Redis Functions, Lua improvements, listpack replaces ziplist. - 2024: Redis changed license from BSD to RSALv2+SSPL. Valkey fork created.
Antirez stepped back from active development in 2020. Redis Ltd. took over maintenance.
Architecture: Single-Threaded Event Loop
Redis processes commands on a single thread. This design eliminates all mutex/lock overhead that plagues multi-threaded servers:
Redis Main Thread Event Loop:
┌─────────────────────────────────────────────────────────────┐
│ Redis Main Thread │
│ │
│ epoll/kqueue event multiplexer │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ fd 5 (client A): READ ready │ │
│ │ fd 6 (client B): WRITE ready │ │
│ │ fd 7 (AOF file): WRITE ready │ │
│ └──────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Process events one at a time: │
│ │
│ 1. Read command bytes from client A │
│ 2. Parse RESP protocol (simple, inline, or multibulk) │
│ 3. Look up command handler in command table │
│ 4. Execute command (e.g., GET key): │
│ a. Hash key → dict bucket │
│ b. Follow pointer chain │
│ c. Check encoding → return value │
│ 5. Write response to client output buffer │
│ 6. Register output buffer for WRITE event │
│ 7. Handle next event │
│ │
│ No mutexes. No shared state between threads. │
│ All state owned by single thread = zero lock contention. │
└─────────────────────────────────────────────────────────────┘
Threading in Redis 6.0+:
I/O threads (for network reads/writes only):
- Read from multiple sockets in parallel
- Parse RESP protocol in parallel
- But: command EXECUTION still on main thread
Background threads (always existed):
- BIO_CLOSE_FILE: close file descriptors (blocking)
- BIO_AOF_FSYNC: fsync AOF file (blocking)
- BIO_LAZY_FREE: free memory asynchronously (UNLINK command)
Simple Dynamic String (SDS)
Every string value in Redis (and all internal byte arrays) uses SDS instead of C strings:
C string: pointer to null-terminated char array
Problem: O(N) strlen(), no binary safety (null = end), no preallocation
SDS layout (sds.h):
struct sdshdr64 { // for strings > 1GB (rare)
uint64_t len; // current string length (not including null)
uint64_t alloc; // allocated size (not including header + null)
unsigned char flags; // SDS type (3 bits)
char buf[]; // actual string data (flexible array member)
};
Smaller types (for typical strings):
sdshdr5: len stored in flags (for strings < 32 bytes)
sdshdr8: uint8_t len/alloc (for strings < 256 bytes)
sdshdr16: uint16_t len/alloc (for strings < 65535 bytes)
sdshdr32: uint32_t len/alloc
Memory layout in memory:
[header][len][alloc][flags][b][u][f][f][e][r][0x00]
^
SDS pointer points here (to buf[0])
The SDS pointer IS a valid C string pointer (null-terminated).
SDS header is accessed by going backwards: ptr - sizeof(header).
Advantages:
- O(1) length: read len field
- Binary-safe: len field, not null terminator, defines end
- Preallocation: alloc > len avoids realloc on append
- Interoperable with C string functions (pointer is null-terminated)
SDS growth strategy:
if (new_len < 1MB): alloc = new_len * 2 (double)
if (new_len >= 1MB): alloc = new_len + 1MB (add 1MB)
Reduces reallocation amortized cost to O(1) for append patterns.
redisObject: The Universal Value Type
Every value stored in Redis is wrapped in a redisObject:
redisObject (object.h):
typedef struct redisObject {
unsigned type:4; // 4 bits: OBJ_STRING/OBJ_LIST/OBJ_SET/OBJ_ZSET/OBJ_HASH
unsigned encoding:4; // 4 bits: how value is internally encoded
unsigned lru:24; // 24 bits: LRU clock (seconds precision) OR LFU counter
int refcount; // 32 bits: reference count
void *ptr; // 64 bits: pointer to actual value OR inline small int
} robj;
Total: 16 bytes per redisObject
Small integer optimization (OBJ_ENCODING_INT):
if value is integer in range [-2^63, 2^63-1]:
ptr = (void*)(long)integer_value (pointer stores int directly!)
No heap allocation for small integers.
Further: integers 0-9999 are pre-allocated (shared objects):
INCR counter → returns shared object for 0, 1, 2, ..., 9999
No allocation/deallocation for common counters.
Encoding progression (automatic, based on element count/size):
OBJ_STRING: OBJ_ENCODING_INT (integer)
OBJ_ENCODING_EMBSTR (string <= 44 bytes, embedded in robj)
OBJ_ENCODING_RAW (SDS, for large strings)
OBJ_LIST: OBJ_ENCODING_LISTPACK (≤128 elements, each ≤64 bytes)
OBJ_ENCODING_QUICKLIST (larger)
OBJ_HASH: OBJ_ENCODING_LISTPACK (≤128 fields, each ≤64 bytes)
OBJ_ENCODING_HT (hashtable, for larger)
OBJ_SET: OBJ_ENCODING_INTSET (small all-integer sets)
OBJ_ENCODING_LISTPACK (≤128 elements)
OBJ_ENCODING_HT (larger)
OBJ_ZSET: OBJ_ENCODING_LISTPACK (≤128 elements, each ≤64 bytes)
OBJ_ENCODING_SKIPLIST + HT (larger)
Data Structure Internals
Listpack (formerly Ziplist)
Listpack is a compact sequential encoding for small collections:
Listpack layout (continuous memory block):
[total_bytes:4][num_elements:2][entry][entry]...[entry][end:1]
Each entry:
[encoding+data][backlen]
encoding types:
7-bit uint: 0xxxxxxx (0-127, 1 byte)
6-bit string: 10xxxxxx + N bytes (string up to 63 bytes)
13-bit int, 16-bit int, 24-bit int, 32-bit int, 64-bit int variants
backlen: variable-length encoding of previous entry's total size
(enables backward traversal: LRANGE from end, ZREVRANGE)
Memory efficiency example:
Hash with 3 fields {name: Alice, age: 30, city: NYC}
As listpack: ~30 bytes total
As hashtable: ~300 bytes (dict_entry × 3, sds × 6, malloc overhead)
10× memory savings for small hashes.
Listpack O(N) access: must iterate from start/end to find element.
Acceptable for small sizes (≤128 elements).
Upgrade to hashtable/skiplist when threshold exceeded.
Quicklist
List implementation for larger lists:
Quicklist = doubly-linked list of listpack nodes
┌──────────┐ ┌──────────┐ ┌──────────┐
│ listpack │ ↔ │ listpack │ ↔ │ listpack │
│ [1,2,3, │ │ [11,12, │ │ [21,22, │
│ 4,5,6, │ │ 13,14, │ │ 23,24, │
│ 7,8,9, │ │ 15,16, │ │ 25,26, │
│ 10] │ │ 17,18] │ │ 27,28] │
└──────────┘ └──────────┘ └──────────┘
Each listpack node has max fill (list-max-listpack-size).
default: 128 bytes per listpack or 8192 bytes per node.
LPUSH/RPUSH: add to head/tail listpack node (O(1))
LINDEX i: traverse to correct listpack, then linear scan (O(N))
LRANGE start end: return elements in range (efficient for small slices)
Skiplist + Hashtable for Sorted Set
The sorted set (ZSET) uses two data structures simultaneously:
ZADD leaderboard 1000 "Alice" 950 "Bob" 900 "Charlie"
Skiplist (for range queries by score):
Level 4: [head] ─────────────────────────────> [tail]
Level 3: [head] ──────────────> [Bob:950] ───> [tail]
Level 2: [head] ──> [Charlie:900] ──> [Bob:950] ──> [Alice:1000] ──> [tail]
Level 1: [head] ──> [Charlie:900] ──> [Bob:950] ──> [Alice:1000] ──> [tail]
Level chosen randomly at insertion (geometric distribution, p=0.25).
Expected level: log_4(N) levels for N elements.
Skiplist supports:
- ZRANK: O(log N) rank query (count nodes at level 1 up to target)
- ZRANGEBYSCORE min max: O(log N + K) for K results
- ZADD / ZREM: O(log N)
Hashtable (for O(1) member-to-score lookup):
key = "Alice" → value = 1000.0 (score)
key = "Bob" → value = 950.0
Supports:
- ZSCORE member: O(1)
- ZRANK requires skiplist traversal: O(log N)
Both structures point to the same SDS strings (shared, not copied).
Memory overhead: ~64 bytes per element (skiplist node + dict entry + pointers)
Stream (Radix Tree of Listpacks)
Redis Streams (5.0+) use a radix tree as the primary index:
Stream entries identified by ID: <milliseconds>-<sequence>
e.g., 1700000000000-0, 1700000000000-1, 1700000001000-0
Radix tree (trie) with listpack leaves:
IDs with common prefix share tree nodes.
Entries within a radix tree leaf stored as listpack (compact).
XADD: append to listpack leaf; split when full.
XREAD COUNT 100: traverse radix tree, decode listpacks.
XRANGE id1 id2: range query using radix tree ordering.
Consumer groups (XREADGROUP):
- Per-group last-delivered-id tracking
- PEL (Pending Entries List): messages delivered but not acked
- XACK removes from PEL
Persistence
RDB (Redis Database) Snapshot
RDB Process (fork + copy-on-write):
t=0: Redis main thread: fork()
Child process created (copy-on-write sharing all memory)
t=0+: Main thread: continues serving requests, mutating data
Child thread: serializes in-memory state to disk
Copy-on-write (CoW):
When main thread modifies a page, OS creates a copy for main thread.
Child still sees original page (snapshot at fork time).
┌──────────────┐ fork() ┌──────────────┐
│ Main process │ ─────────────> │ Child process │
│ (serving req) │ │ (writing RDB) │
│ modifies pg1 │ CoW: gets copy│ reads pg1 │
│ modifies pg2 │ CoW: gets copy│ reads pg2 │
└──────────────┘ └──────────────┘
│
▼
/var/redis/dump.rdb
RDB file format:
[MAGIC:REDIS][VERSION][AUX fields][DB section: key-value pairs][EOF][CRC64 checksum]
Each key-value: [expiry?][type byte][key][value-encoding]
Advantages:
+ Fast startup (single-file binary load)
+ Compact (binary encoding, no command log)
+ Point-in-time snapshot
Disadvantages:
- Data loss window: everything since last RDB (minutes to hours)
- fork() overhead on large datasets (CoW page faults can spike latency)
Configuration:
save 3600 1 # save if 1 key changed in 3600s
save 300 100 # save if 100 keys changed in 300s
save 60 10000 # save if 10000 keys changed in 60s
AOF (Append-Only File)
AOF records every write command as it executes:
SET user:123 "Alice" → *3\r\n$3\r\nSET\r\n$8\r\nuser:123\r\n$5\r\nAlice\r\n
INCR counter → *2\r\n$4\r\nINCR\r\n$7\r\ncounter\r\n
AOF fsync policies:
appendfsync always: fsync() after every command
Durability: complete (no data loss)
Performance: ~1000 writes/sec (disk-limited)
appendfsync everysec: background thread fsyncs every second
Durability: lose up to 1 second of writes
Performance: ~100,000+ writes/sec (buffered)
DEFAULT and recommended for most use cases.
appendfsync no: OS decides when to flush (usually 30s)
Durability: lose up to 30 seconds
Performance: maximum
Use only for caches, not for persistent data.
AOF rewrite:
Over time, AOF grows large (every INCR recorded individually).
Compaction: BGREWRITEAOF forks a child to write minimal AOF
(only current state, not full history):
1000 INCR counter records → SET counter 1000
Reduces AOF size by 100-1000×.
RDB + AOF Hybrid (Redis 4.0+, default in 7.0):
On AOF rewrite: write full RDB snapshot + new commands since snapshot.
Fast load (RDB) + minimal data loss (AOF since last RDB).
Recommended for production.
Cluster Gossip and Failover
Gossip Protocol:
Each node maintains a clusterNode list for all known nodes.
Every second: select random subset of nodes, send PING.
PONG response: includes node's view of cluster state.
PING payload (gossip section): includes info on 3 random other nodes
→ Cluster state propagates in O(log N) rounds for N nodes.
Failure detection:
Node A sends PING to Node B.
If no PONG within cluster-node-timeout/2:
A marks B as PFAIL (probable failure).
A gossips B's PFAIL status to other nodes.
When majority of primaries have PFAIL for B:
B transitions to FAIL.
B's replicas detect FAIL, start election:
Replica with largest replication offset starts election.
Requests votes from other primaries (Raft-like).
Majority vote → replica becomes new primary.
Gossip new primary announcement to all nodes.
Manual failover:
CLUSTER FAILOVER (on replica): gracefully promote replica to primary
CLUSTER FAILOVER FORCE: skip data consistency check (use if primary unreachable)
CLUSTER FAILOVER TAKEOVER: skip election (use for disaster recovery)
Eviction Policies
When maxmemory is reached:
LRU approximation (default for allkeys-lru):
lru field in redisObject (24 bits):
LRU_CLOCK_RESOLUTION = 1000ms
lru = server.unixtime (current second, truncated to 24 bits)
→ wraps around every 194 days (2^24 seconds / 86400)
eviction sampling:
maxmemory-samples keys randomly selected from dict
evict key with smallest (server.lruclock - obj.lru) % LRU_WRAP
→ approximates true LRU with O(maxmemory-samples) work
LFU approximation:
lru field repurposed for LFU:
upper 16 bits: last_decrement_time (minutes resolution)
lower 8 bits: access counter (0-255)
Counter increment: probabilistic on each access
p = 1 / (counter * lfu-log-factor + 1)
→ higher counter = lower probability of increment (logarithmic)
Counter saturates near 255 for very hot keys.
Counter decay: every lfu-decay-time minutes without access,
counter -= 1 (accessed_before = minutes_since_last_decrement)
LFU evicts key with lowest counter (least frequently accessed).
Handles "Netflix new release" pattern: watched once, never again
→ counter decays → eligible for eviction.
Redis Data Structure Internals ASCII
Memory layout of HSET myhash field1 value1 field2 value2 (small hash):
redisObject (16 bytes):
┌────────┬──────────┬────────┬──────────┬──────────────────┐
│ type │ encoding │ lru │ refcount │ ptr │
│OBJ_HASH│LISTPACK │ 123456 │ 1 │ ──────────────> │
│ 4b │ 4b │ 24b │ 32b │ 64b │
└────────┴──────────┴────────┴──────────┴──────────────────┘
│
▼
Listpack (continuous block in heap):
┌──────────┬─────────┬─────────┬─────────┬─────────┬─────────┬───┐
│total_bytes│num_elem │ f1 enc │ field1 │ v1 enc │ value1 │...│
│ 4 bytes │ 2 bytes │ 1 byte │ 6 bytes │ 1 byte │ 6 bytes │ │
└──────────┴─────────┴─────────┴─────────┴─────────┴─────────┴───┘
Memory for HSET myhash field1 value1:
16 bytes (redisObject) + ~25 bytes (listpack) = ~41 bytes
vs. naive dict approach:
128 bytes (dict struct) + 64×2 (dict entries) + 40×4 (SDS strings) = ~416 bytes
10× memory efficiency for small hashes.
Debugging Notes
# Memory analysis:
redis-cli info memory
# used_memory: logical memory used by Redis
# used_memory_rss: physical memory (RSS) from OS perspective
# mem_fragmentation_ratio = used_memory_rss / used_memory
# >1.5: significant fragmentation — consider MEMORY PURGE
# Check encoding of a key:
redis-cli object encoding myhash
# Output: listpack or hashtable
# Check object refcount and idle time:
redis-cli object refcount mykey
redis-cli object idletime mykey # seconds since last access
# Check memory usage of a specific key:
redis-cli memory usage myhash # includes metadata and pointers
redis-cli memory usage myhash SAMPLES 5 # sample nested structures
# Debug internals (DANGEROUS on production):
redis-cli debug object myhash
# Output: Value at:0x7f4bc0 refcount:1 encoding:listpack serializedlength:42 lru:...
# Check keyspace (warning: KEYS is O(N), blocks main thread):
redis-cli dbsize # total key count (safe)
redis-cli scan 0 count 100 # iterate keys safely (non-blocking)
# Check slow queries:
redis-cli slowlog get 10 # last 10 slow commands
redis-cli slowlog len # total slow log entries
# Default threshold: slowlog-log-slower-than 10000 (10ms)
Security Implications
- Redis's
KEYS *command is O(N) and blocks the entire server for the duration. Attackers who can issue commands can trivially cause denial of service. Never expose Redis to untrusted callers. - Redis ACLs (6.0+) allow fine-grained access control per user: limit to specific commands and key patterns. Use
ACL SETUSER appuser on >password ~app:* +get +set +delto restrict app access. - The
CONFIG SETcommand allows changing Redis configuration at runtime — includingdiranddbfilename, which enables writing arbitrary files if an attacker can issue commands. DisableCONFIG SETin ACLs for non-admin users. - Lua scripts execute on the main thread. A malicious Lua script can spin the CPU indefinitely. Set
lua-time-limit(default 5000ms) to kill runaway scripts.
Performance Implications
- Single-threaded execution means one slow command stalls everything. Profile with
SLOWLOGto find them. Common offenders:KEYS *,SORTon large lists,SMEMBERSon large sets. - Pipelining: send multiple commands without waiting for individual responses. Eliminates N-1 round trips for N operations. Redis can process 1M pipelined commands/sec vs 100K individual commands/sec.
- Connection pools: each new Redis connection costs a TCP handshake (~1ms). Pool connections at the application layer (redis-py's ConnectionPool, jedis connection pool).
- Memory overhead: every key costs ~64-128 bytes overhead (redisObject + dict_entry + SDS metadata), regardless of value size. Millions of small keys can cost GBs in overhead alone — consider hashing small objects into larger hash fields.
Failure Modes
| Symptom | Root Cause | Fix |
|---|---|---|
| Latency spikes every N seconds | BGSAVE fork causing CoW page faults | Reduce RDB save frequency; use smaller instances |
| Memory > maxmemory despite eviction | Fragmentation | MEMORY PURGE; restart with activedefrag yes |
| Replica lag growing | Replication buffer full | Increase client-output-buffer-limit replica |
| Cluster marked node FAIL too quickly | cluster-node-timeout too low | Increase to 15-30s for busy/slow nodes |
| AOF grows without rewrite | auto-aof-rewrite-min-size not met | Trigger BGREWRITEAOF manually |
| Fork fails with ENOMEM | Not enough virtual memory for fork | Enable/increase swap; reduce dataset size |
Modern Usage
- Redis Stack (RediSearch, RedisJSON, RedisTimeSeries): Extends Redis with secondary indexing, JSON path queries, and time series compression. Used as a primary database for certain workloads.
- Redis as a job queue (Sidekiq, Bull, Celery): Millions of background jobs per day. Sorted sets as priority queues; lists as FIFO queues. Producer-consumer via BLPOP (blocking list pop).
- Redis Pub/Sub: Simple message broadcast; NOT persisted — messages lost if no subscriber connected. Use Streams for reliable message delivery.
- Valkey (2024 fork): Linux Foundation project, compatible drop-in replacement. AWS ElastiCache and GCP Memorystore migrating to Valkey.
Future Directions
- Multi-threading for command execution: Valkey is exploring multi-threaded command processing (not just I/O), which would break the single-threaded guarantee but enable linear scaling with core count.
- Vector search: Redis Stack's vector similarity search enables KNN queries on embedding vectors — used for AI-powered recommendations, semantic search, and RAG (Retrieval Augmented Generation) applications.
- Cluster-wide Lua/Functions: Extending Redis Functions to execute across multiple shards atomically (currently limited to single-slot operations).
Exercises
-
Instrument Redis memory encoding thresholds: create a hash with 1 field (HSET test f1 v1). Check encoding with
OBJECT ENCODING. Add fields until encoding changes fromlistpacktohashtable. Determine the exact threshold from the server'shash-max-listpack-entriesconfig. -
Measure copy-on-write overhead: start a Redis instance with 1GB dataset. Trigger BGSAVE and simultaneously run a write benchmark. Measure RSS memory increase during the fork (CoW page faults). Compare to no-write baseline.
-
Implement a rate limiter using Redis sorted sets (ZADD + ZREMRANGEBYSCORE + ZCARD) with a sliding window. Test with concurrent requests to verify correct limiting behavior.
-
Explore the skiplist structure: add 1000 elements to a sorted set (ZADD leaderboard
). Use DEBUG OBJECT leaderboardto see the serialized length. Compare to a hash with same 1000 fields. -
Test AOF durability: configure
appendfsync always, write 1000 keys, thenkill -9the Redis process. Restart and verify all 1000 keys are present. Repeat withappendfsync everysecand kill during a write burst — verify data loss is bounded to ~1 second.
References
- Redis source code: github.com/redis/redis (particularly: src/object.c, src/t_*.c, src/aof.c, src/rdb.c, src/cluster.c, src/ae.c)
- antirez blog: antirez.com (design rationale in author's own words)
- "Redis internals: SDS" — antirez, 2009
- Redis documentation: redis.io/docs/reference/internals/
- "Inside Redis: What every developer should know" — Jan-Erik Rediger, 2023
- Redis memory optimization guide: redis.io/docs/management/optimization/memory-optimization/
- Skiplist original paper: "Skip Lists: A Probabilistic Alternative to Balanced Trees" — William Pugh, 1990
- "Design and Implementation of a Key-Value Store" — Salvatore Sanfilippo, 2009 (original Redis thesis)
- Valkey: valkey.io
- "Scaling Memcache at Facebook" — Nishtala et al., NSDI 2013 (for comparison/contrast with Redis scaling)