Kernel Data Structures

Technical Overview

The Linux kernel maintains hundreds of data structures to track every running process, every mapped memory region, every open file, every network connection, every device, and every timer. But rather than a collection of ad-hoc arrays and linked lists, the kernel uses a carefully chosen set of generic, reusable data structures — implemented in a way unique to kernel programming, where memory must be managed precisely, concurrent access is the norm, and cache efficiency matters enormously.

The most distinctive aspect of Linux kernel data structures is the "embedded" pattern: rather than a generic list of void * pointers, the Linux kernel embeds the list link (struct list_head) directly into the data structure being managed. This means a single struct task_struct can appear simultaneously in the scheduler's run queue, a process group list, and the thread group list — using different embedded list_head members for each. This design is elegant, type-safe (via container_of), and cache-efficient.

Understanding these data structures is prerequisite to reading any Linux kernel subsystem code. They appear in every file, at every level.

Prerequisites

Basic C programming (structs, pointers, typedefs)
Concept of linked lists and trees
01-what-is-a-kernel.md: kernel context
03-kernel-memory-model.md: how kernel memory is allocated (adjacent file)

Core Content

Linked Lists: `list_head`

Source: include/linux/list.h

The Linux linked list is a circular doubly-linked list where the list node (struct list_head) is embedded directly into the data structure:

// The list node — contains only pointers, no data
struct list_head {
    struct list_head *next, *prev;
};

// A data structure that can be on a list
struct my_object {
    int data;
    struct list_head node;    // embedded list link
    char name[32];
};

ASCII Diagram: circular doubly-linked list

        list_head (sentinel/head)
           ┌──────────────┐
     ┌─────►  next ──────────────────────────────────┐
     │     │  prev  ◄───────────────────────────────┐│
     │     └──────────────┘                         ││
     │                                              ││
     │  struct my_object A        struct my_object B ││
     │  ┌──────────────────┐      ┌──────────────────┐│
     │  │  data: 42        │      │  data: 99        ││
     │  │  node:           │      │  node:           ││
     │  │   next ──────────►      │   next ──────────┘│
     │  │   prev ◄───────────────────── prev          │
     └──────── prev         │      │         next ────►head
        │                  │      └──────────────────┘
        └──────────────────┘

The head is a sentinel node (LIST_HEAD(my_list) or INIT_LIST_HEAD(&head)) that is never part of any my_object — it just anchors the list.

Key API:

LIST_HEAD(my_list);                     // declare and init a list head (static)
INIT_LIST_HEAD(&list);                  // init a list_head at runtime

list_add(&obj->node, &my_list);         // add after head (stack order)
list_add_tail(&obj->node, &my_list);    // add before head (queue order)
list_del(&obj->node);                   // remove from list
list_empty(&my_list);                   // test if list is empty

// Iterate over all objects:
struct my_object *pos;
list_for_each_entry(pos, &my_list, node) {
    printk("data=%d\n", pos->data);
}

// Safe iteration (allows list_del during loop):
struct my_object *pos, *tmp;
list_for_each_entry_safe(pos, tmp, &my_list, node) {
    if (pos->data == 42)
        list_del(&pos->node);
}

container_of macro (include/linux/container_of.h): The magic that enables the embedded pattern. Given a pointer to an embedded list_head, recover a pointer to the containing my_object:

// container_of(ptr, type, member):
//   ptr    = pointer to the embedded member
//   type   = type of the outer struct
//   member = name of the embedded member within outer struct

struct list_head *lh = get_some_list_node();
struct my_object *obj = container_of(lh, struct my_object, node);
// Equivalent to: obj = (struct my_object *)((char*)lh - offsetof(struct my_object, node))

Real kernel usage: struct task_struct has struct list_head tasks (all processes list), struct list_head thread_group (threads in the same process), struct list_head children (child processes), struct list_head sibling (sibling processes among parent's children). Each is a separate embedded list_head, allowing the same task to be simultaneously on all four lists.

Hash Tables: `hlist_head` and `hlist_node`

Source: include/linux/list.h

For hash table buckets, a full list_head (with two pointers in the head) wastes space when there are thousands of buckets. Linux uses hlist_head (one pointer) and hlist_node (two pointers: next and pprev — a pointer to the previous node's next pointer):

struct hlist_head {
    struct hlist_node *first;    // pointer to first element
};

struct hlist_node {
    struct hlist_node *next;
    struct hlist_node **pprev;   // pointer to previous node's 'next' field
                                 // (allows O(1) deletion without the head)
};

hash_table[0]: hlist_head         hash_table[1]: hlist_head
  ┌──────────┐                      ┌──────────┐
  │ first ───►  hlist_node A         │ first ───► hlist_node B
  └──────────┘  ┌─────────────┐      └──────────┘ ┌────────────┐
                │ next ────────►      hlist_node C │ next: NULL │
                │ pprev ──►[0].first │ ┌────────────┤ pprev ────►[1].first
                └─────────────┘      │ │ next: NULL │
                                     │ │ pprev ─────► A.next
                                     │ └────────────┘

Real kernel usage: pid_hash[] (process ID to task struct lookup), dentry_hashtable[] (filename lookup), inet_bind_hashbucket[] (socket lookup by port).

Red-Black Trees: `rb_root` and `rb_node`

Source: include/linux/rbtree.h, lib/rbtree.c

Red-black trees provide O(log n) insert, delete, and lookup with guaranteed balance. Linux's rb_node is embedded directly into data structures:

struct rb_node {
    unsigned long __rb_parent_color;   // parent ptr + color (uses low bits)
    struct rb_node *rb_right;
    struct rb_node *rb_left;
};

struct rb_root {
    struct rb_node *rb_node;           // root of the tree
};

// Usage:
struct my_interval {
    unsigned long start, end;
    struct rb_node rb;                 // embedded rb_node
};

ASCII Diagram: rb_tree

         rb_root
           │
           ▼
      [50, B=black]
       /          \
  [30, R=red]   [70, B=black]
   /      \
[20, B]  [40, B]

B=black, R=red
Properties:
  1. Every node is red or black
  2. Root is black
  3. Red nodes have black children
  4. All paths from root to NULL have equal black count
  → guarantees max height = 2*log2(n+1)

Real kernel usage:

CFS scheduler runqueue (kernel/sched/fair.c): struct cfs_rq has struct rb_root_cached tasks_timeline. Each runnable task is a node keyed by its vruntime (virtual runtime). The scheduler picks the leftmost node (minimum vruntime) in O(log n).
Virtual memory areas (mm/mmap.c): struct mm_struct has struct maple_tree mm_mt (Linux 6.1+ uses maple tree, previously rb_tree via mm_rb). Each struct vm_area_struct (VMA) is a node keyed by vm_start. Finding which VMA contains a given address is O(log n).
File descriptors (fs/file.c): The fdtable uses an array for small fd numbers and can use rbtree for large fd numbers.

API:

struct rb_root root = RB_ROOT;

// Insert (must provide comparison — rb_tree doesn't know your key type):
void my_insert(struct rb_root *root, struct my_interval *new)
{
    struct rb_node **link = &root->rb_node;
    struct rb_node *parent = NULL;
    while (*link) {
        struct my_interval *entry = rb_entry(*link, struct my_interval, rb);
        parent = *link;
        if (new->start < entry->start)
            link = &(*link)->rb_left;
        else
            link = &(*link)->rb_right;
    }
    rb_link_node(&new->rb, parent, link);
    rb_insert_color(&new->rb, root);   // rebalance
}

// Iterate in order (smallest to largest):
struct rb_node *n;
for (n = rb_first(&root); n; n = rb_next(n)) {
    struct my_interval *entry = rb_entry(n, struct my_interval, rb);
    printk("[%lu, %lu]\n", entry->start, entry->end);
}

Radix Trees / XArray: Page Cache

Source: include/linux/xarray.h, lib/xarray.c (Linux 4.20+ replaces radix tree with XArray)

The page cache (mm/filemap.c) needs to look up the page at offset N in a file rapidly. XArray is a sparse array abstraction built on a radix-tree-like structure. Each node has XA_CHUNK_SIZE (64) slots, 6-bit per level. For a page cache on a 64-bit system, 4 levels can index 2^42 pages (16 TiB of file data).

struct address_space {     // represents a file's pages in memory
    ...
    struct xarray i_pages;  // XArray: file_offset → struct page *
    ...
};

// Looking up page at offset 42:
struct page *page = xa_load(&mapping->i_pages, 42);

// Inserting:
xa_store(&mapping->i_pages, offset, page, GFP_KERNEL);

// Iterating over all present pages:
unsigned long index;
struct page *page;
xa_for_each(&mapping->i_pages, index, page) {
    // process page
}

XArray also supports marks (e.g., PAGECACHE_TAG_DIRTY, PAGECACHE_TAG_WRITEBACK) on entries, enabling efficient lookup of only dirty pages without scanning the entire cache.

Priority Heaps: Timer Subsystem

The Linux hrtimer (high-resolution timer) subsystem uses a timerqueue_head (include/linux/timerqueue.h), which wraps a red-black tree ordered by expiry time. The minimum expiry (the next timer to fire) is always the leftmost node.

struct timerqueue_head {
    struct rb_root_cached rb_root;   // rb tree with cached leftmost
};

struct timerqueue_node {
    struct rb_node node;
    ktime_t expires;                 // expiry time (nanoseconds)
};

timerqueue_getnext() returns the next-to-expire timer in O(1). Insertion is O(log n).

The per-CPU struct hrtimer_cpu_base contains six timerqueue_head (one per clock base: CLOCK_REALTIME, CLOCK_MONOTONIC, CLOCK_BOOTTIME, etc.).

For the legacy timer_wheel (jiffy-based timers), Linux uses a hierarchical timing wheel (kernel/time/timer.c): an array of 256 buckets at 1-tick granularity, plus 64-bucket arrays at 8-tick, 64-tick, and 512-tick granularities. This gives O(1) insert and O(1) expiry (amortized) for the common case where timers fire at their expected time.

Per-CPU Variables

Source: include/linux/percpu.h, include/linux/percpu-defs.h

Per-CPU variables are a fundamental kernel performance technique. Instead of one variable shared across all CPUs (requiring atomic operations or locks for updates), each CPU has its own copy. This eliminates cache line contention for read-modify-write operations:

// Static per-CPU variable:
DEFINE_PER_CPU(long, my_counter);

// Access (must disable preemption to prevent CPU migration):
get_cpu_var(my_counter)++;   // disables preemption, gets this CPU's copy
put_cpu_var(my_counter);     // re-enables preemption

// Or in interrupt context (already non-preemptable):
__this_cpu_inc(my_counter);  // no preemption needed if in IRQ context

// Read another CPU's variable:
long val = per_cpu(my_counter, cpu_id);

// Sum across all CPUs:
long total = 0;
for_each_possible_cpu(cpu)
    total += per_cpu(my_counter, cpu);

Real kernel usage: - vmstat counters (page faults, allocations): per-CPU vm_stat[] - Network statistics (packets sent/received): per-CPU netdev_stats - Scheduler's runqueue (struct rq): per-CPU (kernel/sched/core.c) - softirq pending bitmask: per-CPU __softirq_pending - slab allocator's per-CPU caches: eliminates locking for common allocations

Memory layout: All per-CPU variables for CPU 0 are placed in a contiguous section. For CPU 1, a copy is placed at section_start + __per_cpu_offset[1]. The linker places all DEFINE_PER_CPU variables in the .data..percpu section.

RCU-Protected Data Structures

Source: include/linux/rcupdate.h, kernel/rcu/

RCU (Read-Copy-Update) is covered in detail in 06-kernel-locking-overview.md, but its interaction with data structures deserves mention here.

RCU allows lock-free reads with explicit update synchronization. The pattern:

// Reader (no lock needed):
rcu_read_lock();
struct my_object *obj = rcu_dereference(global_ptr);  // safe pointer dereference
if (obj)
    use(obj->data);   // obj cannot be freed while we hold rcu_read_lock
rcu_read_unlock();

// Writer (updates pointer):
struct my_object *new_obj = kmalloc(sizeof(*new_obj), GFP_KERNEL);
new_obj->data = new_value;
old_obj = rcu_dereference_protected(global_ptr, lockdep_is_held(&my_lock));
rcu_assign_pointer(global_ptr, new_obj);  // atomic pointer store with barrier
// Now wait for all existing readers to finish before freeing old_obj:
synchronize_rcu();     // blocks until all pre-existing RCU readers complete
kfree(old_obj);

RCU-protected lists: hlist_head has RCU variants: hlist_add_head_rcu, hlist_del_rcu, hlist_for_each_entry_rcu. The pid_hash table (process ID lookup) uses RCU-protected hlists — reading a PID requires no lock because PID lookup is extremely frequent (signal delivery, /proc, etc.).

Historical Context

Linux's list_head design (circular embedded doubly-linked list) appeared in Linux 1.3.x (1995) and is credited to Linus Torvalds. It replaced an earlier approach of separate node types that carried void * data pointers. The embedded approach was inspired by the DEC SRC's Modula-3 collection libraries and is related to Knuth's idea of "dancing links."

Red-black trees were invented by Rudolf Bayer (1972, "symmetric binary B-trees") and named by Guibas and Sedgewick (1978). Linux adopted them for the completely fair scheduler's runqueue in 2007 (replacing an O(n) sorted list in the O(1) scheduler and the earlier CFS implementation that used a plain prio_array).

XArray was introduced by Matthew Wilcox in Linux 4.17 (2018), replacing the radix_tree_root. XArray is a cleaner API with better concurrency properties (uses RCU internally) and support for multi-order entries (a single entry spanning multiple indices — used for huge page cache entries).

The per-CPU variable mechanism in its current form dates to Linux 2.6.0 (2003). Earlier Linux used less systematic per-CPU storage. The DEFINE_PER_CPU macro and the __per_cpu_offset[] mechanism were designed specifically for the 2.6 SMP improvements.

Production Examples

CFS scheduler performance: The rb_tree-based CFS runqueue on a 256-core server with thousands of runnable threads provides O(log n) scheduling. The minimum vruntime node is always the leftmost node and is cached in rb_root_cached.rb_leftmost, making the hot path (pick next task) effectively O(1) — just read the cached leftmost node.

/proc fd listing performance: /proc/PID/fd/ lists a process's open file descriptors. The fd table (struct fdtable) stores fds in a radix-tree-backed array. Iterating all fds for a process with 100,000 open file descriptors is O(n) over the XArray, but lookups are O(log n) where n is related to the size of the fd number space.

Networking connection lookup: struct inet_hashinfo (net/ipv4/inet_hashtables.c) stores all TCP connections. Lookup by (local IP, local port, remote IP, remote port) uses a hash into ehash (established hash), an hlist_head[] array. With RCU protection on the hlist, millions of packet-per-second lookups proceed without any lock acquisition.

Debugging Notes

# Inspect list_head usage via kernel debugger (crash utility):
# crash> list task_struct.tasks -h [task_struct address] -o task_struct.pid

# Check for list corruption (common kernel bug: list_del on non-listed node)
# Symptoms: kernel oops in list_del with "list_del corruption" message
# CONFIG_DEBUG_LIST=y enables runtime list integrity checks

# Inspect rb_tree (e.g., VMAs of a process):
# crash> foreach vmas [pid]    (in crash utility)

# Count objects on a list (useful for debugging leaks):
# bpftrace: count task_structs via the tasks list
bpftrace -e '
BEGIN {
    $task = (struct task_struct *)curtask;
    $count = 0;
    $curr = $task->tasks.next;
    while ($curr != &$task->tasks) {
        $count++;
        $curr = $curr->next;
    }
    printf("tasks: %d\n", $count);
    exit();
}'

# Detect RCU stalls (a CPU not completing its RCU read-side critical section):
dmesg | grep "RCU stall"
# Indicates a CPU is stuck inside rcu_read_lock() for too long
# Usually indicates a CPU stuck in an infinite loop with preemption disabled

Security Implications

List corruption as exploit primitive: Overwriting a list_head's next or prev pointer is a classic kernel heap exploit primitive. An attacker who can write two 8-byte values to an arbitrary kernel address can corrupt a list, and when list_del() executes, it will write attacker-controlled values to attacker-controlled addresses: prev->next = next; next->prev = prev. This is a write-what-where primitive, often sufficient for privilege escalation.

Mitigations: - CONFIG_DEBUG_LIST=y: adds list integrity checks (validates next->prev == entry before unlinking), catches corruption before it propagates - Heap hardening (CONFIG_INIT_ON_ALLOC_DEFAULT_ON): zeroes memory on allocation, making uninitialized pointer corruption less exploitable - CONFIG_SLAB_FREELIST_RANDOM: randomizes slab freelist order, making predictable heap grooming harder

rb_tree and use-after-free: rb_erase() does not zero the node's pointers. An object with a stale embedded rb_node that is re-inserted into an rb_tree (use-after-free) can corrupt the tree. Linux's RB_CLEAR_NODE() (sets rb_parent_color to 1) marks a node as not in any tree; RB_EMPTY_NODE() checks this.

Performance Implications

Cache line layout: The design of embedded data structures affects cache efficiency. struct task_struct is ~9KB — far too large to fit in a cache line. The scheduler accesses only a few hot fields (sched_entity, state, on_cpu). These are placed at the beginning of task_struct to share a cache line. The embedded sched_entity at the start means the CFS rb_node is cache-hot during scheduling.

Per-CPU vs. atomic: Incrementing a per-CPU counter (__this_cpu_inc(counter)) is a single INC [memory] instruction on x86 (~5 cycles). Incrementing a shared atomic counter (atomic_inc(&shared_counter)) requires a LOCK XADD or LOCK INC (~50–100 cycles due to cache coherence traffic). For hot paths (page fault counters, allocation counters), per-CPU is essential.

XArray vs. radix tree: XArray was benchmarked at roughly the same speed as the old radix tree for lookup but with lower lock contention due to better RCU integration. Page cache workloads (web server serving files from cache) involve millions of XArray lookups per second; any overhead here directly impacts throughput.

Failure Modes and Real Incidents

CVE-2016-4558 (BPF/list_head corruption): A bug in the BPF subsystem could cause a list_head to be modified without proper synchronization, leading to use-after-free. Exploited to gain root by corrupting kernel data structures.

CFS rb_tree corruption under vfork() race (2019, CVE-2019-11190): A race condition during vfork() could corrupt the CFS scheduler's rb_tree, causing __pick_next_entity() to dereference freed memory. Symptom: sporadic kernel panics in pick_next_task_fair() on systems with high vfork() rates (container spawning).

List_head poisoning and debugging: When CONFIG_DEBUG_LIST=y is enabled and list_del() is called, the deleted node's next and prev are set to LIST_POISON1 and LIST_POISON2 (poisoned values that will fault on any dereference). This immediately catches use-after-free bugs on lists — any subsequent dereference of the poisoned list_head faults with an oops rather than silently corrupting memory.

Modern Usage

Maple tree (Linux 6.1+): The mm_rb rb_tree in struct mm_struct was replaced by a maple tree (struct maple_tree mm_mt). The maple tree is a B-tree variant optimized for range-based lookups and updates, with native RCU support and better cache efficiency for VMA operations. It stores ranges (start, end) as keys natively, making find_vma() faster than the rb_tree approach.

XArray for multi-order entries: Huge page support in the page cache uses XArray multi-order entries: a single XArray entry at offset k can span 512 consecutive indices (one 2MiB huge page = 512 4KiB pages). xa_store_order() inserts a multi-order entry; XArray handles marking all covered indices as pointing to the same entry.

BPF maps: eBPF programs use kernel data structures exposed as "maps" (BPF_MAP_TYPE_HASH, BPF_MAP_TYPE_ARRAY, BPF_MAP_TYPE_RBTREE, BPF_MAP_TYPE_RINGBUF). BPF rbtree maps (Linux 6.3) allow BPF programs to maintain O(log n) sorted collections entirely in the kernel, with safe concurrency via BPF spin locks.

Future Directions

Maple tree expansion: The maple tree is being adopted for more subsystems beyond VMAs. Its range-query capabilities and RCU-friendly design make it a candidate to replace specialized rb_trees in other parts of MM.
Wait-free data structures: Research into wait-free (as opposed to merely lock-free) alternatives to rbtrees and lists for the most performance-critical paths (scheduler runqueue, NIC RX queue descriptors).
Rust generics for kernel data structures: The Rust-for-Linux project is implementing Rust wrappers and type-safe generic versions of list_head, rb_tree, and XArray in rust/kernel/. This allows Rust kernel code to use these structures with full type safety (no container_of casts).

Exercises

Find the definition of struct task_struct in include/linux/sched.h on the Linux 6.x source (elixir.bootlin.com). Count how many struct list_head members it contains. For each one, determine which list it links the task onto (read surrounding comments).
Write a minimal Linux kernel module that creates a linked list of 5 objects using list_head, inserts them, and iterates over them in init_module(), printing each object's value. Use list_add_tail() to maintain insertion order. Use kfree and list_del for cleanup in cleanup_module().
Read kernel/sched/fair.c function __enqueue_entity(). Trace the rb_tree insertion logic. What is the key being compared? How does Linux find the correct insertion point? What does rb_insert_color() do and why is it needed after rb_link_node()?
Use bpftrace to measure the average number of tasks in the CFS rb_tree on your system: attach to kprobe:pick_next_task_fair and count calls. Then read /proc/sched_debug and compare the nr_running field for each CPU's CFS runqueue. What does this tell you about the typical size of the scheduler's rb_tree?
Read include/linux/percpu-defs.h. Understand how DEFINE_PER_CPU(type, name) works by reading the expanded form. On an x86-64 system, what is __per_cpu_offset[1]? (Hint: read arch/x86/kernel/setup_percpu.c.) What does this value represent physically in memory?

References

Linux kernel source: include/linux/list.h, include/linux/rbtree.h, include/linux/xarray.h, include/linux/percpu.h, include/linux/timerqueue.h
Linux kernel documentation: Documentation/core-api/kernel-api.rst, Documentation/core-api/xarray.rst
Robert Love, Linux Kernel Development, 3rd ed., Chapter 6 (Kernel Data Structures)
Matthew Wilcox, "XArray" design document: https://lwn.net/Articles/745021/
Paul McKenney, "What is RCU, Fundamentally?": https://lwn.net/Articles/262464/
Maple tree design: Liam Howlett, "The maple tree, a modern data structure for a complex problem": https://lwn.net/Articles/845507/
Andrea Arcangeli, "Understanding the Linux Kernel Virtual Memory Manager", 2002
Knuth, The Art of Computer Programming, Vol. 3, Section 6.2.3 (Balanced Trees)