Slab Allocator

Technical Overview

The slab allocator sits between the buddy allocator (which deals in whole pages) and the rest of the kernel (which needs objects of arbitrary sizes like inodes, dentries, TCP sockets, and VMAs). It was invented by Jeff Bonwick at Sun Microsystems and described in his seminal 1994 USENIX paper "The Slab Allocator: An Object-Caching Kernel Memory Allocator."

The fundamental insight: kernel objects are frequently allocated and freed. Instead of calling the buddy allocator for every allocation (expensive) and zeroing/reinitializing every object (also expensive), the slab allocator maintains caches of pre-constructed objects. When an object is freed, the slab allocator keeps it in a free list — retaining the object's initialized state — ready for the next allocation. This object caching amortizes constructor overhead and dramatically reduces allocation latency.

Linux originally used Bonwick's SLAB implementation, then replaced it with SLUB (Christoph Lameter, 2007) as the default, with SLOB (Simple List Of Blocks) for embedded systems. Today, SLUB is the universal default (CONFIG_SLUB=y).

Prerequisites

Buddy allocator and GFP flags (06-buddy-allocator.md)
struct page and physical memory layout
Per-CPU data structures and get_cpu_var()
SMP locking fundamentals

Core Content

Slab Allocator Concepts (Bonwick 1994)

Slab Allocator Architecture
=============================

kmem_cache: "inode_cache"
  ┌─────────────────────────────────────────────────────────┐
  │ name: "inode_cache"                                     │
  │ object_size: sizeof(struct inode) = 728 bytes           │
  │ ctor: inode_init_once()   (called once per object)      │
  │ align: ARCH_MIN_OBJ_ALIGN                                │
  │                                                         │
  │  Per-CPU slabs (SLUB: one slab per CPU, hot objects):   │
  │  CPU0: [obj][obj][obj][FREE]...[obj][FREE]              │
  │  CPU1: [obj][obj][FREE][obj]...[FREE][obj]              │
  │                                                         │
  │  Per-node partial slabs (partially full):               │
  │  Node0: slab0[.XX..X.] → slab1[XX..XXX] → slab2[..XX]  │
  │                                                         │
  │  Full slabs: (not tracked in SLUB — all objects in use) │
  └─────────────────────────────────────────────────────────┘

Each slab = one or more contiguous pages (order = kmem_cache->oo.order)
A slab page contains N objects of size obj_size.

SLUB Design (Christoph Lameter, 2007)

SLUB simplifies the original SLAB by: 1. Eliminating the separate slab descriptor (metadata lives in struct page / struct slab) 2. Using a per-CPU freelist pointer for lock-free fast path 3. Reducing metadata overhead by removing coloring (shown to have negligible benefit in practice) 4. Making debugging (poisoning, tracking) opt-in via CONFIG_SLUB_DEBUG

SLUB Per-CPU Structure
========================

struct kmem_cache_cpu {
    void         **freelist;   /* pointer to first free object in current slab */
    unsigned long  tid;        /* transaction ID (ABA prevention) */
    struct slab   *slab;       /* currently active slab page for this CPU */
    struct slab   *partial;    /* partially full slabs (per-CPU partial list) */
};

/* One per kmem_cache per CPU */
DEFINE_PER_CPU(struct kmem_cache_cpu, kmem_cache_cpu_ptr);

Per-slab (struct slab, overlaid on struct page via 'compound_head'):
  slab->freelist   → linked list of free objects within the slab
  slab->inuse      → count of in-use objects
  slab->frozen     → 1 if slab is claimed by a CPU (in per-CPU slot)
  slab->objects    → total objects in slab
  slab->nid        → NUMA node

Object free list (embedded in free object):
  [free_obj] → points to next free object (next_pointer offset within obj)
  No separate freelist array; pointers embedded in the objects themselves.

SLUB Fast Path (Lock-Free Allocation)

kmem_cache_alloc(cache, GFP_KERNEL):
  [FAST PATH — no locks]
  │
  ├── this_cpu_ptr(cache->cpu_slab) → cpu_slab
  ├── object = cpu_slab->freelist
  │     Is freelist non-NULL? (slab has free objects)
  │     Use cmpxchg to atomically update freelist to next object
  │     (uses TID to prevent ABA problem)
  │     Return object   ← ~5 ns
  │
  [SLOW PATH — if freelist empty]
  │
  ├── __slab_alloc(cache, gfp, node, ...)
  │     │
  │     ├── Try per-CPU partial slab (cpu_slab->partial)
  │     │     Install as new active slab, refill freelist
  │     │
  │     ├── Try per-node partial slab (kmem_cache_node->partial list)
  │     │     Lock node, take partial slab, install on CPU
  │     │
  │     └── Allocate new slab from buddy allocator
  │           alloc_pages(GFP_KERNEL, cache->oo.order)
  │           Initialize all objects in new slab
  │           Install as CPU active slab

The cmpxchg-based fast path provides lock-free allocation with a single atomic operation. On modern x86-64, cmpxchg is ~3–5 cycles if there's no contention (local cache hit). The full fast path is ~10–20 cycles (5–10 ns at 3 GHz).

kmalloc vs kmem_cache_alloc

kmalloc(size, gfp) is the general-purpose kernel allocator. It uses a set of fixed-size caches:

kmalloc size classes (SLUB):
  8, 16, 24, 32, 48, 64, 80, 96, 128, 192, 256, 384, 512, 768, 1024,
  1536, 2048, 3072, 4096, 6144, 8192 bytes
  (exact sizes depend on CONFIG_KMALLOC_MAX_SIZE and arch)

kmalloc(17, GFP_KERNEL)  → allocates from "kmalloc-32" cache (next power of 2)
kmalloc(4097, GFP_KERNEL) → allocates from "kmalloc-8192" or buddy direct

For allocations larger than KMALLOC_MAX_CACHE_SIZE (usually 8KB), kmalloc falls back to alloc_pages() directly.

kmem_cache_alloc(cache, gfp) allocates from a specific named cache. Used for well-defined objects with known sizes and initialization requirements (inodes, dentries, net sockets, etc.).

kfree Internals

kfree(ptr):
  │
  ├── virt_to_head_page(ptr)  → struct page/slab for this pointer
  ├── slab->slab_cache        → kmem_cache *cache
  │
  [FAST PATH — return to per-CPU freelist]
  ├── Check if slab == cpu_slab->slab (page still active on this CPU)
  │     YES: prepend ptr to cpu_slab->freelist (atomic cmpxchg)
  │     Return  ← ~5 ns
  │
  [SLOW PATH]
  ├── __slab_free(cache, slab, ptr, ...)
  │     │
  │     ├── slab was full → now partial: move to partial list
  │     ├── slab now empty → return page(s) to buddy allocator (if above threshold)
  │     └── slab still partial → put on per-node partial list

Slab Merging

SLUB merges compatible caches to reduce memory overhead. Two caches can be merged if they have: - Same object size (after alignment) - Same alignment requirements - Same GFP flags - Neither has a constructor (since constructors are not re-called after merge) - Neither has debugging enabled

sysfs_slab_alias() creates aliases. kmalloc-96 and anon_vma might be merged if their sizes align. Merging reduces the total number of slabs in the system.

/sys/kernel/slab/ Statistics

SLUB exposes per-cache statistics through sysfs (when CONFIG_SLUB_DEBUG_ON or via boot param):

# List all slab caches
ls /sys/kernel/slab/

# Key files per cache:
cat /sys/kernel/slab/inode_cache/objects          # objects currently in cache
cat /sys/kernel/slab/inode_cache/slabs            # number of slab pages
cat /sys/kernel/slab/inode_cache/partial          # number of partial slabs
cat /sys/kernel/slab/inode_cache/cpu_slabs        # per-CPU active slabs
cat /sys/kernel/slab/inode_cache/object_size      # bytes per object
cat /sys/kernel/slab/inode_cache/slab_size        # bytes per slab (incl. metadata)
cat /sys/kernel/slab/inode_cache/cache_dma        # using ZONE_DMA?
cat /sys/kernel/slab/inode_cache/alloc_fastpath   # alloc fast path count
cat /sys/kernel/slab/inode_cache/alloc_slowpath   # alloc slow path count

# Quick summary of all caches
slabtop -o -s c  # sort by cache size
# or
cat /proc/slabinfo

slabtop is invaluable for finding memory leaks in kernel code: a cache that grows without bound while the system isn't doing more work is leaking objects.

SLUB Debugging (CONFIG_SLUB_DEBUG)

SLUB debugging adds: - Object poisoning: Free objects filled with 0x6b (POISON_FREE). On allocation, kernel verifies the poison pattern — any modification means use-after-free. - Red zones: Guard bytes before and after each object. Overflow detected on free. - Tracking: Last allocation and free stack trace stored per object. Invaluable for use-after-free diagnosis.

# Enable SLUB debugging for a specific cache (runtime)
echo "inode_cache" > /sys/kernel/slab/inode_cache/sanity_checks

# Enable at boot (all caches, expensive)
# Boot param: slub_debug=FZPU (F=sanity checks, Z=red zone, P=poison, U=user tracking)

# Check for SLUB errors in dmesg
dmesg | grep -E "SLUB|slab|kmalloc|Redzone|Poison"

Typical SLUB error message for use-after-free:

BUG inode_cache: Poison overwritten
INFO: 0xffff88801234abcd-0xffff88801234abff. First byte 0x6b instead of 0xcc
INFO: Slab 0xffffea0000123400 objects=21 used=20 fp=0xffff88801234abcd flags=0x200
INFO: Object 0xffff88801234abcd @offset=123 fp=0x0000000000000000

Historical Context

Jeff Bonwick described the slab allocator in his USENIX Summer 1994 paper, based on his work implementing it in Solaris 2.4. Linux's SLAB implementation (by S. Mochel and others) was introduced in Linux 2.0. The original SLAB had three lists per cache (full, partial, free slabs) and was complex and had significant metadata overhead.

Christoph Lameter developed SLUB and proposed it as a replacement in 2007 (merged in Linux 2.6.23). SLUB's key innovation was treating each slab page independently and using the struct page fields for slab metadata, eliminating the separate slab management structure. This reduced per-object overhead from ~40 bytes to ~8 bytes and simplified the code significantly.

SLOB (Simple List Of Blocks) is an alternative for systems with less than ~16 MB of RAM (embedded, tiny routers). It uses a single linked list of free blocks with first-fit allocation. Much simpler and lower memory overhead, but O(n) allocation time.

Production Examples

Dentry cache growth during find: Running find / -name foo on a system with a large filesystem creates millions of dentries (directory entries) in the dentry slab cache. Each dentry is ~200 bytes. After find exits, the dentries remain cached for future use. On a system with a 1TB filesystem and millions of files, the dcache can consume tens of GB of RAM — which is correct behavior but can look alarming.

Socket slab leak: A bug in a kernel module that allocates TCP sockets but never releases them causes the TCP slab cache to grow without bound. slabtop shows TCP as the fastest-growing cache. The module leaks one socket per connection.

kmalloc fragmentation with many 33-byte allocations: An application sending many 33-byte network headers via kmalloc(33, GFP_KERNEL) allocates from kmalloc-64 (next size class). Each allocation wastes 31 bytes (48% overhead). At 1M allocations/s, this wastes 31 MB/s of slab memory.

Debugging Notes

# Top slab consumers
slabtop -o -s c | head -20

# All slabs with object counts
cat /proc/slabinfo | sort -k3 -rn | head -20
# Fields: name, active_objs, num_objs, objsize, objperslab, pagesperslab

# Monitor slab growth in real-time
watch -n2 "slabtop -o -s c | head -15"

# Find a slab leak (cache growing without a corresponding workload increase)
# Take snapshot 1
cat /proc/slabinfo > /tmp/slab_before.txt
# Run workload, wait
cat /proc/slabinfo > /tmp/slab_after.txt
# Diff
diff /tmp/slab_before.txt /tmp/slab_after.txt | grep "^>" | sort -k3 -rn

# Check if dentry/inode cache is too large
grep -E "^(Slab|SReclaimable|SUnreclaim)" /proc/meminfo
# SUnreclaim = unreclaimable slab = potential leak

# Enable SLUB debug for a specific cache temporarily
echo 1 > /sys/kernel/slab/kmalloc-64/sanity_checks

# Object tracking (requires CONFIG_SLUB_DEBUG compiled in)
echo 1 > /sys/kernel/slab/TCP/store_user
# Then trigger allocations, then read alloc/free stacks
cat /sys/kernel/slab/TCP/alloc_calls
cat /sys/kernel/slab/TCP/free_calls

Security Implications

Heap overflow into adjacent object: If object A overflows into object B (both in the same slab), the attacker can corrupt B's fields. In SLUB without debugging, there is no red zone between objects. The overflow may corrupt function pointers (vtables), capability fields, or uid/gid.

Slab exploitation techniques: Classic kernel heap exploitation targets: 1. Heap spray: Allocate many objects of a specific type to fill slabs, increasing the probability of adjacent allocation. 2. Dangling pointer: Free an object but keep a reference. Wait for the slab to be partially filled with new allocations of a different type, then use the dangling pointer to read/write the new object. 3. Cross-cache exploitation: After freeing an object and the slab returning to buddy, allocate a different cache that gets the same physical page. This is harder with SLUB due to slab isolation.

CVE examples: - CVE-2022-27666 (IPsec ESP6 slab overflow): Heap buffer overflow in esp6_output_head(), overflowing a kmalloc-4096 object. Exploitable for privilege escalation. - CVE-2021-22555 (nf_tables slab overflow): Integer overflow leading to undersized kmalloc allocation for netfilter nft_set, with heap overflow. CVSS 7.8. - CVE-2016-6187 (appended dentry slab overflow): Overflow in string handling when appending to dentry names.

Modern mitigations: - CONFIG_SLAB_FREELIST_RANDOMIZE: Randomizes the order of objects on the slab freelist, making heap spray less predictable. - CONFIG_SLAB_FREELIST_HARDENED: XORs each freelist pointer with a secret value and the object address, making freelist pointer corruption detectable. - CONFIG_INIT_ON_ALLOC_DEFAULT_ON: Zero-initializes slab objects on allocation (prevents info leak from stale data). - CONFIG_INIT_ON_FREE_DEFAULT_ON: Zeroes objects on free (prevents use-after-free info leaks).

Performance Implications

Fast path allocation: ~5–10 ns (per-CPU freelist, no lock)
Slow path with partial slab: ~100–500 ns (acquire node lock)
New slab allocation: ~1–10 µs (buddy allocator + initialization)
High-frequency kmalloc hot spots: Check slabtop -s a (sort by allocs/s)
NUMA and slab: kmem_cache_alloc_node(cache, gfp, nid) allocates from the specified NUMA node's slab. Avoids remote NUMA slab allocation latency.
Per-CPU partial list: SLUB keeps a per-CPU partial list (cpu_slab->partial) as a second-level cache, reducing contention on the per-node partial list.

Failure Modes and Real Incidents

SLUB corruption detection panic: A kernel driver calls kfree() on an address that is not the start of an allocated object (common bug: freeing a pointer offset by struct field). SLUB detects the invalid freelist pointer (doesn't pass the check_valid_pointer() check) and panics with BUG: SLUB object at address not in expected range.

slab_out_of_memory: Under severe memory pressure, kmalloc(GFP_KERNEL) may fail because no buddy pages are available and direct reclaim can't free enough. This causes cascading failures throughout the kernel — any code that doesn't check kmalloc return values will dereference NULL.

Dentry cache storm: A burst of stat() calls to many unique paths creates millions of negative dentries (dentries for files that don't exist). These fill the dentry slab cache. On a system with 4 GB RAM and a dentry of ~200 bytes, 20 million negative dentries consume 4 GB. The OOM killer activates. Production fix: sysctl vm.vfs_cache_pressure=200 aggressively reclaims dentry/inode caches.

Modern Usage

RCU-freed slabs: Objects used with RCU (Read-Copy-Update) are freed via kfree_rcu(), which defers the free until after all current RCU read-side critical sections complete. This avoids use-after-free for RCU-protected objects.
Per-memcg slab accounting: cgroup v2 charges each slab allocation to the memory cgroup of the allocating task, enabling container-level memory accounting for kernel data structures.
Typed memory: Recent work (LWC, 2023) proposes per-type slab isolation: each struct type in its own cache page, so cross-type UAF exploitation requires breaking the buddy allocator rather than just the slab freelist.
kmalloc_large_node: For NUMA-aware large allocations that bypass the slab size class limits.

Future Directions

Slab isolation for security: Each object type in an isolated slab, preventing cross-type heap sprays. Partially done via separate caches for sensitive objects (credential structures, inode).
Hardware-enforced heap safety (ARM MTE): 4-bit tags on heap objects catch buffer overflows and use-after-free at hardware speed. The SLUB allocator would tag each allocated object and clear the tag on free; any stale access raises a fault. Work in progress for arm64.
Linear Address Masking (LAM) for heap: Intel LAM allows the upper bits of a pointer to carry metadata. A future SLUB could embed size/type tags in pointer upper bits for hardware-checked bounds.

Exercises

Write a kernel module that creates a custom kmem_cache for a 200-byte struct, allocates 10,000 objects, and measures allocation time vs using kmalloc(200, GFP_KERNEL) in a loop.
Enable CONFIG_SLUB_DEBUG on a test kernel. Deliberately write beyond an allocated object (use kmalloc and write past the end). Observe the SLUB corruption detection on free.
Use slabtop to identify the top 5 slab consumers on a system running a production workload. For each, explain why that many objects are cached.
Reproduce the "dentry storm": write a script that stat()s millions of non-existent files. Monitor SReclaimable in /proc/meminfo and observe the dentry cache growth.
Trace kmalloc calls in the kernel using perf probe -k vmlinux -a 'kmalloc size'. Run a workload and histogram the allocation sizes.
Read the SLUB freelist hardening code in mm/slub.c:freelist_ptr(). Understand the XOR encoding. Write a test that detects a corrupted freelist pointer by reimplementing the check in user space.

References

mm/slub.c — SLUB allocator implementation
mm/slab_common.c — shared slab infrastructure
include/linux/slab.h — kmalloc(), kfree(), kmem_cache_alloc() prototypes
include/linux/slub_def.h — struct kmem_cache, struct kmem_cache_cpu
mm/slab.c — legacy SLAB implementation (still available as CONFIG_SLAB)
Jeff Bonwick, "The Slab Allocator: An Object-Caching Kernel Memory Allocator", USENIX Summer 1994
Christoph Lameter, "SLUB: The Unqueued Slab Allocator", LWN 2007
/proc/slabinfo — slab statistics
slabtop(1) man page
LWN: "The SLUB allocator" — https://lwn.net/Articles/229984/
CVE-2021-22555 analysis: https://google.github.io/security-research/pocs/linux/cve-2021-22555/writeup.html
"Exploiting the Linux Kernel via Socket Buffers", Phrack 64