Section 11: Memory Management
Purpose and Scope
Memory management is the substrate on which every other OS subsystem rests. This section dissects the complete memory hierarchy from physical DRAM through the OS virtual memory subsystem to user-space allocators. It covers address translation mechanics (paging, segmentation, multi-level page tables), TLB behavior, and every major kernel memory allocator (buddy, slab/SLUB, kmalloc, vmalloc). It extends to the interfaces user space uses (mmap, brk), the lazy mechanisms (demand paging, copy-on-write), the pressure-relief valves (swapping, OOM killer), and the hardware extensions that have reshaped the field (huge pages, NUMA, IOMMU, persistent memory).
Correctness and performance are inseparable here: a misunderstood TLB shootdown or an ill-placed huge page can silently dominate a workload's latency profile.
Prerequisites
- Section 02 (CPU Architecture): cache hierarchy, TLB, memory bus
- Section 03 (OS Fundamentals): kernel/user split, syscalls, process model
- Section 04 (Processes and Threads): address space layout, fork/exec
- Basic familiarity with x86-64 paging structures (CR3, PML4)
Learning Objectives
Upon completing this section you will be able to:
- Walk a 4-level (and 5-level) x86-64 page table translation from virtual address to physical frame.
- Explain TLB shootdown mechanics and their cost in multi-core systems.
- Describe the Linux buddy allocator and why it exists alongside the slab/SLUB allocator.
- Explain copy-on-write and demand paging, including their interaction with fork().
- Trace what happens from the moment a process calls malloc() to when DRAM is physically allocated.
- Reason about NUMA topology effects on memory allocation performance.
- Explain how the OOM killer selects a victim and how to influence it.
- Describe persistent memory (PMEM/NVDIMM) programming models (DAX, CLWB, CLFLUSHOPT).
- Articulate the trade-offs of transparent huge pages vs explicit huge pages.
Architecture Overview
Process Virtual Address Space (0 → 2^48-1 on x86-64)
┌───────────────────────────────────────────────────────┐
│ [text] [data] [heap→] [←stack] [mmap region] │
└──────────────────┬────────────────────────────────────┘
│ virtual address
▼
┌──────────────────────────────────────────────────────────────────┐
│ MMU / Page Table Walker │
│ CR3 → PML4 (L4) → PDPT (L3) → PD (L2) → PT (L1) → PFN │
│ Each level: 512 entries × 8 bytes = 4 KB page │
│ Huge pages: PD entry maps 2 MB; PDPT entry maps 1 GB │
└──────────────────────┬───────────────────────────────────────────┘
│ TLB hit: skip walk
▼
┌──────────────────────────────────────────────────────────────────┐
│ Physical Memory │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Buddy Allocator (page granularity, order 0–10) │ │
│ │ ├── Zone DMA (0–16 MB) │ │
│ │ ├── Zone Normal (16 MB – highmem boundary) │ │
│ │ └── Zone Highmem / Movable │ │
│ └──────────────────────┬────────────────────────────────────┘ │
│ │ slabs carved from buddy pages │
│ ┌──────────────────────▼────────────────────────────────────┐ │
│ │ SLUB Allocator (sub-page objects, per-CPU caches) │ │
│ │ kmem_cache: task_struct, inode, dentry, sk_buff … │ │
│ └───────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────┘
│
NUMA Node 0 NUMA Node 1
┌──────────┐ ┌──────────┐
│ CPU 0-15 │ │ CPU16-31 │
│ 64 GB │◄─QPI/UPI►│ 64 GB │
└──────────┘ └──────────┘
Key Concepts
- Virtual Memory: The abstraction that gives each process its own isolated address space, decoupled from physical layout.
- Paging: Divides virtual and physical memory into fixed-size pages (typically 4 KB); the MMU translates via page tables.
- Page Table: A hierarchical data structure mapping virtual page numbers to physical frame numbers.
- TLB (Translation Lookaside Buffer): A hardware cache for recent page table entries; a miss triggers a page walk.
- TLB Shootdown: When a page mapping changes, the kernel must invalidate TLB entries on all CPUs that may have cached it — an expensive IPI-based operation.
- Huge Pages: 2 MB (x86 large) or 1 GB (x86 huge) pages that reduce TLB pressure for large working sets.
- Transparent Huge Pages (THP): Kernel mechanism that automatically promotes/demotes 4 KB pages to/from 2 MB hugepages.
- Buddy Allocator: Linux's primary physical page allocator; manages free pages in power-of-two order lists to minimize fragmentation.
- Slab/SLUB Allocator: Object-oriented allocator layered on top of buddy; caches frequently-allocated kernel objects.
- Copy-on-Write (CoW): After fork(), parent and child share pages read-only; a write triggers a private copy.
- Demand Paging: Physical frames are not allocated until a virtual page is first accessed; the page fault handler does the work.
- mmap: Maps files or anonymous memory into the address space; the foundation of shared libraries, file I/O, and IPC.
- OOM Killer: Linux's last resort when physical memory is exhausted — scores and kills a process based on memory usage and oom_score_adj.
- NUMA (Non-Uniform Memory Access): Multi-socket systems where memory access latency depends on which node holds the physical frame.
- IOMMU: Input/Output Memory Management Unit; translates device DMA addresses, enables isolation and large contiguous device mappings.
- Persistent Memory (PMEM): Byte-addressable non-volatile memory (Optane, NVDIMM) exposed via DAX (direct access), bypassing the page cache.
- KSM (Kernel Samepage Merging): Scans anonymous pages for identical content and merges them using CoW.
Major Historical Milestones
| Year | Milestone |
|---|---|
| 1961 | Atlas computer introduces one-level store (virtual memory concept) |
| 1969 | Denning formalizes the working set model for page replacement |
| 1970 | IBM System/370 introduces virtual memory to commercial systems |
| 1985 | Intel 80386 introduces 32-bit paging; two-level page tables |
| 1991 | Linux 0.01 ships with basic paging on x86 |
| 1994 | Linux adopts three-level page table abstraction (PGD/PMD/PTE) |
| 1999 | Linux 2.3.23: reverse mapping for efficient page reclaim |
| 2003 | Linux 2.6: NUMA awareness, per-zone allocator, slab rework |
| 2003 | Linux SLOB allocator added for embedded systems |
| 2007 | Linux 2.6.22: SLUB replaces SLAB as default allocator |
| 2009 | Intel Nehalem: IOMMU (VT-d) in mainstream server silicon |
| 2011 | Transparent Huge Pages merged into Linux 2.6.38 |
| 2012 | Linux 5-level page table support (57-bit virtual addresses) merged |
| 2015 | Intel Optane (3D XPoint) NVDIMM announced |
| 2018 | Linux DAX (Direct Access) support for persistent memory stabilizes |
| 2019 | Linux 5-level paging (LA57) enabled by default for large-memory systems |
| 2022 | Folio infrastructure merged to rationalize compound page handling |
Modern Relevance and Production Use Cases
In-memory databases (Redis, Memcached, VoltDB) benefit dramatically from huge pages: a 100 GB dataset with 4 KB pages requires 26 million TLB-resident entries; with 2 MB pages, only 51,200. TLB miss rate directly translates to latency.
JVM workloads (Kafka, Elasticsearch, Cassandra) use -XX:+UseHugePages or transparent huge pages; GC pause times often correlate with TLB shootdown storms during heap remapping.
ML training (PyTorch, JAX) allocates large contiguous GPU-pinned buffers via mmap with MAP_HUGETLB, requiring careful NUMA placement to avoid remote DRAM access penalties of 40–100 ns per access.
Container platforms (Kubernetes) rely on cgroup memory limits, which hook into the OOM killer and page reclaim; misconfigured oom_score_adj is a common cause of unexpected pod terminations.
Persistent memory deployments (SAP HANA, Oracle TimesTen) use PMEM in App Direct mode with DAX-enabled filesystems (ext4-dax, NOVA) to achieve sub-microsecond durable writes, fundamentally changing durability trade-offs.
File Map
| File | Description |
|---|---|
01-virtual-memory-fundamentals.md |
Address spaces, MMU, protection bits, hardware vs software TLB |
02-paging-and-segmentation.md |
x86 segmentation legacy, paging mechanics, protection rings |
03-page-tables.md |
4-level and 5-level x86-64 page tables, ARM64 page tables |
04-tlb-management.md |
TLB structure, shootdowns, ASID/PCID, INVLPG |
05-huge-pages.md |
2 MB / 1 GB pages, HugeTLB, THP, libhugetlbfs |
06-buddy-allocator.md |
Free list design, coalescing, fragmentation, compaction |
07-slab-slub-allocator.md |
kmem_cache, per-CPU magazines, SLUB design |
08-kmalloc-vmalloc.md |
kmalloc size classes, vmalloc for non-contiguous mappings |
09-mmap-brk.md |
mmap internals, VMA merging, brk for heap growth |
10-copy-on-write.md |
CoW in fork(), mmap MAP_PRIVATE, page fault handler |
11-demand-paging.md |
Minor vs major faults, page fault handler flow |
12-swapping-and-reclaim.md |
LRU lists, kswapd, direct reclaim, swap space |
13-oom-killer.md |
OOM score calculation, oom_score_adj, cgroup OOM |
14-memory-fragmentation.md |
Internal/external fragmentation, compaction, CMA |
15-dma-and-iommu.md |
DMA API, coherent vs streaming DMA, IOMMU/VT-d |
16-persistent-memory.md |
NVDIMM, DAX, CLWB/CLFLUSHOPT, PMDK, NOVA filesystem |
17-numa-memory.md |
NUMA topology, numactl, first-touch policy, memory migration |
18-hugeTLB-KSM.md |
HugeTLB reservations, KSM merging, ksm scan rate tuning |
Cross-References
- Section 02 (CPU Architecture): cache hierarchy, TLB hardware, memory bus bandwidth
- Section 04 (Processes): address space layout, fork/exec, VMA structure
- Section 10 (Synchronization): memory barriers, NUMA-aware locking, page table spinlocks
- Section 13 (Filesystems): page cache, buffer cache, DAX bypass
- Section 14 (Device Drivers): DMA API, IOMMU, pinned memory
- Section 15 (Networking): zero-copy via sendfile/splice, sk_buff and page pinning
- Section 19 (Virtualization): EPT/NPT (nested paging), balloon drivers, memory overcommit