Section 11: Memory Management

Purpose and Scope

Memory management is the substrate on which every other OS subsystem rests. This section dissects the complete memory hierarchy from physical DRAM through the OS virtual memory subsystem to user-space allocators. It covers address translation mechanics (paging, segmentation, multi-level page tables), TLB behavior, and every major kernel memory allocator (buddy, slab/SLUB, kmalloc, vmalloc). It extends to the interfaces user space uses (mmap, brk), the lazy mechanisms (demand paging, copy-on-write), the pressure-relief valves (swapping, OOM killer), and the hardware extensions that have reshaped the field (huge pages, NUMA, IOMMU, persistent memory).

Correctness and performance are inseparable here: a misunderstood TLB shootdown or an ill-placed huge page can silently dominate a workload's latency profile.

Prerequisites

Section 02 (CPU Architecture): cache hierarchy, TLB, memory bus
Section 03 (OS Fundamentals): kernel/user split, syscalls, process model
Section 04 (Processes and Threads): address space layout, fork/exec
Basic familiarity with x86-64 paging structures (CR3, PML4)

Learning Objectives

Upon completing this section you will be able to:

Walk a 4-level (and 5-level) x86-64 page table translation from virtual address to physical frame.
Explain TLB shootdown mechanics and their cost in multi-core systems.
Describe the Linux buddy allocator and why it exists alongside the slab/SLUB allocator.
Explain copy-on-write and demand paging, including their interaction with fork().
Trace what happens from the moment a process calls malloc() to when DRAM is physically allocated.
Reason about NUMA topology effects on memory allocation performance.
Explain how the OOM killer selects a victim and how to influence it.
Describe persistent memory (PMEM/NVDIMM) programming models (DAX, CLWB, CLFLUSHOPT).
Articulate the trade-offs of transparent huge pages vs explicit huge pages.

Architecture Overview

  Process Virtual Address Space (0 → 2^48-1 on x86-64)
 ┌───────────────────────────────────────────────────────┐
 │  [text] [data] [heap→]    [←stack]  [mmap region]    │
 └──────────────────┬────────────────────────────────────┘
                    │ virtual address
                    ▼
 ┌──────────────────────────────────────────────────────────────────┐
 │                    MMU / Page Table Walker                        │
 │  CR3 → PML4 (L4) → PDPT (L3) → PD (L2) → PT (L1) → PFN        │
 │  Each level: 512 entries × 8 bytes = 4 KB page                   │
 │  Huge pages: PD entry maps 2 MB; PDPT entry maps 1 GB            │
 └──────────────────────┬───────────────────────────────────────────┘
                        │  TLB hit: skip walk
                        ▼
 ┌──────────────────────────────────────────────────────────────────┐
 │                     Physical Memory                               │
 │  ┌───────────────────────────────────────────────────────────┐   │
 │  │  Buddy Allocator (page granularity, order 0–10)           │   │
 │  │  ├── Zone DMA  (0–16 MB)                                  │   │
 │  │  ├── Zone Normal (16 MB – highmem boundary)               │   │
 │  │  └── Zone Highmem / Movable                               │   │
 │  └──────────────────────┬────────────────────────────────────┘   │
 │                         │ slabs carved from buddy pages           │
 │  ┌──────────────────────▼────────────────────────────────────┐   │
 │  │  SLUB Allocator (sub-page objects, per-CPU caches)        │   │
 │  │  kmem_cache: task_struct, inode, dentry, sk_buff …        │   │
 │  └───────────────────────────────────────────────────────────┘   │
 └──────────────────────────────────────────────────────────────────┘
                        │
              NUMA Node 0          NUMA Node 1
           ┌──────────┐          ┌──────────┐
           │ CPU 0-15 │          │ CPU16-31 │
           │ 64 GB    │◄─QPI/UPI►│ 64 GB    │
           └──────────┘          └──────────┘

Key Concepts

Virtual Memory: The abstraction that gives each process its own isolated address space, decoupled from physical layout.
Paging: Divides virtual and physical memory into fixed-size pages (typically 4 KB); the MMU translates via page tables.
Page Table: A hierarchical data structure mapping virtual page numbers to physical frame numbers.
TLB (Translation Lookaside Buffer): A hardware cache for recent page table entries; a miss triggers a page walk.
TLB Shootdown: When a page mapping changes, the kernel must invalidate TLB entries on all CPUs that may have cached it — an expensive IPI-based operation.
Huge Pages: 2 MB (x86 large) or 1 GB (x86 huge) pages that reduce TLB pressure for large working sets.
Transparent Huge Pages (THP): Kernel mechanism that automatically promotes/demotes 4 KB pages to/from 2 MB hugepages.
Buddy Allocator: Linux's primary physical page allocator; manages free pages in power-of-two order lists to minimize fragmentation.
Slab/SLUB Allocator: Object-oriented allocator layered on top of buddy; caches frequently-allocated kernel objects.
Copy-on-Write (CoW): After fork(), parent and child share pages read-only; a write triggers a private copy.
Demand Paging: Physical frames are not allocated until a virtual page is first accessed; the page fault handler does the work.
mmap: Maps files or anonymous memory into the address space; the foundation of shared libraries, file I/O, and IPC.
OOM Killer: Linux's last resort when physical memory is exhausted — scores and kills a process based on memory usage and oom_score_adj.
NUMA (Non-Uniform Memory Access): Multi-socket systems where memory access latency depends on which node holds the physical frame.
IOMMU: Input/Output Memory Management Unit; translates device DMA addresses, enables isolation and large contiguous device mappings.
Persistent Memory (PMEM): Byte-addressable non-volatile memory (Optane, NVDIMM) exposed via DAX (direct access), bypassing the page cache.
KSM (Kernel Samepage Merging): Scans anonymous pages for identical content and merges them using CoW.

Major Historical Milestones

Year	Milestone
1961	Atlas computer introduces one-level store (virtual memory concept)
1969	Denning formalizes the working set model for page replacement
1970	IBM System/370 introduces virtual memory to commercial systems
1985	Intel 80386 introduces 32-bit paging; two-level page tables
1991	Linux 0.01 ships with basic paging on x86
1994	Linux adopts three-level page table abstraction (PGD/PMD/PTE)
1999	Linux 2.3.23: reverse mapping for efficient page reclaim
2003	Linux 2.6: NUMA awareness, per-zone allocator, slab rework
2003	Linux SLOB allocator added for embedded systems
2007	Linux 2.6.22: SLUB replaces SLAB as default allocator
2009	Intel Nehalem: IOMMU (VT-d) in mainstream server silicon
2011	Transparent Huge Pages merged into Linux 2.6.38
2012	Linux 5-level page table support (57-bit virtual addresses) merged
2015	Intel Optane (3D XPoint) NVDIMM announced
2018	Linux DAX (Direct Access) support for persistent memory stabilizes
2019	Linux 5-level paging (LA57) enabled by default for large-memory systems
2022	Folio infrastructure merged to rationalize compound page handling

Modern Relevance and Production Use Cases

In-memory databases (Redis, Memcached, VoltDB) benefit dramatically from huge pages: a 100 GB dataset with 4 KB pages requires 26 million TLB-resident entries; with 2 MB pages, only 51,200. TLB miss rate directly translates to latency.

JVM workloads (Kafka, Elasticsearch, Cassandra) use -XX:+UseHugePages or transparent huge pages; GC pause times often correlate with TLB shootdown storms during heap remapping.

ML training (PyTorch, JAX) allocates large contiguous GPU-pinned buffers via mmap with MAP_HUGETLB, requiring careful NUMA placement to avoid remote DRAM access penalties of 40–100 ns per access.

Container platforms (Kubernetes) rely on cgroup memory limits, which hook into the OOM killer and page reclaim; misconfigured oom_score_adj is a common cause of unexpected pod terminations.

Persistent memory deployments (SAP HANA, Oracle TimesTen) use PMEM in App Direct mode with DAX-enabled filesystems (ext4-dax, NOVA) to achieve sub-microsecond durable writes, fundamentally changing durability trade-offs.

File Map

File	Description
`01-virtual-memory-fundamentals.md`	Address spaces, MMU, protection bits, hardware vs software TLB
`02-paging-and-segmentation.md`	x86 segmentation legacy, paging mechanics, protection rings
`03-page-tables.md`	4-level and 5-level x86-64 page tables, ARM64 page tables
`04-tlb-management.md`	TLB structure, shootdowns, ASID/PCID, INVLPG
`05-huge-pages.md`	2 MB / 1 GB pages, HugeTLB, THP, libhugetlbfs
`06-buddy-allocator.md`	Free list design, coalescing, fragmentation, compaction
`07-slab-slub-allocator.md`	kmem_cache, per-CPU magazines, SLUB design
`08-kmalloc-vmalloc.md`	kmalloc size classes, vmalloc for non-contiguous mappings
`09-mmap-brk.md`	mmap internals, VMA merging, brk for heap growth
`10-copy-on-write.md`	CoW in fork(), mmap MAP_PRIVATE, page fault handler
`11-demand-paging.md`	Minor vs major faults, page fault handler flow
`12-swapping-and-reclaim.md`	LRU lists, kswapd, direct reclaim, swap space
`13-oom-killer.md`	OOM score calculation, oom_score_adj, cgroup OOM
`14-memory-fragmentation.md`	Internal/external fragmentation, compaction, CMA
`15-dma-and-iommu.md`	DMA API, coherent vs streaming DMA, IOMMU/VT-d
`16-persistent-memory.md`	NVDIMM, DAX, CLWB/CLFLUSHOPT, PMDK, NOVA filesystem
`17-numa-memory.md`	NUMA topology, numactl, first-touch policy, memory migration
`18-hugeTLB-KSM.md`	HugeTLB reservations, KSM merging, ksm scan rate tuning

Cross-References

Section 02 (CPU Architecture): cache hierarchy, TLB hardware, memory bus bandwidth
Section 04 (Processes): address space layout, fork/exec, VMA structure
Section 10 (Synchronization): memory barriers, NUMA-aware locking, page table spinlocks
Section 13 (Filesystems): page cache, buffer cache, DAX bypass
Section 14 (Device Drivers): DMA API, IOMMU, pinned memory
Section 15 (Networking): zero-copy via sendfile/splice, sk_buff and page pinning
Section 19 (Virtualization): EPT/NPT (nested paging), balloon drivers, memory overcommit