Skip to content

Section 11: Memory Management

Purpose and Scope

Memory management is the substrate on which every other OS subsystem rests. This section dissects the complete memory hierarchy from physical DRAM through the OS virtual memory subsystem to user-space allocators. It covers address translation mechanics (paging, segmentation, multi-level page tables), TLB behavior, and every major kernel memory allocator (buddy, slab/SLUB, kmalloc, vmalloc). It extends to the interfaces user space uses (mmap, brk), the lazy mechanisms (demand paging, copy-on-write), the pressure-relief valves (swapping, OOM killer), and the hardware extensions that have reshaped the field (huge pages, NUMA, IOMMU, persistent memory).

Correctness and performance are inseparable here: a misunderstood TLB shootdown or an ill-placed huge page can silently dominate a workload's latency profile.


Prerequisites

  • Section 02 (CPU Architecture): cache hierarchy, TLB, memory bus
  • Section 03 (OS Fundamentals): kernel/user split, syscalls, process model
  • Section 04 (Processes and Threads): address space layout, fork/exec
  • Basic familiarity with x86-64 paging structures (CR3, PML4)

Learning Objectives

Upon completing this section you will be able to:

  1. Walk a 4-level (and 5-level) x86-64 page table translation from virtual address to physical frame.
  2. Explain TLB shootdown mechanics and their cost in multi-core systems.
  3. Describe the Linux buddy allocator and why it exists alongside the slab/SLUB allocator.
  4. Explain copy-on-write and demand paging, including their interaction with fork().
  5. Trace what happens from the moment a process calls malloc() to when DRAM is physically allocated.
  6. Reason about NUMA topology effects on memory allocation performance.
  7. Explain how the OOM killer selects a victim and how to influence it.
  8. Describe persistent memory (PMEM/NVDIMM) programming models (DAX, CLWB, CLFLUSHOPT).
  9. Articulate the trade-offs of transparent huge pages vs explicit huge pages.

Architecture Overview

  Process Virtual Address Space (0 → 2^48-1 on x86-64)
 ┌───────────────────────────────────────────────────────┐
 │  [text] [data] [heap→]    [←stack]  [mmap region]    │
 └──────────────────┬────────────────────────────────────┘
                    │ virtual address
                    ▼
 ┌──────────────────────────────────────────────────────────────────┐
 │                    MMU / Page Table Walker                        │
 │  CR3 → PML4 (L4) → PDPT (L3) → PD (L2) → PT (L1) → PFN        │
 │  Each level: 512 entries × 8 bytes = 4 KB page                   │
 │  Huge pages: PD entry maps 2 MB; PDPT entry maps 1 GB            │
 └──────────────────────┬───────────────────────────────────────────┘
                        │  TLB hit: skip walk
                        ▼
 ┌──────────────────────────────────────────────────────────────────┐
 │                     Physical Memory                               │
 │  ┌───────────────────────────────────────────────────────────┐   │
 │  │  Buddy Allocator (page granularity, order 0–10)           │   │
 │  │  ├── Zone DMA  (0–16 MB)                                  │   │
 │  │  ├── Zone Normal (16 MB – highmem boundary)               │   │
 │  │  └── Zone Highmem / Movable                               │   │
 │  └──────────────────────┬────────────────────────────────────┘   │
 │                         │ slabs carved from buddy pages           │
 │  ┌──────────────────────▼────────────────────────────────────┐   │
 │  │  SLUB Allocator (sub-page objects, per-CPU caches)        │   │
 │  │  kmem_cache: task_struct, inode, dentry, sk_buff …        │   │
 │  └───────────────────────────────────────────────────────────┘   │
 └──────────────────────────────────────────────────────────────────┘
                        │
              NUMA Node 0          NUMA Node 1
           ┌──────────┐          ┌──────────┐
           │ CPU 0-15 │          │ CPU16-31 │
           │ 64 GB    │◄─QPI/UPI►│ 64 GB    │
           └──────────┘          └──────────┘

Key Concepts

  • Virtual Memory: The abstraction that gives each process its own isolated address space, decoupled from physical layout.
  • Paging: Divides virtual and physical memory into fixed-size pages (typically 4 KB); the MMU translates via page tables.
  • Page Table: A hierarchical data structure mapping virtual page numbers to physical frame numbers.
  • TLB (Translation Lookaside Buffer): A hardware cache for recent page table entries; a miss triggers a page walk.
  • TLB Shootdown: When a page mapping changes, the kernel must invalidate TLB entries on all CPUs that may have cached it — an expensive IPI-based operation.
  • Huge Pages: 2 MB (x86 large) or 1 GB (x86 huge) pages that reduce TLB pressure for large working sets.
  • Transparent Huge Pages (THP): Kernel mechanism that automatically promotes/demotes 4 KB pages to/from 2 MB hugepages.
  • Buddy Allocator: Linux's primary physical page allocator; manages free pages in power-of-two order lists to minimize fragmentation.
  • Slab/SLUB Allocator: Object-oriented allocator layered on top of buddy; caches frequently-allocated kernel objects.
  • Copy-on-Write (CoW): After fork(), parent and child share pages read-only; a write triggers a private copy.
  • Demand Paging: Physical frames are not allocated until a virtual page is first accessed; the page fault handler does the work.
  • mmap: Maps files or anonymous memory into the address space; the foundation of shared libraries, file I/O, and IPC.
  • OOM Killer: Linux's last resort when physical memory is exhausted — scores and kills a process based on memory usage and oom_score_adj.
  • NUMA (Non-Uniform Memory Access): Multi-socket systems where memory access latency depends on which node holds the physical frame.
  • IOMMU: Input/Output Memory Management Unit; translates device DMA addresses, enables isolation and large contiguous device mappings.
  • Persistent Memory (PMEM): Byte-addressable non-volatile memory (Optane, NVDIMM) exposed via DAX (direct access), bypassing the page cache.
  • KSM (Kernel Samepage Merging): Scans anonymous pages for identical content and merges them using CoW.

Major Historical Milestones

Year Milestone
1961 Atlas computer introduces one-level store (virtual memory concept)
1969 Denning formalizes the working set model for page replacement
1970 IBM System/370 introduces virtual memory to commercial systems
1985 Intel 80386 introduces 32-bit paging; two-level page tables
1991 Linux 0.01 ships with basic paging on x86
1994 Linux adopts three-level page table abstraction (PGD/PMD/PTE)
1999 Linux 2.3.23: reverse mapping for efficient page reclaim
2003 Linux 2.6: NUMA awareness, per-zone allocator, slab rework
2003 Linux SLOB allocator added for embedded systems
2007 Linux 2.6.22: SLUB replaces SLAB as default allocator
2009 Intel Nehalem: IOMMU (VT-d) in mainstream server silicon
2011 Transparent Huge Pages merged into Linux 2.6.38
2012 Linux 5-level page table support (57-bit virtual addresses) merged
2015 Intel Optane (3D XPoint) NVDIMM announced
2018 Linux DAX (Direct Access) support for persistent memory stabilizes
2019 Linux 5-level paging (LA57) enabled by default for large-memory systems
2022 Folio infrastructure merged to rationalize compound page handling

Modern Relevance and Production Use Cases

In-memory databases (Redis, Memcached, VoltDB) benefit dramatically from huge pages: a 100 GB dataset with 4 KB pages requires 26 million TLB-resident entries; with 2 MB pages, only 51,200. TLB miss rate directly translates to latency.

JVM workloads (Kafka, Elasticsearch, Cassandra) use -XX:+UseHugePages or transparent huge pages; GC pause times often correlate with TLB shootdown storms during heap remapping.

ML training (PyTorch, JAX) allocates large contiguous GPU-pinned buffers via mmap with MAP_HUGETLB, requiring careful NUMA placement to avoid remote DRAM access penalties of 40–100 ns per access.

Container platforms (Kubernetes) rely on cgroup memory limits, which hook into the OOM killer and page reclaim; misconfigured oom_score_adj is a common cause of unexpected pod terminations.

Persistent memory deployments (SAP HANA, Oracle TimesTen) use PMEM in App Direct mode with DAX-enabled filesystems (ext4-dax, NOVA) to achieve sub-microsecond durable writes, fundamentally changing durability trade-offs.


File Map

File Description
01-virtual-memory-fundamentals.md Address spaces, MMU, protection bits, hardware vs software TLB
02-paging-and-segmentation.md x86 segmentation legacy, paging mechanics, protection rings
03-page-tables.md 4-level and 5-level x86-64 page tables, ARM64 page tables
04-tlb-management.md TLB structure, shootdowns, ASID/PCID, INVLPG
05-huge-pages.md 2 MB / 1 GB pages, HugeTLB, THP, libhugetlbfs
06-buddy-allocator.md Free list design, coalescing, fragmentation, compaction
07-slab-slub-allocator.md kmem_cache, per-CPU magazines, SLUB design
08-kmalloc-vmalloc.md kmalloc size classes, vmalloc for non-contiguous mappings
09-mmap-brk.md mmap internals, VMA merging, brk for heap growth
10-copy-on-write.md CoW in fork(), mmap MAP_PRIVATE, page fault handler
11-demand-paging.md Minor vs major faults, page fault handler flow
12-swapping-and-reclaim.md LRU lists, kswapd, direct reclaim, swap space
13-oom-killer.md OOM score calculation, oom_score_adj, cgroup OOM
14-memory-fragmentation.md Internal/external fragmentation, compaction, CMA
15-dma-and-iommu.md DMA API, coherent vs streaming DMA, IOMMU/VT-d
16-persistent-memory.md NVDIMM, DAX, CLWB/CLFLUSHOPT, PMDK, NOVA filesystem
17-numa-memory.md NUMA topology, numactl, first-touch policy, memory migration
18-hugeTLB-KSM.md HugeTLB reservations, KSM merging, ksm scan rate tuning

Cross-References

  • Section 02 (CPU Architecture): cache hierarchy, TLB hardware, memory bus bandwidth
  • Section 04 (Processes): address space layout, fork/exec, VMA structure
  • Section 10 (Synchronization): memory barriers, NUMA-aware locking, page table spinlocks
  • Section 13 (Filesystems): page cache, buffer cache, DAX bypass
  • Section 14 (Device Drivers): DMA API, IOMMU, pinned memory
  • Section 15 (Networking): zero-copy via sendfile/splice, sk_buff and page pinning
  • Section 19 (Virtualization): EPT/NPT (nested paging), balloon drivers, memory overcommit