Paging
Technical Overview
Paging is the hardware-software mechanism that implements virtual memory. Physical RAM is divided into fixed-size units called frames (typically 4 KiB on x86-64). A process's virtual address space is likewise divided into same-sized units called pages. The CPU's Memory Management Unit (MMU) translates every virtual address to a physical address by walking a hierarchy of page tables stored in kernel memory. When a translation is missing or invalid, the hardware raises a page fault exception, transferring control to the kernel's fault handler which resolves the fault — allocating a physical frame, reading from disk, or delivering a signal.
The x86-64 page size is 4 KiB (12-bit page offset). The MMU supports 2 MiB "large pages" and 1 GiB "huge pages" through intermediate page table entries flagged with the Page Size bit.
Prerequisites
- Virtual memory concepts and VMA layout (see 01-virtual-memory.md)
- CPU privilege levels and exception handling
- Understanding of CR3 register on x86-64
- Basic data structure knowledge (trees, arrays)
Core Content
Page Table Walk (x86-64 4-Level)
On x86-64, a virtual address is decomposed into five fields:
Virtual Address Decomposition (48-bit canonical, 4KB pages)
=============================================================
63 48 47 39 38 30 29 21 20 12 11 0
+-------------+---------+---------+---------+---------+---------+
| sign extend | PGD idx | PUD idx | PMD idx | PTE idx | offset |
| (ignored) | 9 bits | 9 bits | 9 bits | 9 bits | 12 bits|
+-------------+---------+---------+---------+---------+---------+
[8] [7] [6] [5] page offset
PGD = Page Global Directory (level 4)
PUD = Page Upper Directory (level 3)
PMD = Page Middle Directory (level 2)
PTE = Page Table Entry (level 1)
Each table has 512 entries (9 bits = 2^9).
Each entry is 8 bytes.
Each table fits in exactly one 4KB page.
Walk diagram:
CR3 (physical addr of PGD)
|
v
+-------+-------+-------+-------+---...---+
| PGD[0]| PGD[1]| ... |PGD[n] | | <-- 4KB page in kernel memory
+-------+-------+-------+-------+---------+
|
| (physical address of PUD table + flags)
v
+-------+-------+---------+
| PUD[0]| PUD[1]| ... |
+-------+-------+---------+
|
| (physical address of PMD table + flags)
v
+-------+-------+---------+
| PMD[0]| PMD[1]| ... | If PS=1: 2MB huge page entry
+-------+-------+---------+
|
| (physical address of PTE table + flags)
v
+-------+-------+---------+
| PTE[0]| PTE[1]| ... |
+-------+-------+---------+
|
| (physical frame address + flags)
v
+----------------------------+
| Physical Frame (4KB) |
+----------------------------+
|
+ offset (bits 0-11)
v
Physical Byte
Page Table Entry (PTE) Structure
Each PTE on x86-64 is 64 bits wide. The layout:
PTE Bit Layout (x86-64, 4KB page)
===================================
Bit(s) Name Description
------ --------- --------------------------------------------------
0 P Present: 1 = page in RAM; 0 = not present (fault)
1 R/W Read/Write: 0 = read-only; 1 = writable
2 U/S User/Supervisor: 0 = kernel only; 1 = user accessible
3 PWT Page Write-Through (cache policy)
4 PCD Page Cache Disable
5 A Accessed: set by hardware on any access
6 D Dirty: set by hardware on write
7 PS Page Size: 1 = 2MB (in PMD) or 1GB (in PUD) huge page
8 G Global: TLB entry survives CR3 reload (kernel pages)
9-11 (avail) Software use (Linux uses for swap/special markers)
11-12 PAT Page Attribute Table index (together with PWT, PCD)
12-51 PFN Physical Frame Number (40 bits = 1TB physical)
52-62 (avail) Software use
63 NX/XD No-Execute: 1 = instruction fetch raises #PF
Linux reads and writes PTEs through inline functions defined in arch/x86/include/asm/pgtable.h:
- pte_present(), pte_write(), pte_dirty(), pte_young()
- pte_mkwrite(), pte_mkdirty(), pte_mkyoung()
- set_pte_at() — writes a PTE with the necessary memory barriers
Page Fault Handling Flow
When a virtual address has no valid PTE (P=0) or a permission violation (write to read-only page), the CPU raises exception vector 14 (#PF). The fault address is saved in CR2. The fault handler chain:
Hardware raises #PF
|
v
arch/x86/mm/fault.c: exc_page_fault()
|
+-- Is faulting address in kernel space?
| Yes --> handle_kernel_page_fault() --> oops / BUG / vmalloc fixup
|
+-- Is faulting address in user space VMA?
No --> send SIGSEGV (bad_area)
|
v
handle_mm_fault() [mm/memory.c]
|
+-- huge page path: hugetlb_fault() / transparent_huge_page
|
+-- normal path:
pmd_none? --> __pmd_alloc()
pte_none? --> do_anonymous_page() / do_fault()
pte write-protected? --> do_wp_page() (Copy-on-Write)
|
v
pte installed, fault returns
|
v
iret → restart faulting instruction
Detailed fault classification (mm/memory.c):
| Condition | Handler | Description |
|---|---|---|
| PTE absent, anonymous VMA | do_anonymous_page() |
Allocate zero page or new frame |
| PTE absent, file-backed VMA | do_fault() → do_read_fault() |
Read from page cache / disk |
| PTE present, write-protected | do_wp_page() |
CoW: copy page, install writable PTE |
| PTE present, not accessed | hardware Accessed bit | TLB refill only |
| PTE swapped out | do_swap_page() |
Read from swap, reinstall PTE |
Demand Paging
When a binary is exec'd, the kernel does not load all of its code into RAM. It creates file-backed VMAs pointing to the ELF segments, with no PTEs installed. As the CPU fetches instructions, each new page triggers a page fault. The fault handler finds the file-backed VMA, reads the page from disk (or page cache), installs the PTE, and resumes execution. This is demand paging: the physical cost is paid only for pages actually touched.
The benefit is fast exec (no upfront I/O) and minimal RAM consumption (never-executed code paths stay on disk).
Minor vs Major Faults
- Minor fault (soft fault): The page was already in the page cache (or is zero-fill). The kernel installs a PTE without any I/O. Cost: ~1–10 µs.
- Major fault (hard fault): The page must be read from disk (swap or file). Cost: 1–10 ms (SSD) to 5–15 ms (HDD).
Tracked per-process in task_struct.min_flt and maj_flt, visible in /proc/PID/stat fields 10 and 12.
# Show minor and major faults for a process
cat /proc/$(pidof postgres)/stat | awk '{print "minor:", $10, "major:", $12}'
# Real-time fault rate
/usr/bin/time -v ./program 2>&1 | grep -E "Major|Minor"
/proc/vmstat Page Fault Counters
grep -E "pgfault|pgmajfault|pgalloc|pgscan|pgsteal" /proc/vmstat
Key counters:
- pgfault — total page faults (minor + major) since boot
- pgmajfault — major faults (required I/O)
- pgalloc_normal — pages allocated from ZONE_NORMAL
- pgscan_kswapd — pages scanned by kswapd (memory pressure)
- pgsteal_kswapd — pages reclaimed by kswapd
- nr_page_table_pages — number of pages used for page tables themselves
Historical Context
Paging was conceived in the late 1950s and first implemented practically in the Ferranti Atlas (1962), which used one-level page tables with drum-backed swapping. The theoretical foundation was formalized by Fotheringham (1961). Early Unix systems (V6, V7) used segmentation without paging on the PDP-11. BSD 4.2 introduced demand paging on the VAX. Linux inherited BSD-derived VM concepts and rewrote them incrementally; the current Linux VM design descends largely from the work of Rik van Riel and Linus Torvalds in the late 1990s and early 2000s.
The shift from 32-bit (2-level page tables) to 64-bit (4-level, then 5-level) was driven by the need for larger address spaces. 32-bit Linux used only two levels (PGD + PTE) because the 32-bit address space fit in 1024 × 1024 = 1M entries.
Production Examples
Database hot-path faults: A freshly started PostgreSQL instance with a cold page cache will generate thousands of major faults as it reads index pages. A warmed-up instance running the same query generates only minor faults (pages already in buffer pool or page cache). The ratio of major to minor faults is a direct measure of cache effectiveness.
JVM GC and page faults: A Java application that has not touched the top of its heap in a while may have those pages swapped out or reclaimed. A full GC that sweeps the entire heap then generates a burst of major faults, causing GC pause times 10–100x longer than expected. This is why JVM operators pin the heap with mlockall() or -XX:+AlwaysPreTouch.
Kubernetes cold-start latency: A container image layer mapped with mmap causes major faults on first access. Kubernetes image pull + container start latency is often dominated by page fault I/O on first exec.
Debugging Notes
# Watch page fault rate system-wide in real time
watch -n1 "grep pgfault /proc/vmstat"
# Per-process fault count (fields 10=minflt, 12=majflt in /proc/PID/stat)
awk '{print "pid:", $1, "minflt:", $10, "majflt:", $12}' /proc/$(pidof nginx)/stat
# perf stat for fault counts
perf stat -e page-faults,major-faults ./myprogram
# Trace individual page faults with perf (expensive, sampling)
perf record -e page-faults -g ./myprogram
perf report
# Identify which file/offset is generating major faults
perf trace -e 'pagefault:page_fault_user' ./myprogram 2>&1 | head -50
# Check if swap is being used (major fault driver)
vmstat 1 5
When a process shows an unexpected spike in pgmajfault, first check:
1. Is swap in use (vmstat or /proc/meminfo SwapUsed)?
2. Is the working set larger than available RAM?
3. Has madvise(MADV_DONTNEED) or madvise(MADV_FREE) been called incorrectly?
Security Implications
Page fault as side channel: The fault-vs-no-fault status of a page is observable from a co-located process through timing. Rowhammer attacks (CVE-2015-7547) exploit physical proximity of DRAM rows; a page fault brings a target page into a specific DRAM row that can then be hammered. KASLR bypass via Prefetch side channels (CVE-2016-3672) inferred the kernel base address by timing access to unmapped vs mapped addresses.
Spectre/Meltdown: Meltdown (CVE-2017-5754) exploited the gap between a page fault being raised and the CPU's out-of-order execution continuing to use the faulted value. The CPU could transiently read kernel memory even though the #PF would later be delivered. KPTI removes kernel mappings from user-space page tables, eliminating the Meltdown surface.
Dirty page exploitation: The Dirty COW race (CVE-2016-5195) exploited a race in the CoW fault path to achieve a write to a read-only memory-mapped file. See 08-copy-on-write.md for full details.
NULL page mapping: mmap(NULL, 4096, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_FIXED|MAP_ANONYMOUS, -1, 0) would map the null page. A kernel NULL dereference then executes attacker code. vm.mmap_min_addr prevents this (default 65536 on modern kernels).
Performance Implications
- Page table walk cost: 4 memory accesses in the worst case (one per level). With TLB hit rates above 99%, this is rarely on the critical path. At TLB miss rates of 1–5%, it becomes significant.
- Working set vs RAM: Any working set exceeding physical RAM results in major faults on the reclaimed pages. The performance cliff is dramatic — a factor of 1000x between minor and major fault latency.
- madvise optimization: Applications can hint the kernel about future access patterns:
MADV_SEQUENTIAL: kernel prefetches pages aheadMADV_RANDOM: kernel disables read-aheadMADV_WILLNEED: prefault pages now (mlock-like)MADV_DONTNEED: discard pages (anonymous pages become zero on next access)MADV_FREE: pages may be reclaimed lazily (faster than DONTNEED for reuse cases)- Huge pages: Reduce page table depth and TLB pressure. A 2 MiB huge page requires only 3 levels of walk vs 4, and uses one TLB entry instead of 512. See
05-huge-pages.md.
Failure Modes and Real Incidents
Swap storm: A system under memory pressure activates swap. Random access patterns generate O(n) major faults where n is the number of working-set pages. CPU becomes 100% I/O wait. The system appears hung. Resolution: add RAM, reduce working set, or tune vm.swappiness.
Page fault in atomic context: Kernel code holding a spinlock must not fault (spinlocks disable preemption; sleeping to wait for I/O is impossible). CONFIG_DEBUG_ATOMIC_SLEEP catches this. Real bugs have caused kernel panics when a driver attempted a kmalloc with GFP_KERNEL (which may fault) while holding a lock.
AWS EC2 "noisy neighbor" memory pressure (2014): Hypervisor balloon drivers reclaimed guest RAM, causing guest kernels to swap. The resulting major fault storm caused application latency spikes of 10–100x. The fix was to use dedicated instances and pin the working set with mlock.
Modern Usage
- UFFD (userfaultfd): User-space page fault handling. Used by QEMU for post-copy live migration, CRIU for checkpoint/restore, and user-space GC implementations.
- io_uring fixed buffers: Pre-register buffers so the kernel can fault them in once and hold the PTE reference, avoiding repeated fault overhead on I/O.
- Lazy TLB mode: During kernel threads that don't use user address spaces, the kernel avoids switching CR3, saving TLB flushes.
- THP (Transparent Huge Pages): The kernel's
khugepageddaemon scans for adjacent 4KB pages that can be collapsed into 2MB pages, reducing TLB pressure and page table overhead. See05-huge-pages.md.
Future Directions
- Hardware page table walkers: Future ISAs may offload the page table walk to dedicated hardware state machines, reducing the memory bandwidth cost.
- Inverted page tables: IBM POWER uses an inverted page table (one entry per physical frame, not per virtual page). Constant memory cost regardless of address space size, but requires hashing and is complex for sparse address spaces.
- Page coloring: Assign physical frames to virtual pages to minimize L1/L2 cache set conflicts. Currently not done in mainstream Linux but studied in real-time OS research.
- Persistent memory (PMEM/NVDIMM): Page fault semantics for byte-addressable persistent memory require new models (DAX — Direct Access mode bypasses page cache and maps persistent memory directly).
Exercises
- Write a program that allocates a 100 MiB anonymous mapping but only touches every 256th page. Compare the RSS (from
/proc/self/status) with VMA size. Observe demand paging at work. - Force a major fault by allocating memory, calling
madvise(MADV_DONTNEED), then touching it again. Measure the latency difference vs minor fault usingclock_gettime(CLOCK_MONOTONIC). - Use
perf stat -e page-faultsto measure how many page faultsgit cloneof a medium repository generates. Explain the result. - Inspect the PTE flags for a known read-only mapping (e.g.,
/proc/self/mapsshowsr--p). Write a tool using/proc/self/pagemapto read the raw PTE values. - Implement a simple demand-paging allocator using
userfaultfd: allocate 1 GiB of virtual space, handle faults by allocating and zero-filling pages one at a time. - Examine
/proc/vmstatbefore and after runningfind /usr -name "*.so". Quantify how many pages were faulted in and what fraction were minor vs major.
References
mm/memory.c—handle_mm_fault(),do_anonymous_page(),do_fault(),do_wp_page()arch/x86/mm/fault.c—exc_page_fault(),handle_page_fault()arch/x86/include/asm/pgtable.h— PTE manipulation macrosarch/x86/include/asm/pgtable_types.h—_PAGE_PRESENT,_PAGE_RW,_PAGE_NXbit definitionsinclude/linux/mm.h—vm_faultstructure, fault return codes- Intel SDM Vol. 3A, Section 4.7 — Page-Fault Exceptions
- Mel Gorman, "Understanding the Linux Virtual Memory Manager", Chapter 4
- Bovet & Cesati, "Understanding the Linux Kernel", 3rd ed., Chapter 9
proc(5)man page —/proc/PID/statfield definitions- LWN: "Memory compaction" — https://lwn.net/Articles/368869/