Page Tables
Technical Overview
Page tables are the data structures through which the CPU's MMU translates virtual addresses to physical addresses. They live in kernel memory but are walked by hardware (the MMU's page table walker) on every TLB miss. Understanding page table structure is essential for kernel developers, performance engineers, and security researchers: page tables determine memory isolation, protection granularity, memory overhead of the VM system itself, and the behavior of fork/exec/CoW.
Linux supports multiple page table depths simultaneously through a unified five-level abstraction: - Two-level: 32-bit x86 without PAE (PGD + PTE) - Three-level: 32-bit x86 with PAE, or ARM with 3-level tables - Four-level: x86-64 standard (PGD→PUD→PMD→PTE) - Five-level: x86-64 with LA57 (PGD→P4D→PUD→PMD→PTE), ARM64 with 5-level
The code abstracts over these through include/linux/pgtable.h macros that collapse unused levels to no-ops.
Prerequisites
- Virtual memory and VMA concepts (01-virtual-memory.md)
- Paging and page fault flow (02-paging.md)
- x86-64 register set (CR3, CR4.LA57)
- ARM64 TCR_EL1 and TTBR0/TTBR1 registers (for ARM coverage)
Core Content
Four-Level x86-64 Page Table Walk (Detailed)
4-Level x86-64 Page Table Walk
================================
Virtual Address: 0x00007F_A8B3_C4D5_E6F0
Binary: 0000 0000 0000 0111 1111 1010 1000 1011 0011 1100 0100 1101 0101 1110 0110 1111 0000
Bit decomposition (48-bit canonical):
[47:39] PGD index = 0b111_111_10 = 0xFE (bits 47..39 = 9 bits)
[38:30] PUD index = 0b101_000_10 = 0x142 -> 0x42 mod 512
[29:21] PMD index = 0b110_011_11 = 0xCF (9 bits)
[20:12] PTE index = 0b000_0100_11 = 0x013 (9 bits)
[11:0] page offset= 0b0101_1110_0110_1111_0000 -> but only 12 bits = 0xEF0
Step-by-step:
CR3 ──► [PGD page, 4KB]
entry[0xFE]: bits[51:12]=PUD_phys_pfn, bits[0]=P=1, bits[1]=R/W, bits[2]=U/S
│
▼
[PUD page, 4KB]
entry[0x42]: bits[51:12]=PMD_phys_pfn OR if PS=1: 1GB huge page here
│
▼
[PMD page, 4KB]
entry[0xCF]: bits[51:12]=PT_phys_pfn OR if PS=1: 2MB huge page here
│
▼
[PT (PTE array) page, 4KB]
entry[0x13]: bits[51:12]=frame_pfn, bits[63]=NX, bits[6]=D, bits[5]=A, bits[1]=R/W
│
▼
[Physical Frame] + offset 0xEF0 ──► Physical Byte
Address space coverage per entry at each level:
One PTE covers 4 KB (2^12)
One PMD covers 2 MB (2^21) — 512 PTEs * 4KB
One PUD covers 1 GB (2^30) — 512 PMDs * 2MB
One PGD covers 512 GB (2^39) — 512 PUDs * 1GB
Full PGD covers 256 TB (2^47) — 512 PGDs * 512GB
Five-Level Extension (LA57)
With CR4.LA57=1 (Linux 5.14+), a fifth level (P4D) is inserted between PGD and PUD:
Five-Level x86-64 (57-bit canonical, 128 PiB user space)
==========================================================
Bit ranges:
[56:48] PGD (9 bits, 512 entries)
[47:39] P4D (9 bits, 512 entries) <-- new level
[38:30] PUD (9 bits)
[29:21] PMD (9 bits)
[20:12] PTE (9 bits)
[11:0] offset (12 bits)
Linux code: arch/x86/include/asm/pgtable_64_types.h
PTRS_PER_P4D = 512 when CONFIG_X86_5LEVEL=y
Linux's p4d_offset() is a no-op (returns the PGD entry reinterpreted as a P4D) when the kernel is compiled without 5-level support, keeping the four-level and five-level paths unified.
ARM64 Page Tables
ARM64 uses a different terminology but the same concept. Two translation table base registers:
- TTBR0_EL1: user address space (VA[55]=0)
- TTBR1_EL1: kernel address space (VA[55]=1)
ARM64 page table levels are called L0–L3 (or sometimes PGD/PUD/PMD/PTE following Linux naming). The granule size (4KB, 16KB, or 64KB) changes the number of bits per level. With 4KB granules and 48-bit VA: 4-level tables, each level indexed by 9 bits.
ARM64 4KB / 48-bit VA Table Walk:
L0 (PGD): bits [47:39]
L1 (PUD): bits [38:30]
L2 (PMD): bits [29:21]
L3 (PTE): bits [20:12]
offset: bits [11:0]
ARM64 PTE bits differ from x86-64: AP[2:1] for access permissions, UXN/PXN for execute-never at EL0/EL1, AF (Access Flag), nG (not-global), ASID in TTBR.
Linux Page Table Manipulation Functions
Defined in arch/x86/include/asm/pgtable.h and include/linux/pgtable.h:
/* Navigate the page table for a given mm and virtual address */
pgd_t *pgd_offset(struct mm_struct *mm, unsigned long address);
p4d_t *p4d_offset(pgd_t *pgd, unsigned long address);
pud_t *pud_offset(p4d_t *p4d, unsigned long address);
pmd_t *pmd_offset(pud_t *pud, unsigned long address);
pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address);
pte_t *pte_offset_map(pmd_t *pmd, unsigned long address); // may kmap on HIGHMEM
/* Test entry validity */
int pgd_none(pgd_t pgd); /* true if entry is empty (P=0 and software bits 0) */
int pgd_bad(pgd_t pgd); /* true if entry is corrupt */
int pgd_present(pgd_t pgd); /* true if P=1 */
/* For PTEs specifically */
int pte_present(pte_t pte);
int pte_write(pte_t pte);
int pte_dirty(pte_t pte);
int pte_young(pte_t pte); /* Accessed bit */
int pte_exec(pte_t pte); /* !NX */
pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma);
pte_t pte_mkdirty(pte_t pte);
pte_t pte_mkyoung(pte_t pte);
pte_t pte_wrprotect(pte_t pte);
/* Allocate a new page table page */
pgd_t *pgd_alloc(struct mm_struct *mm); /* mm/pgtable.c */
pud_t *pud_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long addr);
pmd_t *pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long addr);
Page table pages are allocated with __get_free_page(GFP_PGTABLE_USER) and tracked in mm->pgtables_bytes.
Page Table Memory Cost
A fully populated 48-bit address space would require: - 1 PGD page (4 KB) - 512 PUD pages (2 MB) - 512 × 512 PMD pages (1 GB) - 512 × 512 × 512 PT pages (512 GB)
This is impractical; in reality only a tiny fraction of the address space is mapped. A typical process has a few dozen to a few hundred PT pages. Each page table page costs 4 KB, so 100 VMA regions might consume ~400 KB of page table pages (still much less than the mapped data).
For large processes (Java with 32 GB heap, or a database with 1 TB mmap), page table overhead can reach tens to hundreds of MB. Huge pages (2 MB) reduce PT depth by one level, eliminating the entire PTE table layer for those regions.
Page Table Sharing: fork() and CoW
On fork(), the kernel clones the parent's page tables. Rather than copying every page:
1. dup_mmap() (kernel/fork.c) iterates the parent's VMAs and calls copy_page_range().
2. copy_page_range() walks the page table and, for writable private VMAs, write-protects both parent and child PTEs (sets R/W=0).
3. Physical frames are not copied; both processes share the same frames.
4. On first write by either process, the CoW handler (do_wp_page()) allocates a new frame, copies the content, and installs a writable PTE.
After fork():
Parent PTE → Frame X (R/W=0)
Child PTE → Frame X (R/W=0)
After child writes to the page:
Child PTE → Frame Y (R/W=1, copy of X)
Parent PTE → Frame X (R/W=0, unchanged)
Kernel Page Tables vs User Page Tables
Every process has its own set of user-space page tables (rooted at mm->pgd). The kernel maintains a single canonical kernel page table that maps:
- All physical RAM (the "direct map" at PAGE_OFFSET)
- vmalloc range
- Module text/data
- fixmap (compile-time fixed virtual addresses)
On x86-64, kernel page tables are replicated into the upper half of every process's PGD. pgd_populate() copies kernel PGD entries into new process PGDs, ensuring the kernel mapping is always consistent.
KPTI: Two-Page-Table Design (Meltdown Mitigation)
KPTI (Kernel Page Table Isolation), introduced in Linux 4.15 as a response to Meltdown (CVE-2017-5754), maintains two sets of page tables per process:
KPTI Dual Page Table Design
============================
User-mode page tables (loaded when CPL=3):
+-- User VMA mappings (full access)
+-- Kernel trampoline page (minimal: just syscall entry, IDT, GDT, TSS)
+-- NO kernel text, NO kernel data, NO direct map
Kernel-mode page tables (loaded when CPL=0):
+-- User VMA mappings (still mapped, needed for copy_to/from_user)
+-- Full kernel mapping (text, data, direct map, modules, vmalloc)
Context switch between user ↔ kernel:
CPL=3 → CPL=0: CR3 switches from user PCID to kernel PCID
CPL=0 → CPL=3: CR3 switches from kernel PCID to user PCID
Cost: ~100-250 cycles per syscall for CR3 reload (without PCID)
~10-20 cycles with PCID (avoids TLB flush on CR3 write)
Implementation: arch/x86/mm/tlb.c — choose_new_asid(), switch_mm_irqs_off(). The user-mode CR3 is stored in mm->pgd and the kernel-mode CR3 in a per-mm mm->context.ctx_id.
Two-Level and Three-Level Tables (Legacy / 32-bit)
On 32-bit x86 without PAE: - PGD: 1024 entries × 4 bytes = 4 KB (indices bits 31..22) - PTE: 1024 entries × 4 bytes = 4 KB (indices bits 21..12) - Offset: 12 bits
With PAE (Physical Address Extension): - PGD: 4 entries × 8 bytes = 32 bytes (indices bits 31..30) - PMD: 512 entries × 8 bytes = 4 KB (indices bits 29..21) - PTE: 512 entries × 8 bytes = 4 KB (indices bits 20..12)
PAE allows 32-bit systems to address up to 64 GB of physical RAM by widening PTEs to 64 bits.
Historical Context
Early PDP-11 Unix used segmentation (base + limit registers) with no paging. The VAX (1977) introduced a hierarchical page table similar to the four-level design. MULTICS used a two-level (segment + page) scheme. The 80386's paging used a two-level design (PD + PT). The Intel Pentium Pro added Physical Address Extension (PAE), effectively adding a third level. The AMD64 architecture (2003) introduced the current four-level design with 512-entry tables at each level. Intel added five-level paging (LA57) in Ice Lake CPUs (2019).
Linux's unified page table abstraction (pgd_t, pud_t, pmd_t, pte_t) was designed by Linus to accommodate the PAE three-level case without forking the VM code, and has scaled cleanly to five levels.
Production Examples
Large JVM heap fragmentation: A JVM running with -Xmx256g and 4KB pages requires ~128 MB of page table pages just for the PTE level (64M pages × 8 bytes per PTE). Switching to 2MB huge pages reduces this by 512x (to 256 KB). This is why UseTransparentHugePages and UseHugeTLBFS are recommended for large-heap JVMs.
Database mmap + huge pages: Oracle Database on Linux uses HugeTLB pages for its SGA (System Global Area). A 512 GB SGA with 4KB pages needs 128 MB of page tables; with 2MB pages this drops to 256 KB. The page table walk is also faster (3 levels vs 4).
Debugging Notes
# Total page table memory used by all processes
grep PageTables /proc/meminfo
# Page table bytes for a specific process
cat /proc/$(pidof java)/status | grep VmPTE
# Walk the page table of a process (Linux kernel debugger, crash utility)
# crash> ptov <cr3_value>
# crash> pte <virtual_address>
# Check if 5-level paging is active
grep -i "5-level" /proc/cpuinfo
# Or: dmesg | grep "5-level"
# Dump raw page table entries for a process (requires root, kernel 4.14+)
# Via /proc/PID/pagemap: read 8 bytes per page for PFN + flags
python3 - <<'EOF'
import struct, sys
pid = int(sys.argv[1]) if len(sys.argv) > 1 else 1
addr = 0x400000 # typical text segment
with open(f"/proc/{pid}/pagemap", "rb") as f:
f.seek((addr // 4096) * 8)
entry = struct.unpack("Q", f.read(8))[0]
print(f"pagemap entry for {addr:#x}: {entry:#018x}")
print(f" PFN: {entry & ((1<<55)-1):#x}")
print(f" Present: {(entry >> 63) & 1}")
print(f" Swapped: {(entry >> 62) & 1}")
EOF
Security Implications
Page table spraying: An attacker who can allocate many pages can influence which physical frames are used for page table pages, enabling physical memory prediction attacks (relevant for Rowhammer).
KPTI bypass: Spectre variant 3a (CVE-2018-3640) can read system registers through speculative execution even with KPTI. The retpoline + IBRS + IBPB mitigations are layered on top.
PTE modification races: On SMP systems, a window exists between reading a PTE and acting on it. The "TLB invalidation race" in Dirty COW (CVE-2016-5195) exploited a similar race in the CoW path.
Page table page reuse: If a page table page is freed and reallocated as user data, an attacker who controls that data can synthesize arbitrary PTEs. This is the "page table confusion" class of exploits. Modern kernels use __GFP_RECLAIMABLE and SLAB type awareness to prevent this.
Performance Implications
- Page table walk memory bandwidth: Each TLB miss requires 4 memory reads (for 4-level tables). On a system with a 100 ns DRAM latency and a 1% TLB miss rate, page table walks consume ~4 ns of average memory latency per access.
- Huge pages eliminate PTE level: 2MB pages require only 3 levels; 1GB pages only 2. This reduces both TLB miss penalty and the number of page table pages.
- Fork overhead for large processes: Copying page tables on
fork()is O(number of PT pages). A process with 100 GB of mmap regions has ~50 MB of PT pages;fork()takes ~50 ms just copying them. This is why Chromium useszygote + forkcarefully, and why POSIX_SPAWN is preferred over fork+exec in performance-critical code. - KPTI overhead: On kernels with KPTI enabled and hardware without PCID support, every syscall flushes the TLB (CR3 reload). Intel CPUs since Westmere support PCID; with PCID, the overhead drops from ~5% to ~0.5% for syscall-heavy workloads.
Failure Modes and Real Incidents
Page table corruption: A kernel bug that writes to the wrong physical address can corrupt a page table, causing immediate random #PF panics or silent data corruption. CONFIG_DEBUG_PAGEALLOC marks free pages as non-present to catch use-after-free.
OOM due to page table explosion: A bug in Linux 2.6 caused some workloads to create millions of single-page VMAs (each needing its own page table page). The page table pages exhausted RAM before the working set did, causing OOM kills. Fixed by VMA merging improvements in vma_merge().
Meltdown (CVE-2017-5754): The architectural vulnerability that required KPTI. The root cause was that kernel page table entries were present in user-mode page tables (with supervisor-only bits). Speculative execution could transiently access them before the permission fault was delivered. Impact: ~5–30% syscall overhead on patched systems without PCID hardware support.
Modern Usage
- Memory encryption (AMD SME/SEV): AMD Secure Memory Encryption uses a C-bit in PTEs to mark encrypted pages. The CPU transparently encrypts/decrypts cache lines for those physical frames. SEV (Secure Encrypted Virtualization) extends this to encrypt guest memory from the hypervisor.
- Intel TME (Total Memory Encryption): Similar to SME; a global key encrypts all DRAM accesses. No PTE modification required.
- CET (Control-flow Enforcement Technology): Uses shadow stack PTEs with a new protection key (
WRSSinstruction) to enforce return address integrity without software overhead.
Future Directions
- Shared page tables: Processes sharing large read-only mappings (shared libraries, shared memory) could share page table pages, reducing memory overhead. Research prototype: "shared page tables" patch series (LWN 2022).
- Hardware-assisted page table management: ARM's "stage 2" page tables for virtualization already exist. Future extensions may allow hardware-managed page table aging without kernel involvement.
- Capability pointers replacing PTEs: CHERI-MIPS and CHERI-RISC-V replace the page table protection model with in-register capability bits, eliminating the need for a TLB miss to enforce bounds.
Exercises
- Write a kernel module that walks the page table for a given PID and virtual address, printing the physical address and PTE flags. Handle the case where a level is not present.
- Measure the page table memory overhead (
/proc/meminfo PageTables) before and after forking a process with a 1 GB anonymous mmap region. - Enable
CONFIG_DEBUG_PAGEALLOCin a test kernel and trigger a use-after-free on a page table page. Observe the immediate panic vs silent corruption without the config. - Compare the fork time of a process with 100 MB of anonymous memory using 4KB pages vs 2MB huge pages. Measure using
perf stat -e cycles. - Trace CR3 loads using
perfhardware events on a kernel with KPTI enabled. Count CR3 switches per second during a syscall-heavy workload (e.g.,strace -c find /usr). - Using
/proc/PID/pagemap, implement a tool that reports how many physical frames are shared between two processes (identical PFN in both processes' pagemaps for the same VMA).
References
arch/x86/include/asm/pgtable.h— PTE type definitions and manipulationarch/x86/include/asm/pgtable_64_types.h— PGD/PUD/PMD/PTE level widths, PTRS_PER_*arch/x86/mm/pgtable.c—pgd_alloc(),pgd_free()mm/memory.c—copy_page_range(),__copy_pte_range()arch/x86/mm/tlb.c—switch_mm_irqs_off(), KPTI CR3 switchingarch/arm64/include/asm/pgtable.h— ARM64 page table typesinclude/linux/pgtable.h— Architecture-independent page table API- Intel SDM Vol. 3A, Chapters 4 (Paging) and 4.5 (4-Level Paging)
- AMD64 Architecture Programmer's Manual, Vol. 2, Chapter 5
- Kees Cook et al., "KAISER: Hiding the kernel from user space" (2017) — KPTI design
- CVE-2017-5754 — Meltdown: https://meltdownattack.com/
- LWN: "Five-level page tables" — https://lwn.net/Articles/717293/