Virtual Memory
Technical Overview
Virtual memory is the foundational abstraction that decouples a process's view of memory from the physical RAM installed in the machine. Every process believes it has exclusive access to a large, contiguous address space — on a 64-bit Linux system, nominally 128 TiB of user-space address space — regardless of how much physical RAM is present or how many other processes are running. The kernel and hardware cooperate through a translation layer to map virtual addresses to physical frames on demand.
Virtual memory solves four distinct problems simultaneously:
- Protection: Process A cannot read or write process B's memory. The kernel enforces ownership at the page-table level; any unauthorized access raises a fault that the kernel handles by delivering SIGSEGV.
- Abstraction: Every process sees the same canonical layout regardless of where the kernel actually placed its physical pages. This makes position-independent code, shared libraries, and ASLR all tractable.
- Overcommit: The sum of all virtual address spaces across running processes can vastly exceed physical RAM. Pages are only allocated when touched (demand paging).
- Convenience primitives: mmap, CoW fork, shared memory, memory-mapped files, and guard pages all depend on the virtual-to-physical indirection.
Prerequisites
- Understanding of CPU privilege levels (ring 0 / ring 3)
- Basic familiarity with the x86-64 register set (CR3, CR0)
- Knowledge of what a system call is and how it transitions to kernel mode
- Familiarity with the concept of a process and its lifecycle
Core Content
The Virtual Address Space
On Linux/x86-64 with a 4-level page table, user space occupies the lower canonical half (addresses 0x0000000000000000 – 0x00007FFFFFFFFFFF) and kernel space occupies the upper canonical half (0xFFFF800000000000 – 0xFFFFFFFFFFFFFFFF). The 47-bit user virtual address space is 128 TiB; with five-level page tables (x86-64 LA57) it grows to 57 bits / 128 PiB.
64-bit Linux Virtual Address Space (per-process view)
=====================================================
0xFFFFFFFFFFFFFFFF +------------------------+
| Kernel space | (not accessible from user)
| kernel text/data |
| vmalloc |
| direct phys mapping |
| modules |
0xFFFF800000000000 +------------------------+
| |
| (non-canonical hole) | 128 TiB hole; any access = #GP
| |
0x00007FFFFFFFFFFF +------------------------+
| Stack | grows down; expands via fault
| (+ guard page) |
+------------------------+
| ... |
+------------------------+
| mmap / shared libs | ld-linux.so, libc.so, ...
| region (grows down) |
+------------------------+
| ... |
+------------------------+
| Heap | grows up via brk()
+------------------------+
| BSS segment | zero-initialized globals
+------------------------+
| Data segment | initialized globals
+------------------------+
| Text segment | executable code (r-x)
0x0000000000400000 +------------------------+
| (unmapped / null guard)|
0x0000000000000000 +------------------------+
The exact addresses vary with ASLR, but the relative ordering is stable. The kernel randomizes the base of the stack, mmap region, and (for PIE binaries) the text segment at each exec.
VMA: vm_area_struct
The kernel does not track individual pages in the address space directly. Instead it maintains a sorted list (and red-black tree) of Virtual Memory Areas (VMAs), each described by struct vm_area_struct in mm/mmap.c and defined in include/linux/mm_types.h:
struct vm_area_struct {
unsigned long vm_start; /* first byte of VMA */
unsigned long vm_end; /* first byte past VMA end */
struct vm_area_struct *vm_next, *vm_prev; /* sorted linked list */
struct rb_node vm_rb; /* red-black tree node */
pgprot_t vm_page_prot; /* page protection flags */
unsigned long vm_flags; /* VM_READ, VM_WRITE, VM_EXEC, VM_SHARED, ... */
struct mm_struct *vm_mm; /* back-pointer to mm */
struct file *vm_file; /* file backing, or NULL for anonymous */
unsigned long vm_pgoff; /* offset in file (pages) */
const struct vm_operations_struct *vm_ops; /* fault handlers etc. */
/* ... more fields ... */
};
Key vm_flags:
- VM_READ / VM_WRITE / VM_EXEC: access permissions
- VM_SHARED: shared mapping (writes visible to other processes)
- VM_MAYWRITE: can become writable (used for CoW check)
- VM_GROWSDOWN: stack segment
- VM_HUGETLB: uses HugeTLB pages
The mm_struct (one per process, pointed to by task_struct->mm) holds the VMA tree root, page global directory (CR3 value), memory statistics (mm_rss_stat), and lock (mmap_lock — an rwsem).
The brk System Call
brk(2) moves the heap break (the top of the heap) to a new address. The kernel adjusts the heap VMA in do_brk_flags() (mm/mmap.c). No pages are faulted in immediately; physical pages are allocated on first write (demand paging). The C library's malloc uses brk for small allocations and mmap for large ones.
sbrk(0) → returns current brk
brk(addr) → extend heap to addr (must be page-aligned)
The mmap System Call
mmap(2) creates a new VMA in the calling process's address space. The kernel calls do_mmap() (mm/mmap.c) which:
1. Finds a suitable gap in the address space using get_unmapped_area()
2. Creates a new vm_area_struct (or extends an adjacent one if flags match)
3. Optionally installs a file reference for file-backed mappings
4. Returns the virtual address; no physical pages allocated yet
Key flags:
| Flag | Meaning |
|------|---------|
| MAP_ANONYMOUS | Not backed by a file; zero-filled |
| MAP_SHARED | Changes visible to other processes mapping same file |
| MAP_PRIVATE | CoW; changes private to this process |
| MAP_FIXED | Place at exactly the given address (dangerous) |
| MAP_POPULATE | Pre-fault pages (no page fault latency later) |
| MAP_LOCKED | Lock pages in RAM (mlock semantics) |
| MAP_HUGETLB | Use HugeTLB pages |
/proc/PID/maps and /proc/PID/smaps
/proc/PID/maps exposes the VMA list in human-readable form:
address perms offset dev inode pathname
55a1b2c3d000-55a1b2c3e000 r--p 00000000 fd:01 1234567 /usr/bin/bash
55a1b2c3e000-55a1b2c6a000 r-xp 00001000 fd:01 1234567 /usr/bin/bash
7f8a9b0c0000-7f8a9b2c0000 r--p 00000000 fd:01 7654321 /lib/x86_64-linux-gnu/libc.so.6
7f8a9b2c0000-7f8a9b418000 r-xp 00200000 fd:01 7654321 /lib/x86_64-linux-gnu/libc.so.6
...
7ffe12340000-7ffe12361000 rw-p 00000000 00:00 0 [stack]
Permission field: r read, w write, x execute, p private (CoW), s shared.
/proc/PID/smaps adds RSS, PSS (proportional share), private dirty, and anonymous/swap statistics per VMA. It is the most accurate tool for production memory analysis.
Address Space Isolation
The CR3 register on x86-64 holds the physical address of the PGD (Page Global Directory). On every context switch, the kernel loads CR3 with the new process's page table root (switch_mm() in arch/x86/mm/tlb.c). This atomically changes the entire virtual-to-physical mapping, providing complete isolation between processes.
Kernel address space is mapped into every process's upper half, but with supervisor-only protection bits (CPL=0 required). With KPTI (Kernel Page Table Isolation, introduced as a Meltdown mitigation in Linux 4.15), even the kernel mapping in user-mode page tables is stripped to a tiny trampoline, further tightening isolation.
Historical Context
Virtual memory was pioneered at the University of Manchester in the Atlas computer (1962) and independently at MIT with MULTICS (1965). The Intel 80386 (1985) brought paging to the x86 architecture, enabling protected-mode multitasking OSes. Linux inherited the x86 paging model from early on (Linus Torvalds targeted the 386). The move from 32-bit to 64-bit (2003–2006) expanded address spaces from 4 GiB to 128 TiB, making virtual memory overcommit even more practical.
Production Examples
Memory-mapped databases: PostgreSQL uses mmap for its buffer pool on some platforms. SQLite opens the database file with mmap for read paths. LMDB is built entirely on mmap — its entire database file is mapped and the OS page cache serves as the buffer pool.
Java heap: The JVM allocates its heap as a large anonymous mmap region at startup. With -Xmx8g, the reservation is 8 GiB of virtual space but physical pages are only committed as the GC fills the heap.
Large sparse files: Tools like fallocate(1) and posix_fallocate(2) manipulate file-backed VMAs to pre-allocate disk space without touching every page, keeping virtual memory sparse.
Debugging Notes
# Show all VMAs for a process
cat /proc/$(pidof postgres)/maps
# Detailed per-VMA memory statistics (PSS, swap, etc.)
cat /proc/$(pidof postgres)/smaps
# Summarized smaps
cat /proc/$(pidof postgres)/smaps_rollup
# Total virtual/physical memory per process
ps -o pid,vsz,rss,comm -p $(pidof postgres)
# Find processes with the most VMAs (watch for VMA limit: /proc/sys/vm/max_map_count)
for p in /proc/[0-9]*/maps; do echo "$(wc -l < $p) $p"; done | sort -rn | head -10
# Raise VMA limit (default 65530; Java with many JARs can hit this)
sysctl -w vm.max_map_count=262144
/proc/sys/vm/max_map_count defaults to 65530. Elasticsearch and Java applications with hundreds of JARs routinely exhaust this, causing mmap to fail with ENOMEM. The fix is to raise max_map_count.
Security Implications
ASLR: vm.randomize_va_space (default 2 on Linux) randomizes the base of the stack, mmap region, and VDSO. It raises the cost of return-oriented programming (ROP) exploits by making addresses unpredictable. Bypass techniques include heap spraying and information leak vulnerabilities.
MAP_FIXED races: Using MAP_FIXED at an address already occupied by a VMA silently unmaps the existing region. This is a footgun; MAP_FIXED_NOREPLACE (Linux 4.17+) fails instead of clobbering.
Null pointer dereferences: The zero page (0x0–0xFFF) is normally unmapped, so NULL dereferences fault immediately. But mmap(0, ...) with vm.mmap_min_addr=0 (root only) can map the null page, allowing kernel exploits that dereference a NULL pointer to jump to attacker-controlled code. Modern kernels set vm.mmap_min_addr=65536.
VMA limit exhaustion: An attacker can DoS a process by inducing it to create thousands of VMAs (e.g., via a malformed ELF or library) until mmap fails.
Performance Implications
- VMA lookup: O(log n) in the red-black tree. Processes with tens of thousands of VMAs (Elasticsearch, JVMs with many class loaders) see measurable overhead on page fault handling.
- mmap_lock contention:
mmap_lockis a per-mm rwsem. Concurrentmmap/munmapand page fault handling contend on this lock. High-concurrency applications can see CPU time indown_read(&mm->mmap_lock). - Kernel address space layout: The kernel direct map covers all physical RAM. On machines with 1+ TiB of RAM, the direct map itself occupies a large fraction of the kernel virtual address space, potentially conflicting with vmalloc and modules regions.
Failure Modes and Real Incidents
OOM from VMA fragmentation: A bug in an early version of Chromium's zygote forking caused each renderer process to accumulate thousands of VMAs. Combined with a per-process VMA limit, this caused mmap failures and renderer crashes at scale.
Stack overflow with no guard page: If ulimit -s unlimited is set and the kernel skips the guard page, a stack overflow can silently corrupt the mmap region below it rather than delivering SIGSEGV.
Meltdown (CVE-2017-5754): Exploited the fact that kernel mappings exist in user-space page tables (just protected). Speculative execution could bypass the protection check and read kernel memory into the CPU cache before the exception was raised. Fixed by KPTI.
Modern Usage
- io_uring: Uses
mmapto share submission/completion ring buffers between kernel and user space without copies. - VDSO/vvar: The kernel maps a small read-only region (
[vdso]) into every process to allow fast user-space syscalls (gettimeofday, clock_gettime) without a ring transition. - userfaultfd: Allows user-space to handle page faults for its own VMAs, enabling live migration, checkpoint/restore (CRIU), and user-space memory managers.
- memfd_create: Creates anonymous files backed by tmpfs; combined with
mmapthis enables sealed, shareable memory regions (used by Wayland for GPU buffer sharing).
Future Directions
- Five-level page tables (LA57): Already in mainline Linux (5.14+). Extends user space to 128 PiB. Required for future machines with petabytes of RAM or for massive virtual address space consumers.
- Linear Address Masking (LAM): Intel extension to allow storing metadata in the upper bits of pointers (hardware masks them before address translation). Enables hardware-assisted memory tagging.
- Memory tagging (ARM MTE): ARM's Memory Tagging Extension assigns 4-bit tags to 16-byte granules. Enables hardware-checked bounds and use-after-free detection at near-zero overhead.
- Capability-based addressing (CHERI): Replaces raw pointers with hardware-enforced capabilities, eliminating the entire class of out-of-bounds and use-after-free bugs at the architectural level.
Exercises
- Write a C program that maps 10 GiB of anonymous memory with
mmapon a machine with 4 GiB of RAM. Observe that it succeeds (overcommit). Then touch every page and observe OOM. - Parse
/proc/self/mapsfrom within a running process and reconstruct the VMA layout programmatically. - Use
straceto trace allmmap/munmapcalls made bylsduring a single invocation. Count how many distinct regions are created and destroyed. - Set
vm.randomize_va_space=0and run the same binary ten times; confirm the stack address is identical. Re-enable and confirm randomization. - Write a program that deliberately exhausts
vm.max_map_countby creating thousands of one-page anonymous mappings. Handle the resultingENOMEMgracefully. - Use
userfaultfdto implement a simple demand-paging scheme in user space that logs every page fault to stdout.
References
include/linux/mm_types.h—struct mm_struct,struct vm_area_structmm/mmap.c—do_mmap(),do_brk_flags(),get_unmapped_area()arch/x86/mm/tlb.c—switch_mm_irqs_off(), CR3 switcharch/x86/entry/vdso/— VDSO implementationfs/proc/task_mmu.c—/proc/PID/mapsand/proc/PID/smapsimplementationmm/userfaultfd.c— userfaultfd implementation- Linux man pages:
mmap(2),brk(2),mprotect(2),mincore(2),madvise(2) - Mel Gorman, "Understanding the Linux Virtual Memory Manager" (2004) — still the definitive reference
- Intel 64 and IA-32 Architectures Software Developer's Manual, Vol. 3A, Chapter 4 (Paging)
- CVE-2017-5754 (Meltdown) — KPTI design document: https://lwn.net/Articles/741878/