03 — Memory Virtualization
Prerequisites
- Virtual memory fundamentals: page tables, TLB, page faults, CR3 register
- x86 paging: 4-level page table walk (PML4 → PDPT → PD → PT → physical page)
- KVM architecture: VMCS, VMEXIT, VMX non-root mode
- Virtualization fundamentals: guest physical address vs host physical address distinction
Historical Context
Memory virtualization was the hardest part of the x86 virtualization problem. On architectures like IBM System/370, the hardware had explicit support for a two-level address translation hierarchy. On x86, the MMU had exactly one job: translate virtual addresses to physical addresses using the page tables pointed to by CR3. There was no provision for a second level of translation.
Early hypervisors like VMware (1999) solved this with shadow page tables — an entirely software-maintained extra level of mapping. This worked but was expensive, complex, and a persistent source of bugs. Intel recognized this and added Extended Page Tables (EPT) in the Nehalem microarchitecture (2007, Xeon 5500 series). AMD simultaneously added Nested Page Tables (NPT) in Barcelona (2007). These hardware features are now universal on any server CPU.
The result was dramatic: EPT/NPT eliminated the shadow page table maintenance overhead entirely, reducing memory-intensive workload overhead from ~30% to ~3%.
The Memory Virtualization Challenge
When a VM runs, there are three distinct address spaces:
- Guest Virtual Address (gVA): what guest userspace and kernel software uses
- Guest Physical Address (gPA): what the guest OS believes is physical memory (its page tables map gVA → gPA)
- Host Physical Address (hPA): actual DRAM addresses on the host machine
The guest's page tables map gVA → gPA. But the CPU's MMU produces hPA as the final result of a page walk. The hypervisor must arrange that when the guest's MMU traverses page tables to produce a gPA, the hardware ultimately delivers the correct hPA.
Address Space Layers:
Guest Process
| gVA (e.g., 0x7fff1000)
|
v [Guest OS page tables]
Guest OS
| gPA (e.g., 0x04000000) <-- guest thinks this is physical
|
v [Hypervisor mapping: gPA → hPA]
Hypervisor
| hPA (e.g., 0xA3000000) <-- actual DRAM location
|
v
Physical Memory
Approach 1: Shadow Page Tables
Used by VMware (pre-2007), early KVM (before EPT), and still used as a fallback.
Concept
The hypervisor maintains a shadow page table for each guest page table. The shadow PT maps gVA → hPA directly (skipping the gPA level). The CPU's CR3 actually points to the shadow PT, not the guest PT. The guest PT is only consulted by the hypervisor to keep the shadow PT in sync.
Shadow Page Table Mechanism:
Guest CR3 (as seen by guest) Shadow PT (what CPU really uses)
points to guest PML4 Shadow PML4 → Shadow PDPT → ...
| |
v v
Guest PML4 Shadow PML4
Guest PML4[i] = gPA of PDPT Shadow PML4[i] = hPA of shadow PDPT
| |
v v
Guest PDPT Shadow PDPT
Guest PDPT[j] = gPA of PD Shadow PDPT[j] = hPA of shadow PD
| |
v v
Guest PD / PT Shadow PD / PT
Guest PT[k] = gPA of page Shadow PT[k] = hPA of page (final!)
Keeping Shadow PTs Consistent
The hard part: when the guest modifies its page tables, the shadow PT must be updated. The hypervisor does this by:
- Write-protecting guest page tables: the hypervisor marks the guest PT pages as read-only in the shadow PT. Any guest write to a PT page causes a page fault VMEXIT. KVM inspects the write, updates the corresponding shadow PT entry, then allows the write.
- Flushing on CR3 load: when the guest loads a new CR3 (context switch), KVM must flush or switch the shadow PT. This is expensive — context switches cost 5,000–50,000 ns extra.
- TLB invalidation: when the guest executes INVLPG or writes CR3, KVM must invalidate corresponding shadow PT entries and TLB entries.
Shadow PT Performance Problems
- Context switch overhead: every guest context switch (loading a new CR3) requires building or switching shadow PTs. O(1) amortized, but with cold cache penalty.
- Write fault overhead: every guest PT write traps to KVM. For workloads with frequent
mmap/munmap(web servers, JVMs), this is very expensive. - Memory overhead: each guest PT page requires a shadow PT page — memory usage doubles.
- Complexity: ~5,000 lines of subtle, bug-prone code in KVM (now mostly removed).
Approach 2: Hardware Extended Page Tables (EPT) / Nested Page Tables (NPT)
Intel EPT
Intel EPT (part of VT-x, available since Nehalem 2007) adds a second hardware page table structure. The CPU performs a two-level page walk:
- Walk the guest page tables (pointed to by guest CR3, stored in VMCS Guest CR3 field) to translate gVA → gPA
- Walk the EPT (pointed to by the EPT Pointer field in VMCS) to translate gPA → hPA
Both walks happen entirely in hardware, without hypervisor involvement for normal memory accesses.
EPT Two-Level Page Walk:
Guest CR3 ──────────────────────────────┐
(gPA of guest PML4) │
│ EPT walk (gPA→hPA)
gVA bits [47:39] ──> Guest PML4 │
entry = gPA ───────┼──> EPT walk ──> hPA of PML4
Follow hPA │
│ │
v │
gVA bits [38:30] ──> Guest PDPT │
entry = gPA ───────┼──> EPT walk ──> hPA of PDPT
Follow hPA │
│ │
v │
gVA bits [29:21] ──> Guest PD │
entry = gPA ───────┼──> EPT walk ──> hPA of PD
Follow hPA │
│ │
v │
gVA bits [20:12] ──> Guest PT │
entry = gPA ───────┼──> EPT walk ──> hPA of PT
Follow hPA │
│ │
gVA bits [11:0] ──> Final gPA ─────────┘──> EPT walk ──> hPA + offset
= physical byte
EPT Entry Format
Each EPT entry is 64 bits: - Bits 0-2: Read/Write/Execute permissions - Bits 3-5: Memory type (WB, UC, WT, etc.) - Bit 6: Ignore PAT memory type - Bit 7: Large page (maps 2MB if set in PDPTE/PDE) - Bits 12-51: Physical address of next-level table or final page - Bit 52: EPT accessed flag (if ept-ad bit set in VMCS) - Bit 57: EPT dirty flag
EPT Violation VMEXIT
When the hardware walks the EPT and finds an entry with the present bit clear (gPA not yet mapped), it generates an EPT violation VMEXIT (exit reason 48). KVM then:
- Reads the faulting gPA from the VMCS exit qualification field
- Allocates a host physical page (via standard Linux page allocator)
- Maps it into the EPT at the correct gPA
- Issues VMRESUME
This is analogous to a regular page fault, but at the gPA → hPA level.
EPT Violation Handling:
Guest accesses gVA 0x7fff1000
|
| Hardware walks guest PT: gVA → gPA = 0x04000000
|
| Hardware walks EPT: gPA 0x04000000 → not mapped!
|
v
EPT Violation VMEXIT (reason=48)
|
| KVM reads exit qualification: gPA = 0x04000000
| KVM calls kvm_mmu_page_fault()
| Linux allocates hPA page (e.g., 0xA3000000)
| KVM writes EPT entry: 0x04000000 → 0xA3000000 (R/W/X)
| VMRESUME
|
v
Guest continues: gVA 0x7fff1000 → gPA 0x04000000 → hPA 0xA3000000
Shadow PT vs EPT Comparison
| Aspect | Shadow Page Tables | EPT/NPT |
|---|---|---|
| CPU requirement | Any VMX-capable CPU | VT-x + EPT (Nehalem+) |
| Page walk depth | 4 levels (gVA→hPA) | 4+4 levels (gVA→gPA→hPA) |
| Hardware TLB entries | Tagged by ASID | Tagged by VPID + ASID |
| Context switch cost | High (shadow PT rebuild/flush) | Low (just CR3 load) |
| Guest PT write cost | VMEXIT per write | None (no write-protection) |
| EPT miss penalty | Single walk | Two nested walks (up to 20 mem accesses) |
| Memory overhead | 2× (shadow copies) | ~5-10% extra for EPT tables |
| Typical overhead vs native | 10-30% (compute/memory) | 1-5% |
EPT Large Pages
EPT supports 2MB and 1GB large pages (similar to host hugepages). Benefits: - Reduce EPT walk depth by eliminating PT-level entries - Reduce TLB pressure (fewer TLB entries needed for same address space) - Critical for workloads like databases (1GB pages map an entire huge file buffer in one EPT entry)
KVM uses EPT large pages when the host uses transparent hugepages (THP) and the guest's gPA range can be mapped with a 2MB-aligned hPA region.
Memory Overcommit and Reclamation
Hypervisors routinely run more guest RAM than available host RAM. Three mechanisms handle overcommit:
1. Balloon Driver
The balloon driver is a guest OS driver (e.g., virtio_balloon in Linux) that can be inflated or deflated by the hypervisor:
- Inflate: hypervisor tells guest balloon driver to allocate N MB of guest memory. Guest kernel allocates these pages (using its own MM), marks them as "balloon pages," and tells the hypervisor their gPAs. Hypervisor can then reclaim the backing hPA pages for other uses.
- Deflate: hypervisor tells balloon driver to release pages. Driver frees them, making them available to the guest OS again.
Balloon Driver Operation:
Guest Memory (8GB configured):
+--[guest app pages]--+--[balloon pages]--+--[free guest pages]--+
6 GB available 2 GB inflated
|
v
Host: 2 GB hPA freed for other VMs (balloon hypervisor reclaim)
Advantage: guest OS participates voluntarily, no data loss. Disadvantage: requires guest cooperation (driver); slow to react.
2. KSM — Kernel Same-Page Merging
KSM is a Linux kernel feature (merged 2.6.32, 2009) that scans all guest VM memory for identical pages and merges them using Copy-on-Write:
- KSM daemon (
ksmd) scans pages in host physical memory - Computes a hash of each page's content
- Pages with matching hashes are compared byte-by-byte
- If identical: unmap both pages, map both virtual addresses to one shared hPA (read-only)
- On guest write: CoW fault triggers, duplicate page created, write proceeds
KSM Deduplication:
Before KSM: After KSM:
VM1 gPA 0x1000 → hPA A VM1 gPA 0x1000 ─┐
VM2 gPA 0x1000 → hPA B VM2 gPA 0x1000 ─┴──> hPA A (RO, shared)
(both contain "zero page") hPA B freed
- Typical savings: 30–50% memory reduction when running many similar VMs (e.g., 50 Ubuntu VMs with identical kernels)
- Zero page (all-zero 4KB page) is the most commonly merged: almost every process has large BSS/stack zero regions
- Performance cost: ksmd scanning uses CPU. On CoW write, extra fault. Network timing attacks via memory deduplication timing have been demonstrated (flush+reload via dedup detection).
- KSM configuration:
/sys/kernel/mm/ksm/
3. Memory Swap / Ballooning Under Pressure
When the hypervisor is critically low on memory: 1. Aggressively inflate balloon in all VMs 2. If still short: swap guest physical pages to disk (host swaps the backing pages of the EPT mapping). Guest sees high latency but no visible swap activity. 3. Last resort: OOM-kill a QEMU process (entire VM dies)
VPID — Virtual Processor ID
Without VPID, TLB flushes on every VMENTER/VMEXIT are required (because the new address space — host or guest — may have conflicting TLB entries with the old one). VPID (VT-x feature) tags TLB entries with a virtual processor ID:
- Host entries tagged with VPID=0
- Each guest vCPU assigned a unique VPID (1..65535)
- TLB entries from VPID=1 are invisible to VPID=2 and to VPID=0
- On VMENTER/VMEXIT: no global TLB flush needed, just activate the appropriate VPID
VPID reduces VMENTER/VMEXIT cost by ~500 ns on TLB-intensive workloads.
Production Examples
JVM workloads in VMs: JVMs allocate large heaps with many dirty pages. Without EPT large pages, EPT walks for heap accesses add 10–15% overhead. Enabling transparent hugepages on the host (/sys/kernel/mm/transparent_hugepage/enabled = always) combined with EPT large page support reduces JVM overhead in VMs to ~2–3%.
Memory-intensive databases (Redis, Memcached): These benefit enormously from KSM on hosting fleets — Redis instances often contain similar data structures. AWS reported 40% memory savings on ElastiCache fleets using KSM.
Windows VM on KVM: Windows does not use the zero page efficiently; KSM savings from Windows VMs are typically lower (15–25%) than Linux VMs (40–60%).
Security Implications
- Rowhammer: physically adjacent DRAM rows can be flipped by repeated access (hammering). In VMs, an attacker VM can potentially flip bits in the hypervisor or another VM's memory. Mitigation: ECC RAM, TRR (Target Row Refresh) in DDR4, guard rows in EPT mapping.
- Side-channel via KSM: an attacker can detect whether a target page has been deduplicated (by measuring write latency — CoW write is slower than a fresh write). This allows probing another VM's memory content byte-by-byte. Mitigated by disabling KSM for security-sensitive VMs.
- EPT misconfiguration: if an EPT entry has a physical address pointing into the hypervisor's own memory, a guest can read/write hypervisor data. KVM validates EPT entries but this was a source of early CVEs.
- Speculative EPT walks: L1TF exploited the fact that speculative execution could walk EPT entries with non-present bits, leaking host data through L1 cache timing. Fixed by ensuring non-present EPT entries have PA=0.
Debugging Notes
# View EPT violation statistics
cat /sys/kernel/debug/kvm/*/mmu_pte_write # shadow PT writes
cat /sys/kernel/debug/kvm/*/mmu_cache_miss # EPT cache misses
# KSM statistics
cat /sys/kernel/mm/ksm/pages_shared # pages currently shared
cat /sys/kernel/mm/ksm/pages_sharing # how many use shared pages
cat /sys/kernel/mm/ksm/pages_unshared # scanned, not merged
# Guest balloon driver status (inside guest)
cat /sys/kernel/debug/virtio-balloon/balloon_num_pages
# Check if EPT large pages active
grep -i ept /sys/kernel/debug/kvm/*/mmu_*
# perf to measure EPT misses
perf kvm stat -e kvm:kvm_mmu_get_page sleep 10
Failure Modes
- EPT table exhaustion: with millions of small guest pages, EPT tables themselves consume significant memory. 512 GB of guest physical address space at 4KB granularity requires 134 million EPT entries × 8 bytes = ~1 GB just for EPT tables.
- Balloon overinflation: hypervisor inflates balloon too aggressively; guest OOM-kills critical processes. Monitor
oom_kill_processevents inside VMs. - KSM scan storm: ksmd scanning all VM memory with high
pages_to_scansetting consumes significant CPU and causes cache pollution. Usesleep_millisecsto rate-limit. - THP fragmentation: host THP allocated for EPT large pages can fragment over time;
khugepagedneeds to run periodically to re-compact.
Modern Usage and Future Directions
AMD SEV (Secure Encrypted Virtualization): encrypts each VM's guest physical memory with a unique encryption key. The EPT maps encrypted pages; even the hypervisor reading hPA sees ciphertext. Requires AMD EPYC processors (Zen 1+).
Intel TDX (Trust Domain Extensions): similar to SEV but with stronger attestation. The CPU maintains a "Trust Domain Control Structure" analogous to VMCS. Memory accesses outside the TD are hardware-blocked.
Virtual NUMA (vNUMA): large VMs expose NUMA topology to the guest so the guest OS can make NUMA-aware allocation decisions, reducing cross-NUMA memory traffic. KVM supports this via QEMU -numa flags; EPT is extended to map NUMA-local hPA for each vNUMA node.
Memory Tagging: ARM MTE (Memory Tagging Extension) in VMs allows guest programs to use hardware memory safety with minimal overhead. Hypervisor must propagate tag bits through the EPT/Stage-2 translation.
Exercises
- Calculate: a 128 GB guest with 4KB pages requires how many EPT leaf entries? How many pages of EPT tables at 4 levels?
- Enable KSM on a Linux system (
echo 1 > /sys/kernel/mm/ksm/run) and launch 5 QEMU VMs from the same image. Monitorpages_sharingover 60 seconds. Calculate the memory saved. - Write a script that measures the time to write to a freshly allocated page vs a KSM-merged page (hint: read a page, then write to it; the CoW fault is the difference).
- Explain why shadow page tables require write-protecting guest PT pages, but EPT does not.
- Research the "Flip Feng Shui" (2016) attack. Explain how it combines rowhammer with QEMU's CoW page sharing to overwrite a guest's RSA key.
References
- Intel. Intel 64 and IA-32 SDM, Vol. 3C, Chapter 28: VMX Support for Address Translation (EPT).
- Bhargava, R. et al. (2008). "Accelerating Two-Dimensional Page Walks for Virtualized Systems." ASPLOS 2008.
- Waldspurger, C. (2002). "Memory Resource Management in VMware ESX Server." OSDI 2002. (Balloon driver, KSM precursor).
- Arcangeli, A. (2009). "Transparent Hugepages: Moving THP to madvise or always." Linux Kernel Mailing List.
- Kim, Y. et al. (2014). "Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors." ISCA 2014. (Rowhammer).
- AMD. "AMD64 Architecture Programmer's Manual, Vol. 2", Section 15.25: Nested Paging.