03 — Memory Virtualization

Prerequisites

Virtual memory fundamentals: page tables, TLB, page faults, CR3 register
x86 paging: 4-level page table walk (PML4 → PDPT → PD → PT → physical page)
KVM architecture: VMCS, VMEXIT, VMX non-root mode
Virtualization fundamentals: guest physical address vs host physical address distinction

Historical Context

Memory virtualization was the hardest part of the x86 virtualization problem. On architectures like IBM System/370, the hardware had explicit support for a two-level address translation hierarchy. On x86, the MMU had exactly one job: translate virtual addresses to physical addresses using the page tables pointed to by CR3. There was no provision for a second level of translation.

Early hypervisors like VMware (1999) solved this with shadow page tables — an entirely software-maintained extra level of mapping. This worked but was expensive, complex, and a persistent source of bugs. Intel recognized this and added Extended Page Tables (EPT) in the Nehalem microarchitecture (2007, Xeon 5500 series). AMD simultaneously added Nested Page Tables (NPT) in Barcelona (2007). These hardware features are now universal on any server CPU.

The result was dramatic: EPT/NPT eliminated the shadow page table maintenance overhead entirely, reducing memory-intensive workload overhead from ~30% to ~3%.

The Memory Virtualization Challenge

When a VM runs, there are three distinct address spaces:

Guest Virtual Address (gVA): what guest userspace and kernel software uses
Guest Physical Address (gPA): what the guest OS believes is physical memory (its page tables map gVA → gPA)
Host Physical Address (hPA): actual DRAM addresses on the host machine

The guest's page tables map gVA → gPA. But the CPU's MMU produces hPA as the final result of a page walk. The hypervisor must arrange that when the guest's MMU traverses page tables to produce a gPA, the hardware ultimately delivers the correct hPA.

Address Space Layers:

Guest Process
  |  gVA (e.g., 0x7fff1000)
  |
  v  [Guest OS page tables]
Guest OS
  |  gPA (e.g., 0x04000000)  <-- guest thinks this is physical
  |
  v  [Hypervisor mapping: gPA → hPA]
Hypervisor
  |  hPA (e.g., 0xA3000000)  <-- actual DRAM location
  |
  v
Physical Memory

Approach 1: Shadow Page Tables

Used by VMware (pre-2007), early KVM (before EPT), and still used as a fallback.

Concept

The hypervisor maintains a shadow page table for each guest page table. The shadow PT maps gVA → hPA directly (skipping the gPA level). The CPU's CR3 actually points to the shadow PT, not the guest PT. The guest PT is only consulted by the hypervisor to keep the shadow PT in sync.

Shadow Page Table Mechanism:

  Guest CR3 (as seen by guest)          Shadow PT (what CPU really uses)
  points to guest PML4                  Shadow PML4 → Shadow PDPT → ...
       |                                       |
       v                                       v
  Guest PML4                            Shadow PML4
  Guest PML4[i] = gPA of PDPT          Shadow PML4[i] = hPA of shadow PDPT
       |                                       |
       v                                       v
  Guest PDPT                            Shadow PDPT
  Guest PDPT[j] = gPA of PD             Shadow PDPT[j] = hPA of shadow PD
       |                                       |
       v                                       v
  Guest PD / PT                         Shadow PD / PT
  Guest PT[k] = gPA of page            Shadow PT[k] = hPA of page (final!)

Keeping Shadow PTs Consistent

The hard part: when the guest modifies its page tables, the shadow PT must be updated. The hypervisor does this by:

Write-protecting guest page tables: the hypervisor marks the guest PT pages as read-only in the shadow PT. Any guest write to a PT page causes a page fault VMEXIT. KVM inspects the write, updates the corresponding shadow PT entry, then allows the write.
Flushing on CR3 load: when the guest loads a new CR3 (context switch), KVM must flush or switch the shadow PT. This is expensive — context switches cost 5,000–50,000 ns extra.
TLB invalidation: when the guest executes INVLPG or writes CR3, KVM must invalidate corresponding shadow PT entries and TLB entries.

Shadow PT Performance Problems

Context switch overhead: every guest context switch (loading a new CR3) requires building or switching shadow PTs. O(1) amortized, but with cold cache penalty.
Write fault overhead: every guest PT write traps to KVM. For workloads with frequent mmap/munmap (web servers, JVMs), this is very expensive.
Memory overhead: each guest PT page requires a shadow PT page — memory usage doubles.
Complexity: ~5,000 lines of subtle, bug-prone code in KVM (now mostly removed).

Approach 2: Hardware Extended Page Tables (EPT) / Nested Page Tables (NPT)

Intel EPT

Intel EPT (part of VT-x, available since Nehalem 2007) adds a second hardware page table structure. The CPU performs a two-level page walk:

Walk the guest page tables (pointed to by guest CR3, stored in VMCS Guest CR3 field) to translate gVA → gPA
Walk the EPT (pointed to by the EPT Pointer field in VMCS) to translate gPA → hPA

Both walks happen entirely in hardware, without hypervisor involvement for normal memory accesses.

EPT Two-Level Page Walk:

  Guest CR3 ──────────────────────────────┐
  (gPA of guest PML4)                     │
                                          │  EPT walk (gPA→hPA)
  gVA bits [47:39] ──> Guest PML4         │
                       entry = gPA ───────┼──> EPT walk ──> hPA of PML4
                       Follow hPA         │
                            │             │
                            v             │
  gVA bits [38:30] ──> Guest PDPT         │
                       entry = gPA ───────┼──> EPT walk ──> hPA of PDPT
                       Follow hPA         │
                            │             │
                            v             │
  gVA bits [29:21] ──> Guest PD           │
                       entry = gPA ───────┼──> EPT walk ──> hPA of PD
                       Follow hPA         │
                            │             │
                            v             │
  gVA bits [20:12] ──> Guest PT           │
                       entry = gPA ───────┼──> EPT walk ──> hPA of PT
                       Follow hPA         │
                            │             │
  gVA bits [11:0]  ──> Final gPA ─────────┘──> EPT walk ──> hPA + offset
                                                              = physical byte

EPT Entry Format

Each EPT entry is 64 bits: - Bits 0-2: Read/Write/Execute permissions - Bits 3-5: Memory type (WB, UC, WT, etc.) - Bit 6: Ignore PAT memory type - Bit 7: Large page (maps 2MB if set in PDPTE/PDE) - Bits 12-51: Physical address of next-level table or final page - Bit 52: EPT accessed flag (if ept-ad bit set in VMCS) - Bit 57: EPT dirty flag

EPT Violation VMEXIT

When the hardware walks the EPT and finds an entry with the present bit clear (gPA not yet mapped), it generates an EPT violation VMEXIT (exit reason 48). KVM then:

Reads the faulting gPA from the VMCS exit qualification field
Allocates a host physical page (via standard Linux page allocator)
Maps it into the EPT at the correct gPA
Issues VMRESUME

This is analogous to a regular page fault, but at the gPA → hPA level.

EPT Violation Handling:

  Guest accesses gVA 0x7fff1000
       |
       | Hardware walks guest PT: gVA → gPA = 0x04000000
       |
       | Hardware walks EPT: gPA 0x04000000 → not mapped!
       |
       v
  EPT Violation VMEXIT (reason=48)
       |
       | KVM reads exit qualification: gPA = 0x04000000
       | KVM calls kvm_mmu_page_fault()
       | Linux allocates hPA page (e.g., 0xA3000000)
       | KVM writes EPT entry: 0x04000000 → 0xA3000000 (R/W/X)
       | VMRESUME
       |
       v
  Guest continues: gVA 0x7fff1000 → gPA 0x04000000 → hPA 0xA3000000

Shadow PT vs EPT Comparison

Aspect	Shadow Page Tables	EPT/NPT
CPU requirement	Any VMX-capable CPU	VT-x + EPT (Nehalem+)
Page walk depth	4 levels (gVA→hPA)	4+4 levels (gVA→gPA→hPA)
Hardware TLB entries	Tagged by ASID	Tagged by VPID + ASID
Context switch cost	High (shadow PT rebuild/flush)	Low (just CR3 load)
Guest PT write cost	VMEXIT per write	None (no write-protection)
EPT miss penalty	Single walk	Two nested walks (up to 20 mem accesses)
Memory overhead	2× (shadow copies)	~5-10% extra for EPT tables
Typical overhead vs native	10-30% (compute/memory)	1-5%

EPT Large Pages

EPT supports 2MB and 1GB large pages (similar to host hugepages). Benefits: - Reduce EPT walk depth by eliminating PT-level entries - Reduce TLB pressure (fewer TLB entries needed for same address space) - Critical for workloads like databases (1GB pages map an entire huge file buffer in one EPT entry)

KVM uses EPT large pages when the host uses transparent hugepages (THP) and the guest's gPA range can be mapped with a 2MB-aligned hPA region.

Memory Overcommit and Reclamation

Hypervisors routinely run more guest RAM than available host RAM. Three mechanisms handle overcommit:

1. Balloon Driver

The balloon driver is a guest OS driver (e.g., virtio_balloon in Linux) that can be inflated or deflated by the hypervisor:

Inflate: hypervisor tells guest balloon driver to allocate N MB of guest memory. Guest kernel allocates these pages (using its own MM), marks them as "balloon pages," and tells the hypervisor their gPAs. Hypervisor can then reclaim the backing hPA pages for other uses.
Deflate: hypervisor tells balloon driver to release pages. Driver frees them, making them available to the guest OS again.

Balloon Driver Operation:

  Guest Memory (8GB configured):
  +--[guest app pages]--+--[balloon pages]--+--[free guest pages]--+
       6 GB available          2 GB inflated
                                    |
                                    v
  Host: 2 GB hPA freed for other VMs (balloon hypervisor reclaim)

Advantage: guest OS participates voluntarily, no data loss. Disadvantage: requires guest cooperation (driver); slow to react.

2. KSM — Kernel Same-Page Merging

KSM is a Linux kernel feature (merged 2.6.32, 2009) that scans all guest VM memory for identical pages and merges them using Copy-on-Write:

KSM daemon (ksmd) scans pages in host physical memory
Computes a hash of each page's content
Pages with matching hashes are compared byte-by-byte
If identical: unmap both pages, map both virtual addresses to one shared hPA (read-only)
On guest write: CoW fault triggers, duplicate page created, write proceeds

KSM Deduplication:

Before KSM:                    After KSM:
VM1 gPA 0x1000 → hPA A         VM1 gPA 0x1000 ─┐
VM2 gPA 0x1000 → hPA B         VM2 gPA 0x1000 ─┴──> hPA A (RO, shared)
(both contain "zero page")     hPA B freed

Typical savings: 30–50% memory reduction when running many similar VMs (e.g., 50 Ubuntu VMs with identical kernels)
Zero page (all-zero 4KB page) is the most commonly merged: almost every process has large BSS/stack zero regions
Performance cost: ksmd scanning uses CPU. On CoW write, extra fault. Network timing attacks via memory deduplication timing have been demonstrated (flush+reload via dedup detection).
KSM configuration: /sys/kernel/mm/ksm/

3. Memory Swap / Ballooning Under Pressure

When the hypervisor is critically low on memory: 1. Aggressively inflate balloon in all VMs 2. If still short: swap guest physical pages to disk (host swaps the backing pages of the EPT mapping). Guest sees high latency but no visible swap activity. 3. Last resort: OOM-kill a QEMU process (entire VM dies)

VPID — Virtual Processor ID

Without VPID, TLB flushes on every VMENTER/VMEXIT are required (because the new address space — host or guest — may have conflicting TLB entries with the old one). VPID (VT-x feature) tags TLB entries with a virtual processor ID:

Host entries tagged with VPID=0
Each guest vCPU assigned a unique VPID (1..65535)
TLB entries from VPID=1 are invisible to VPID=2 and to VPID=0
On VMENTER/VMEXIT: no global TLB flush needed, just activate the appropriate VPID

VPID reduces VMENTER/VMEXIT cost by ~500 ns on TLB-intensive workloads.

Production Examples

JVM workloads in VMs: JVMs allocate large heaps with many dirty pages. Without EPT large pages, EPT walks for heap accesses add 10–15% overhead. Enabling transparent hugepages on the host (/sys/kernel/mm/transparent_hugepage/enabled = always) combined with EPT large page support reduces JVM overhead in VMs to ~2–3%.

Memory-intensive databases (Redis, Memcached): These benefit enormously from KSM on hosting fleets — Redis instances often contain similar data structures. AWS reported 40% memory savings on ElastiCache fleets using KSM.

Windows VM on KVM: Windows does not use the zero page efficiently; KSM savings from Windows VMs are typically lower (15–25%) than Linux VMs (40–60%).

Security Implications

Rowhammer: physically adjacent DRAM rows can be flipped by repeated access (hammering). In VMs, an attacker VM can potentially flip bits in the hypervisor or another VM's memory. Mitigation: ECC RAM, TRR (Target Row Refresh) in DDR4, guard rows in EPT mapping.
Side-channel via KSM: an attacker can detect whether a target page has been deduplicated (by measuring write latency — CoW write is slower than a fresh write). This allows probing another VM's memory content byte-by-byte. Mitigated by disabling KSM for security-sensitive VMs.
EPT misconfiguration: if an EPT entry has a physical address pointing into the hypervisor's own memory, a guest can read/write hypervisor data. KVM validates EPT entries but this was a source of early CVEs.
Speculative EPT walks: L1TF exploited the fact that speculative execution could walk EPT entries with non-present bits, leaking host data through L1 cache timing. Fixed by ensuring non-present EPT entries have PA=0.

Debugging Notes

# View EPT violation statistics
cat /sys/kernel/debug/kvm/*/mmu_pte_write        # shadow PT writes
cat /sys/kernel/debug/kvm/*/mmu_cache_miss        # EPT cache misses

# KSM statistics
cat /sys/kernel/mm/ksm/pages_shared       # pages currently shared
cat /sys/kernel/mm/ksm/pages_sharing      # how many use shared pages
cat /sys/kernel/mm/ksm/pages_unshared     # scanned, not merged

# Guest balloon driver status (inside guest)
cat /sys/kernel/debug/virtio-balloon/balloon_num_pages

# Check if EPT large pages active
grep -i ept /sys/kernel/debug/kvm/*/mmu_*

# perf to measure EPT misses
perf kvm stat -e kvm:kvm_mmu_get_page sleep 10

Failure Modes

EPT table exhaustion: with millions of small guest pages, EPT tables themselves consume significant memory. 512 GB of guest physical address space at 4KB granularity requires 134 million EPT entries × 8 bytes = ~1 GB just for EPT tables.
Balloon overinflation: hypervisor inflates balloon too aggressively; guest OOM-kills critical processes. Monitor oom_kill_process events inside VMs.
KSM scan storm: ksmd scanning all VM memory with high pages_to_scan setting consumes significant CPU and causes cache pollution. Use sleep_millisecs to rate-limit.
THP fragmentation: host THP allocated for EPT large pages can fragment over time; khugepaged needs to run periodically to re-compact.

Modern Usage and Future Directions

AMD SEV (Secure Encrypted Virtualization): encrypts each VM's guest physical memory with a unique encryption key. The EPT maps encrypted pages; even the hypervisor reading hPA sees ciphertext. Requires AMD EPYC processors (Zen 1+).

Intel TDX (Trust Domain Extensions): similar to SEV but with stronger attestation. The CPU maintains a "Trust Domain Control Structure" analogous to VMCS. Memory accesses outside the TD are hardware-blocked.

Virtual NUMA (vNUMA): large VMs expose NUMA topology to the guest so the guest OS can make NUMA-aware allocation decisions, reducing cross-NUMA memory traffic. KVM supports this via QEMU -numa flags; EPT is extended to map NUMA-local hPA for each vNUMA node.

Memory Tagging: ARM MTE (Memory Tagging Extension) in VMs allows guest programs to use hardware memory safety with minimal overhead. Hypervisor must propagate tag bits through the EPT/Stage-2 translation.

Exercises

Calculate: a 128 GB guest with 4KB pages requires how many EPT leaf entries? How many pages of EPT tables at 4 levels?
Enable KSM on a Linux system (echo 1 > /sys/kernel/mm/ksm/run) and launch 5 QEMU VMs from the same image. Monitor pages_sharing over 60 seconds. Calculate the memory saved.
Write a script that measures the time to write to a freshly allocated page vs a KSM-merged page (hint: read a page, then write to it; the CoW fault is the difference).
Explain why shadow page tables require write-protecting guest PT pages, but EPT does not.
Research the "Flip Feng Shui" (2016) attack. Explain how it combines rowhammer with QEMU's CoW page sharing to overwrite a guest's RSA key.

References

Intel. Intel 64 and IA-32 SDM, Vol. 3C, Chapter 28: VMX Support for Address Translation (EPT).
Bhargava, R. et al. (2008). "Accelerating Two-Dimensional Page Walks for Virtualized Systems." ASPLOS 2008.
Waldspurger, C. (2002). "Memory Resource Management in VMware ESX Server." OSDI 2002. (Balloon driver, KSM precursor).
Arcangeli, A. (2009). "Transparent Hugepages: Moving THP to madvise or always." Linux Kernel Mailing List.
Kim, Y. et al. (2014). "Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors." ISCA 2014. (Rowhammer).
AMD. "AMD64 Architecture Programmer's Manual, Vol. 2", Section 15.25: Nested Paging.