08 — Nested Virtualization
Prerequisites
- KVM architecture: VMCS, VMENTER/VMEXIT, VMX root/non-root modes
- Memory virtualization: EPT, two-level page table walks
- Intel VT-x: VMLAUNCH/VMRESUME instructions, VMEXIT reasons
- Virtualization fundamentals: hypervisor types, full virtualization
Historical Context
Nested virtualization — running a hypervisor inside a virtual machine — was long considered an academic curiosity. The performance overhead of double-VMEXITs made it impractical. Intel and AMD added hardware support for nested virtualization in 2010 (Westmere/Sandy Bridge for Intel, Barcelona for AMD with VMCB shadowing), but software support lagged.
The practical need emerged from the cloud era. Developers wanted to test hypervisor code in cloud VMs. VMware-on-cloud scenarios (running vSphere inside AWS EC2) became a real product request. Kubernetes-in-VM for CI/CD pipelines needed either containers or VMs for isolation. These use cases drove investment in making nested virtualization fast and correct.
KVM support for nested virtualization (nested=1 for kvm_intel / kvm_amd) reached production quality around 2012–2015. Today it is widely used: AWS bare-metal instances support full nesting, GCP offers --enable-nested-virtualization, and most cloud providers support it on at least some instance types. Azure supports nested virtualization on all Dv3/Ev3 and newer series.
Nested Virtualization Terminology
The three-level hierarchy uses standardized naming:
Nested Virtualization Levels:
+---------------------------------+
| L2 Guest (Guest-of-Guest) |
| "innermost VM" |
| e.g., an app inside a nested VM|
+---------------------------------+
| L1 Guest Hypervisor |
| "guest hypervisor" |
| e.g., KVM running inside a VM |
| or VMware ESXi in cloud |
+---------------------------------+
| L0 Host Hypervisor |
| "real hypervisor" |
| e.g., KVM on the physical host |
+---------------------------------+
| Physical Hardware |
+---------------------------------+
- L0: the hypervisor running on bare metal. It controls all hardware.
- L1: a guest VM running a hypervisor. L1 believes it has full hardware access but is actually virtualized by L0.
- L2: a VM created by L1. L2 is a VM-of-a-VM — doubly virtualized.
Use Cases
1. Testing Hypervisors in the Cloud
Kernel developers testing KVM patches run their test KVM hypervisor (L1) inside a cloud VM (L0 = cloud provider's hypervisor). The test hypervisor creates test VMs (L2). Without nested support, this requires a dedicated bare-metal machine.
2. VMware on Cloud (VMware Cloud on AWS)
VMware vSphere (ESXi) is deployed as an L1 hypervisor inside AWS EC2 bare-metal instances. L0 = AWS Nitro. L1 = ESXi. L2 = customer's existing vSphere VMs. Enables migrating on-premises VMware workloads to the cloud without converting VM images.
3. Kubernetes in VMs for CI
CI systems (GitHub Actions, Jenkins) create ephemeral VMs per build job. If the build needs to run containers (Docker-in-Docker) or even VMs (kind — Kubernetes-in-Docker), nested virtualization enables this. Without nesting, containers in CI VMs cannot use KVM acceleration.
4. Cloud Gaming
Cloud gaming platforms run a Windows VM (L1) with DirectX/game support inside a GPU-partitioned host. Some architectures use nested virtualization for game state isolation — each game session is a nested VM.
5. Security Research
Hypervisor-level rootkit research requires a controlled environment. Researchers run a real OS (L1) inside a KVM VM (L0) and then run malware that attempts to install hypervisor-level hooks. L0 allows complete monitoring of L1+L2 without L1's knowledge.
Intel VT-x Nested Support: VMCS Shadowing
The core challenge: L1 (the guest hypervisor) wants to execute VMLAUNCH and VMRESUME to run L2. These are VMX instructions — they would normally cause a VMEXIT to L0 (the real hypervisor). Without nesting support, L0 would not know what to do with them.
Naive Approach (Pre-Hardware Assist)
Without hardware VMCS shadowing, L0 must fully emulate every VMX instruction that L1 executes:
- L1 executes
VMWRITE: VMEXIT to L0. L0 notes the field/value in a shadow VMCS. - L1 executes
VMREAD: VMEXIT to L0. L0 returns the value from shadow VMCS. - L1 executes
VMLAUNCH: VMEXIT to L0. L0 merges L1's VMCS (desired L2 configuration) with L0's own requirements and creates a real VMCS for L2. L0 launches L2 itself. - L2 causes a VMEXIT: Goes to L0 first (hardware). L0 determines if L1 should see this exit. If yes: L0 saves the exit info into L1's "virtual VMCS", then injects a VMEXIT event into L1 (synthetic exit delivery).
Every VMX instruction from L1 costs a VMEXIT to L0. For a L1 that is running 100 L2 VMs each doing 10,000 VMEXITs/sec, this means L0 handles 1,000,000+ nested VMEXITs/sec — a massive overhead.
Intel VMCS Shadowing (Broadwell, 2014)
Intel added VMCS shadowing to hardware to accelerate the most common path: VMREAD/VMWRITE from L1 do not need to VMEXIT to L0 if VMCS shadowing is enabled:
- L0 creates a shadow VMCS and points the VMCS link pointer in L1's VMCS to it
- When L1 executes
VMREAD/VMWRITE, the CPU accesses the shadow VMCS directly without VMEXIT - Read/write bitmaps control which fields are intercepted (still cause VMEXIT) vs passed through
VMLAUNCH/VMRESUME from L1 still VMEXIT to L0. L0 must:
1. Merge L1's shadow VMCS (L2 desired state) with L0's own execution controls
2. Create a merged VMCS for direct L2 execution
3. VMRESUME to launch L2 under L0's control
VMCS Shadowing: VMREAD/VMWRITE path
L1 Guest (VMX non-root)
|
| VMREAD field X
|
+-- field X in "pass-through" bitmap?
| YES: CPU reads shadow VMCS directly (no VMEXIT!)
| NO: VMEXIT to L0
|
v
Shadow VMCS (in host physical memory, pointed to by VMCS link ptr)
AMD Nested Virtualization: VMCB Shadowing
AMD uses the VMCB (Virtual Machine Control Block) instead of VMCS. AMD nested virtualization (nSVM) works analogously:
- L1 executes
VMRUN(AMD's equivalent ofVMLAUNCH/VMRESUME) - L0 intercepts, merges L1's VMCB with L0's requirements, creates a merged VMCB
- L2 runs under L0's control with the merged VMCB
- L2 VMEXIT → L0 → L0 decides if L1 should see it → synthetic VMEXIT delivery to L1
AMD added hardware support for VMCB shadowing in Zen 2 (2019), reducing VMREAD/VMWRITE equivalent overhead.
Nested VMEXIT Handling
When L2 causes a VMEXIT, the control flow is:
L2 VMEXIT Control Flow:
L2 executes sensitive instruction
|
| Hardware VMEXIT (goes to L0, because L0 owns VMX root mode)
v
L0 VMEXIT handler
|
+-- Is this exit interesting to L0?
| e.g., EPT violation for L0's own mapping, L0's timer
| YES: L0 handles it, VMRESUME back to L2 (transparent)
|
+-- Should L1 see this exit?
| e.g., L1 configured its VMCS to intercept this exit reason
| YES: L0 synthesizes a VMEXIT event for L1
|
| L0 saves L2 state into L1's "virtual VMCS" (shadow VMCS)
| L0 constructs exit reason + qualification in shadow VMCS
| L0 "injects" VMEXIT into L1:
| - Switches to L1's VMCS
| - Loads L1's exit handler RIP (from L1's host state area)
| - VMRESUME to L1
v
L1 VMEXIT handler (running in VMX non-root mode, Ring 0)
|
| L1 reads exit reason from its VMCS → reads shadow VMCS
| L1 handles the exit (e.g., emulates I/O device for L2)
| L1 executes VMRESUME
|
v VMEXIT to L0 (VMRESUME from L1 is a sensitive instruction)
L0 handles L1's VMRESUME
|
| L0 merges L1's updated shadow VMCS with L0 controls
| L0 launches L2 (VMRESUME to L2)
v
L2 resumes execution
Each L2→L1 handoff requires two L0 VMEXITs (one to deliver the synthetic VMEXIT to L1, one to process L1's VMRESUME). This doubles the VMEXIT overhead compared to non-nested operation.
Three-Level Memory: Nested EPT
With nested virtualization, address translation has three levels:
| Level | Translation | Who maintains |
|---|---|---|
| L2 guest PT | L2-gVA → L2-gPA | L2 guest OS |
| L1 EPT (shadow/nested) | L2-gPA → L1-gPA | L1 hypervisor |
| L0 EPT | L1-gPA → hPA | L0 hypervisor |
Full walk: L2-gVA → L2-gPA → L1-gPA → hPA
Without hardware nested EPT: L0 must maintain a "nested shadow EPT" combining all three levels, software-maintained. Extremely expensive.
With hardware nested EPT (Intel Haswell+, AMD Zen 2+): the hardware performs all three translations. A single memory access from L2 causes up to 24 memory accesses for full three-level page table walks (4-level guest PT × 3 levels of EPT walks = up to 4 × (4+4+4) = ... the math is 4-level PT walk where each guest PT level lookup needs a 4-level EPT walk = 4 × 4 = 16 EPT lookups for L1 alone, plus 4 more for L0 EPT = up to 24 memory accesses per L2 TLB miss).
The TLB is critical here — VPID tagging must include L2's VPID to avoid TLB flushes, and L2 TLB entries are tagged differently from L1 and L0 entries.
Nested EPT Walk (worst case):
L2 TLB miss for gVA X
|
v
Hardware walks L2 guest PT (4 levels):
For each PT level access (gPA of PT page):
→ Walk L1 EPT (4 levels): each L1 EPT entry is gPA
→ Walk L0 EPT (4 levels): to get hPA of L1 EPT entry
→ Walk L0 EPT for final gPA→hPA
Total: up to 4 × (4 + 4) + 4 = 36 memory accesses
(in practice, TLB hits reduce this dramatically)
KVM Nested Virtualization Implementation
KVM supports nested virtualization via the nested=1 module parameter:
# Enable nested virt for Intel
modprobe kvm_intel nested=1
# Or permanently in /etc/modprobe.d/kvm.conf:
echo "options kvm_intel nested=1" > /etc/modprobe.d/kvm-intel.conf
# Enable nested virt for AMD
modprobe kvm_amd nested=1
# Verify it's enabled
cat /sys/module/kvm_intel/parameters/nested # Y
The guest VM (L1) must expose VMX to its vCPUs. In QEMU:
-cpu host,+vmx # Intel: expose VMX capability
-cpu host,+svm # AMD: expose SVM capability
# Or use a named CPU model that includes VMX:
-cpu Skylake-Server,+vmx
Inside L1, the guest hypervisor sees VMX support in CPUID and can use /dev/kvm normally.
Key KVM Nested Code Paths
arch/x86/kvm/vmx/nested.c: ~7,000 lines of the most complex code in KVMhandle_vmlaunch()/handle_vmresume(): intercept L1's VMX launchnested_vmx_enter_non_root_mode(): merge L1 VMCS + L0 requirements, launch L2nested_vmx_vmexit(): handle L2 exit, decide L0 vs L1 handlingprepare_vmcs02(): compute the "merged VMCS" (VMCS02) that is actually loaded
Performance Overhead
Overhead Sources
- Double VMEXIT: each L2 VMEXIT costs two L0 VMEXITs (synthetic + VMRESUME processing)
- VMCS merging:
prepare_vmcs02()runs on every L1→L2 transition (~10,000 ns) - Triple address translation: three-level EPT walk on L2 TLB misses
- Shadow VMCS synchronization: L0 must keep shadow VMCS in sync with L1's intended VMCS
Empirical overhead: nested KVM typically adds 15–30% performance penalty vs running directly on hardware for compute workloads. I/O-intensive workloads can see 50-100% overhead due to high VMEXIT rates.
Mitigation
- VirtIO in L2: use VirtIO devices in L2 (not emulated devices) to minimize VMEXITs
- EPT TLB coverage: use 2MB pages in EPT at L0 level to reduce EPT miss rate
- VMCS shadowing: Intel VMCS shadowing hardware (Broadwell+) eliminates VMREAD/VMWRITE VMEXITs
- Nested EPT: hardware three-level EPT walk (Haswell+ / Zen 2+) eliminates software-maintained nested shadow page tables
Cloud Provider Support
| Provider | Nested Support | Notes |
|---|---|---|
| AWS EC2 | Bare-metal instances (*.metal) |
Direct hardware: nested works natively |
| AWS EC2 | Regular instances (Nitro) | Nitro-based instances support nested since 2021 via nested KVM |
| GCP | All instances | --enable-nested-virtualization flag at VM creation |
| Azure | Dv3/Ev3/Fsv2 and newer | Hyper-V nested enabled; all recent series |
| DigitalOcean | Most instance types | KVM nested enabled by default |
| Linode/Akamai | All | KVM nested enabled |
| Hetzner | All dedicated servers | Full nested (bare-metal option too) |
AWS bare-metal (i3.metal, c5.metal, etc.): the guest VM runs directly on hardware with no L0 hypervisor. VMX instructions operate natively — there is no L0 overhead. This is the highest-performance option for nested workloads and is required for VMware Cloud on AWS.
Running KVM Inside QEMU (Testing)
The standard nested virtualization test setup for kernel development:
# On the host (L0 running KVM):
qemu-system-x86_64 \
-enable-kvm \
-cpu host,+vmx \ # expose VMX to L1
-m 4G \
-smp 2 \
-hda l1-disk.qcow2 \
-netdev user,id=n0 \
-device virtio-net,netdev=n0 \
-nographic
# Inside the L1 VM:
sudo modprobe kvm_intel nested=1
sudo qemu-system-x86_64 -enable-kvm -m 512M -hda l2-disk.qcow2 # L2 VM
Security Implications
- L0 attack surface expansion: supporting nested virtualization adds ~7,000 lines of KVM code, all of which is an attack surface. Bugs in
nested.chave been a significant source of CVEs: CVE-2021-3653 (AMD SVM nested privilege escalation), CVE-2022-26362 (Xen/AMD nested heap overflow). - L1 escape via nested: a compromised L2 can attempt to escape to L1 (by exploiting L1's hypervisor bugs), and then L2-via-L1 can attempt to escape to L0 (by exploiting KVM nested handling bugs). Each layer adds attack surface.
- VMCS/VMCB field validation: L0 must validate every field in L1's shadow VMCS before merging it. Invalid fields (e.g., a host RIP pointing into user space) must be rejected. This validation logic is complex and has been the source of multiple CVEs.
- Information leakage: L2's CPUID can reveal information about L0's hardware configuration. L0 must carefully synthesize CPUID responses for L2.
Debugging Notes
# Check if nested virtualization is active on host
cat /sys/module/kvm_intel/parameters/nested # Y or 1
# In L1: verify VMX is visible
grep vmx /proc/cpuinfo
# Should show "vmx" in flags
# In L1: check if KVM can be loaded
sudo modprobe kvm_intel
dmesg | grep kvm # Should show "kvm: Nested virtualization enabled"
# Monitor nested VMEXIT statistics (L0 host)
cat /sys/kernel/debug/kvm/*/nested_run # count of nested VM runs
cat /sys/kernel/debug/kvm/*/nested_vmexit_* # per-reason counts
# Trace nested VMCS operations
trace-cmd record -e 'kvm:kvm_nested_vmexit' sleep 5
trace-cmd report | grep nested
# Common failure: L1 cannot load KVM
# Check: is vmx/svm in /proc/cpuinfo?
# Check: is /dev/kvm accessible in L1?
ls -la /dev/kvm # should exist inside L1
# Debugging nested EPT issues
dmesg | grep -i "ept\|nested\|vmcs"
Failure Modes
- VMCS merge failure: if L1 sets VMCS fields to values that L0 cannot accommodate (e.g., requesting a feature L0 disallows),
nested_vmx_enter_non_root_mode()returns an error. L1 sees a VMLAUNCH failure. Hard to debug without KVM source knowledge. - L2 triple fault: L2 guest OS encounters a triple fault (e.g., kernel panic). This VMEXIT goes to L0, which forwards it to L1. L1 must handle it (typically reset the L2 VM). If L1 mishandles the triple fault forwarding, the nested stack can deadlock.
- Nested TLB invalidation bug: incorrect VPID/ASID management causes L2 to see stale TLB entries from L0 or L1. Symptoms: L2 crashes with page fault at unexpected addresses. These bugs are subtle and rare in production KVM.
- Interrupt injection deadlock: L0 tries to inject a virtual interrupt into L1 while L1 is trying to inject an interrupt into L2. Interrupt window management must be carefully coordinated across all three levels.
- Performance cliff: some workloads (nested containers, JVM inside nested VM) cause catastrophic VMEXIT rates. The 15–30% overhead estimate can reach 5-10× on pathological workloads. Monitor
nested_runandnested_vmexit_*stats before deploying nested workloads in production.
Modern Usage and Future Directions
Confidential computing with nesting: AMD SEV-SNP supports nesting — an SEV-SNP protected L1 hypervisor can run SEV-SNP protected L2 guests. This enables a "cloud-within-a-cloud" where neither the L0 operator nor L1 operator can read L2 guest memory.
Unikernels as L1: Unikraft and similar unikernels are being explored as minimal L1 hypervisors. A 2 MB unikernel running KVM as L1 provides hypervisor services with an order-of-magnitude smaller attack surface than Linux-based L1.
Cross-cloud migration via nesting: organizations running VMware on-premises want to migrate to cloud. VMware Cloud on AWS (nested on EC2 bare-metal) is the production solution today. As cloud interconnects improve, nested virtualization enables burst-to-cloud patterns where peak load runs as L2 in cloud while steady state remains on-premises.
Hardware acceleration for three-level EPT: future CPU generations are expected to support hardware three-level EPT (Intel calls this "nested EPT acceleration"), reducing the 24+ memory accesses per TLB miss to hardware-optimized walks comparable to two-level EPT.
Exercises
- On a KVM host with
nested=1, launch a VM with-cpu host,+vmx. Inside the VM, verify VMX is available and launch a nested VM. Measure the overhead by runningsysbench cpuinside the L2 VM vs directly inside L1. - Use
trace-cmdto capturekvm_nested_vmexitevents while L2 is running. Identify the most common nested VMEXIT reasons and explain what causes each. - Research CVE-2021-3653 (AMD KVM nested SVM privilege escalation). Describe: what VMCB field was not validated, what L1 could set it to, and what happened when L0 used the malformed value.
- Explain why running a container (Docker) inside an L2 VM is generally safe from L0's perspective, but running Kata Containers (VM inside container runtime) inside L2 would require triple nesting.
- Compute the theoretical maximum number of memory accesses needed to resolve a L2 TLB miss when all three page table levels (L2 PT, L1 EPT, L0 EPT) are 4-level walks. Show your work.
References
- Ben-Yehuda, M. et al. (2010). "The Turtles Project: Design and Implementation of Nested Virtualization." USENIX OSDI 2010.
- Dall, C. et al. (2016). "KVM/ARM: Experiences Building the Linux ARM Hypervisor." ACM TOCS 2016.
- Linux kernel source:
arch/x86/kvm/vmx/nested.c,arch/x86/kvm/svm/nested.c - Intel. Intel 64 and IA-32 SDM, Vol 3C, Section 25: Nested Virtualization.
- AMD. "AMD64 Architecture Programmer's Manual, Vol. 2", Section 15.22: Nested Paging (nSVM).
- Uhlig, R. et al. (2005). "Intel Virtualization Technology." IEEE Computer.
- Ben-Yehuda, M. et al. (2006). "Utilizing IOMMUs for Virtualization in Linux and Xen." Linux Symposium 2006.