08 — Nested Virtualization

Prerequisites

KVM architecture: VMCS, VMENTER/VMEXIT, VMX root/non-root modes
Memory virtualization: EPT, two-level page table walks
Intel VT-x: VMLAUNCH/VMRESUME instructions, VMEXIT reasons
Virtualization fundamentals: hypervisor types, full virtualization

Historical Context

Nested virtualization — running a hypervisor inside a virtual machine — was long considered an academic curiosity. The performance overhead of double-VMEXITs made it impractical. Intel and AMD added hardware support for nested virtualization in 2010 (Westmere/Sandy Bridge for Intel, Barcelona for AMD with VMCB shadowing), but software support lagged.

The practical need emerged from the cloud era. Developers wanted to test hypervisor code in cloud VMs. VMware-on-cloud scenarios (running vSphere inside AWS EC2) became a real product request. Kubernetes-in-VM for CI/CD pipelines needed either containers or VMs for isolation. These use cases drove investment in making nested virtualization fast and correct.

KVM support for nested virtualization (nested=1 for kvm_intel / kvm_amd) reached production quality around 2012–2015. Today it is widely used: AWS bare-metal instances support full nesting, GCP offers --enable-nested-virtualization, and most cloud providers support it on at least some instance types. Azure supports nested virtualization on all Dv3/Ev3 and newer series.

Nested Virtualization Terminology

The three-level hierarchy uses standardized naming:

Nested Virtualization Levels:

  +---------------------------------+
  |  L2 Guest (Guest-of-Guest)      |
  |  "innermost VM"                 |
  |  e.g., an app inside a nested VM|
  +---------------------------------+
  |  L1 Guest Hypervisor            |
  |  "guest hypervisor"             |
  |  e.g., KVM running inside a VM  |
  |  or VMware ESXi in cloud        |
  +---------------------------------+
  |  L0 Host Hypervisor             |
  |  "real hypervisor"              |
  |  e.g., KVM on the physical host |
  +---------------------------------+
  |  Physical Hardware              |
  +---------------------------------+

L0: the hypervisor running on bare metal. It controls all hardware.
L1: a guest VM running a hypervisor. L1 believes it has full hardware access but is actually virtualized by L0.
L2: a VM created by L1. L2 is a VM-of-a-VM — doubly virtualized.

Use Cases

1. Testing Hypervisors in the Cloud

Kernel developers testing KVM patches run their test KVM hypervisor (L1) inside a cloud VM (L0 = cloud provider's hypervisor). The test hypervisor creates test VMs (L2). Without nested support, this requires a dedicated bare-metal machine.

2. VMware on Cloud (VMware Cloud on AWS)

VMware vSphere (ESXi) is deployed as an L1 hypervisor inside AWS EC2 bare-metal instances. L0 = AWS Nitro. L1 = ESXi. L2 = customer's existing vSphere VMs. Enables migrating on-premises VMware workloads to the cloud without converting VM images.

3. Kubernetes in VMs for CI

CI systems (GitHub Actions, Jenkins) create ephemeral VMs per build job. If the build needs to run containers (Docker-in-Docker) or even VMs (kind — Kubernetes-in-Docker), nested virtualization enables this. Without nesting, containers in CI VMs cannot use KVM acceleration.

4. Cloud Gaming

Cloud gaming platforms run a Windows VM (L1) with DirectX/game support inside a GPU-partitioned host. Some architectures use nested virtualization for game state isolation — each game session is a nested VM.

5. Security Research

Hypervisor-level rootkit research requires a controlled environment. Researchers run a real OS (L1) inside a KVM VM (L0) and then run malware that attempts to install hypervisor-level hooks. L0 allows complete monitoring of L1+L2 without L1's knowledge.

Intel VT-x Nested Support: VMCS Shadowing

The core challenge: L1 (the guest hypervisor) wants to execute VMLAUNCH and VMRESUME to run L2. These are VMX instructions — they would normally cause a VMEXIT to L0 (the real hypervisor). Without nesting support, L0 would not know what to do with them.

Naive Approach (Pre-Hardware Assist)

Without hardware VMCS shadowing, L0 must fully emulate every VMX instruction that L1 executes:

L1 executes VMWRITE: VMEXIT to L0. L0 notes the field/value in a shadow VMCS.
L1 executes VMREAD: VMEXIT to L0. L0 returns the value from shadow VMCS.
L1 executes VMLAUNCH: VMEXIT to L0. L0 merges L1's VMCS (desired L2 configuration) with L0's own requirements and creates a real VMCS for L2. L0 launches L2 itself.
L2 causes a VMEXIT: Goes to L0 first (hardware). L0 determines if L1 should see this exit. If yes: L0 saves the exit info into L1's "virtual VMCS", then injects a VMEXIT event into L1 (synthetic exit delivery).

Every VMX instruction from L1 costs a VMEXIT to L0. For a L1 that is running 100 L2 VMs each doing 10,000 VMEXITs/sec, this means L0 handles 1,000,000+ nested VMEXITs/sec — a massive overhead.

Intel VMCS Shadowing (Broadwell, 2014)

Intel added VMCS shadowing to hardware to accelerate the most common path: VMREAD/VMWRITE from L1 do not need to VMEXIT to L0 if VMCS shadowing is enabled:

L0 creates a shadow VMCS and points the VMCS link pointer in L1's VMCS to it
When L1 executes VMREAD/VMWRITE, the CPU accesses the shadow VMCS directly without VMEXIT
Read/write bitmaps control which fields are intercepted (still cause VMEXIT) vs passed through

VMLAUNCH/VMRESUME from L1 still VMEXIT to L0. L0 must: 1. Merge L1's shadow VMCS (L2 desired state) with L0's own execution controls 2. Create a merged VMCS for direct L2 execution 3. VMRESUME to launch L2 under L0's control

VMCS Shadowing: VMREAD/VMWRITE path

  L1 Guest (VMX non-root)
       |
       |  VMREAD field X
       |
       +-- field X in "pass-through" bitmap?
       |         YES: CPU reads shadow VMCS directly (no VMEXIT!)
       |         NO:  VMEXIT to L0
       |
       v
  Shadow VMCS (in host physical memory, pointed to by VMCS link ptr)

AMD Nested Virtualization: VMCB Shadowing

AMD uses the VMCB (Virtual Machine Control Block) instead of VMCS. AMD nested virtualization (nSVM) works analogously:

L1 executes VMRUN (AMD's equivalent of VMLAUNCH/VMRESUME)
L0 intercepts, merges L1's VMCB with L0's requirements, creates a merged VMCB
L2 runs under L0's control with the merged VMCB
L2 VMEXIT → L0 → L0 decides if L1 should see it → synthetic VMEXIT delivery to L1

AMD added hardware support for VMCB shadowing in Zen 2 (2019), reducing VMREAD/VMWRITE equivalent overhead.

Nested VMEXIT Handling

When L2 causes a VMEXIT, the control flow is:

L2 VMEXIT Control Flow:

  L2 executes sensitive instruction
       |
       | Hardware VMEXIT (goes to L0, because L0 owns VMX root mode)
       v
  L0 VMEXIT handler
       |
       +-- Is this exit interesting to L0?
       |    e.g., EPT violation for L0's own mapping, L0's timer
       |    YES: L0 handles it, VMRESUME back to L2 (transparent)
       |
       +-- Should L1 see this exit?
       |    e.g., L1 configured its VMCS to intercept this exit reason
       |    YES: L0 synthesizes a VMEXIT event for L1
       |
       |  L0 saves L2 state into L1's "virtual VMCS" (shadow VMCS)
       |  L0 constructs exit reason + qualification in shadow VMCS
       |  L0 "injects" VMEXIT into L1:
       |    - Switches to L1's VMCS
       |    - Loads L1's exit handler RIP (from L1's host state area)
       |    - VMRESUME to L1
       v
  L1 VMEXIT handler (running in VMX non-root mode, Ring 0)
       |
       |  L1 reads exit reason from its VMCS → reads shadow VMCS
       |  L1 handles the exit (e.g., emulates I/O device for L2)
       |  L1 executes VMRESUME
       |
       v  VMEXIT to L0 (VMRESUME from L1 is a sensitive instruction)
  L0 handles L1's VMRESUME
       |
       |  L0 merges L1's updated shadow VMCS with L0 controls
       |  L0 launches L2 (VMRESUME to L2)
       v
  L2 resumes execution

Each L2→L1 handoff requires two L0 VMEXITs (one to deliver the synthetic VMEXIT to L1, one to process L1's VMRESUME). This doubles the VMEXIT overhead compared to non-nested operation.

Three-Level Memory: Nested EPT

With nested virtualization, address translation has three levels:

Level	Translation	Who maintains
L2 guest PT	L2-gVA → L2-gPA	L2 guest OS
L1 EPT (shadow/nested)	L2-gPA → L1-gPA	L1 hypervisor
L0 EPT	L1-gPA → hPA	L0 hypervisor

Full walk: L2-gVA → L2-gPA → L1-gPA → hPA

Without hardware nested EPT: L0 must maintain a "nested shadow EPT" combining all three levels, software-maintained. Extremely expensive.

With hardware nested EPT (Intel Haswell+, AMD Zen 2+): the hardware performs all three translations. A single memory access from L2 causes up to 24 memory accesses for full three-level page table walks (4-level guest PT × 3 levels of EPT walks = up to 4 × (4+4+4) = ... the math is 4-level PT walk where each guest PT level lookup needs a 4-level EPT walk = 4 × 4 = 16 EPT lookups for L1 alone, plus 4 more for L0 EPT = up to 24 memory accesses per L2 TLB miss).

The TLB is critical here — VPID tagging must include L2's VPID to avoid TLB flushes, and L2 TLB entries are tagged differently from L1 and L0 entries.

Nested EPT Walk (worst case):

  L2 TLB miss for gVA X
       |
       v
  Hardware walks L2 guest PT (4 levels):
    For each PT level access (gPA of PT page):
      → Walk L1 EPT (4 levels): each L1 EPT entry is gPA
        → Walk L0 EPT (4 levels): to get hPA of L1 EPT entry
    → Walk L0 EPT for final gPA→hPA

  Total: up to 4 × (4 + 4) + 4 = 36 memory accesses
  (in practice, TLB hits reduce this dramatically)

KVM Nested Virtualization Implementation

KVM supports nested virtualization via the nested=1 module parameter:

# Enable nested virt for Intel
modprobe kvm_intel nested=1
# Or permanently in /etc/modprobe.d/kvm.conf:
echo "options kvm_intel nested=1" > /etc/modprobe.d/kvm-intel.conf

# Enable nested virt for AMD
modprobe kvm_amd nested=1

# Verify it's enabled
cat /sys/module/kvm_intel/parameters/nested   # Y

The guest VM (L1) must expose VMX to its vCPUs. In QEMU:

-cpu host,+vmx    # Intel: expose VMX capability
-cpu host,+svm    # AMD: expose SVM capability
# Or use a named CPU model that includes VMX:
-cpu Skylake-Server,+vmx

Inside L1, the guest hypervisor sees VMX support in CPUID and can use /dev/kvm normally.

Key KVM Nested Code Paths

arch/x86/kvm/vmx/nested.c: ~7,000 lines of the most complex code in KVM
handle_vmlaunch() / handle_vmresume(): intercept L1's VMX launch
nested_vmx_enter_non_root_mode(): merge L1 VMCS + L0 requirements, launch L2
nested_vmx_vmexit(): handle L2 exit, decide L0 vs L1 handling
prepare_vmcs02(): compute the "merged VMCS" (VMCS02) that is actually loaded

Performance Overhead

Overhead Sources

Double VMEXIT: each L2 VMEXIT costs two L0 VMEXITs (synthetic + VMRESUME processing)
VMCS merging: prepare_vmcs02() runs on every L1→L2 transition (~10,000 ns)
Triple address translation: three-level EPT walk on L2 TLB misses
Shadow VMCS synchronization: L0 must keep shadow VMCS in sync with L1's intended VMCS

Empirical overhead: nested KVM typically adds 15–30% performance penalty vs running directly on hardware for compute workloads. I/O-intensive workloads can see 50-100% overhead due to high VMEXIT rates.

Mitigation

VirtIO in L2: use VirtIO devices in L2 (not emulated devices) to minimize VMEXITs
EPT TLB coverage: use 2MB pages in EPT at L0 level to reduce EPT miss rate
VMCS shadowing: Intel VMCS shadowing hardware (Broadwell+) eliminates VMREAD/VMWRITE VMEXITs
Nested EPT: hardware three-level EPT walk (Haswell+ / Zen 2+) eliminates software-maintained nested shadow page tables

Cloud Provider Support

Provider	Nested Support	Notes
AWS EC2	Bare-metal instances (`*.metal`)	Direct hardware: nested works natively
AWS EC2	Regular instances (Nitro)	Nitro-based instances support nested since 2021 via nested KVM
GCP	All instances	`--enable-nested-virtualization` flag at VM creation
Azure	Dv3/Ev3/Fsv2 and newer	Hyper-V nested enabled; all recent series
DigitalOcean	Most instance types	KVM nested enabled by default
Linode/Akamai	All	KVM nested enabled
Hetzner	All dedicated servers	Full nested (bare-metal option too)

AWS bare-metal (i3.metal, c5.metal, etc.): the guest VM runs directly on hardware with no L0 hypervisor. VMX instructions operate natively — there is no L0 overhead. This is the highest-performance option for nested workloads and is required for VMware Cloud on AWS.

Running KVM Inside QEMU (Testing)

The standard nested virtualization test setup for kernel development:

# On the host (L0 running KVM):
qemu-system-x86_64 \
  -enable-kvm \
  -cpu host,+vmx \          # expose VMX to L1
  -m 4G \
  -smp 2 \
  -hda l1-disk.qcow2 \
  -netdev user,id=n0 \
  -device virtio-net,netdev=n0 \
  -nographic

# Inside the L1 VM:
sudo modprobe kvm_intel nested=1
sudo qemu-system-x86_64 -enable-kvm -m 512M -hda l2-disk.qcow2  # L2 VM

Security Implications

L0 attack surface expansion: supporting nested virtualization adds ~7,000 lines of KVM code, all of which is an attack surface. Bugs in nested.c have been a significant source of CVEs: CVE-2021-3653 (AMD SVM nested privilege escalation), CVE-2022-26362 (Xen/AMD nested heap overflow).
L1 escape via nested: a compromised L2 can attempt to escape to L1 (by exploiting L1's hypervisor bugs), and then L2-via-L1 can attempt to escape to L0 (by exploiting KVM nested handling bugs). Each layer adds attack surface.
VMCS/VMCB field validation: L0 must validate every field in L1's shadow VMCS before merging it. Invalid fields (e.g., a host RIP pointing into user space) must be rejected. This validation logic is complex and has been the source of multiple CVEs.
Information leakage: L2's CPUID can reveal information about L0's hardware configuration. L0 must carefully synthesize CPUID responses for L2.

Debugging Notes

# Check if nested virtualization is active on host
cat /sys/module/kvm_intel/parameters/nested   # Y or 1

# In L1: verify VMX is visible
grep vmx /proc/cpuinfo
# Should show "vmx" in flags

# In L1: check if KVM can be loaded
sudo modprobe kvm_intel
dmesg | grep kvm   # Should show "kvm: Nested virtualization enabled"

# Monitor nested VMEXIT statistics (L0 host)
cat /sys/kernel/debug/kvm/*/nested_run   # count of nested VM runs
cat /sys/kernel/debug/kvm/*/nested_vmexit_*  # per-reason counts

# Trace nested VMCS operations
trace-cmd record -e 'kvm:kvm_nested_vmexit' sleep 5
trace-cmd report | grep nested

# Common failure: L1 cannot load KVM
# Check: is vmx/svm in /proc/cpuinfo?
# Check: is /dev/kvm accessible in L1?
ls -la /dev/kvm   # should exist inside L1

# Debugging nested EPT issues
dmesg | grep -i "ept\|nested\|vmcs"

Failure Modes

VMCS merge failure: if L1 sets VMCS fields to values that L0 cannot accommodate (e.g., requesting a feature L0 disallows), nested_vmx_enter_non_root_mode() returns an error. L1 sees a VMLAUNCH failure. Hard to debug without KVM source knowledge.
L2 triple fault: L2 guest OS encounters a triple fault (e.g., kernel panic). This VMEXIT goes to L0, which forwards it to L1. L1 must handle it (typically reset the L2 VM). If L1 mishandles the triple fault forwarding, the nested stack can deadlock.
Nested TLB invalidation bug: incorrect VPID/ASID management causes L2 to see stale TLB entries from L0 or L1. Symptoms: L2 crashes with page fault at unexpected addresses. These bugs are subtle and rare in production KVM.
Interrupt injection deadlock: L0 tries to inject a virtual interrupt into L1 while L1 is trying to inject an interrupt into L2. Interrupt window management must be carefully coordinated across all three levels.
Performance cliff: some workloads (nested containers, JVM inside nested VM) cause catastrophic VMEXIT rates. The 15–30% overhead estimate can reach 5-10× on pathological workloads. Monitor nested_run and nested_vmexit_* stats before deploying nested workloads in production.

Modern Usage and Future Directions

Confidential computing with nesting: AMD SEV-SNP supports nesting — an SEV-SNP protected L1 hypervisor can run SEV-SNP protected L2 guests. This enables a "cloud-within-a-cloud" where neither the L0 operator nor L1 operator can read L2 guest memory.

Unikernels as L1: Unikraft and similar unikernels are being explored as minimal L1 hypervisors. A 2 MB unikernel running KVM as L1 provides hypervisor services with an order-of-magnitude smaller attack surface than Linux-based L1.

Cross-cloud migration via nesting: organizations running VMware on-premises want to migrate to cloud. VMware Cloud on AWS (nested on EC2 bare-metal) is the production solution today. As cloud interconnects improve, nested virtualization enables burst-to-cloud patterns where peak load runs as L2 in cloud while steady state remains on-premises.

Hardware acceleration for three-level EPT: future CPU generations are expected to support hardware three-level EPT (Intel calls this "nested EPT acceleration"), reducing the 24+ memory accesses per TLB miss to hardware-optimized walks comparable to two-level EPT.

Exercises

On a KVM host with nested=1, launch a VM with -cpu host,+vmx. Inside the VM, verify VMX is available and launch a nested VM. Measure the overhead by running sysbench cpu inside the L2 VM vs directly inside L1.
Use trace-cmd to capture kvm_nested_vmexit events while L2 is running. Identify the most common nested VMEXIT reasons and explain what causes each.
Research CVE-2021-3653 (AMD KVM nested SVM privilege escalation). Describe: what VMCB field was not validated, what L1 could set it to, and what happened when L0 used the malformed value.
Explain why running a container (Docker) inside an L2 VM is generally safe from L0's perspective, but running Kata Containers (VM inside container runtime) inside L2 would require triple nesting.
Compute the theoretical maximum number of memory accesses needed to resolve a L2 TLB miss when all three page table levels (L2 PT, L1 EPT, L0 EPT) are 4-level walks. Show your work.

References

Ben-Yehuda, M. et al. (2010). "The Turtles Project: Design and Implementation of Nested Virtualization." USENIX OSDI 2010.
Dall, C. et al. (2016). "KVM/ARM: Experiences Building the Linux ARM Hypervisor." ACM TOCS 2016.
Linux kernel source: arch/x86/kvm/vmx/nested.c, arch/x86/kvm/svm/nested.c
Intel. Intel 64 and IA-32 SDM, Vol 3C, Section 25: Nested Virtualization.
AMD. "AMD64 Architecture Programmer's Manual, Vol. 2", Section 15.22: Nested Paging (nSVM).
Uhlig, R. et al. (2005). "Intel Virtualization Technology." IEEE Computer.
Ben-Yehuda, M. et al. (2006). "Utilizing IOMMUs for Virtualization in Linux and Xen." Linux Symposium 2006.