Section 19: Virtualization
Purpose and Scope
Virtualization is the technique of running a complete guest system — with its own kernel, memory map, and device model — atop a host, mediated by a hypervisor. This section covers the full technological landscape: the classical dichotomy of full virtualization versus paravirtualization, the hardware revolution that made efficient virtualization mainstream (Intel VT-x / AMD-V, EPT/NPT, SR-IOV, IOMMU), and the major hypervisor implementations (KVM, Xen, VMware, Hyper-V, QEMU). It extends to memory virtualization, I/O virtualization via VirtIO, live migration, snapshots, nested virtualization, and the frontier of microVMs (Firecracker, gVisor, Cloud Hypervisor) that power serverless computing.
Understanding virtualization at this level is prerequisite to reasoning about cloud instance performance, container-vs-VM isolation trade-offs, and the architectural decisions that determine the cost of a syscall in a cloud function.
Prerequisites
- Section 02 (CPU Architecture): privilege rings, TLB, PCIe, MMU, APIC
- Section 03 (OS Fundamentals): kernel/user mode, system calls, memory layout
- Section 11 (Memory Management): paging, TLB, EPT/NPT as extensions of these
- Section 14 (Device Drivers): driver model, DMA, interrupt handling
- Section 15 (Networking): virtio-net, SR-IOV, virtual switches
Learning Objectives
Upon completing this section you will be able to:
- Explain why the x86 architecture was not classically virtualizable (sensitive non-privileged instructions) and how VT-x/AMD-V solves this.
- Describe the VMCS (Virtual Machine Control Structure) and the VM-entry/VM-exit cycle.
- Explain EPT (Extended Page Tables): a second level of page table translation enabling hardware-assisted memory virtualization.
- Trace a guest I/O operation through QEMU/KVM: guest userspace → guest kernel → VM exit → KVM in host kernel → QEMU device emulation.
- Describe VirtIO's split virtqueue design and explain why it dramatically outperforms fully-emulated devices.
- Explain SR-IOV: how PFs create VFs, how VFs are assigned to guests, and the role of the IOMMU.
- Describe the live migration process: pre-copy, stop-and-copy, and the role of dirty page tracking.
- Explain Xen's architectural distinction between dom0 and domU, and the split driver model.
- Describe Firecracker's microVM architecture: minimal device model, jailer, and threat model.
- Explain nested virtualization: L0 (host hypervisor), L1 (guest hypervisor), L2 (nested guest) and the performance implications.
Architecture Overview
Guest User Space
┌───────────────────────────────────────────────────────────────┐
│ Application code (ring 3 in guest context) │
└───────────────────────────────┬───────────────────────────────┘
│ syscall (INT 0x80 / SYSCALL)
┌───────────────────────────────▼───────────────────────────────┐
│ Guest Kernel (ring 0 in guest context / VMX non-root) │
│ Sensitive instructions → VM exit → hypervisor handles │
└───────────────────────────────┬───────────────────────────────┘
│ VM exit (hardware trap)
┌───────────────────────────────▼───────────────────────────────┐
│ KVM (kernel module) │
│ ioctl(KVM_RUN) ─ VMCS/VMCB ─ EPT/NPT ─ virtual APIC │
│ Fast path: in-kernel handling (EPT fault, hypercall, MMIO) │
│ Slow path: exit to QEMU for device emulation │
└───────────────────────────────┬───────────────────────────────┘
│ ioctl() for device emulation
┌───────────────────────────────▼───────────────────────────────┐
│ QEMU (user space) │
│ Device emulation: virtio-blk, virtio-net, e1000, nvme │
│ Migration, snapshot, monitor, VNC/SPICE │
└───────────────────────────────┬───────────────────────────────┘
│
┌───────────────────────────────▼───────────────────────────────┐
│ Host Kernel + Hardware │
│ ┌──────────────┐ ┌──────────────┐ ┌────────────────────┐ │
│ │ EPT / NPT │ │ IOMMU/VT-d │ │ SR-IOV VF assign │ │
│ │ 2-level TLB │ │ DMA remap │ │ PCIe VF per guest │ │
│ └──────────────┘ └──────────────┘ └────────────────────┘ │
└───────────────────────────────────────────────────────────────┘
VirtIO Virtqueue (split ring):
Guest ─[avail ring]──► Host driver processes descriptors
Host ─[used ring]──► Guest notified via interrupt injection
Descriptor table: guest physical address + length + flags
Firecracker microVM:
┌──────────────────────────────────────────────────┐
│ minimal Linux kernel (5 MB) + rootfs │
│ 2 VirtIO devices: virtio-blk + virtio-net │
│ No BIOS, No PCI bus, No USB — attack surface ↓ │
└───────────────────┬──────────────────────────────┘
│ KVM (same as QEMU-KVM)
▼
jailer (seccomp + cgroup + namespace isolation)
Key Concepts
- Full Virtualization: Guest OS runs unmodified; hypervisor traps and emulates all privileged operations; historically required binary translation (VMware) or hardware assist (VT-x).
- Paravirtualization: Guest OS is modified to issue hypercalls instead of privileged instructions; avoids trap-and-emulate overhead; Xen PV model.
- Hardware-Assisted Virtualization (VT-x / AMD-V): CPU extensions adding VMX non-root mode; guest runs directly at ring 0 with full isolation; hypervisor at ring -1 (VMX root).
- VMCS (Virtual Machine Control Structure): Intel data structure that captures complete VM state; VM entry restores guest state; VM exit saves guest state and jumps to hypervisor handler.
- VM Exit: Hardware-triggered transition from guest to hypervisor; caused by I/O, privileged instructions, EPT faults, interrupts; each exit costs ~1000–5000 ns.
- EPT (Extended Page Tables) / NPT (Nested Page Tables): Second-level address translation: gPA (guest physical) → hPA (host physical); hardware TLB caches both levels; eliminates shadow page table maintenance.
- VPID / ASID: Virtual Processor ID (Intel) / Address Space ID (AMD): tags TLB entries per virtual CPU; avoids full TLB flush on VM entry/exit.
- KVM: Linux kernel module that uses VT-x/AMD-V to implement a hypervisor; VM management via /dev/kvm ioctl interface; QEMU provides device emulation.
- QEMU: User-space process providing complete machine emulation; works with KVM for accelerated execution; handles device I/O for VMs via ioctl.
- Xen: Type-1 (bare metal) hypervisor; dom0 is a privileged Linux VM with hardware access; domU are unprivileged guests; split driver model passes I/O through dom0.
- VirtIO: Paravirtual device framework; defines a standard interface for virtual block, network, memory, and other devices; guest driver + host backend communicate via virtqueue (descriptor table + available ring + used ring).
- VFIO: Framework for safe device passthrough to userspace or VMs; uses IOMMU to protect host memory from DMA by the assigned device.
- SR-IOV: Physical NIC presents as a Physical Function (PF) managing N Virtual Functions (VFs); each VF has its own PCI config space, MAC, VLAN, and DMA queues; assigned directly to VMs.
- IOMMU (VT-d / AMD-Vi): Translates device DMA addresses; prevents a compromised VM/device from DMAing to arbitrary host memory.
- Live Migration: Move a running VM between hosts with minimal downtime; pre-copy phase replicates memory while VM runs; stop-and-copy phase copies final dirty pages; storage either shared or streamed.
- Dirty Page Tracking: Hypervisor marks pages as clean; writes from guest trigger EPT write-protection faults recorded in a dirty bitmap; drives live migration copy.
- Nested Virtualization: Running a hypervisor inside a VM; L0 (host), L1 (guest hypervisor), L2 (nested guest); hardware support (VMCS shadowing) required for acceptable performance.
- microVM: Minimal VM with a stripped-down device model (1–2 VirtIO devices, no PCI bus, no legacy BIOS); boots in <125 ms; used by Firecracker (AWS Lambda), Fly.io, Kata Containers.
- Firecracker: MicroVM VMM written in Rust; uses KVM; jailer sandboxes the VMM process via seccomp and cgroups; serves as the isolation boundary for AWS Lambda and Fargate.
Major Historical Milestones
| Year | Milestone |
|---|---|
| 1972 | IBM VM/370: first practical VM system on System/370 |
| 1974 | Popek and Goldberg: formal requirements for virtualizable architectures |
| 1998 | VMware founded; VMware Workstation using binary translation |
| 2000 | Xen project begins at Cambridge (Barham et al.) |
| 2001 | VMware ESX Server (Type-1 bare-metal hypervisor) |
| 2003 | Xen 1.0 released; paravirtualization for Linux/NetBSD |
| 2005 | Intel VT-x (Vanderpool) ships in Pentium 4 6x0 series |
| 2005 | AMD-V (Pacifica) ships in Athlon 64 FX |
| 2006 | KVM merged into Linux kernel 2.6.20 |
| 2006 | Xen 3.0: hardware virtualization support (HVM) |
| 2007 | VirtIO specification published (Rusty Russell) |
| 2008 | Intel VT-d (IOMMU) shipping in mainstream Xeon |
| 2009 | Amazon AWS moves from Xen to KVM (completes ~2017) |
| 2010 | QEMU 0.13: VirtIO stabilizes as default for KVM guests |
| 2012 | Microsoft Hyper-V matures; Azure built on Hyper-V |
| 2015 | SR-IOV widely deployed in cloud NIC passthrough |
| 2018 | Firecracker open-sourced by AWS; microVM paradigm |
| 2019 | Kata Containers 1.0: containers in microVMs |
| 2020 | Apple M1 with Hypervisor.framework; nested virt on ARM |
| 2021 | Intel TDX / AMD SEV-SNP: confidential computing VMs |
| 2022 | VMPL (VM Privilege Levels) in AMD SEV-SNP |
Modern Relevance and Production Use Cases
Cloud infrastructure (AWS EC2, GCP, Azure) is built on hypervisors: AWS Nitro System uses KVM + dedicated hardware offload cards for EBS and networking, essentially moving all I/O out of the VMM; understanding this architecture explains why network I/O in EC2 is not a VM-exit bottleneck.
Serverless platforms (AWS Lambda, Google Cloud Run, Cloudflare Workers) require fast VM startup; Firecracker achieves <125 ms cold start by eliminating BIOS, PCI enumeration, and most VirtIO devices; the microVM model is now the industry standard for multi-tenant FaaS.
Kubernetes with Kata Containers provides VM-level isolation with container-like ergonomics; each pod runs in a Firecracker or QEMU microVM; understanding the VirtIO path explains the additional latency vs native containers.
Confidential computing (AMD SEV, Intel TDX) encrypts VM memory and CPU state, preventing the hypervisor from reading guest memory; TEE-based workloads (financial, healthcare) require understanding the attestation and threat model.
ML training clusters use SR-IOV or RDMA passthrough to give VMs near-native GPU and network performance; understanding IOMMU and VF assignment is essential for GPU cluster configuration.
File Map
| File | Description |
|---|---|
01-virtualization-fundamentals.md |
Popek-Goldberg criteria, VMM types, isolation goals |
02-full-vs-paravirt.md |
Binary translation, Xen PV vs HVM, hypercall interface |
03-hardware-virt-vtx-amdv.md |
VMX root/non-root, VMCS layout, VM entry/exit flow |
04-kvm-architecture.md |
KVM as Linux module, /dev/kvm API, vCPU ioctl loop |
05-qemu-internals.md |
QEMU machine model, device emulation, QMP monitor |
06-xen-hypervisor.md |
dom0/domU, PV vs HVM, grant tables, event channels |
07-vmware-internals.md |
VMware ESXi VMkernel, vSphere, vMotion architecture |
08-hyper-v.md |
Hyper-V partitions, VSM, Generation 2 VMs, Azure Mezzanine |
09-memory-virtualization-ept.md |
Shadow page tables (pre-VT-x), EPT/NPT, VPID, TLB cost |
10-io-virtualization-virtio.md |
VirtIO spec, split virtqueue, vhost-net, vhost-user |
11-sr-iov.md |
PF/VF model, IOMMU mapping, guest NIC performance |
12-live-migration.md |
Pre-copy algorithm, dirty bitmap, storage migration |
13-snapshots.md |
COW snapshot disk, memory snapshot, checkpoint restore |
14-nested-virtualization.md |
L0/L1/L2, VMCS shadowing, performance overhead |
15-microvms-firecracker.md |
Firecracker VMM, jailer, minimal device model, startup time |
16-confidential-computing.md |
AMD SEV/SEV-SNP, Intel TDX, attestation, threat model |
17-container-vs-vm.md |
Namespace/cgroup vs full VM, Kata Containers, gVisor |
18-aws-nitro-system.md |
Nitro card, NitroTPM, offloaded EBS/networking |
Cross-References
- Section 02 (CPU Architecture): VT-x/AMD-V as CPU features; APIC virtualization; cache and TLB in virtualized context
- Section 11 (Memory Management): EPT as two-level paging; balloon driver; memory overcommit and KSM in hypervisors
- Section 14 (Device Drivers): VFIO driver for passthrough; VirtIO driver on guest side; virtio-blk/net driver structure
- Section 15 (Networking): virtio-net, vhost-net, OVS-DPDK, SR-IOV NIC in VM context
- Section 17 (Distributed Systems): live migration as a distributed systems problem; cluster schedulers over VMs
- Section 18 (Database Internals): storage performance implications of VirtIO vs passthrough; fsync through virtualized storage