Skip to content

Section 19: Virtualization

Purpose and Scope

Virtualization is the technique of running a complete guest system — with its own kernel, memory map, and device model — atop a host, mediated by a hypervisor. This section covers the full technological landscape: the classical dichotomy of full virtualization versus paravirtualization, the hardware revolution that made efficient virtualization mainstream (Intel VT-x / AMD-V, EPT/NPT, SR-IOV, IOMMU), and the major hypervisor implementations (KVM, Xen, VMware, Hyper-V, QEMU). It extends to memory virtualization, I/O virtualization via VirtIO, live migration, snapshots, nested virtualization, and the frontier of microVMs (Firecracker, gVisor, Cloud Hypervisor) that power serverless computing.

Understanding virtualization at this level is prerequisite to reasoning about cloud instance performance, container-vs-VM isolation trade-offs, and the architectural decisions that determine the cost of a syscall in a cloud function.


Prerequisites

  • Section 02 (CPU Architecture): privilege rings, TLB, PCIe, MMU, APIC
  • Section 03 (OS Fundamentals): kernel/user mode, system calls, memory layout
  • Section 11 (Memory Management): paging, TLB, EPT/NPT as extensions of these
  • Section 14 (Device Drivers): driver model, DMA, interrupt handling
  • Section 15 (Networking): virtio-net, SR-IOV, virtual switches

Learning Objectives

Upon completing this section you will be able to:

  1. Explain why the x86 architecture was not classically virtualizable (sensitive non-privileged instructions) and how VT-x/AMD-V solves this.
  2. Describe the VMCS (Virtual Machine Control Structure) and the VM-entry/VM-exit cycle.
  3. Explain EPT (Extended Page Tables): a second level of page table translation enabling hardware-assisted memory virtualization.
  4. Trace a guest I/O operation through QEMU/KVM: guest userspace → guest kernel → VM exit → KVM in host kernel → QEMU device emulation.
  5. Describe VirtIO's split virtqueue design and explain why it dramatically outperforms fully-emulated devices.
  6. Explain SR-IOV: how PFs create VFs, how VFs are assigned to guests, and the role of the IOMMU.
  7. Describe the live migration process: pre-copy, stop-and-copy, and the role of dirty page tracking.
  8. Explain Xen's architectural distinction between dom0 and domU, and the split driver model.
  9. Describe Firecracker's microVM architecture: minimal device model, jailer, and threat model.
  10. Explain nested virtualization: L0 (host hypervisor), L1 (guest hypervisor), L2 (nested guest) and the performance implications.

Architecture Overview

  Guest User Space
  ┌───────────────────────────────────────────────────────────────┐
  │  Application code (ring 3 in guest context)                   │
  └───────────────────────────────┬───────────────────────────────┘
                                  │ syscall (INT 0x80 / SYSCALL)
  ┌───────────────────────────────▼───────────────────────────────┐
  │  Guest Kernel (ring 0 in guest context / VMX non-root)        │
  │  Sensitive instructions → VM exit → hypervisor handles        │
  └───────────────────────────────┬───────────────────────────────┘
                                  │ VM exit (hardware trap)
  ┌───────────────────────────────▼───────────────────────────────┐
  │                     KVM (kernel module)                        │
  │  ioctl(KVM_RUN) ─ VMCS/VMCB ─ EPT/NPT ─ virtual APIC        │
  │  Fast path: in-kernel handling (EPT fault, hypercall, MMIO)   │
  │  Slow path: exit to QEMU for device emulation                 │
  └───────────────────────────────┬───────────────────────────────┘
                                  │ ioctl() for device emulation
  ┌───────────────────────────────▼───────────────────────────────┐
  │                     QEMU (user space)                          │
  │  Device emulation: virtio-blk, virtio-net, e1000, nvme        │
  │  Migration, snapshot, monitor, VNC/SPICE                       │
  └───────────────────────────────┬───────────────────────────────┘
                                  │
  ┌───────────────────────────────▼───────────────────────────────┐
  │                   Host Kernel + Hardware                        │
  │  ┌──────────────┐  ┌──────────────┐  ┌────────────────────┐  │
  │  │  EPT / NPT   │  │  IOMMU/VT-d  │  │  SR-IOV VF assign  │  │
  │  │  2-level TLB │  │  DMA remap   │  │  PCIe VF per guest  │  │
  │  └──────────────┘  └──────────────┘  └────────────────────┘  │
  └───────────────────────────────────────────────────────────────┘

  VirtIO Virtqueue (split ring):
  Guest ─[avail ring]──► Host driver processes descriptors
  Host  ─[used  ring]──► Guest notified via interrupt injection
  Descriptor table: guest physical address + length + flags

  Firecracker microVM:
  ┌──────────────────────────────────────────────────┐
  │  minimal Linux kernel (5 MB) + rootfs            │
  │  2 VirtIO devices: virtio-blk + virtio-net       │
  │  No BIOS, No PCI bus, No USB — attack surface ↓  │
  └───────────────────┬──────────────────────────────┘
                      │ KVM (same as QEMU-KVM)
                      ▼
             jailer (seccomp + cgroup + namespace isolation)

Key Concepts

  • Full Virtualization: Guest OS runs unmodified; hypervisor traps and emulates all privileged operations; historically required binary translation (VMware) or hardware assist (VT-x).
  • Paravirtualization: Guest OS is modified to issue hypercalls instead of privileged instructions; avoids trap-and-emulate overhead; Xen PV model.
  • Hardware-Assisted Virtualization (VT-x / AMD-V): CPU extensions adding VMX non-root mode; guest runs directly at ring 0 with full isolation; hypervisor at ring -1 (VMX root).
  • VMCS (Virtual Machine Control Structure): Intel data structure that captures complete VM state; VM entry restores guest state; VM exit saves guest state and jumps to hypervisor handler.
  • VM Exit: Hardware-triggered transition from guest to hypervisor; caused by I/O, privileged instructions, EPT faults, interrupts; each exit costs ~1000–5000 ns.
  • EPT (Extended Page Tables) / NPT (Nested Page Tables): Second-level address translation: gPA (guest physical) → hPA (host physical); hardware TLB caches both levels; eliminates shadow page table maintenance.
  • VPID / ASID: Virtual Processor ID (Intel) / Address Space ID (AMD): tags TLB entries per virtual CPU; avoids full TLB flush on VM entry/exit.
  • KVM: Linux kernel module that uses VT-x/AMD-V to implement a hypervisor; VM management via /dev/kvm ioctl interface; QEMU provides device emulation.
  • QEMU: User-space process providing complete machine emulation; works with KVM for accelerated execution; handles device I/O for VMs via ioctl.
  • Xen: Type-1 (bare metal) hypervisor; dom0 is a privileged Linux VM with hardware access; domU are unprivileged guests; split driver model passes I/O through dom0.
  • VirtIO: Paravirtual device framework; defines a standard interface for virtual block, network, memory, and other devices; guest driver + host backend communicate via virtqueue (descriptor table + available ring + used ring).
  • VFIO: Framework for safe device passthrough to userspace or VMs; uses IOMMU to protect host memory from DMA by the assigned device.
  • SR-IOV: Physical NIC presents as a Physical Function (PF) managing N Virtual Functions (VFs); each VF has its own PCI config space, MAC, VLAN, and DMA queues; assigned directly to VMs.
  • IOMMU (VT-d / AMD-Vi): Translates device DMA addresses; prevents a compromised VM/device from DMAing to arbitrary host memory.
  • Live Migration: Move a running VM between hosts with minimal downtime; pre-copy phase replicates memory while VM runs; stop-and-copy phase copies final dirty pages; storage either shared or streamed.
  • Dirty Page Tracking: Hypervisor marks pages as clean; writes from guest trigger EPT write-protection faults recorded in a dirty bitmap; drives live migration copy.
  • Nested Virtualization: Running a hypervisor inside a VM; L0 (host), L1 (guest hypervisor), L2 (nested guest); hardware support (VMCS shadowing) required for acceptable performance.
  • microVM: Minimal VM with a stripped-down device model (1–2 VirtIO devices, no PCI bus, no legacy BIOS); boots in <125 ms; used by Firecracker (AWS Lambda), Fly.io, Kata Containers.
  • Firecracker: MicroVM VMM written in Rust; uses KVM; jailer sandboxes the VMM process via seccomp and cgroups; serves as the isolation boundary for AWS Lambda and Fargate.

Major Historical Milestones

Year Milestone
1972 IBM VM/370: first practical VM system on System/370
1974 Popek and Goldberg: formal requirements for virtualizable architectures
1998 VMware founded; VMware Workstation using binary translation
2000 Xen project begins at Cambridge (Barham et al.)
2001 VMware ESX Server (Type-1 bare-metal hypervisor)
2003 Xen 1.0 released; paravirtualization for Linux/NetBSD
2005 Intel VT-x (Vanderpool) ships in Pentium 4 6x0 series
2005 AMD-V (Pacifica) ships in Athlon 64 FX
2006 KVM merged into Linux kernel 2.6.20
2006 Xen 3.0: hardware virtualization support (HVM)
2007 VirtIO specification published (Rusty Russell)
2008 Intel VT-d (IOMMU) shipping in mainstream Xeon
2009 Amazon AWS moves from Xen to KVM (completes ~2017)
2010 QEMU 0.13: VirtIO stabilizes as default for KVM guests
2012 Microsoft Hyper-V matures; Azure built on Hyper-V
2015 SR-IOV widely deployed in cloud NIC passthrough
2018 Firecracker open-sourced by AWS; microVM paradigm
2019 Kata Containers 1.0: containers in microVMs
2020 Apple M1 with Hypervisor.framework; nested virt on ARM
2021 Intel TDX / AMD SEV-SNP: confidential computing VMs
2022 VMPL (VM Privilege Levels) in AMD SEV-SNP

Modern Relevance and Production Use Cases

Cloud infrastructure (AWS EC2, GCP, Azure) is built on hypervisors: AWS Nitro System uses KVM + dedicated hardware offload cards for EBS and networking, essentially moving all I/O out of the VMM; understanding this architecture explains why network I/O in EC2 is not a VM-exit bottleneck.

Serverless platforms (AWS Lambda, Google Cloud Run, Cloudflare Workers) require fast VM startup; Firecracker achieves <125 ms cold start by eliminating BIOS, PCI enumeration, and most VirtIO devices; the microVM model is now the industry standard for multi-tenant FaaS.

Kubernetes with Kata Containers provides VM-level isolation with container-like ergonomics; each pod runs in a Firecracker or QEMU microVM; understanding the VirtIO path explains the additional latency vs native containers.

Confidential computing (AMD SEV, Intel TDX) encrypts VM memory and CPU state, preventing the hypervisor from reading guest memory; TEE-based workloads (financial, healthcare) require understanding the attestation and threat model.

ML training clusters use SR-IOV or RDMA passthrough to give VMs near-native GPU and network performance; understanding IOMMU and VF assignment is essential for GPU cluster configuration.


File Map

File Description
01-virtualization-fundamentals.md Popek-Goldberg criteria, VMM types, isolation goals
02-full-vs-paravirt.md Binary translation, Xen PV vs HVM, hypercall interface
03-hardware-virt-vtx-amdv.md VMX root/non-root, VMCS layout, VM entry/exit flow
04-kvm-architecture.md KVM as Linux module, /dev/kvm API, vCPU ioctl loop
05-qemu-internals.md QEMU machine model, device emulation, QMP monitor
06-xen-hypervisor.md dom0/domU, PV vs HVM, grant tables, event channels
07-vmware-internals.md VMware ESXi VMkernel, vSphere, vMotion architecture
08-hyper-v.md Hyper-V partitions, VSM, Generation 2 VMs, Azure Mezzanine
09-memory-virtualization-ept.md Shadow page tables (pre-VT-x), EPT/NPT, VPID, TLB cost
10-io-virtualization-virtio.md VirtIO spec, split virtqueue, vhost-net, vhost-user
11-sr-iov.md PF/VF model, IOMMU mapping, guest NIC performance
12-live-migration.md Pre-copy algorithm, dirty bitmap, storage migration
13-snapshots.md COW snapshot disk, memory snapshot, checkpoint restore
14-nested-virtualization.md L0/L1/L2, VMCS shadowing, performance overhead
15-microvms-firecracker.md Firecracker VMM, jailer, minimal device model, startup time
16-confidential-computing.md AMD SEV/SEV-SNP, Intel TDX, attestation, threat model
17-container-vs-vm.md Namespace/cgroup vs full VM, Kata Containers, gVisor
18-aws-nitro-system.md Nitro card, NitroTPM, offloaded EBS/networking

Cross-References

  • Section 02 (CPU Architecture): VT-x/AMD-V as CPU features; APIC virtualization; cache and TLB in virtualized context
  • Section 11 (Memory Management): EPT as two-level paging; balloon driver; memory overcommit and KSM in hypervisors
  • Section 14 (Device Drivers): VFIO driver for passthrough; VirtIO driver on guest side; virtio-blk/net driver structure
  • Section 15 (Networking): virtio-net, vhost-net, OVS-DPDK, SR-IOV NIC in VM context
  • Section 17 (Distributed Systems): live migration as a distributed systems problem; cluster schedulers over VMs
  • Section 18 (Database Internals): storage performance implications of VirtIO vs passthrough; fsync through virtualized storage