02 — KVM Architecture
Prerequisites
- Virtualization fundamentals: hypervisor types, Popek–Goldberg, VMX root/non-root modes
- Linux kernel internals: kernel modules, process scheduler, memory management
- x86 architecture: privilege rings, control registers (CR0, CR3, CR4), segment registers, MSRs
- Intel VT-x basics: VMCS concept, VMENTER/VMEXIT terminology
Historical Context
KVM (Kernel-based Virtual Machine) was written by Avi Kivity at Qumranet (later acquired by Red Hat in 2008) and merged into the Linux kernel 2.6.20 in February 2007. The core insight was elegant: rather than building a new OS-like hypervisor from scratch (as VMware and Xen did), why not turn the existing Linux kernel — with its mature scheduler, memory manager, device driver ecosystem, and security infrastructure — into a hypervisor by adding VMX support?
This design decision made KVM radically simpler than its competitors. The initial KVM patch was approximately 10,000 lines of code. Xen at the time was ~250,000 lines. KVM leveraged Linux for everything it could and only added the thin VMX control layer.
QEMU, originally a pure software emulator by Fabrice Bellard (2003), became KVM's device emulation partner. The combination QEMU/KVM became the dominant open-source virtualization stack, underpinning most of the world's public cloud infrastructure by 2012.
KVM Design Philosophy: The "Type 1.5" Hypervisor
KVM blurs the Type 1 / Type 2 distinction. Strictly:
- Not Type 2: KVM uses hardware VMX root mode — it does not run "on top of" Linux in the guest sense; the hypervisor itself operates at Ring 0 in VMX root mode.
- Not pure Type 1: Linux is still the OS managing hardware; KVM is a kernel module, not a standalone hypervisor.
The term "Type 1.5" captures this: Linux becomes the privileged management domain (analogous to Xen's Dom0), and KVM provides the VMX control infrastructure.
KVM Architecture:
+------------------+ +------------------+
| Guest VM 1 | | Guest VM 2 |
| (VMX non-root) | | (VMX non-root) |
+------------------+ +------------------+
| |
| VMEXIT | VMEXIT
v v
+---------------------------------------+
| KVM Kernel Module |
| (kvm.ko + kvm-intel.ko/kvm-amd.ko) |
| |
| VMCS mgmt | EPT mgmt | vCPU sched |
+---------------------------------------+
| Linux Kernel |
| Scheduler | MM | VFS | Net | etc. |
+---------------------------------------+
| Physical Hardware |
| CPU (VMX) | RAM | NIC | Disk |
+---------------------------------------+
KVM Components
Kernel Modules
KVM consists of two layers of kernel modules:
kvm.ko — Architecture-independent core:
- VM lifecycle management
- vCPU creation and scheduling
- Memory slot management
- IRQ routing and virtualization
- /dev/kvm device file creation
kvm-intel.ko / kvm-amd.ko — Architecture-specific backends:
- Intel: VMX initialization, VMCS allocation and management, VMLAUNCH/VMRESUME
- AMD: SVM initialization, VMCB management, VMRUN
- EPT (Intel) / NPT (AMD) setup
- MSR bitmap configuration
/dev/kvm Interface
KVM exposes a device file at /dev/kvm through which userspace (QEMU or any VMM) communicates via ioctl() calls. The interface has three levels:
- System-level ioctls on
/dev/kvmfd:KVM_GET_API_VERSION,KVM_CREATE_VM,KVM_CHECK_EXTENSION - VM-level ioctls on the VM fd returned by
KVM_CREATE_VM:KVM_CREATE_VCPU,KVM_SET_USER_MEMORY_REGION,KVM_CREATE_IRQCHIP,KVM_IRQFD - vCPU-level ioctls on the vCPU fd:
KVM_RUN,KVM_GET_REGS,KVM_SET_REGS,KVM_GET_SREGS,KVM_SET_SREGS,KVM_GET_MSRS,KVM_TRANSLATE
VMCS — Virtual Machine Control Structure
The VMCS is a per-vCPU Intel data structure (~4KB) that lives in memory and is managed by the CPU via VMREAD/VMWRITE instructions. It is the central data structure for VMX operation.
VMCS Structure
VMCS Layout (per vCPU):
+------------------------------------------+
| VMCS Header |
| (revision ID, abort indicator) |
+------------------------------------------+
| Guest State Area |
| CR0, CR3, CR4, DR7 |
| RSP, RIP, RFLAGS |
| CS, SS, DS, ES, FS, GS, LDTR, TR |
| GDTR, IDTR base+limit |
| IA32_DEBUGCTL MSR |
| IA32_SYSENTER_CS/ESP/EIP MSR |
| Activity state (active/HLT/shutdown) |
| Interruptibility state |
+------------------------------------------+
| Host State Area |
| CR0, CR3, CR4 |
| RSP, RIP (VMEXIT entry point) |
| CS, SS, DS, ES, FS, GS, TR |
| GDTR, IDTR base |
| IA32_SYSENTER_CS/ESP/EIP MSR |
+------------------------------------------+
| VM-Execution Control Fields |
| Pin-based controls |
| (external interrupt exiting, |
| NMI exiting, VMX preemption timer) |
| Primary processor-based controls |
| (HLT exiting, MWAIT exiting, |
| RDPMC exiting, CR3-load exiting, |
| use MSR bitmaps, use TPR shadow) |
| Secondary processor-based controls |
| (enable EPT, enable RDTSCP, |
| unrestricted guest, VPID, XSAVES) |
| Exception bitmap (trap these exceptions)|
| I/O bitmap A & B (port 0-7FFF, 8000-FF)|
| MSR bitmap (RDMSR/WRMSR intercept) |
| EPT pointer (PML4 of EPT) |
| VPID (virtual processor ID) |
| TSC offset |
+------------------------------------------+
| VM-Exit Control Fields |
| VM-exit controls |
| (save/load IA32_EFER on exit) |
| VM-exit MSR-store/load count + addr |
+------------------------------------------+
| VM-Entry Control Fields |
| VM-entry controls |
| (load IA32_EFER, IA32_PAT on entry) |
| VM-entry MSR-load count + addr |
| VM-entry interruption information |
| VM-entry exception error code |
| VM-entry instruction length |
+------------------------------------------+
| VM-Exit Information Fields |
| Exit reason |
| Exit qualification |
| VM-exit interruption information |
| IDT-vectoring information |
| VM-exit instruction information |
| Guest linear address |
| Guest physical address (EPT violation) |
+------------------------------------------+
Key VMCS Fields
| Field | Direction | Purpose |
|---|---|---|
| Guest CR3 | Guest State | Guest page table root |
| Guest RSP / RIP | Guest State | Guest stack/instruction pointer |
| Guest RFLAGS | Guest State | Guest flags (incl. IF) |
| Host CR3 | Host State | Hypervisor page table root |
| Host RSP | Host State | Hypervisor stack on VMEXIT |
| Host RIP | Host State | VMEXIT handler entry point |
| Exception bitmap | Exec control | Which exceptions cause VMEXIT |
| I/O bitmap | Exec control | Which I/O ports cause VMEXIT |
| MSR bitmap | Exec control | Which MSR accesses cause VMEXIT |
| EPT pointer | Exec control | Points to EPT PML4 |
| Exit reason | Exit info | Why VMEXIT occurred |
| Exit qualification | Exit info | Additional exit context |
| Guest physical addr | Exit info | GPA on EPT violation |
VMENTER / VMEXIT Lifecycle
VMENTER / VMEXIT Lifecycle:
QEMU/VMM (userspace)
|
| ioctl(vcpu_fd, KVM_RUN, ...)
|
v
KVM (kernel, vmx root mode)
|
| VMLAUNCH (first entry) / VMRESUME (subsequent)
| Load guest state from VMCS Guest State Area
| Switch CR3 to guest CR3 (or EPT handles it)
|
v
Guest (vmx non-root mode)
|
| Guest OS executes normally
| System calls: handled by guest OS (not hypervisor)
|
| <-- sensitive instruction: I/O port, CPUID, HLT, etc.
|
v
VMEXIT (hardware-triggered)
|
| CPU saves guest registers to VMCS Guest State Area
| CPU loads host registers from VMCS Host State Area
| CPU jumps to Host RIP (KVM's vmexit_handler)
|
v
KVM vmexit handler (vmx root mode)
|
| Read exit_reason from VMCS
| Dispatch to handler based on reason
|
+-- I/O port access? --> emulate in KVM or signal QEMU
+-- MMIO? --> signal QEMU (KVM_EXIT_MMIO)
+-- EPT violation? --> map the page, resume
+-- CPUID? --> synthesize result, update RIP
+-- HLT? --> schedule another vCPU
+-- MSR read/write? --> emulate MSR, update regs
+-- External interrupt? --> inject into guest LAPIC
|
| If needs QEMU: return from ioctl(KVM_RUN) to QEMU
| If handled: VMRESUME
|
v
Back to guest (VMX non-root)
Common VMEXIT Reasons
| Exit Reason | Code | Trigger | Handler |
|---|---|---|---|
| Exception/NMI | 0 | Guest exception in exception bitmap | Inject or handle |
| External interrupt | 1 | Host IRQ arrived | Deliver to guest LAPIC |
| CPUID | 10 | Guest executes CPUID | Synthesize CPUID result |
| HLT | 12 | Guest executes HLT | Block vCPU, reschedule |
| INVLPG | 14 | Guest invalidates TLB entry | May need EPT invalidation |
| RDMSR | 31 | Guest reads MSR in bitmap | Emulate MSR read |
| WRMSR | 32 | Guest writes MSR in bitmap | Emulate MSR write |
| VM-entry failure | 33 | Invalid VMCS on entry | Debugging |
| EPT violation | 48 | Guest GPA not in EPT | Map page in EPT |
| EPT misconfiguration | 49 | Malformed EPT entry | Usually bug |
| RDTSCP | 51 | Guest reads RDTSCP | Apply TSC offset |
| I/O instruction | 30 | Guest does IN/OUT in bitmap | Emulate I/O device |
| INTERRUPT_WINDOW | 7 | Pending interrupt deliverable | Inject pending IRQ |
| PAUSE | 40 | Guest executes PAUSE (spinlock) | Yield vCPU |
KVM ioctl Interface — Userspace API
A minimal VMM using KVM directly (simplified):
// 1. Open /dev/kvm
int kvm_fd = open("/dev/kvm", O_RDWR);
// 2. Create a VM
int vm_fd = ioctl(kvm_fd, KVM_CREATE_VM, 0);
// 3. Set up memory (map host memory into guest physical address space)
struct kvm_userspace_memory_region region = {
.slot = 0,
.flags = 0,
.guest_phys_addr = 0x0,
.memory_size = 1 << 30, // 1 GB
.userspace_addr = (uint64_t)guest_mem,
};
ioctl(vm_fd, KVM_SET_USER_MEMORY_REGION, ®ion);
// 4. Create vCPU
int vcpu_fd = ioctl(vm_fd, KVM_CREATE_VCPU, 0);
// 5. Map kvm_run struct (shared memory for VMEXIT communication)
int mmap_size = ioctl(kvm_fd, KVM_GET_VCPU_MMAP_SIZE, 0);
struct kvm_run *run = mmap(NULL, mmap_size, PROT_READ|PROT_WRITE,
MAP_SHARED, vcpu_fd, 0);
// 6. Set initial registers
struct kvm_regs regs = { .rip = 0x1000, .rflags = 0x2 };
ioctl(vcpu_fd, KVM_SET_REGS, ®s);
// 7. Run the vCPU
while (1) {
ioctl(vcpu_fd, KVM_RUN, 0); // blocks until VMEXIT
switch (run->exit_reason) {
case KVM_EXIT_HLT: goto done;
case KVM_EXIT_IO: handle_io(run); break;
case KVM_EXIT_MMIO: handle_mmio(run); break;
default: fprintf(stderr, "unhandled exit %d\n",
run->exit_reason);
}
}
The kvm_run structure (mapped via mmap) serves as shared memory between the kernel KVM module and userspace QEMU, avoiding an extra copy for communicating VMEXIT details.
KVM vCPU Scheduling
Each vCPU is a Linux kernel thread (created via kthread_create). The Linux CFS scheduler schedules vCPU threads like any other thread. Key implications:
- vCPU threads can be preempted by the host scheduler — the guest sees this as "steal time"
KVM_GET_VCPU_MMAP_SIZE/kvm_vcpu_statsexpose steal time to the guest viakvmclock- CPU pinning via
tasksetor cgroups cpuset reduces jitter for latency-sensitive VMs - NUMA affinity: vCPU threads and guest memory should be on the same NUMA node
vCPU Thread Lifecycle:
+----------+ +---------+ +-----------+
| RUNNABLE | ---> | RUNNING | ---> | VMXON |
| | | (host) | | (in guest)|
+----------+ +---------+ +-----------+
^ |
| VMEXIT |
+-------- KVM handles ------------ +
(reschedule if HLT)
Performance Implications
VMEXIT frequency is the primary KVM performance knob. Each VMEXIT costs ~1–5 μs round-trip. A guest doing 100,000 I/O port accesses per second loses 100–500 ms/second in VMEXIT overhead alone.
Mitigation strategies:
- MSR bitmaps: whitelist frequently-read MSRs (like TSC) to avoid VMEXITs on every read
- I/O bitmaps: only trap specific I/O ports used by emulated devices; allow others to pass through
- APICv (Advanced PIC Virtualization): virtualizes LAPIC in hardware, eliminating VMEXITs for APIC mmio
- Posted interrupts: hardware delivers interrupts directly to guest vCPU without VMEXIT
- halt_poll_ns: KVM spins (busy-waits) for a configurable time before blocking a HLT-ted vCPU. Trades CPU for lower wake latency. Tunable at /sys/module/kvm/parameters/halt_poll_ns.
Security Implications
- VMCS poisoning: an attacker with kernel access could modify VMCS fields to escape the VM. VMCS is in host-physical memory; KVM must validate all fields before VMENTER.
- L1TF (CVE-2018-3646): EPT entries with "not present" bit but a non-zero physical address could be speculatively accessed, leaking host memory. Mitigation: flush L1 data cache on VMENTER via
VERWinstruction (MDS/TAA mitigation), or use EPT paging-structure entries with PA=0 for non-present entries. - Spectre v2 across VM boundary: attacker VM poisons branch predictor, causing hypervisor or other VMs to speculatively execute attacker-chosen code paths. Mitigation: IBRS, eIBRS, retpoline in KVM.
- KVM kernel bugs: since KVM runs in Ring 0, bugs are kernel vulnerabilities. See CVE-2021-22543 (KVM use-after-free), CVE-2022-0185.
- Device emulation bugs in QEMU: QEMU runs as a userspace process, limiting blast radius. QEMU bugs require kernel KVM bug to achieve full privilege escalation — defense in depth.
Debugging Notes
# Enable KVM debug tracing
echo 1 > /sys/kernel/debug/tracing/events/kvm/enable
# Watch VMEXIT reasons live
trace-cmd record -e 'kvm:kvm_exit' -e 'kvm:kvm_entry' sleep 5
trace-cmd report | head -100
# Per-VM VMEXIT stats
ls /sys/kernel/debug/kvm/
# e.g., /sys/kernel/debug/kvm/1234-0/exits (KVM_STAT_VM or per vcpu)
# Check KVM module parameters
cat /sys/module/kvm_intel/parameters/nested
cat /sys/module/kvm/parameters/halt_poll_ns
# Inspect VMCS (requires kernel debug build)
# Use VMREAD instruction via kvm-unit-tests or crash utility
# QEMU monitor: info kvm
(qemu) info kvm
KVM support: enabled
Failure Modes
- VMLAUNCH failure: if VMCS fields are invalid (e.g., host CR3 not page-aligned), VMLAUNCH fails with an error code in RFLAGS. KVM logs this as a fatal error.
- vCPU lockup: guest spinning in a tight loop without a HLT/PAUSE — consumes 100% of a host CPU core. Watchdog timers (NMI watchdog) can detect and forcibly VMEXIT such guests.
- Memory overcommit OOM: host exhausts physical memory; KVM cannot fulfill EPT faults. Guest experiences stalls as KVM tries to reclaim memory via balloon or swap. Worst case: OOM killer terminates QEMU process, VM disappears.
- KSM deduplication delay: KSM merges pages asynchronously; during high-write workloads, CoW faults on merged pages add latency spikes.
Modern Usage and Future Directions
KVM is the hypervisor engine for the majority of the public cloud: - AWS Nitro (since 2017): KVM core with Nitro hypervisor wrapper; networking/storage offloaded to Nitro cards - Google Compute Engine: KVM with custom live migration and overcommit - OpenStack: KVM + QEMU as default compute driver - Firecracker (AWS): uses KVM ioctls directly without QEMU for Lambda/Fargate
Future: AMD SEV-SNP and Intel TDX add memory encryption at the KVM/hardware boundary. KVM is being extended to support these Trusted Execution Environments (TEEs) where even the hypervisor cannot read guest memory. The KVM API (via KVM_CREATE_VM flags) already has rudimentary SEV support since Linux 5.1.
Exercises
- Write a 50-line C program that uses
/dev/kvmto run a single x86 instruction (HLT) in a VM and detect theKVM_EXIT_HLTexit. Reference: "Using the KVM API" (LWN 2015). - Boot a KVM VM with QEMU and run
perf kvm statto observe VMEXIT reason counts. Identify the top 3 exit reasons and explain what triggers each. - Set
halt_poll_ns=200000and measure VM-to-VM ping latency vs the default. Explain the tradeoff. - Examine the KVM source for
vmx_handle_exit()inarch/x86/kvm/vmx/vmx.c. Map each exit reason to its handler function. - Explain why a guest executing a
SYSCALLinstruction does not cause a VMEXIT. What happens instead?
References
- Kivity, A. et al. (2007). "KVM: the Linux Virtual Machine Monitor." Ottawa Linux Symposium 2007.
- Intel. Intel 64 and IA-32 Architectures SDM, Vol. 3C, Chapters 23–33: VMX.
- Corbet, J. (2015). "Using the KVM API." LWN.net. https://lwn.net/Articles/658511/
- Linux kernel source:
arch/x86/kvm/,virt/kvm/ - Dall, C. & Nieh, J. (2014). "KVM/ARM: The Design and Implementation of the Linux ARM Hypervisor." ASPLOS 2014.
- Intel. "Intel Virtualization Technology for IA-32, Intel 64, and Intel Architecture (Intel VT-x)." Technical White Paper.