07 — Live Migration

Prerequisites

KVM architecture: VMCS, vCPU state, memory slots, EPT
Memory virtualization: EPT, dirty page tracking, balloon driver
Linux: memory mapped files, dirty page bitmaps, QEMU migration protocol
Networking: shared storage (NFS, Ceph, iSCSI), Ethernet (for migration network)

Historical Context

Live migration — moving a running virtual machine from one physical host to another with minimal service interruption — was first demonstrated on Xen by Clark et al. in the landmark paper "Live Migration of Virtual Machines" at USENIX NSDI 2005. The paper demonstrated migrations with downtimes of 60–300 ms for a 512 MB VM over a 1 Gbps network, while the VM was actively serving web requests.

The ability to live-migrate VMs without disrupting workloads was transformative for data center operations. It enabled: - Hardware maintenance without downtime: drain a physical host by migrating all VMs away, perform BIOS updates / hardware replacement, then migrate back - Load balancing: move VMs from overloaded hosts to underloaded ones - Energy savings: consolidate VMs onto fewer hosts at night, power off empty hosts - Fault tolerance: migrate VMs away from hosts showing early failure signals (ECC errors, temperature warnings)

VMware's vMotion (introduced in VirtualCenter 2.0, 2003) was the first commercial live migration product, predating the NSDI paper but not open-sourced. Today live migration is a fundamental feature of every production hypervisor.

Google Compute Engine performs live migration transparently as routine maintenance — tens of thousands of VM migrations occur daily without customer awareness, enabling Google to update host firmware and kernel without any scheduled maintenance windows.

The Live Migration Problem

A running VM has state distributed across:

CPU state: vCPU registers (general-purpose, FP/SSE/AVX, control registers, MSRs, APIC state)
Memory: all guest physical pages mapped via EPT (potentially gigabytes)
Device state: NIC TX/RX buffers, disk queues, timers, interrupt state
Network connections: existing TCP connections (source IP stays the same; handled by shared IP)

The challenge: memory is too large to transfer atomically. A 32 GB VM over a 10 Gbps link takes ~26 seconds to copy. During this time the VM continues running and dirtying pages. We need to transfer a consistent snapshot of a running, modifying system.

Pre-Copy Migration (Standard Algorithm)

Pre-copy is the dominant algorithm, used by Xen, KVM/QEMU, VMware vMotion, and Hyper-V.

Overview

Transfer memory iteratively while the VM runs. Pages dirtied after transfer are retransferred. Stop when the remaining dirty set is small enough to transfer in a short stop-and-copy phase.

Phases

Pre-Copy Migration Timeline:

Host A (Source)                Host B (Destination)
   |                                  |
   |  Phase 1: Enable dirty tracking  |
   |  KVM_CAP_DIRTY_LOG enabled       |
   |  Snapshot: all pages "dirty"     |
   |                                  |
   |  Phase 2: First pass (bulk copy) |
   |  Transfer ALL pages to dest ---->|-----> Write to dest RAM
   |  VM continues running            |
   |  Some pages get re-dirtied       |
   |                                  |
   |  Phase 3: Iterative passes       |
   |  Fetch dirty bitmap              |
   |  Transfer re-dirtied pages ----->|-----> Overwrite on dest
   |  VM continues running            |
   |  Repeat while dirty_count > threshold or
   |  iteration_count < max           |
   |                                  |
   |  Phase 4: Stop-and-copy          |
   |  PAUSE VM on Host A              |
   |  Transfer final dirty pages ---->|-----> Final overwrite
   |  Transfer CPU state          --->|-----> Load vCPU state
   |  Transfer device state       --->|-----> Restore devices
   |                                  |
   |  Phase 5: Resume on Dest         |
   |                    VM resumes <--|
   |  Host A: destroy VM              |

Dirty Page Tracking in KVM

KVM tracks dirty pages using a dirty bitmap — one bit per guest physical page. Two mechanisms:

KVM_CAP_DIRTY_LOG (legacy): The hypervisor write-protects all EPT entries. Any guest write → EPT write-protection fault → VMEXIT → KVM sets the corresponding dirty bit → VMRESUME. QEMU retrieves the dirty bitmap via KVM_GET_DIRTY_LOG ioctl, which returns and atomically clears all dirty bits.

KVM_CAP_DIRTY_LOG_RING (2021): Ring buffer-based dirty tracking with lower overhead. KVM writes dirty GFN (guest frame number) entries to a ring buffer rather than a bitmap. QEMU reads the ring buffer instead of polling the full bitmap. Reduces VMEXIT overhead by batching dirty log updates.

EPT Accessed/Dirty (A/D) bits (2015, Broadwell): If EPT A/D bits are supported, hardware sets the dirty bit in EPT entries on write without causing a VMEXIT. KVM reads EPT dirty bits during migration iteration. Eliminates all EPT write-protection faults during migration.

Dirty Page Iteration Detail

Iteration convergence condition:

Pass 1: Transfer N pages (entire RAM)
        Dirty rate: D pages/second
        Transfer rate: T pages/second

If D < T:  dirty set shrinks each pass → convergence possible
If D >= T: dirty set grows each pass → pre-copy will not converge!
           (thrashing workload like Redis with heavy writes)
           → fall back to longer stop-and-copy

Pass k:
  dirty_k = pages re-dirtied during transfer of dirty_{k-1}
  dirty_k ≈ dirty_{k-1} × (D/T)

  If D/T < 1.0: dirty set converges → stop when dirty_k < threshold

Example:
  VM: 8 GB RAM, 10 Gbps migration link
  Transfer rate: T = 1 GB/s = 256K pages/s
  Dirty rate: D = 10K pages/s (typical database)

  Pass 1: transfer 2M pages (8 GB), ~8 sec
          dirty during pass: 10K × 8 = 80K pages (3.2%)
  Pass 2: transfer 80K pages, ~0.3 sec
          dirty during pass: 10K × 0.3 = 3K pages
  Pass 3: transfer 3K pages, ~0.01 sec
          dirty during pass: ~100 pages
  Stop-and-copy: pause VM, copy ~100 pages + CPU state
          Downtime: ~10ms

Convergence Thresholds (QEMU defaults)

QEMU terminates pre-copy iteration when: - dirty_pages_rate × downtime_limit < remaining_dirty_pages (would finish in one stop phase), OR - Maximum iterations reached (default: 64), OR - Maximum total migration time exceeded

QEMU exposes migration parameters: - x-checkpoint-delay: interval between iterations (ms) - downtime-limit: maximum acceptable stop-and-copy time (ms, default 300ms) - max-bandwidth: cap migration bandwidth to avoid network saturation

Stop-and-Copy Final Phase

Once pre-copy converges:

Pause the source VM: KVM KVM_PAUSE_GUEST — all vCPUs stop executing
Dirty bitmap final scan: transfer any pages dirtied between last iteration and pause
CPU state transfer: send all vCPU register state (via VMCS fields)
Device state transfer: QEMU serializes all emulated device state (NIC queues, timer state, disk queue positions)
Resume on destination: KVM KVM_CREATE_VCPU + KVM_SET_REGS + KVM_RUN on destination
Redirect network traffic: update ARP table or VM router to point VM's IP to destination host's MAC
Release source VM: destroy source VM

Typical stop-and-copy downtime: 10–100 ms for a well-converged pre-copy migration.

Post-Copy Migration

Post-copy is an alternative where the VM starts executing on the destination with minimal initial state, and pages are faulted in from the source on demand.

Post-Copy Flow:

Source Host A                    Destination Host B
   |                                     |
   | Transfer: CPU state only ---------> |
   | Transfer: minimal memory (working set estimate)
   |                                     |
   |  VM paused on A                     |  VM STARTS on B!
   |                                     |
   |  Page request for GPA 0x1000  <-----|  EPT fault on B
   |  Fetch page from A            ----> |  Map page, resume
   |  Page request for GPA 0x5000  <-----|  EPT fault on B
   |  ...                                |
   |  Continue serving pages until   <---|  All pages migrated
   |  all pages transferred              |
   |  A can be freed                     |

Post-Copy Advantages and Risks

Advantages: - Downtime = only CPU state transfer = ~1–5 ms (near-zero) - Total migration time is shorter for thrashing workloads (pages only transferred once, not re-transferred when re-dirtied)

Disadvantages: - Source host failure risk: if Host A fails while VM is running on B with outstanding page requests, the VM hangs or dies. There is no recovery — the pages on A are lost. - Remote page faults: every cache miss that goes to a not-yet-transferred page results in a network round-trip (typically 50–200 μs). Application performance degrades until all pages are resident on B. - Complexity: requires page server on source, fault handler on destination, cancellation protocol

Post-copy is available in KVM/QEMU (since 2.11) but rarely used in production due to the failure risk. Amazon EC2 live migration uses pre-copy exclusively.

Storage Migration

If the VM uses local disk (not shared storage), disk data must also be migrated:

Block migration (QEMU): transfers disk image content over the migration channel alongside memory. Extremely slow for large disks (1 TB disk at 1 Gbps = 8,000 seconds). Only practical for small instances.

Shared storage (standard for production): all VMs use network-attached storage (NFS, iSCSI, Ceph/RBD, Gluster). The storage is accessible from both hosts. Only memory, CPU, and device state need to be transferred. This is the universal production approach.

Shared Storage Migration:

  Host A                       Host B
  +--------+     (1) copy mem  +--------+
  | QEMU   | ----------------> | QEMU   |
  | VM     |                   | VM     |
  +--+-----+                   +--+-----+
     |                            |
     |  (2) storage path unchanged|
     +------------+---------------+
                  |
         +--------+--------+
         |  Ceph / NFS /   |
         |  iSCSI SAN      |
         +-----------------+

Ceph RBD (RADOS Block Device) is the dominant storage backend for KVM clouds (OpenStack, Proxmox). RBD volumes are accessible from any host with Ceph access — migration requires only memory/CPU transfer.

Memory Overcommit During Migration

Migration is expensive when the host is memory-constrained:

Destination must have free RAM: needs to allocate vm_memory_size of fresh pages. If destination is overcommitted, migration is rejected.
Balloon deflation at destination: destination may need to deflate balloons in other VMs to free RAM before accepting the migrating VM.
Post-copy with compression: QEMU can compress migrated pages (using zlib, zstd, or xbzrle/XBZRLE algorithm) to reduce bandwidth requirements. XBZRLE encodes deltas between page versions, very effective for page-flipping workloads.

Live Migration in Production

Google Compute Engine

GCP performs live migration for all regular instances on every host maintenance event (kernel patching, hardware health checks, microcode updates). The user's VM is migrated to a different host. The user observes: - A brief performance dip (~30–60 ms downtime) during migration - A notification in the metadata server that maintenance is occurring - No VM restart, no IP change

GCP uses a variant of pre-copy with aggressive convergence and custom dirty tracking.

AWS EC2

EC2 does not live-migrate most instances. Instead, AWS issues scheduled maintenance events (24-hour notice) and users are expected to stop/start their instance or use multiple AZs. Some newer instance types on Nitro support live migration. The complexity and overhead of live migration across AWS's extreme scale made per-instance migration less attractive than fleet architecture with redundancy.

VMware vMotion

Enterprise standard. vMotion supports: - vMotion (live migration between hosts): standard pre-copy, downtime < 100ms - Storage vMotion: migrate disk from one datastore to another (online, no VM downtime) - Combined vMotion (vSphere 6.0+): move VM and storage simultaneously

vMotion requires network bandwidth allocation and vSphere licensing. Used extensively for hardware maintenance, DRS (Distributed Resource Scheduler) load balancing, and power consolidation.

Debugging Notes

# Monitor KVM/QEMU migration in progress
(qemu) info migrate
Migration status: active
total time: 12345 ms
expected downtime: 42 ms
setup: 1 ms
transferred ram: 3100 MiB
throughput: 826 MiB/s
remaining ram: 245 MiB
total ram: 8192 MiB
duplicate: 1524312 pages
skipped: 0 pages
normal: 804123 pages
normal bytes: 3221123456 B
dirty pages rate: 1023
dirty sync count: 4

# Watch dirty page rate in real-time
(qemu) info migrate_parameters

# Set migration bandwidth cap (MiB/s)
(qemu) migrate_set_capability xbzrle on
(qemu) migrate_set_parameter max-bandwidth 1073741824  # 1 GB/s

# Start migration
(qemu) migrate -d tcp:destination-host:4444

# View migration stats with virsh (libvirt)
virsh domjobinfo <vm-name>
virsh migrate --live <vm-name> qemu+ssh://dest-host/system

# KVM dirty log
# In QEMU source: migration/ram.c - ram_save_iterate()

Security Implications

Migration channel security: migration transfers full VM memory in plaintext over TCP by default. Must use TLS or a VPN-secured migration network. QEMU supports TLS for migration (migrate -d tls:... with x509 cert).
Rogue destination: a compromised destination host can inspect all migrated VM memory. Migration must only target trusted hosts. vSphere and libvirt use certificate-based mutual authentication.
CPU state transfer: migrated CPU state includes registers that may hold cryptographic material. Side-channel attacks on migration channels have been demonstrated in research.
Post-copy page server: the page server on the source must authenticate requests — otherwise an attacker on the migration network could request arbitrary VM pages.

Failure Modes

Migration timeout: if dirty rate never converges (Redis mass-write, memory compression), QEMU reaches max-migration-time and fails with Migration failed. Mitigation: reduce VM memory write rate during maintenance window, or use post-copy.
Network interruption during migration: if the migration TCP connection drops, QEMU aborts migration and resumes the VM on the source host (pre-copy). For post-copy: VM is in an unrecoverable state — must be reset.
Destination out of memory: destination rejects migration with ENOMEM. Mitigation: pre-check destination capacity via management plane before initiating.
Incompatible CPU features: if destination CPU lacks features that source vCPU state depends on (e.g., AVX-512 registers saved but destination has no AVX-512), migration fails. Cloud providers pin VM instances to CPU generations.

Modern Usage and Future Directions

Live migration at container granularity: CRIU (Checkpoint/Restore In Userspace) enables live migration of Linux containers (processes). Not as efficient as VM migration (process state is harder to checkpoint) but increasingly production-ready.

Delta compression: XBZRLE (eXtended Block, Zero, Run-Length Encoded delta) compression in QEMU compares current dirty pages against their previously-transmitted version and sends only the XOR delta. Reduces migration bandwidth by 50–80% for typical workloads.

RDMA migration: using InfiniBand RDMA for migration bypasses the CPU for data transfer, enabling 40–100 Gbps migration bandwidth. Reduces migration time for large VMs from minutes to seconds.

Zero-downtime migration research: "MemFuzz" and similar systems explore pre-copying combined with application-level coordination (pause-and-migrate at transaction boundaries) for truly zero-observable-downtime migration.

Exercises

Set up two KVM hosts with shared NFS storage. Migrate a running VM between them using virsh migrate --live. Measure downtime by pinging the VM continuously and counting lost packets.
Configure QEMU migration bandwidth caps and observe how cap value affects total migration time and downtime. Use info migrate to monitor.
Implement a simple memory write loop in a VM (e.g., memset a 1 GB buffer in a loop). Attempt to migrate the VM while the loop runs. What happens? Does migration converge?
Enable XBZRLE compression for migration (migrate_set_capability xbzrle on). Compare migration traffic volume vs uncompressed for a VM running a database workload.
Explain the failure mode of post-copy migration if the source host powers off before all pages have been transferred to the destination.

References

Clark, C. et al. (2005). "Live Migration of Virtual Machines." USENIX NSDI 2005.
Hines, M. et al. (2009). "Post-Copy Address Space Migration for Virtual Machines." Hotdep 2009.
Google Cloud. "Live Migration." https://cloud.google.com/compute/docs/instances/live-migration
VMware. "vMotion Architecture, Performance and Best Practices in VMware vSphere 5.x." VMware Technical White Paper.
QEMU documentation: docs/devel/migration.rst
Hu, W. et al. (2017). "XBZRLE: Efficient Memory Migration in QEMU." (KVM Forum 2012 presentation).
Nelson, M. et al. (2005). "Fast Transparent Migration for Virtual Machines." USENIX ATC 2005.