06 — Firecracker and MicroVMs

Prerequisites

KVM architecture: /dev/kvm ioctls, VMCS, VMEXIT, EPT
Linux security: seccomp-BPF, namespaces (network, PID, user), cgroups, chroot
VirtIO: virtio-net, virtio-blk frontend/backend model
Containerization basics: container isolation, rootfs, overlay filesystems

Historical Context

AWS Lambda launched in November 2014, pioneering the serverless computing model. Lambda's initial architecture ran each customer function in a separate container (Docker-like) within a shared EC2 instance. This provided process isolation but not VM-level security isolation — multiple customers' code ran in containers on the same host OS kernel.

By 2016, the scale of Lambda demanded a fundamentally different isolation model. Containers on a shared kernel are inherently vulnerable to kernel exploits — a single CVE could compromise all tenant functions on a host. But traditional VMs (QEMU/KVM) were too heavy for function execution: QEMU alone consumed ~500 MB of RAM and took ~5 seconds to boot. Running 150 VMs per host was impractical with these requirements.

The answer was Firecracker: a minimal VMM (Virtual Machine Monitor) designed from first principles for serverless workloads. Firecracker was written in Rust by AWS engineers, exploiting KVM for the hardware isolation guarantee while eliminating every QEMU component not strictly necessary. It was open-sourced in November 2018.

The name "Firecracker" reflects the design goal: fast, bright, contained — like a firecracker rather than a full explosion.

MicroVM Design Goals

A MicroVM is a VM that satisfies a different set of constraints than a traditional VM:

Goal	Target	Traditional VM
Boot time	< 125 ms to userspace	5–30 seconds
Memory footprint	5 MB overhead per VM	200–500 MB QEMU overhead
Density	150+ VMs per host	10–30 VMs per host
Attack surface	Minimal device set	Full PCIe, USB, BIOS, VGA
Startup jitter	< 5 ms P99	Unpredictable
Security boundary	VM-level (kernel isolation)	Same

Lambda requires booting a function's execution environment in response to an invocation request — the end-to-end user-visible latency budget (including network, routing, container start) is under 100 ms. Firecracker delivers < 125 ms from CreateMicroVM API call to executing customer code.

Firecracker Implementation

Firecracker is written entirely in Rust, chosen for: - Memory safety without garbage collection (no GC pauses in the VM lifecycle path) - Rich type system catching logical errors at compile time - Safe FFI for ioctl calls to /dev/kvm - Fearless concurrency for the API server + VMM threads

Firecracker does not use QEMU. It implements its own minimal VMM using KVM ioctls directly.

Firecracker vs QEMU/KVM Stack:

Traditional QEMU/KVM:           Firecracker/KVM:
+-----------------------+       +------------------------+
|       Guest VM        |       |       Guest VM         |
| (Linux, 512MB+)       |       | (Linux, custom kernel) |
+-----------------------+       +------------------------+
|  VirtIO devices       |       |  VirtIO devices        |
|  PCI bus              |       |  (no PCI bus)          |
|  BIOS/UEFI/ACPI       |       |  (no BIOS/ACPI)        |
+-----------------------+       +------------------------+
|       QEMU            |       |    Firecracker VMM     |
|  (~500MB RSS)         |       |    (~50MB RSS)         |
|  C code (1M+ lines)   |       |    Rust (50K lines)    |
+-----------------------+       +------------------------+
|       KVM             |       |       KVM              |
|  (/dev/kvm ioctls)    |       |  (/dev/kvm ioctls)     |
+-----------------------+       +------------------------+
|     Linux Kernel      |       |     Linux Kernel       |
+-----------------------+       +------------------------+
|      Hardware         |       |      Hardware          |
+-----------------------+       +------------------------+

Firecracker's Device Model

Firecracker deliberately implements only the devices needed for serverless workloads:

Device	Implementation	Notes
virtio-net	VirtIO 1.0, TAP backend	mmio, not PCI
virtio-blk	VirtIO 1.0, file backend	mmio, not PCI
virtio-vsock	AF_VSOCK host↔guest comms	Used for init protocol
Serial console (UART 16550A)	Emulated for logs	ttyS0
i8042 keyboard controller	Minimal (required for reset)	No mouse
RTC (MC146818)	Real-time clock

Explicitly absent: USB, PCIe bus, VGA/GPU, BIOS/UEFI, ACPI tables, PCI device enumeration, floppy, sound, parallel port, IDE/SATA controllers. Each omitted device is one fewer code path that could contain a vulnerability.

VirtIO devices use MMIO transport (not PCI) — faster device discovery and simpler code path. The guest kernel finds devices by scanning a fixed MMIO address range rather than enumerating a PCI bus.

The Jailer: Multi-Layer Isolation

Firecracker the VMM process is itself isolated by the jailer — a companion tool that:

cgroup: Assigns the Firecracker process to a cgroup with CPU and memory limits
Network namespace: Places the process in a dedicated network namespace (only the tap device is visible)
chroot: Pivots the process root to a minimal directory containing only Firecracker binary and its dependencies
seccomp-BPF: Applies a strict syscall whitelist. Firecracker is allowed only the ~20 syscalls it actually needs. Any other syscall → SIGSYS → process killed.
User namespace: Drops privileges; Firecracker runs as an unprivileged user

Isolation Layers:

  Host OS
  |
  +-- network namespace (isolated NIC)
  |   +-- cgroup (CPU/mem limits)
  |       +-- chroot (minimal rootfs)
  |           +-- seccomp-BPF (syscall filter)
  |               +-- Firecracker VMM process
  |                   +-- KVM /dev/kvm
  |                       +-- Guest VM
  |                           +-- Customer Lambda function code

The combination means: a guest VM escape (compromising Firecracker) is still contained by seccomp + namespace isolation. A complete Firecracker process escape is still contained by the outer network namespace and cgroup. Defense in depth with VM isolation as the primary boundary.

seccomp-BPF Filter

Firecracker's seccomp filter allows approximately:

ALLOWED: read, write, open, close, mmap, mprotect, madvise,
         ioctl (only on /dev/kvm fd), epoll_wait, eventfd,
         signalfd, timerfd, futex, exit, exit_group,
         socket (only AF_UNIX), bind, listen, accept,
         sendmsg, recvmsg, fstat, lseek, pread64, pwrite64
BLOCKED: execve, fork, clone, ptrace, setuid, chroot,
         mount, unshare, and ~300 others

Any syscall not in the allowlist terminates Firecracker immediately with SIGSYS.

Firecracker Boot Process

Firecracker bypasses BIOS/UEFI completely. It boots a Linux kernel directly (must be compiled with CONFIG_PVH and Firecracker-compatible config):

Firecracker Boot Sequence (<125ms total):

0ms:   Firecracker process starts
       Reads VM configuration from API (JSON over Unix socket)

5ms:   KVM VM created (KVM_CREATE_VM ioctl)
       Memory slots configured (KVM_SET_USER_MEMORY_REGION)
       vCPUs created (KVM_CREATE_VCPU)
       VirtIO MMIO devices configured

10ms:  Kernel image (vmlinux) loaded into guest physical memory
       Initial ramdisk (initrd) loaded
       Kernel command line set (boot params in memory)
       vCPU registers set: RIP → kernel entry point

12ms:  KVM_RUN: guest vCPU starts executing kernel
       Kernel decompresses itself (if compressed)

40ms:  Kernel initializes: VirtIO probing, network config
       init process (PID 1) starts

100ms: Customer init (agent) running, ready for function invocation
125ms: Function code executing

The critical optimization: no BIOS POST, no MBR boot, no ACPI table parsing. The kernel starts at its entry point directly.

Custom Minimal Kernel

AWS maintains a minimal kernel configuration for Firecracker with: - No modules (everything compiled in) - No ACPI - Minimal device drivers (only VirtIO, virtio-mmio, 8250 UART) - CONFIG_PVH=y (paravirtual hardware boot) - Very fast unpacking (lz4-compressed kernel) - Minimal initrd (~1MB)

Snapshot and Restore

Firecracker supports snapshotting a running MicroVM and restoring it later:

Snapshot Flow:
  Running MicroVM
       |
       | PUT /snapshot/create (API call)
       |
       v
  Firecracker:
  1. Pause all vCPUs (KVM_PAUSE_GUEST)
  2. Save vCPU state (registers, VMCS fields via KVM_GET_VCPU_MMAP)
  3. Save device state (virtio queues, network state, disk position)
  4. Write memory snapshot (mmap guest memory, write to file)
  5. Resume (if requested) or exit
       |
       v
  Snapshot files:
    - vm-state.snapshot (vCPU registers, device state, ~100KB)
    - mem.snapshot (full guest RAM, 256MB-2GB compressed)

Restore Flow:
  New Firecracker process
       |
       | PUT /snapshot/load (API call)
       |
       v
  Firecracker:
  1. Create new KVM VM
  2. Map memory snapshot (mmap or load) as guest physical memory
  3. Restore vCPU state (KVM_SET_VCPU_MMAP)
  4. Restore device state
  5. Resume all vCPUs
       |
       v
  MicroVM running from saved state (<150ms from file to running)

Use case — Lambda function warm starts: Lambda pre-warms function execution environments by snapshotting an initialized container (JVM started, code loaded, warm-up invocations done). Subsequent cold starts restore from snapshot in ~150ms instead of starting from zero in ~3 seconds.

Production Scale

AWS Lambda: millions of concurrent function executions. Lambda runs on a fleet of hosts, each running 150+ Firecracker MicroVMs. Functions are billed per 100ms of execution — the overhead of the MicroVM must be imperceptible.

AWS Fargate: ECS/EKS Fargate runs each container task inside a dedicated Firecracker MicroVM. The customer sees a container; underneath is a VM providing kernel-level isolation.

Scale numbers (AWS 2019 blog post): - < 125 ms: typical MicroVM boot time - < 5 MB: Firecracker process memory overhead (excluding guest RAM) - 150 MicroVMs per metal host: typical density - Millions of MicroVMs running simultaneously across AWS fleet

Comparing Minimal VMMs

VMM	Language	Backend	Key Use Case	Attack Surface
Firecracker	Rust	KVM	AWS Lambda/Fargate	Smallest
Cloud Hypervisor	Rust	KVM	Cloud VMs, Container VMs	Small
crosvm	Rust	KVM/VFIO	ChromeOS containers (Crostini)	Small
QEMU	C	KVM/TCG/many	General purpose, dev/test	Large
Kata Containers	Go + C	QEMU/Firecracker/dragonball	Container-in-VM	Configurable

Cloud Hypervisor (Intel): similar to Firecracker but with additional features — VFIO passthrough, large VMs (>1 TB), live migration. Used for more general cloud workloads.

crosvm (Google): powers Linux containers in ChromeOS ("Crostini"). Developed by the Chrome OS team. Supports GPU virtualization for graphics in the Linux container.

Kata Containers: an OCI-compliant container runtime that runs each container inside a VM. Uses either QEMU or Firecracker as the VMM. The container API (Docker, Kubernetes) is preserved, but isolation is VM-level.

Security Implications

VM isolation as primary boundary: unlike Docker containers, Firecracker uses KVM-enforced hardware isolation. A kernel exploit in the guest cannot compromise the host (without a KVM bug).
seccomp filter effectiveness: the jailer's seccomp filter means even a Firecracker memory corruption bug cannot lead to arbitrary syscall execution. This was verified through extensive fuzzing.
Memory isolation: each MicroVM's guest memory is mapped into Firecracker's virtual address space as an mmap'd region. EPT ensures the guest cannot access memory outside its mapped region.
Side channels: MicroVMs sharing a physical host still share CPU microarchitectural state. Spectre/Meltdown patches (IBRS, STIBP, retpoline) are required on the host kernel. Firecracker itself enables appropriate CPU isolation flags via KVM.
Snapshot security: snapshots contain full VM memory. At rest, snapshots must be encrypted. Restored snapshots must be treated as potentially compromised if the source was untrusted.

Performance Implications

Boot latency: < 125 ms for Firecracker vs 3-10s for QEMU. The difference is the absence of BIOS, ACPI, device enumeration, and a much smaller kernel config.
Memory overhead: ~5 MB Firecracker overhead per VM vs ~200-500 MB QEMU. At 150 VMs/host, this is 750 MB saved just in VMM overhead.
I/O performance: Firecracker uses VirtIO-mmio (not PCI). MMIO-based VirtIO has slightly higher configuration overhead than PCI-based (no DMA remapping support in MMIO path) but is simpler and faster at the data path level.
vCPU overhead: same as KVM — hardware VMX. No additional overhead from Firecracker vs QEMU for compute.
Snapshotting overhead: full memory snapshot of a 512 MB VM takes ~200 ms. Compressed snapshots (lz4) reduce file size by 3-5x for typical workloads.

Debugging Notes

# Start Firecracker manually (for testing)
./firecracker --api-sock /tmp/fc.sock --log-path /tmp/fc.log

# Configure VM via API
curl -X PUT --unix-socket /tmp/fc.sock \
  http://localhost/machine-config \
  -H 'Content-Type: application/json' \
  -d '{"vcpu_count": 1, "mem_size_mib": 128}'

# Set kernel
curl -X PUT --unix-socket /tmp/fc.sock \
  http://localhost/boot-source \
  -d '{"kernel_image_path": "vmlinux", "boot_args": "console=ttyS0 reboot=k"}'

# Start the VM
curl -X PUT --unix-socket /tmp/fc.sock \
  http://localhost/actions \
  -d '{"action_type": "InstanceStart"}'

# Check Firecracker metrics
cat /tmp/fc.log | grep -i "boot_time\|duration"

# Monitor KVM VMEXITs for Firecracker VM
cat /sys/kernel/debug/kvm/*/exits

# Check seccomp filter applied to Firecracker
ls /proc/$(pgrep firecracker)/status | xargs grep -i seccomp

Failure Modes

MicroVM boot failure: if the guest kernel panics during boot (wrong kernel config, missing virtio-mmio support), the VM exits. Firecracker logs the last serial output. Solution: check kernel config against Firecracker's documented requirements.
OOM during snapshot: snapshotting a VM with many dirty pages can cause the host to OOM if swap space is insufficient. Monitor host memory before snapshot operations.
API socket timeout: the Firecracker API server is single-threaded per VM. If a long operation blocks the API thread, subsequent API calls time out. Production deployments use per-VM API sockets.
vCPU stuck: a guest executing an infinite tight loop without HLT/PAUSE consumes 100% of a host core. KVM watchdog (NMI watchdog) or cpu.shares cgroup limits prevent monopolization.
Snapshot restore failure: if the snapshot was taken on a different CPU model (different CPUID flags), restoring may fail due to incompatible VMCS state. Snapshots are tied to specific CPU generations.

Modern Usage and Future Directions

Firecracker v1.0 (released 2022): stabilized API, improved snapshot format, added metrics, balloon device support, and CPU oversubscription controls.

UEFI-less boot standardization: Firecracker pioneered the approach of booting guests directly from Linux boot protocol without BIOS/UEFI. This pattern is being adopted by Cloud Hypervisor and other minimal VMMs.

Lazy paging: Firecracker 1.0 introduced lazy page loading for snapshots — instead of loading the full memory snapshot at restore time, pages are faulted in on demand (via userfaultfd). This reduces restore time from 3 seconds to under 200 ms for a 512 MB VM.

Confidential MicroVMs: research work on AMD SEV + Firecracker for confidential Lambda functions. The Lambda operator (AWS) cannot read the function's memory at rest.

Exercises

Clone the Firecracker repository and build it from source. Boot a minimal Linux guest using the official getting-started guide. Measure the time from InstanceStart to first shell prompt.
Compare the RSS (Resident Set Size) of a Firecracker process vs a QEMU process running equivalent VMs. Use ps aux and /proc/<pid>/status.
Examine Firecracker's seccomp filter source (src/jailer/src/env.rs). Count the number of allowed syscalls. Explain why execve is not in the whitelist.
Implement a Kata Containers setup using Firecracker as the backend. Run a container and verify isolation: from inside the container, attempt to read /proc/1/maps. Is this the host PID 1 or the VM's PID 1?
Read the Firecracker 2018 announcement blog post and the 2018 NSDI paper. Identify three design decisions that prioritized security over functionality.

References

Agache, A. et al. (2020). "Firecracker: Lightweight Virtualization for Serverless Applications." NSDI 2020.
AWS. (2018). "Firecracker: Secure and Fast MicroVMs for Serverless Computing." AWS Blog.
Firecracker GitHub. https://github.com/firecracker-microvm/firecracker
Firecracker Design Documentation. https://github.com/firecracker-microvm/firecracker/tree/main/docs
Intel Cloud Hypervisor. https://github.com/cloud-hypervisor/cloud-hypervisor
Google crosvm. https://github.com/google/crosvm
Kata Containers. https://katacontainers.io