06 — Firecracker and MicroVMs
Prerequisites
- KVM architecture: /dev/kvm ioctls, VMCS, VMEXIT, EPT
- Linux security: seccomp-BPF, namespaces (network, PID, user), cgroups, chroot
- VirtIO: virtio-net, virtio-blk frontend/backend model
- Containerization basics: container isolation, rootfs, overlay filesystems
Historical Context
AWS Lambda launched in November 2014, pioneering the serverless computing model. Lambda's initial architecture ran each customer function in a separate container (Docker-like) within a shared EC2 instance. This provided process isolation but not VM-level security isolation — multiple customers' code ran in containers on the same host OS kernel.
By 2016, the scale of Lambda demanded a fundamentally different isolation model. Containers on a shared kernel are inherently vulnerable to kernel exploits — a single CVE could compromise all tenant functions on a host. But traditional VMs (QEMU/KVM) were too heavy for function execution: QEMU alone consumed ~500 MB of RAM and took ~5 seconds to boot. Running 150 VMs per host was impractical with these requirements.
The answer was Firecracker: a minimal VMM (Virtual Machine Monitor) designed from first principles for serverless workloads. Firecracker was written in Rust by AWS engineers, exploiting KVM for the hardware isolation guarantee while eliminating every QEMU component not strictly necessary. It was open-sourced in November 2018.
The name "Firecracker" reflects the design goal: fast, bright, contained — like a firecracker rather than a full explosion.
MicroVM Design Goals
A MicroVM is a VM that satisfies a different set of constraints than a traditional VM:
| Goal | Target | Traditional VM |
|---|---|---|
| Boot time | < 125 ms to userspace | 5–30 seconds |
| Memory footprint | 5 MB overhead per VM | 200–500 MB QEMU overhead |
| Density | 150+ VMs per host | 10–30 VMs per host |
| Attack surface | Minimal device set | Full PCIe, USB, BIOS, VGA |
| Startup jitter | < 5 ms P99 | Unpredictable |
| Security boundary | VM-level (kernel isolation) | Same |
Lambda requires booting a function's execution environment in response to an invocation request — the end-to-end user-visible latency budget (including network, routing, container start) is under 100 ms. Firecracker delivers < 125 ms from CreateMicroVM API call to executing customer code.
Firecracker Implementation
Firecracker is written entirely in Rust, chosen for:
- Memory safety without garbage collection (no GC pauses in the VM lifecycle path)
- Rich type system catching logical errors at compile time
- Safe FFI for ioctl calls to /dev/kvm
- Fearless concurrency for the API server + VMM threads
Firecracker does not use QEMU. It implements its own minimal VMM using KVM ioctls directly.
Firecracker vs QEMU/KVM Stack:
Traditional QEMU/KVM: Firecracker/KVM:
+-----------------------+ +------------------------+
| Guest VM | | Guest VM |
| (Linux, 512MB+) | | (Linux, custom kernel) |
+-----------------------+ +------------------------+
| VirtIO devices | | VirtIO devices |
| PCI bus | | (no PCI bus) |
| BIOS/UEFI/ACPI | | (no BIOS/ACPI) |
+-----------------------+ +------------------------+
| QEMU | | Firecracker VMM |
| (~500MB RSS) | | (~50MB RSS) |
| C code (1M+ lines) | | Rust (50K lines) |
+-----------------------+ +------------------------+
| KVM | | KVM |
| (/dev/kvm ioctls) | | (/dev/kvm ioctls) |
+-----------------------+ +------------------------+
| Linux Kernel | | Linux Kernel |
+-----------------------+ +------------------------+
| Hardware | | Hardware |
+-----------------------+ +------------------------+
Firecracker's Device Model
Firecracker deliberately implements only the devices needed for serverless workloads:
| Device | Implementation | Notes |
|---|---|---|
| virtio-net | VirtIO 1.0, TAP backend | mmio, not PCI |
| virtio-blk | VirtIO 1.0, file backend | mmio, not PCI |
| virtio-vsock | AF_VSOCK host↔guest comms | Used for init protocol |
| Serial console (UART 16550A) | Emulated for logs | ttyS0 |
| i8042 keyboard controller | Minimal (required for reset) | No mouse |
| RTC (MC146818) | Real-time clock |
Explicitly absent: USB, PCIe bus, VGA/GPU, BIOS/UEFI, ACPI tables, PCI device enumeration, floppy, sound, parallel port, IDE/SATA controllers. Each omitted device is one fewer code path that could contain a vulnerability.
VirtIO devices use MMIO transport (not PCI) — faster device discovery and simpler code path. The guest kernel finds devices by scanning a fixed MMIO address range rather than enumerating a PCI bus.
The Jailer: Multi-Layer Isolation
Firecracker the VMM process is itself isolated by the jailer — a companion tool that:
- cgroup: Assigns the Firecracker process to a cgroup with CPU and memory limits
- Network namespace: Places the process in a dedicated network namespace (only the tap device is visible)
- chroot: Pivots the process root to a minimal directory containing only Firecracker binary and its dependencies
- seccomp-BPF: Applies a strict syscall whitelist. Firecracker is allowed only the ~20 syscalls it actually needs. Any other syscall → SIGSYS → process killed.
- User namespace: Drops privileges; Firecracker runs as an unprivileged user
Isolation Layers:
Host OS
|
+-- network namespace (isolated NIC)
| +-- cgroup (CPU/mem limits)
| +-- chroot (minimal rootfs)
| +-- seccomp-BPF (syscall filter)
| +-- Firecracker VMM process
| +-- KVM /dev/kvm
| +-- Guest VM
| +-- Customer Lambda function code
The combination means: a guest VM escape (compromising Firecracker) is still contained by seccomp + namespace isolation. A complete Firecracker process escape is still contained by the outer network namespace and cgroup. Defense in depth with VM isolation as the primary boundary.
seccomp-BPF Filter
Firecracker's seccomp filter allows approximately:
ALLOWED: read, write, open, close, mmap, mprotect, madvise,
ioctl (only on /dev/kvm fd), epoll_wait, eventfd,
signalfd, timerfd, futex, exit, exit_group,
socket (only AF_UNIX), bind, listen, accept,
sendmsg, recvmsg, fstat, lseek, pread64, pwrite64
BLOCKED: execve, fork, clone, ptrace, setuid, chroot,
mount, unshare, and ~300 others
Any syscall not in the allowlist terminates Firecracker immediately with SIGSYS.
Firecracker Boot Process
Firecracker bypasses BIOS/UEFI completely. It boots a Linux kernel directly (must be compiled with CONFIG_PVH and Firecracker-compatible config):
Firecracker Boot Sequence (<125ms total):
0ms: Firecracker process starts
Reads VM configuration from API (JSON over Unix socket)
5ms: KVM VM created (KVM_CREATE_VM ioctl)
Memory slots configured (KVM_SET_USER_MEMORY_REGION)
vCPUs created (KVM_CREATE_VCPU)
VirtIO MMIO devices configured
10ms: Kernel image (vmlinux) loaded into guest physical memory
Initial ramdisk (initrd) loaded
Kernel command line set (boot params in memory)
vCPU registers set: RIP → kernel entry point
12ms: KVM_RUN: guest vCPU starts executing kernel
Kernel decompresses itself (if compressed)
40ms: Kernel initializes: VirtIO probing, network config
init process (PID 1) starts
100ms: Customer init (agent) running, ready for function invocation
125ms: Function code executing
The critical optimization: no BIOS POST, no MBR boot, no ACPI table parsing. The kernel starts at its entry point directly.
Custom Minimal Kernel
AWS maintains a minimal kernel configuration for Firecracker with: - No modules (everything compiled in) - No ACPI - Minimal device drivers (only VirtIO, virtio-mmio, 8250 UART) - CONFIG_PVH=y (paravirtual hardware boot) - Very fast unpacking (lz4-compressed kernel) - Minimal initrd (~1MB)
Snapshot and Restore
Firecracker supports snapshotting a running MicroVM and restoring it later:
Snapshot Flow:
Running MicroVM
|
| PUT /snapshot/create (API call)
|
v
Firecracker:
1. Pause all vCPUs (KVM_PAUSE_GUEST)
2. Save vCPU state (registers, VMCS fields via KVM_GET_VCPU_MMAP)
3. Save device state (virtio queues, network state, disk position)
4. Write memory snapshot (mmap guest memory, write to file)
5. Resume (if requested) or exit
|
v
Snapshot files:
- vm-state.snapshot (vCPU registers, device state, ~100KB)
- mem.snapshot (full guest RAM, 256MB-2GB compressed)
Restore Flow:
New Firecracker process
|
| PUT /snapshot/load (API call)
|
v
Firecracker:
1. Create new KVM VM
2. Map memory snapshot (mmap or load) as guest physical memory
3. Restore vCPU state (KVM_SET_VCPU_MMAP)
4. Restore device state
5. Resume all vCPUs
|
v
MicroVM running from saved state (<150ms from file to running)
Use case — Lambda function warm starts: Lambda pre-warms function execution environments by snapshotting an initialized container (JVM started, code loaded, warm-up invocations done). Subsequent cold starts restore from snapshot in ~150ms instead of starting from zero in ~3 seconds.
Production Scale
AWS Lambda: millions of concurrent function executions. Lambda runs on a fleet of hosts, each running 150+ Firecracker MicroVMs. Functions are billed per 100ms of execution — the overhead of the MicroVM must be imperceptible.
AWS Fargate: ECS/EKS Fargate runs each container task inside a dedicated Firecracker MicroVM. The customer sees a container; underneath is a VM providing kernel-level isolation.
Scale numbers (AWS 2019 blog post): - < 125 ms: typical MicroVM boot time - < 5 MB: Firecracker process memory overhead (excluding guest RAM) - 150 MicroVMs per metal host: typical density - Millions of MicroVMs running simultaneously across AWS fleet
Comparing Minimal VMMs
| VMM | Language | Backend | Key Use Case | Attack Surface |
|---|---|---|---|---|
| Firecracker | Rust | KVM | AWS Lambda/Fargate | Smallest |
| Cloud Hypervisor | Rust | KVM | Cloud VMs, Container VMs | Small |
| crosvm | Rust | KVM/VFIO | ChromeOS containers (Crostini) | Small |
| QEMU | C | KVM/TCG/many | General purpose, dev/test | Large |
| Kata Containers | Go + C | QEMU/Firecracker/dragonball | Container-in-VM | Configurable |
Cloud Hypervisor (Intel): similar to Firecracker but with additional features — VFIO passthrough, large VMs (>1 TB), live migration. Used for more general cloud workloads.
crosvm (Google): powers Linux containers in ChromeOS ("Crostini"). Developed by the Chrome OS team. Supports GPU virtualization for graphics in the Linux container.
Kata Containers: an OCI-compliant container runtime that runs each container inside a VM. Uses either QEMU or Firecracker as the VMM. The container API (Docker, Kubernetes) is preserved, but isolation is VM-level.
Security Implications
- VM isolation as primary boundary: unlike Docker containers, Firecracker uses KVM-enforced hardware isolation. A kernel exploit in the guest cannot compromise the host (without a KVM bug).
- seccomp filter effectiveness: the jailer's seccomp filter means even a Firecracker memory corruption bug cannot lead to arbitrary syscall execution. This was verified through extensive fuzzing.
- Memory isolation: each MicroVM's guest memory is mapped into Firecracker's virtual address space as an mmap'd region. EPT ensures the guest cannot access memory outside its mapped region.
- Side channels: MicroVMs sharing a physical host still share CPU microarchitectural state. Spectre/Meltdown patches (IBRS, STIBP, retpoline) are required on the host kernel. Firecracker itself enables appropriate CPU isolation flags via KVM.
- Snapshot security: snapshots contain full VM memory. At rest, snapshots must be encrypted. Restored snapshots must be treated as potentially compromised if the source was untrusted.
Performance Implications
- Boot latency: < 125 ms for Firecracker vs 3-10s for QEMU. The difference is the absence of BIOS, ACPI, device enumeration, and a much smaller kernel config.
- Memory overhead: ~5 MB Firecracker overhead per VM vs ~200-500 MB QEMU. At 150 VMs/host, this is 750 MB saved just in VMM overhead.
- I/O performance: Firecracker uses VirtIO-mmio (not PCI). MMIO-based VirtIO has slightly higher configuration overhead than PCI-based (no DMA remapping support in MMIO path) but is simpler and faster at the data path level.
- vCPU overhead: same as KVM — hardware VMX. No additional overhead from Firecracker vs QEMU for compute.
- Snapshotting overhead: full memory snapshot of a 512 MB VM takes ~200 ms. Compressed snapshots (lz4) reduce file size by 3-5x for typical workloads.
Debugging Notes
# Start Firecracker manually (for testing)
./firecracker --api-sock /tmp/fc.sock --log-path /tmp/fc.log
# Configure VM via API
curl -X PUT --unix-socket /tmp/fc.sock \
http://localhost/machine-config \
-H 'Content-Type: application/json' \
-d '{"vcpu_count": 1, "mem_size_mib": 128}'
# Set kernel
curl -X PUT --unix-socket /tmp/fc.sock \
http://localhost/boot-source \
-d '{"kernel_image_path": "vmlinux", "boot_args": "console=ttyS0 reboot=k"}'
# Start the VM
curl -X PUT --unix-socket /tmp/fc.sock \
http://localhost/actions \
-d '{"action_type": "InstanceStart"}'
# Check Firecracker metrics
cat /tmp/fc.log | grep -i "boot_time\|duration"
# Monitor KVM VMEXITs for Firecracker VM
cat /sys/kernel/debug/kvm/*/exits
# Check seccomp filter applied to Firecracker
ls /proc/$(pgrep firecracker)/status | xargs grep -i seccomp
Failure Modes
- MicroVM boot failure: if the guest kernel panics during boot (wrong kernel config, missing virtio-mmio support), the VM exits. Firecracker logs the last serial output. Solution: check kernel config against Firecracker's documented requirements.
- OOM during snapshot: snapshotting a VM with many dirty pages can cause the host to OOM if swap space is insufficient. Monitor host memory before snapshot operations.
- API socket timeout: the Firecracker API server is single-threaded per VM. If a long operation blocks the API thread, subsequent API calls time out. Production deployments use per-VM API sockets.
- vCPU stuck: a guest executing an infinite tight loop without HLT/PAUSE consumes 100% of a host core. KVM watchdog (NMI watchdog) or
cpu.sharescgroup limits prevent monopolization. - Snapshot restore failure: if the snapshot was taken on a different CPU model (different CPUID flags), restoring may fail due to incompatible VMCS state. Snapshots are tied to specific CPU generations.
Modern Usage and Future Directions
Firecracker v1.0 (released 2022): stabilized API, improved snapshot format, added metrics, balloon device support, and CPU oversubscription controls.
UEFI-less boot standardization: Firecracker pioneered the approach of booting guests directly from Linux boot protocol without BIOS/UEFI. This pattern is being adopted by Cloud Hypervisor and other minimal VMMs.
Lazy paging: Firecracker 1.0 introduced lazy page loading for snapshots — instead of loading the full memory snapshot at restore time, pages are faulted in on demand (via userfaultfd). This reduces restore time from 3 seconds to under 200 ms for a 512 MB VM.
Confidential MicroVMs: research work on AMD SEV + Firecracker for confidential Lambda functions. The Lambda operator (AWS) cannot read the function's memory at rest.
Exercises
- Clone the Firecracker repository and build it from source. Boot a minimal Linux guest using the official getting-started guide. Measure the time from
InstanceStartto first shell prompt. - Compare the RSS (Resident Set Size) of a Firecracker process vs a QEMU process running equivalent VMs. Use
ps auxand/proc/<pid>/status. - Examine Firecracker's seccomp filter source (
src/jailer/src/env.rs). Count the number of allowed syscalls. Explain whyexecveis not in the whitelist. - Implement a Kata Containers setup using Firecracker as the backend. Run a container and verify isolation: from inside the container, attempt to read
/proc/1/maps. Is this the host PID 1 or the VM's PID 1? - Read the Firecracker 2018 announcement blog post and the 2018 NSDI paper. Identify three design decisions that prioritized security over functionality.
References
- Agache, A. et al. (2020). "Firecracker: Lightweight Virtualization for Serverless Applications." NSDI 2020.
- AWS. (2018). "Firecracker: Secure and Fast MicroVMs for Serverless Computing." AWS Blog.
- Firecracker GitHub. https://github.com/firecracker-microvm/firecracker
- Firecracker Design Documentation. https://github.com/firecracker-microvm/firecracker/tree/main/docs
- Intel Cloud Hypervisor. https://github.com/cloud-hypervisor/cloud-hypervisor
- Google crosvm. https://github.com/google/crosvm
- Kata Containers. https://katacontainers.io