gVisor and Kata Containers
Technical Overview
Standard containers (runc-based) share the host kernel. This creates a fundamental security problem: any exploitable kernel vulnerability is reachable from within a container. The history of container CVEs confirms this — many high-severity container escapes are kernel vulnerabilities triggered via syscalls that containers can make.
Two major projects address this problem with different approaches:
-
gVisor (Google, 2018): Implements a user-space kernel that intercepts guest syscalls in software, so they never reach the host kernel. The guest process communicates with a Go-written kernel (the Sentry) rather than the host Linux kernel.
-
Kata Containers (merge of Clear Containers + runV, 2017): Runs each container inside a lightweight virtual machine. Hardware virtualization (KVM) provides the isolation boundary — the guest kernel runs isolated, and even if it is compromised, the VMM boundary must also be breached.
Both are OCI-compatible, meaning they can be dropped in as runc replacements without modifying container images or orchestrators.
Prerequisites
- Linux namespaces and cgroups (sections 01, 02)
- Container runtimes and OCI spec (section 03)
- Virtual machine concepts: hypervisor, VMM, hardware virtualization (VMX/SVM)
- System call mechanism, interrupt handling
- eBPF and ptrace fundamentals (for understanding gVisor platforms)
Historical Context
gVisor was developed at Google internally for several years before open-source release in May 2018. Google had been running container-based infrastructure (Borg) since the early 2000s, but as Kubernetes/GKE became public multi-tenant infrastructure, the security model of "containers share host kernel" was unacceptable for running untrusted user workloads. gVisor was the solution for Google Cloud Run and certain GKE sandbox modes.
Clear Containers (Intel) and runV (HyperHQ) were independent efforts at VM-based containers. Clear Containers used Intel's optimized lightweight KVM VMs; runV was a universal approach. They merged in 2017 to form Kata Containers under the OpenStack Foundation (now the Open Infrastructure Foundation). Kata Containers received major contributions from Intel, Red Hat, Hyper.sh, and later Amazon (who uses Firecracker extensively).
Firecracker (Amazon, 2018): While not a container runtime itself, Firecracker is a microVM monitor (VMM) written in Rust that targets sub-125ms VM boot times. AWS Lambda and AWS Fargate use Firecracker. It integrates with Kata Containers as an alternative to QEMU, providing OCI-compatible hardware isolation.
The Problem: Host Kernel Exposure
Standard Container (runc):
Container Process
│
│ syscall (e.g., open, read, clone, ioctl, bpf)
▼
Host Linux Kernel
(same kernel as all other containers on the node)
│
│ if kernel has exploitable vulnerability:
▼
Attacker gains host kernel code execution
→ Container escape → host compromise
The attack surface is the ENTIRE Linux kernel syscall interface.
~400 syscalls, millions of lines of kernel code.
Seccomp reduces this surface but cannot eliminate it.
gVisor Architecture
gVisor implements a Go-written user-space kernel called the Sentry that intercepts all guest system calls. The guest process never executes host kernel code for its syscalls — the Sentry handles them entirely in user space.
Components
Sentry: The core of gVisor. A sandboxed process written in Go that: - Implements ~200 of the Linux system calls (from a re-implementation of the Linux ABI) - Maintains its own network stack (netstack, based on gVisor's own Go network stack) - Maintains its own VFS (virtual filesystem) - Does NOT use libc or the host's standard library for its own operation - Runs as an unprivileged process with its own seccomp filter that allows only the minimal host syscalls needed for the Sentry itself to operate (~50-70 syscalls to the host kernel, down from ~400)
Gofer: A file access proxy process. The Sentry needs to access the container's filesystem (the OCI image rootfs). Rather than accessing it directly (which would require more host syscalls), the Sentry communicates with a Gofer process over a 9P protocol socket. The Gofer does the actual file I/O against the host kernel. This ensures the Sentry's seccomp filter can be kept minimal.
runsc: The OCI runtime binary. Replaces runc. When Kubernetes or Docker invokes runsc with an OCI bundle, it starts the Sentry and Gofer, sets up the container, and the container process runs inside the Sentry's sandbox.
gVisor Architecture Diagram
Guest Container Process
(nginx, python app, etc.)
│
│ syscall (e.g., read(fd, buf, len))
▼
┌─────────────────────────────────────────────────────────┐
│ gVisor Sentry (userspace kernel in Go) │
│ │
│ Handles ~200 Linux syscalls: │
│ - open/read/write/close → asks Gofer via 9P │
│ - socket/connect/send → handled by gVisor netstack │
│ - fork/clone/execve → manages in Sentry VFS/ProcMgr │
│ - mmap/mprotect → manages memory in Sentry │
│ │
│ Own seccomp filter: allows only ~50 host syscalls │
└──────────┬────────────────────────┬────────────────────┘
│ │
│ 9P protocol │ host syscalls (minimal set)
▼ ▼
┌──────────────┐ ┌──────────────────────────┐
│ Gofer │ │ Host Linux Kernel │
│ (file proxy)│ │ (only ~50 syscalls │
│ │ │ reachable from Sentry) │
│ host syscalls └──────────────────────────┘
│ for file I/O
└──────────────┘
gVisor Platforms
The Sentry needs a way to intercept guest process syscalls. gVisor supports two "platforms" that implement this differently:
ptrace platform:
- The Sentry uses ptrace(PTRACE_SYSEMU) to intercept every syscall from the guest process before it reaches the host kernel
- When the guest makes a syscall, ptrace causes it to stop; the Sentry handles the syscall and resumes the guest
- Overhead: ptrace involves context switches between the guest process and the Sentry for every syscall
- Advantage: works on any hardware, any kernel; no hardware virtualization required
- Performance: ~10-100x overhead on syscall-heavy workloads; compute-bound workloads see minimal impact
- Use case: environments without KVM (nested virtualization, containers on VMs without hardware support)
KVM platform:
- The Sentry uses KVM (hardware virtualization) to run the guest process in ring 0 of a KVM VM, with the Sentry acting as the hypervisor
- VM exits occur when the guest executes a syscall (VMCALL or similar), trapping into the Sentry
- Context switches between guest and Sentry are handled by KVM — faster than ptrace
- Overhead: ~20-50% on syscall-heavy workloads (much better than ptrace)
- Advantage: near-native performance for compute-bound workloads; syscall latency comparable to Docker for many workloads
- Requirement: /dev/kvm available — requires hardware virtualization, which may not be available in nested VM environments
gVisor Isolation Boundary
Host kernel attack surface:
Standard container (runc): ~400 syscalls reachable
+ millions of lines of kernel code
attack surface: LARGE
gVisor (ptrace/KVM): ~50 host syscalls reachable from Sentry
+ Sentry Go code (smaller, auditable)
attack surface: MUCH SMALLER
Remaining risk:
- Vulnerabilities in the ~50 host syscalls Sentry uses
- Vulnerabilities in Sentry's own code (Go, ~150k lines)
- Platform-specific attack surface (KVM hypervisor code)
gVisor Overhead Profile
gVisor's performance impact depends heavily on workload type:
| Workload type | gVisor overhead | Reason |
|---|---|---|
| CPU-bound (no syscalls) | ~0-5% | Pure compute, no syscall interception |
| File I/O (small files, many calls) | 2-5x slower | 9P round trips to Gofer for each I/O |
| File I/O (large sequential reads) | 20-50% slower | Gofer proxying overhead |
| Network throughput | 20-50% slower | gVisor's Go netstack vs. host kernel |
| Network latency | +50-200µs | Netstack processing in Sentry |
| Syscall-heavy benchmarks | 10-100x slower | ptrace platform; 2-5x on KVM |
| Container startup | 100-300ms overhead | Sentry initialization |
Where gVisor is used in production: Google Cloud Run (serverless containers), Cloud Functions, some GKE node pool configurations. These workloads are typically short-lived HTTP handlers where the 5-20% throughput reduction is acceptable for the security isolation guarantee.
Kata Containers
Kata Containers take a fundamentally different approach: instead of implementing a user-space kernel, run each container inside a real (lightweight) virtual machine. The container workload has its own guest kernel — the host kernel is only exposed to the VM monitor (VMM), not to the container process.
Architecture
┌──────────────────────────────────────────────────────────────┐
│ Kubernetes / Container Orchestrator │
│ kubelet → CRI → containerd → kata-containers shim │
└────────────────────────────────┬─────────────────────────────┘
│
┌────────────────▼────────────────┐
│ kata-containers shim (host) │
│ - creates the microVM │
│ - manages VM lifecycle │
└────────────────┬────────────────┘
│ virtio, vsock
┌────────────────▼────────────────┐
│ Lightweight VM (QEMU or │
│ Firecracker) │
│ - dedicated guest kernel │
│ - virtio devices for storage/net│
│ │
│ ┌──────────────────────────┐ │
│ │ kata-agent (in-VM) │ │
│ │ - receives CRI-like API │ │
│ │ over vsock │ │
│ │ - runs OCI container │ │
│ │ inside the VM │ │
│ │ │ │
│ │ Container Process │ │
│ │ (nginx, python, etc.) │ │
│ └──────────────────────────┘ │
└─────────────────────────────────┘
│
┌────────────────▼────────────────┐
│ Host Linux Kernel │
│ (only exposed to VMM code, │
│ not to container syscalls) │
└──────────────────────────────────┘
kata-agent
The kata-agent is a small process running inside the VM that:
- Communicates with the Kata shim on the host via a vsock (VM socket)
- Receives container lifecycle commands (create, start, exec, kill)
- Runs the OCI container inside the VM using an embedded runc
- Manages stdio routing between the host shim and the in-VM container process
The vsock protocol used is the Kata Containers Agent API (a gRPC-over-vsock protocol).
VM Boot Time
Traditional VMs take seconds to minutes to boot. Kata Containers achieves fast boot by:
- Using a minimal guest kernel (compiled to include only necessary drivers)
- Using initrd (initial ramdisk) instead of full disk boot
- Reusing pre-warmed VM templates (DAX/virtiofs for shared memory between host and guest)
- Using Firecracker (vs QEMU): sub-125ms cold boot, 150ms to first request
Storage: virtiofs
Kata Containers uses virtiofs to share the container's rootfs from the host into the VM:
- virtiofsd runs on the host, serving the container's OCI image layers via FUSE
- The VM accesses the rootfs via a virtio-fs device (DAX window for direct memory mapping)
- Shared memory between host and guest eliminates data copies for I/O
With DAX (Direct Access), file reads/writes in the guest directly map to host memory pages containing the overlay filesystem — near-native I/O performance.
Kata with Firecracker
Firecracker is Amazon's open-source microVM monitor written in Rust. It provides:
- Sub-125ms cold boot (measured: ~125ms from PUT /actions/instance-start to first userspace instruction)
- Minimal device model: virtio-net, virtio-block, vsock — no USB, PCI, BIOS, legacy devices
- Rust implementation: memory-safe VMM with small attack surface
- Security model: seccomp filter on the VMM process, jailer for further restriction
Kata + Firecracker combines OCI compatibility (Kata shim speaks CRI/OCI) with Firecracker's fast, minimal VMs:
kubelet → containerd → kata-fc shim → Firecracker VM → kata-agent → container
Amazon Lambda's execution model is essentially this architecture (though not literally using the public Kata + Firecracker stack — their internal implementation predates the integration).
Isolation Boundary Comparison Diagram
Isolation Model Comparison:
runc (standard):
┌────────────────────────────────────────────────────────────┐
│ Host System │
│ ┌────────────────────────────────────────────────────┐ │
│ │ Container Process │ │
│ │ (nginx, app, etc.) │ │
│ │ │ syscall │ │
│ │ ▼ │ │
│ │ Host Linux Kernel (SHARED) ← exploit target │ │
│ └────────────────────────────────────────────────────┘ │
│ Isolation: namespace + seccomp (software, bypassable) │
└────────────────────────────────────────────────────────────┘
gVisor:
┌────────────────────────────────────────────────────────────┐
│ Host System │
│ ┌───────────────────────────────────────────────────┐ │
│ │ Sentry (userspace kernel) │ │
│ │ ┌────────────────────────────────────┐ │ │
│ │ │ Container Process │ │ │
│ │ │ │ syscall │ │ │
│ │ │ ▼ │ │ │
│ │ │ Sentry handles syscall │ │ │
│ │ │ (Go code, ~150k lines) │ │ │
│ │ └────────────────────────────────────┘ │ │
│ │ │ ~50 host syscalls only │ │
│ │ ▼ │ │
│ │ Host Linux Kernel │ │
│ └───────────────────────────────────────────────────┘ │
│ Isolation: user-space kernel intercepts all syscalls │
└────────────────────────────────────────────────────────────┘
Kata Containers:
┌────────────────────────────────────────────────────────────┐
│ Host System │
│ ┌───────────────────────────────────────────────────┐ │
│ │ Lightweight VM (Firecracker / QEMU) │ │
│ │ ┌────────────────────────────────────┐ │ │
│ │ │ Guest Kernel (isolated) │ │ │
│ │ │ ┌──────────────────────────┐ │ │ │
│ │ │ │ Container Process │ │ │ │
│ │ │ │ │ syscall │ │ │ │
│ │ │ │ ▼ │ │ │ │
│ │ │ │ Guest Linux Kernel │ │ │ │
│ │ │ └──────────────────────────┘ │ │ │
│ │ └────────────────────────────────────┘ │ │
│ │ │ KVM VM exits only │ │
│ │ ▼ │ │
│ │ VMM (Firecracker/QEMU) + Host Kernel │ │
│ └───────────────────────────────────────────────────┘ │
│ Isolation: hardware VM boundary (KVM) │
└────────────────────────────────────────────────────────────┘
Comparison Table: runc vs gVisor vs Kata
| Property | runc | gVisor | Kata (Firecracker) |
|---|---|---|---|
| Isolation mechanism | Namespaces + seccomp | User-space kernel | Hardware VM (KVM) |
| Kernel attack surface | Full host kernel (~400 syscalls) | ~50 host syscalls | VMM code only |
| Container escape risk | High (kernel vuln = escape) | Medium (Sentry bug or host syscall vuln) | Low (VM + VMM must both be broken) |
| Startup latency | 100-200ms | 200-500ms | 150-400ms (Firecracker) |
| Memory overhead | ~1MB per container | ~150MB (Sentry) | ~100-200MB (guest kernel + kata-agent) |
| CPU overhead | ~0% | 5-20% (KVM) to 2-10x (ptrace) | 2-10% |
| IO performance | Near-native | 20-50% slower | Near-native (virtiofs+DAX) |
| Network performance | Near-native | 20-50% slower | Near-native (virtio-net) |
| Compatibility | Full Linux ABI | ~200 syscalls (gaps exist) | Full guest kernel ABI |
| Multi-tenancy safety | Poor | Good | Excellent |
| OCI compatible | Yes | Yes (runsc) | Yes (kata-runtime) |
| Rootless | Yes | Yes | No (KVM requires privilege) |
| Kubernetes integration | Default | RuntimeClass | RuntimeClass |
Production Examples
Using gVisor with Kubernetes (RuntimeClass):
# RuntimeClass definition
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: gvisor
handler: runsc
---
# Pod using gVisor
apiVersion: v1
kind: Pod
spec:
runtimeClassName: gvisor
containers:
- name: app
image: nginx:latest
Using Kata Containers with Kubernetes:
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: kata-fc
handler: kata-fc # kata-containers with Firecracker
---
apiVersion: v1
kind: Pod
spec:
runtimeClassName: kata-fc
containers:
- name: app
image: nginx:latest
Verify which runtime a pod is using:
# gVisor: check Sentry process on node
ps aux | grep runsc
# Kata: check Firecracker process
ps aux | grep firecracker
Debugging Notes
- gVisor missing syscall: Application fails with
ENOSYSorEPERM. Check gVisor syscall compatibility table (github.com/google/gvisor/blob/master/pkg/sentry/syscalls/linux). File an issue or implement a workaround (different code path that avoids the missing syscall). - gVisor network connectivity issues: gVisor's netstack has subtle differences from Linux's network stack. If a network protocol implementation is incomplete, connections may fail silently or with unexpected errors.
- Kata boot failure: Check VMM logs — either QEMU/Firecracker stderr or kata-runtime debug logs. Common causes: KVM not available, insufficient memory for guest, device model configuration error.
- Kata container startup slow: virtiofsd performance or large image layers being staged. Check if DAX is enabled for the virtiofs mount.
- Memory accounting with Kata: The guest VM has its own memory balloon — memory usage reported by Kubernetes may not accurately reflect actual guest memory usage. Monitor both host VM RSS and in-guest memory metrics.
Security Implications
- gVisor security model: Sentry is the security boundary. Vulnerabilities in Sentry's ~150k lines of Go code can be exploited to escape the sandbox. The Sentry itself runs with a tight seccomp filter — if an attacker can exploit a Sentry bug, they are constrained to the ~50 syscalls Sentry is allowed.
- Kata security model: Guest kernel is the first boundary; VMM (Firecracker/QEMU) is the second. Even if the guest kernel is fully compromised, the attacker must also exploit Firecracker or KVM to reach the host.
- QEMU attack surface: Traditional QEMU has a large attack surface (emulated devices, PCI, USB). Firecracker addresses this by supporting only a minimal device set — virtio-net, virtio-block, vsock.
- Both isolate better than runc for multi-tenant: Neither is perfect, but both dramatically reduce the host kernel attack surface compared to standard containers.
Performance Implications
- gVisor syscall overhead is workload-specific: Do not use gVisor for latency-sensitive, syscall-intensive workloads (database engines, high-frequency trading, intensive log processing). Use for stateless HTTP handlers, script execution, untrusted code.
- Kata memory baseline: Each Kata VM requires memory for the guest kernel and kata-agent (~100-200MB). On nodes running thousands of containers, Kata's per-container memory overhead is a significant constraint.
- Firecracker vs QEMU in Kata: Firecracker has lower memory overhead (~3MB for the VMM process vs ~100MB for QEMU), faster boot, and smaller attack surface. QEMU is needed for features like live migration or GPU passthrough.
- virtiofs DAX: Without DAX, file reads involve copying data from host to VM memory. With DAX, file reads map directly to host page cache — near-native performance. Requires kernel 5.4+ and QEMU 5.0+.
Failure Modes
| Failure | Symptom | Cause |
|---|---|---|
| gVisor ENOSYS | App crashes on specific operation | Unimplemented syscall; file issue or use runc fallback |
| gVisor memory leak | Sentry memory grows unbounded | Bug in Sentry memory management; restart pod |
| Kata VM OOM | Container killed unexpectedly | Guest kernel OOM kill; increase pod memory limit |
| Kata boot timeout | Pod stuck in ContainerCreating | KVM unavailable, Firecracker binary missing, or device error |
| KVM not available | Both gVisor KVM and Kata fail | Nested virtualization disabled; use gVisor ptrace or different node type |
| virtiofs performance | Slow file I/O in Kata | DAX not enabled; check virtiofsd cache mode configuration |
Modern Usage
- Google Cloud Run: Uses gVisor (with KVM platform) as the isolation mechanism for serverless workloads — every Cloud Run container is sandboxed with gVisor.
- AWS Lambda/Fargate: Uses Firecracker MicroVMs (not Kata, but architecturally similar) for function isolation.
- GKE Sandbox: Google Kubernetes Engine offers a
--sandbox-type=gvisornode pool option for workloads requiring enhanced isolation. - Confidential containers: Kata Containers + TEE (AMD SEV-SNP, Intel TDX) provides encrypted VM memory so even the host's root user cannot read container memory — targeting multi-cloud workloads and sensitive data processing.
Future Directions
- gVisor eBPF support: Ongoing work to implement more eBPF functionality inside gVisor to support observability tools (OpenTelemetry, Pixie) that rely on eBPF inside the container.
- gVisor GPU support: Work to expose GPU devices through gVisor for ML workloads with sandboxing — currently a major gap.
- Confidential containers (CoCo): Kata + hardware TEE is becoming a serious option for healthcare, finance, and government workloads that must prevent host operator access to container data.
- WASM containers: WebAssembly runtimes (WasmEdge, Wasmtime) as a third isolation tier — lighter than gVisor, faster startup, language-level sandboxing. Being standardized as
wasmedgeRuntimeClass in Kubernetes. - Hardware-accelerated virtiofs: Work in the Linux kernel and QEMU to make virtiofs performance indistinguishable from local filesystem performance.
Exercises
- Install gVisor on a Linux machine with KVM available. Run
docker run --runtime=runsc nginxand verify it starts. Rungvisor-containerd-shimand observe the Sentry process. - Benchmark gVisor vs runc: run
sysbench cpu --cpu-max-prime=10000 runinside gVisor and runc containers. Compare results. Explain any difference. - Run a file I/O benchmark (
fio --rw=randread --bs=4k --numjobs=1) in both gVisor and runc. Explain the performance difference in terms of Gofer 9P round trips. - Set up Kata Containers with Firecracker on a host with KVM. Run a container. Find the Firecracker process on the host. Identify its PID and measure its memory footprint.
- Create a Kubernetes RuntimeClass for gVisor and one for Kata. Run the same nginx workload with each. Measure pod startup latency and HTTP request latency. Build a comparison table.
- Research the gVisor syscall compatibility page. Find 3 syscalls that are not yet implemented. For each, explain what functionality in a real application would break.
References
- gVisor documentation: gvisor.dev/docs/
- gVisor paper: "gVisor: A Platform for Running Linux Containers" — USENIX ATC '19 adjacent discussion
- gVisor source: github.com/google/gvisor
- Kata Containers documentation: katacontainers.io/docs/
- Kata Containers source: github.com/kata-containers/kata-containers
- Firecracker paper: "Firecracker: Lightweight Virtualization for Serverless Applications" — USENIX NSDI '20
- Firecracker source: github.com/firecracker-microvm/firecracker
- Kubernetes RuntimeClass: kubernetes.io/docs/concepts/containers/runtime-class/
- Confidential Containers (CoCo): github.com/confidential-containers
- virtio-fs specification: virtio-fs.gitlab.io
- USENIX NSDI 2020 Firecracker paper (Amazon)