gVisor and Kata Containers

Technical Overview

Standard containers (runc-based) share the host kernel. This creates a fundamental security problem: any exploitable kernel vulnerability is reachable from within a container. The history of container CVEs confirms this — many high-severity container escapes are kernel vulnerabilities triggered via syscalls that containers can make.

Two major projects address this problem with different approaches:

gVisor (Google, 2018): Implements a user-space kernel that intercepts guest syscalls in software, so they never reach the host kernel. The guest process communicates with a Go-written kernel (the Sentry) rather than the host Linux kernel.
Kata Containers (merge of Clear Containers + runV, 2017): Runs each container inside a lightweight virtual machine. Hardware virtualization (KVM) provides the isolation boundary — the guest kernel runs isolated, and even if it is compromised, the VMM boundary must also be breached.

Both are OCI-compatible, meaning they can be dropped in as runc replacements without modifying container images or orchestrators.

Prerequisites

Linux namespaces and cgroups (sections 01, 02)
Container runtimes and OCI spec (section 03)
Virtual machine concepts: hypervisor, VMM, hardware virtualization (VMX/SVM)
System call mechanism, interrupt handling
eBPF and ptrace fundamentals (for understanding gVisor platforms)

Historical Context

gVisor was developed at Google internally for several years before open-source release in May 2018. Google had been running container-based infrastructure (Borg) since the early 2000s, but as Kubernetes/GKE became public multi-tenant infrastructure, the security model of "containers share host kernel" was unacceptable for running untrusted user workloads. gVisor was the solution for Google Cloud Run and certain GKE sandbox modes.

Clear Containers (Intel) and runV (HyperHQ) were independent efforts at VM-based containers. Clear Containers used Intel's optimized lightweight KVM VMs; runV was a universal approach. They merged in 2017 to form Kata Containers under the OpenStack Foundation (now the Open Infrastructure Foundation). Kata Containers received major contributions from Intel, Red Hat, Hyper.sh, and later Amazon (who uses Firecracker extensively).

Firecracker (Amazon, 2018): While not a container runtime itself, Firecracker is a microVM monitor (VMM) written in Rust that targets sub-125ms VM boot times. AWS Lambda and AWS Fargate use Firecracker. It integrates with Kata Containers as an alternative to QEMU, providing OCI-compatible hardware isolation.

The Problem: Host Kernel Exposure

Standard Container (runc):

Container Process
      │
      │  syscall (e.g., open, read, clone, ioctl, bpf)
      ▼
  Host Linux Kernel
  (same kernel as all other containers on the node)
      │
      │  if kernel has exploitable vulnerability:
      ▼
  Attacker gains host kernel code execution
  → Container escape → host compromise

The attack surface is the ENTIRE Linux kernel syscall interface.
~400 syscalls, millions of lines of kernel code.
Seccomp reduces this surface but cannot eliminate it.

gVisor Architecture

gVisor implements a Go-written user-space kernel called the Sentry that intercepts all guest system calls. The guest process never executes host kernel code for its syscalls — the Sentry handles them entirely in user space.

Components

Sentry: The core of gVisor. A sandboxed process written in Go that: - Implements ~200 of the Linux system calls (from a re-implementation of the Linux ABI) - Maintains its own network stack (netstack, based on gVisor's own Go network stack) - Maintains its own VFS (virtual filesystem) - Does NOT use libc or the host's standard library for its own operation - Runs as an unprivileged process with its own seccomp filter that allows only the minimal host syscalls needed for the Sentry itself to operate (~50-70 syscalls to the host kernel, down from ~400)

Gofer: A file access proxy process. The Sentry needs to access the container's filesystem (the OCI image rootfs). Rather than accessing it directly (which would require more host syscalls), the Sentry communicates with a Gofer process over a 9P protocol socket. The Gofer does the actual file I/O against the host kernel. This ensures the Sentry's seccomp filter can be kept minimal.

runsc: The OCI runtime binary. Replaces runc. When Kubernetes or Docker invokes runsc with an OCI bundle, it starts the Sentry and Gofer, sets up the container, and the container process runs inside the Sentry's sandbox.

gVisor Architecture Diagram

Guest Container Process
  (nginx, python app, etc.)
         │
         │  syscall (e.g., read(fd, buf, len))
         ▼
┌─────────────────────────────────────────────────────────┐
│  gVisor Sentry  (userspace kernel in Go)                │
│                                                          │
│  Handles ~200 Linux syscalls:                            │
│  - open/read/write/close → asks Gofer via 9P            │
│  - socket/connect/send  → handled by gVisor netstack     │
│  - fork/clone/execve    → manages in Sentry VFS/ProcMgr │
│  - mmap/mprotect        → manages memory in Sentry       │
│                                                          │
│  Own seccomp filter: allows only ~50 host syscalls      │
└──────────┬────────────────────────┬────────────────────┘
           │                        │
           │ 9P protocol             │ host syscalls (minimal set)
           ▼                        ▼
   ┌──────────────┐        ┌──────────────────────────┐
   │   Gofer      │        │   Host Linux Kernel       │
   │  (file proxy)│        │   (only ~50 syscalls      │
   │              │        │    reachable from Sentry) │
   │  host syscalls         └──────────────────────────┘
   │  for file I/O
   └──────────────┘

gVisor Platforms

The Sentry needs a way to intercept guest process syscalls. gVisor supports two "platforms" that implement this differently:

ptrace platform: - The Sentry uses ptrace(PTRACE_SYSEMU) to intercept every syscall from the guest process before it reaches the host kernel - When the guest makes a syscall, ptrace causes it to stop; the Sentry handles the syscall and resumes the guest - Overhead: ptrace involves context switches between the guest process and the Sentry for every syscall - Advantage: works on any hardware, any kernel; no hardware virtualization required - Performance: ~10-100x overhead on syscall-heavy workloads; compute-bound workloads see minimal impact - Use case: environments without KVM (nested virtualization, containers on VMs without hardware support)

KVM platform: - The Sentry uses KVM (hardware virtualization) to run the guest process in ring 0 of a KVM VM, with the Sentry acting as the hypervisor - VM exits occur when the guest executes a syscall (VMCALL or similar), trapping into the Sentry - Context switches between guest and Sentry are handled by KVM — faster than ptrace - Overhead: ~20-50% on syscall-heavy workloads (much better than ptrace) - Advantage: near-native performance for compute-bound workloads; syscall latency comparable to Docker for many workloads - Requirement: /dev/kvm available — requires hardware virtualization, which may not be available in nested VM environments

gVisor Isolation Boundary

Host kernel attack surface:

Standard container (runc):   ~400 syscalls reachable
                              + millions of lines of kernel code
                              attack surface: LARGE

gVisor (ptrace/KVM):          ~50 host syscalls reachable from Sentry
                              + Sentry Go code (smaller, auditable)
                              attack surface: MUCH SMALLER

Remaining risk:
- Vulnerabilities in the ~50 host syscalls Sentry uses
- Vulnerabilities in Sentry's own code (Go, ~150k lines)
- Platform-specific attack surface (KVM hypervisor code)

gVisor Overhead Profile

gVisor's performance impact depends heavily on workload type:

Workload type	gVisor overhead	Reason
CPU-bound (no syscalls)	~0-5%	Pure compute, no syscall interception
File I/O (small files, many calls)	2-5x slower	9P round trips to Gofer for each I/O
File I/O (large sequential reads)	20-50% slower	Gofer proxying overhead
Network throughput	20-50% slower	gVisor's Go netstack vs. host kernel
Network latency	+50-200µs	Netstack processing in Sentry
Syscall-heavy benchmarks	10-100x slower	ptrace platform; 2-5x on KVM
Container startup	100-300ms overhead	Sentry initialization

Where gVisor is used in production: Google Cloud Run (serverless containers), Cloud Functions, some GKE node pool configurations. These workloads are typically short-lived HTTP handlers where the 5-20% throughput reduction is acceptable for the security isolation guarantee.

Kata Containers

Kata Containers take a fundamentally different approach: instead of implementing a user-space kernel, run each container inside a real (lightweight) virtual machine. The container workload has its own guest kernel — the host kernel is only exposed to the VM monitor (VMM), not to the container process.

Architecture

┌──────────────────────────────────────────────────────────────┐
│  Kubernetes / Container Orchestrator                          │
│  kubelet → CRI → containerd → kata-containers shim           │
└────────────────────────────────┬─────────────────────────────┘
                                 │
                ┌────────────────▼────────────────┐
                │  kata-containers shim (host)     │
                │  - creates the microVM           │
                │  - manages VM lifecycle          │
                └────────────────┬────────────────┘
                                 │ virtio, vsock
                ┌────────────────▼────────────────┐
                │  Lightweight VM (QEMU or         │
                │  Firecracker)                    │
                │  - dedicated guest kernel        │
                │  - virtio devices for storage/net│
                │                                  │
                │  ┌──────────────────────────┐   │
                │  │  kata-agent (in-VM)       │   │
                │  │  - receives CRI-like API  │   │
                │  │    over vsock             │   │
                │  │  - runs OCI container     │   │
                │  │    inside the VM          │   │
                │  │                           │   │
                │  │  Container Process        │   │
                │  │  (nginx, python, etc.)    │   │
                │  └──────────────────────────┘   │
                └─────────────────────────────────┘
                                 │
                ┌────────────────▼────────────────┐
                │  Host Linux Kernel               │
                │  (only exposed to VMM code,      │
                │   not to container syscalls)     │
                └──────────────────────────────────┘

kata-agent

The kata-agent is a small process running inside the VM that: - Communicates with the Kata shim on the host via a vsock (VM socket) - Receives container lifecycle commands (create, start, exec, kill) - Runs the OCI container inside the VM using an embedded runc - Manages stdio routing between the host shim and the in-VM container process

The vsock protocol used is the Kata Containers Agent API (a gRPC-over-vsock protocol).

VM Boot Time

Traditional VMs take seconds to minutes to boot. Kata Containers achieves fast boot by: - Using a minimal guest kernel (compiled to include only necessary drivers) - Using initrd (initial ramdisk) instead of full disk boot - Reusing pre-warmed VM templates (DAX/virtiofs for shared memory between host and guest) - Using Firecracker (vs QEMU): sub-125ms cold boot, 150ms to first request

Storage: virtiofs

Kata Containers uses virtiofs to share the container's rootfs from the host into the VM: - virtiofsd runs on the host, serving the container's OCI image layers via FUSE - The VM accesses the rootfs via a virtio-fs device (DAX window for direct memory mapping) - Shared memory between host and guest eliminates data copies for I/O

With DAX (Direct Access), file reads/writes in the guest directly map to host memory pages containing the overlay filesystem — near-native I/O performance.

Kata with Firecracker

Firecracker is Amazon's open-source microVM monitor written in Rust. It provides: - Sub-125ms cold boot (measured: ~125ms from PUT /actions/instance-start to first userspace instruction) - Minimal device model: virtio-net, virtio-block, vsock — no USB, PCI, BIOS, legacy devices - Rust implementation: memory-safe VMM with small attack surface - Security model: seccomp filter on the VMM process, jailer for further restriction

Kata + Firecracker combines OCI compatibility (Kata shim speaks CRI/OCI) with Firecracker's fast, minimal VMs:

kubelet → containerd → kata-fc shim → Firecracker VM → kata-agent → container

Amazon Lambda's execution model is essentially this architecture (though not literally using the public Kata + Firecracker stack — their internal implementation predates the integration).

Isolation Boundary Comparison Diagram

Isolation Model Comparison:

runc (standard):
┌────────────────────────────────────────────────────────────┐
│  Host System                                                │
│  ┌────────────────────────────────────────────────────┐    │
│  │  Container Process                                  │    │
│  │  (nginx, app, etc.)                                │    │
│  │       │ syscall                                    │    │
│  │       ▼                                            │    │
│  │  Host Linux Kernel (SHARED)  ← exploit target     │    │
│  └────────────────────────────────────────────────────┘    │
│  Isolation: namespace + seccomp (software, bypassable)      │
└────────────────────────────────────────────────────────────┘

gVisor:
┌────────────────────────────────────────────────────────────┐
│  Host System                                                │
│  ┌───────────────────────────────────────────────────┐     │
│  │  Sentry (userspace kernel)                         │     │
│  │  ┌────────────────────────────────────┐           │     │
│  │  │  Container Process                  │           │     │
│  │  │       │ syscall                    │           │     │
│  │  │       ▼                            │           │     │
│  │  │  Sentry handles syscall            │           │     │
│  │  │  (Go code, ~150k lines)            │           │     │
│  │  └────────────────────────────────────┘           │     │
│  │       │ ~50 host syscalls only                    │     │
│  │       ▼                                           │     │
│  │  Host Linux Kernel                                │     │
│  └───────────────────────────────────────────────────┘     │
│  Isolation: user-space kernel intercepts all syscalls        │
└────────────────────────────────────────────────────────────┘

Kata Containers:
┌────────────────────────────────────────────────────────────┐
│  Host System                                                │
│  ┌───────────────────────────────────────────────────┐     │
│  │  Lightweight VM (Firecracker / QEMU)               │     │
│  │  ┌────────────────────────────────────┐           │     │
│  │  │  Guest Kernel (isolated)           │           │     │
│  │  │  ┌──────────────────────────┐     │           │     │
│  │  │  │  Container Process       │     │           │     │
│  │  │  │       │ syscall          │     │           │     │
│  │  │  │       ▼                  │     │           │     │
│  │  │  │  Guest Linux Kernel      │     │           │     │
│  │  │  └──────────────────────────┘     │           │     │
│  │  └────────────────────────────────────┘           │     │
│  │       │ KVM VM exits only                         │     │
│  │       ▼                                           │     │
│  │  VMM (Firecracker/QEMU) + Host Kernel             │     │
│  └───────────────────────────────────────────────────┘     │
│  Isolation: hardware VM boundary (KVM)                       │
└────────────────────────────────────────────────────────────┘

Comparison Table: runc vs gVisor vs Kata

Property	runc	gVisor	Kata (Firecracker)
Isolation mechanism	Namespaces + seccomp	User-space kernel	Hardware VM (KVM)
Kernel attack surface	Full host kernel (~400 syscalls)	~50 host syscalls	VMM code only
Container escape risk	High (kernel vuln = escape)	Medium (Sentry bug or host syscall vuln)	Low (VM + VMM must both be broken)
Startup latency	100-200ms	200-500ms	150-400ms (Firecracker)
Memory overhead	~1MB per container	~150MB (Sentry)	~100-200MB (guest kernel + kata-agent)
CPU overhead	~0%	5-20% (KVM) to 2-10x (ptrace)	2-10%
IO performance	Near-native	20-50% slower	Near-native (virtiofs+DAX)
Network performance	Near-native	20-50% slower	Near-native (virtio-net)
Compatibility	Full Linux ABI	~200 syscalls (gaps exist)	Full guest kernel ABI
Multi-tenancy safety	Poor	Good	Excellent
OCI compatible	Yes	Yes (runsc)	Yes (kata-runtime)
Rootless	Yes	Yes	No (KVM requires privilege)
Kubernetes integration	Default	RuntimeClass	RuntimeClass

Production Examples

Using gVisor with Kubernetes (RuntimeClass):

# RuntimeClass definition
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: gvisor
handler: runsc

---
# Pod using gVisor
apiVersion: v1
kind: Pod
spec:
  runtimeClassName: gvisor
  containers:
  - name: app
    image: nginx:latest

Using Kata Containers with Kubernetes:

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: kata-fc
handler: kata-fc        # kata-containers with Firecracker

---
apiVersion: v1
kind: Pod
spec:
  runtimeClassName: kata-fc
  containers:
  - name: app
    image: nginx:latest

Verify which runtime a pod is using:

# gVisor: check Sentry process on node
ps aux | grep runsc
# Kata: check Firecracker process
ps aux | grep firecracker

Debugging Notes

gVisor missing syscall: Application fails with ENOSYS or EPERM. Check gVisor syscall compatibility table (github.com/google/gvisor/blob/master/pkg/sentry/syscalls/linux). File an issue or implement a workaround (different code path that avoids the missing syscall).
gVisor network connectivity issues: gVisor's netstack has subtle differences from Linux's network stack. If a network protocol implementation is incomplete, connections may fail silently or with unexpected errors.
Kata boot failure: Check VMM logs — either QEMU/Firecracker stderr or kata-runtime debug logs. Common causes: KVM not available, insufficient memory for guest, device model configuration error.
Kata container startup slow: virtiofsd performance or large image layers being staged. Check if DAX is enabled for the virtiofs mount.
Memory accounting with Kata: The guest VM has its own memory balloon — memory usage reported by Kubernetes may not accurately reflect actual guest memory usage. Monitor both host VM RSS and in-guest memory metrics.

Security Implications

gVisor security model: Sentry is the security boundary. Vulnerabilities in Sentry's ~150k lines of Go code can be exploited to escape the sandbox. The Sentry itself runs with a tight seccomp filter — if an attacker can exploit a Sentry bug, they are constrained to the ~50 syscalls Sentry is allowed.
Kata security model: Guest kernel is the first boundary; VMM (Firecracker/QEMU) is the second. Even if the guest kernel is fully compromised, the attacker must also exploit Firecracker or KVM to reach the host.
QEMU attack surface: Traditional QEMU has a large attack surface (emulated devices, PCI, USB). Firecracker addresses this by supporting only a minimal device set — virtio-net, virtio-block, vsock.
Both isolate better than runc for multi-tenant: Neither is perfect, but both dramatically reduce the host kernel attack surface compared to standard containers.

Performance Implications

gVisor syscall overhead is workload-specific: Do not use gVisor for latency-sensitive, syscall-intensive workloads (database engines, high-frequency trading, intensive log processing). Use for stateless HTTP handlers, script execution, untrusted code.
Kata memory baseline: Each Kata VM requires memory for the guest kernel and kata-agent (~100-200MB). On nodes running thousands of containers, Kata's per-container memory overhead is a significant constraint.
Firecracker vs QEMU in Kata: Firecracker has lower memory overhead (~3MB for the VMM process vs ~100MB for QEMU), faster boot, and smaller attack surface. QEMU is needed for features like live migration or GPU passthrough.
virtiofs DAX: Without DAX, file reads involve copying data from host to VM memory. With DAX, file reads map directly to host page cache — near-native performance. Requires kernel 5.4+ and QEMU 5.0+.

Failure Modes

Failure	Symptom	Cause
gVisor ENOSYS	App crashes on specific operation	Unimplemented syscall; file issue or use runc fallback
gVisor memory leak	Sentry memory grows unbounded	Bug in Sentry memory management; restart pod
Kata VM OOM	Container killed unexpectedly	Guest kernel OOM kill; increase pod memory limit
Kata boot timeout	Pod stuck in ContainerCreating	KVM unavailable, Firecracker binary missing, or device error
KVM not available	Both gVisor KVM and Kata fail	Nested virtualization disabled; use gVisor ptrace or different node type
virtiofs performance	Slow file I/O in Kata	DAX not enabled; check virtiofsd cache mode configuration

Modern Usage

Google Cloud Run: Uses gVisor (with KVM platform) as the isolation mechanism for serverless workloads — every Cloud Run container is sandboxed with gVisor.
AWS Lambda/Fargate: Uses Firecracker MicroVMs (not Kata, but architecturally similar) for function isolation.
GKE Sandbox: Google Kubernetes Engine offers a --sandbox-type=gvisor node pool option for workloads requiring enhanced isolation.
Confidential containers: Kata Containers + TEE (AMD SEV-SNP, Intel TDX) provides encrypted VM memory so even the host's root user cannot read container memory — targeting multi-cloud workloads and sensitive data processing.

Future Directions

gVisor eBPF support: Ongoing work to implement more eBPF functionality inside gVisor to support observability tools (OpenTelemetry, Pixie) that rely on eBPF inside the container.
gVisor GPU support: Work to expose GPU devices through gVisor for ML workloads with sandboxing — currently a major gap.
Confidential containers (CoCo): Kata + hardware TEE is becoming a serious option for healthcare, finance, and government workloads that must prevent host operator access to container data.
WASM containers: WebAssembly runtimes (WasmEdge, Wasmtime) as a third isolation tier — lighter than gVisor, faster startup, language-level sandboxing. Being standardized as wasmedge RuntimeClass in Kubernetes.
Hardware-accelerated virtiofs: Work in the Linux kernel and QEMU to make virtiofs performance indistinguishable from local filesystem performance.

Exercises

Install gVisor on a Linux machine with KVM available. Run docker run --runtime=runsc nginx and verify it starts. Run gvisor-containerd-shim and observe the Sentry process.
Benchmark gVisor vs runc: run sysbench cpu --cpu-max-prime=10000 run inside gVisor and runc containers. Compare results. Explain any difference.
Run a file I/O benchmark (fio --rw=randread --bs=4k --numjobs=1) in both gVisor and runc. Explain the performance difference in terms of Gofer 9P round trips.
Set up Kata Containers with Firecracker on a host with KVM. Run a container. Find the Firecracker process on the host. Identify its PID and measure its memory footprint.
Create a Kubernetes RuntimeClass for gVisor and one for Kata. Run the same nginx workload with each. Measure pod startup latency and HTTP request latency. Build a comparison table.
Research the gVisor syscall compatibility page. Find 3 syscalls that are not yet implemented. For each, explain what functionality in a real application would break.

References

gVisor documentation: gvisor.dev/docs/
gVisor paper: "gVisor: A Platform for Running Linux Containers" — USENIX ATC '19 adjacent discussion
gVisor source: github.com/google/gvisor
Kata Containers documentation: katacontainers.io/docs/
Kata Containers source: github.com/kata-containers/kata-containers
Firecracker paper: "Firecracker: Lightweight Virtualization for Serverless Applications" — USENIX NSDI '20
Firecracker source: github.com/firecracker-microvm/firecracker
Kubernetes RuntimeClass: kubernetes.io/docs/concepts/containers/runtime-class/
Confidential Containers (CoCo): github.com/confidential-containers
virtio-fs specification: virtio-fs.gitlab.io
USENIX NSDI 2020 Firecracker paper (Amazon)