Container Runtimes
Technical Overview
A container runtime is the software component responsible for actually running a container: setting up the isolated environment (namespaces, cgroups, filesystem), then executing the container process. The term "container runtime" is overloaded — it refers to both high-level runtimes (containerd, CRI-O) that manage the full container lifecycle including image pulling and storage, and low-level runtimes (runc, crun) that take an already-prepared bundle and execute it.
The ecosystem has organized around standardization: the Open Container Initiative (OCI) defines what a container image and container execution look like at the specification level, so that any OCI-compliant image can run on any OCI-compliant runtime. Above that, the Container Runtime Interface (CRI) defines the API Kubernetes uses to communicate with runtimes, enabling runtime choice at the Kubernetes layer.
Understanding the full stack — from Kubernetes kubelet at the top, through the CRI, through containerd, through runc, to kernel namespaces at the bottom — is essential for debugging container infrastructure.
Prerequisites
- Linux namespaces and cgroups (see sections 01 and 02)
- gRPC and Protocol Buffers basics
- Filesystem concepts: overlayfs, bind mounts
- Linux capabilities model
Historical Context
Docker launched in 2013 and initially bundled everything: image format, image registry protocol, container runtime, network management, and orchestration tooling in a single monolithic daemon. This was valuable for adoption but created a single point of failure and made it impossible for Kubernetes and other orchestrators to use Docker's internals without going through the full Docker daemon.
The industry fragmentation and standardization effort began in 2015:
- June 2015: Docker, CoreOS, and others announced the Open Container Initiative under the Linux Foundation
- 2016: OCI 1.0 image-spec and runtime-spec published
- 2017: containerd was donated to CNCF; Docker refactored its daemon to use containerd internally
- 2017: Kubernetes introduced the Container Runtime Interface, abstracting away the runtime
- 2018: containerd graduated from CNCF incubation
- 2020: Kubernetes deprecated direct Docker (dockershim) support — clusters must use a CRI-compliant runtime directly
OCI Specification
The Open Container Initiative defines two specifications:
OCI Image Spec
Defines the format of a container image: - A manifest listing ordered layers (content-addressable by SHA256 digest) - Each layer is a gzipped tar archive of filesystem changes - An image configuration JSON describing entrypoint, environment, working directory, exposed ports, etc. - An image index (manifest list) for multi-architecture images
Image Index (manifest list)
├── manifest (linux/amd64) → sha256:abc123
│ ├── config.json → sha256:def456
│ ├── layer 1 → sha256:layer1digest
│ ├── layer 2 → sha256:layer2digest
│ └── layer 3 → sha256:layer3digest
└── manifest (linux/arm64) → sha256:xyz789
└── ...
OCI Runtime Spec
Defines what a container runtime receives and does. The runtime receives an OCI bundle:
/containers/my-container/
├── config.json ← runtime specification
└── rootfs/ ← root filesystem
├── bin/
├── etc/
├── lib/
└── usr/
config.json specifies:
- process: command, args, env, working directory, user, capabilities, rlimits
- root: path to rootfs
- mounts: additional bind mounts (volumes)
- linux.namespaces: which namespaces to create or join
- linux.resources: cgroup configuration (cpu, memory, pids, etc.)
- linux.seccomp: seccomp-BPF filter profile
- hooks: prestart, createRuntime, poststart, poststop hooks
runc — The OCI Reference Implementation
runc is the reference implementation of the OCI runtime spec. It is the component that actually interacts with the Linux kernel to create containers.
What runc does
- Reads
config.jsonfrom the OCI bundle - Calls
libcontainer(its internal Go library) to: - Call
clone()with the specifiedCLONE_NEW*flags to create a new process in new namespaces - Write cgroup configuration to the appropriate cgroup files
- Set up the filesystem: mount the rootfs overlay, process the mounts list, call
pivot_root() - Apply user namespace UID/GID mappings
- Apply seccomp-BPF filter
- Drop capabilities to the specified set
- Set
PR_SET_NO_NEW_PRIVS - Execute the container process via
execve()
runc invocation (by containerd)
[containerd-shim]
│
│ exec: runc create --bundle /run/containerd/... --pid-file ...
▼
[runc process]
│
│ reads config.json
│ calls clone() with CLONE_NEWPID|CLONE_NEWNS|CLONE_NEWNET|...
▼
[container init process]
│
│ sets up mounts, pivot_root, capabilities, seccomp
│ waits for "start" signal from runc
▼
[runc start] → signals init process to execve() entrypoint
│
▼
[container entrypoint process running]
crun is an alternative OCI runtime implementation in C (vs. runc's Go), with lower overhead and faster start times, now the default in Fedora/CentOS.
containerd — High-Level Runtime
containerd is a CNCF-graduated project that manages the full container lifecycle above the OCI layer:
- Image management: Pull from registries (HTTP/HTTPS, authentication), store in content-addressable store, manage layer deduplication
- Snapshot management: Manage filesystem snapshots (overlayfs layers) for container rootfs
- Container lifecycle: Create, start, pause, resume, stop, delete containers
- Task management: A "task" is a running container — containerd tracks task state
- Events: Publish events (container started, image pulled, etc.) for subscribers
- CRI plugin: Implements the Kubernetes CRI gRPC API as a built-in plugin
containerd exposes a gRPC API over a Unix socket (/run/containerd/containerd.sock).
containerd Content Store
All image layers are stored content-addressably in /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/. A layer with the same digest is stored once and referenced by all images that include it — this is how Docker-like layer sharing is implemented.
containerd-shim
The containerd-shim is a small per-container process that acts as a bridge between containerd and runc. Its purpose:
- runc exits after container start: runc creates the container and exits. The shim inherits the container's stdio and the container's PID file.
- Keepalive across containerd restarts: If containerd crashes and restarts, existing running containers remain alive because the shim (not containerd) is their parent. The shim reports exit status back when containerd reconnects.
- Stdio management: The shim handles stdio routing (to logs, to a TTY, to a pipe).
Process tree during container run:
containerd
└── containerd-shim-runc-v2 (shim; stays alive while container runs)
└── <container PID 1> (the container's init process)
└── <container apps>
The shim communicates with containerd via a Unix socket it creates, using the shim API (a small gRPC protocol separate from the CRI).
CRI — Container Runtime Interface
CRI is the Kubernetes abstraction layer between kubelet and the container runtime. It is a gRPC API defined in cri-api/pkg/apis/runtime/v1/api.proto.
CRI Operations
RuntimeService:
RunPodSandbox(PodSandboxConfig) → PodSandboxId
StopPodSandbox(PodSandboxId)
RemovePodSandbox(PodSandboxId)
CreateContainer(PodSandboxId, ContainerConfig) → ContainerId
StartContainer(ContainerId)
StopContainer(ContainerId, timeout)
ExecSync(ContainerId, cmd) → stdout/stderr
Exec(ContainerId, ...) → stream URL
ImageService:
PullImage(ImageSpec, auth) → ImageRef
ListImages() → []Image
RemoveImage(ImageRef)
CRI Call Flow Diagram
Kubernetes API Server
│
│ Pod spec created
▼
kubelet
│
│ gRPC: RunPodSandbox(pod spec)
▼
containerd (CRI plugin at /run/containerd/containerd.sock)
│
│ 1. Create network namespace
│ 2. Call CNI plugin to configure networking
│ 3. Create pause container (infra container)
▼
[pause container — holds Pod namespaces]
│
│ gRPC: CreateContainer(pod sandbox id, container spec)
▼
containerd
│
│ 1. Pull image if not cached
│ 2. Create snapshot (overlay layers)
│ 3. Generate OCI bundle (config.json + rootfs)
│ 4. Invoke containerd-shim
▼
containerd-shim-runc-v2
│
│ exec: runc create
▼
runc
│
│ clone() → set up namespaces, cgroups, mounts
│ join network namespace of pause container
▼
[container process running]
CRI-O — Kubernetes-Native Runtime
CRI-O is a minimal CRI implementation built specifically for Kubernetes. Key differences from containerd:
- No Docker compatibility layer
- No general-purpose API (only the CRI API)
- Uses runc (or any OCI runtime) directly
- Lighter weight: less code surface area
- Default runtime in OpenShift
CRI-O's philosophy: do exactly what Kubernetes needs, nothing more.
Docker Architecture Evolution
Original Docker (2013-2016):
┌──────────────────────────────────────────┐
│ docker daemon (monolith) │
│ image pull, build, run, network, volumes│
└──────────────────────────────────────────┘
│ fork+exec
▼
container process
Modern Docker (2017+):
┌──────────────┐ gRPC ┌─────────────────────────┐
│ dockerd │ ──────────→ │ containerd │
│ (docker API) │ │ (container lifecycle) │
└──────────────┘ └─────────────────────────┘
│
│ shim API
▼
containerd-shim-runc-v2
│
│ exec
▼
runc
│
│ clone()/cgroups/mounts
▼
container process
Docker itself now uses containerd internally — docker run is essentially a convenience API that translates to containerd operations.
Container Runtime Stack Layers Diagram
┌────────────────────────────────────────────────────────────┐
│ User / Orchestrator │
│ kubectl / docker CLI / podman CLI │
└─────────────────────┬──────────────────────────────────────┘
│ REST/gRPC
┌─────────────────────▼──────────────────────────────────────┐
│ High-level Runtime │
│ containerd / CRI-O / dockerd │
│ - Image pull/push (registry protocol) │
│ - Snapshot management (overlayfs layers) │
│ - CRI gRPC server (for Kubernetes) │
└─────────────────────┬──────────────────────────────────────┘
│ shim API / exec
┌─────────────────────▼──────────────────────────────────────┐
│ containerd-shim-runc-v2 │
│ - Per-container process, survives containerd restart │
│ - Manages stdio, reports exit status │
└─────────────────────┬──────────────────────────────────────┘
│ exec (OCI bundle)
┌─────────────────────▼──────────────────────────────────────┐
│ Low-level OCI Runtime │
│ runc / crun / youki │
│ - Reads config.json │
│ - Sets up namespaces, cgroups, mounts │
│ - Drops caps, applies seccomp │
│ - execve() container entrypoint │
└─────────────────────┬──────────────────────────────────────┘
│ syscalls
┌─────────────────────▼──────────────────────────────────────┐
│ Linux Kernel │
│ namespaces (clone), cgroups (/sys/fs/cgroup), │
│ overlayfs, seccomp-BPF, capabilities, netlink │
└────────────────────────────────────────────────────────────┘
Rootless Containers
Rootless containers run the entire container stack — including containerd or podman and the containers themselves — as a non-root user. This significantly improves security: a container escape only yields the unprivileged host UID.
How it works:
1. User namespace: maps container UID 0 → unprivileged host UID (e.g., 100000)
2. Network: uses slirp4netns or pasta for userspace networking (no host network bridge required)
3. Cgroups: uses the portion of the cgroup hierarchy delegated to the user by systemd
4. Overlayfs: requires kernel support for overlayfs in user namespaces (kernel 5.11+ with fuse-overlayfs as fallback)
Tools: rootless containerd (nerdctl), podman (default rootless), rootlesskit
Production Examples
Inspecting what runc does for a container:
# Get container bundle path from containerd
ctr containers info <id>
# Look at the generated config.json
cat /run/containerd/io.containerd.runtime.v2.task/default/<id>/config.json
Using crictl (CRI debug tool):
# List pods
crictl pods
# List containers
crictl ps
# Exec into a container via CRI
crictl exec -it <container-id> bash
# Pull an image via CRI
crictl pull nginx:latest
Direct containerd interaction:
# Pull and run without Docker
ctr images pull docker.io/library/nginx:latest
ctr run --rm docker.io/library/nginx:latest mynginx nginx -g "daemon off;"
Debugging Notes
- "container not found": Could be containerd-shim crashed while container is still running. PID in
/run/containerd/.../<id>/init.pidstill alive but shim is gone. Requires manual cleanup. - slow
docker exec: Thesetns()call for network namespace is a known latency contributor. Profile withstrace -T -e trace=setns docker exec .... - runc failing with permission error: Check if cgroup v2 delegation is properly set up for the user/service running runc.
- containerd snapshotter issues: If overlayfs fails, containerd falls back to naive snapshotter. Check
ctr snapshots lsfor orphaned snapshots. - CRI sandbox vs container confusion: In Kubernetes, a "pod" maps to a "sandbox" in CRI. The sandbox is the pause container. Individual containers are created inside the sandbox.
crictl podsshows sandboxes;crictl psshows containers within sandboxes.
Security Implications
- runc and host kernel: Since runc uses host kernel namespaces, a kernel vulnerability can be exploited from within a container to escape.
- containerd-shim process: The shim runs as root. Its socket must be protected from access by container processes.
- OCI bundle config.json injection: If an attacker can modify
config.jsonbefore runc reads it, they can inject capabilities, seccomp bypasses, or additional bind mounts. - Image layer poisoning: If the content-addressable store is tampered with, containers run malicious code. Digest verification on pull is critical.
- rootless containers: Running rootless eliminates the "root inside container = root escape" scenario, but user namespace exploits are a risk. Keep kernels patched.
Performance Implications
- Container startup latency: Dominated by: image layer extraction (cold start), snapshot creation (overlayfs setup), cgroup creation, namespace setup, CNI plugin calls. Typical range: 100ms–2s.
- runc vs crun: crun (C implementation) starts containers ~2x faster than runc (Go) due to lower startup overhead.
- containerd vs CRI-O: CRI-O has slightly lower baseline CPU usage due to minimal feature set.
- Shim overhead: Each running container has a shim process (typically ~4MB RSS). On nodes with 1000+ containers, shim memory adds up.
Failure Modes
| Failure | Symptom | Diagnosis |
|---|---|---|
| containerd crash | All containers still run (shims alive), but can't start/stop new ones | Check containerd logs, restart containerd; shims reconnect |
| runc hang | Container start hangs indefinitely | Check kernel logs, strace the runc process; often caused by seccomp or mount issues |
| Shim leak | Old container gone but shim still running | ps aux | grep containerd-shim, check PID file; cleanup needed |
| CRI version mismatch | kubelet can't connect to runtime | Check CRI API version in kubelet and containerd config |
| Snapshot corrupted | Container fails with filesystem errors | ctr snapshots rm on corrupted snapshot; may require image re-pull |
Modern Usage
- containerd is the dominant production runtime, used by Docker, Kubernetes (default since 1.20), and major cloud providers
- Podman: Drop-in Docker replacement using rootless containers, no daemon, uses runc/crun directly
- nerdctl:
docker-compatible CLI for containerd, supports Compose, rootless mode - WebAssembly runtimes: Containerd's WasmEdge and Spin plugins enable running Wasm workloads as containers using a different "runtime" path
Future Directions
- OCI v2 image spec: Working group on improved image format with streaming layers, content deduplication at chunk level
- Confidential containers: OCI runtime for TEE (Trusted Execution Environment) containers — AMD SEV, Intel TDX; kata-containers with encrypted memory
- crun plugins (libkrun): micro-VM runner via libkrun for kernel isolation without full VM overhead
- Namespace-less containers: Exploration of BPF-based isolation that does not require traditional namespace overhead
Exercises
- Manually create an OCI bundle for a simple
echo hellocontainer. Write a minimalconfig.json. Run it withrunc run mycontainer. - Use
strace -e trace=clone,unshare,setns runc run mycontainerto observe the exact namespace setup calls runc makes. - Pull an image with
ctr images pulland inspect its layers. Find the content blobs in/var/lib/containerd. Verify the digest matches the manifest. - Set up
crictlon a Kubernetes node. List all pods and containers. Compare output withkubectl get pods. - Run
containerd-shim-runc-v2 --helpand read its flags. Start a container and find the shim's PID. Kill containerd. Verify the container is still running. Restart containerd and verify you can interact with the container again. - Build a rootless container setup with podman (no root). Run a web server container. Inspect how the UID mapping is configured using
cat /proc/<PID>/uid_map.
References
- OCI Image Spec: github.com/opencontainers/image-spec
- OCI Runtime Spec: github.com/opencontainers/runtime-spec
- containerd documentation: containerd.io/docs
- runc source: github.com/opencontainers/runc
- crun source: github.com/containers/crun
- CRI API definition: github.com/kubernetes/cri-api
- Kubernetes CRI documentation: kubernetes.io/docs/concepts/architecture/cri/
crictltool: github.com/kubernetes-sigs/cri-tools- Michael Crosby's containerd design documents