Container Runtimes

Technical Overview

A container runtime is the software component responsible for actually running a container: setting up the isolated environment (namespaces, cgroups, filesystem), then executing the container process. The term "container runtime" is overloaded — it refers to both high-level runtimes (containerd, CRI-O) that manage the full container lifecycle including image pulling and storage, and low-level runtimes (runc, crun) that take an already-prepared bundle and execute it.

The ecosystem has organized around standardization: the Open Container Initiative (OCI) defines what a container image and container execution look like at the specification level, so that any OCI-compliant image can run on any OCI-compliant runtime. Above that, the Container Runtime Interface (CRI) defines the API Kubernetes uses to communicate with runtimes, enabling runtime choice at the Kubernetes layer.

Understanding the full stack — from Kubernetes kubelet at the top, through the CRI, through containerd, through runc, to kernel namespaces at the bottom — is essential for debugging container infrastructure.

Prerequisites

Linux namespaces and cgroups (see sections 01 and 02)
gRPC and Protocol Buffers basics
Filesystem concepts: overlayfs, bind mounts
Linux capabilities model

Historical Context

Docker launched in 2013 and initially bundled everything: image format, image registry protocol, container runtime, network management, and orchestration tooling in a single monolithic daemon. This was valuable for adoption but created a single point of failure and made it impossible for Kubernetes and other orchestrators to use Docker's internals without going through the full Docker daemon.

The industry fragmentation and standardization effort began in 2015: - June 2015: Docker, CoreOS, and others announced the Open Container Initiative under the Linux Foundation - 2016: OCI 1.0 image-spec and runtime-spec published - 2017: containerd was donated to CNCF; Docker refactored its daemon to use containerd internally - 2017: Kubernetes introduced the Container Runtime Interface, abstracting away the runtime - 2018: containerd graduated from CNCF incubation - 2020: Kubernetes deprecated direct Docker (dockershim) support — clusters must use a CRI-compliant runtime directly

OCI Specification

The Open Container Initiative defines two specifications:

OCI Image Spec

Defines the format of a container image: - A manifest listing ordered layers (content-addressable by SHA256 digest) - Each layer is a gzipped tar archive of filesystem changes - An image configuration JSON describing entrypoint, environment, working directory, exposed ports, etc. - An image index (manifest list) for multi-architecture images

Image Index (manifest list)
├── manifest (linux/amd64)  → sha256:abc123
│   ├── config.json         → sha256:def456
│   ├── layer 1             → sha256:layer1digest
│   ├── layer 2             → sha256:layer2digest
│   └── layer 3             → sha256:layer3digest
└── manifest (linux/arm64)  → sha256:xyz789
    └── ...

OCI Runtime Spec

Defines what a container runtime receives and does. The runtime receives an OCI bundle:

/containers/my-container/
├── config.json     ← runtime specification
└── rootfs/         ← root filesystem
    ├── bin/
    ├── etc/
    ├── lib/
    └── usr/

config.json specifies: - process: command, args, env, working directory, user, capabilities, rlimits - root: path to rootfs - mounts: additional bind mounts (volumes) - linux.namespaces: which namespaces to create or join - linux.resources: cgroup configuration (cpu, memory, pids, etc.) - linux.seccomp: seccomp-BPF filter profile - hooks: prestart, createRuntime, poststart, poststop hooks

runc — The OCI Reference Implementation

runc is the reference implementation of the OCI runtime spec. It is the component that actually interacts with the Linux kernel to create containers.

What runc does

Reads config.json from the OCI bundle
Calls libcontainer (its internal Go library) to:
Call clone() with the specified CLONE_NEW* flags to create a new process in new namespaces
Write cgroup configuration to the appropriate cgroup files
Set up the filesystem: mount the rootfs overlay, process the mounts list, call pivot_root()
Apply user namespace UID/GID mappings
Apply seccomp-BPF filter
Drop capabilities to the specified set
Set PR_SET_NO_NEW_PRIVS
Execute the container process via execve()

runc invocation (by containerd)

[containerd-shim]
       │
       │  exec: runc create --bundle /run/containerd/... --pid-file ...
       ▼
[runc process]
       │
       │  reads config.json
       │  calls clone() with CLONE_NEWPID|CLONE_NEWNS|CLONE_NEWNET|...
       ▼
[container init process]
       │
       │  sets up mounts, pivot_root, capabilities, seccomp
       │  waits for "start" signal from runc
       ▼
[runc start]  → signals init process to execve() entrypoint
       │
       ▼
[container entrypoint process running]

crun is an alternative OCI runtime implementation in C (vs. runc's Go), with lower overhead and faster start times, now the default in Fedora/CentOS.

containerd — High-Level Runtime

containerd is a CNCF-graduated project that manages the full container lifecycle above the OCI layer:

Image management: Pull from registries (HTTP/HTTPS, authentication), store in content-addressable store, manage layer deduplication
Snapshot management: Manage filesystem snapshots (overlayfs layers) for container rootfs
Container lifecycle: Create, start, pause, resume, stop, delete containers
Task management: A "task" is a running container — containerd tracks task state
Events: Publish events (container started, image pulled, etc.) for subscribers
CRI plugin: Implements the Kubernetes CRI gRPC API as a built-in plugin

containerd exposes a gRPC API over a Unix socket (/run/containerd/containerd.sock).

containerd Content Store

All image layers are stored content-addressably in /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/. A layer with the same digest is stored once and referenced by all images that include it — this is how Docker-like layer sharing is implemented.

containerd-shim

The containerd-shim is a small per-container process that acts as a bridge between containerd and runc. Its purpose:

runc exits after container start: runc creates the container and exits. The shim inherits the container's stdio and the container's PID file.
Keepalive across containerd restarts: If containerd crashes and restarts, existing running containers remain alive because the shim (not containerd) is their parent. The shim reports exit status back when containerd reconnects.
Stdio management: The shim handles stdio routing (to logs, to a TTY, to a pipe).

Process tree during container run:
containerd
└── containerd-shim-runc-v2  (shim; stays alive while container runs)
    └── <container PID 1>    (the container's init process)
        └── <container apps>

The shim communicates with containerd via a Unix socket it creates, using the shim API (a small gRPC protocol separate from the CRI).

CRI — Container Runtime Interface

CRI is the Kubernetes abstraction layer between kubelet and the container runtime. It is a gRPC API defined in cri-api/pkg/apis/runtime/v1/api.proto.

CRI Operations

RuntimeService:
  RunPodSandbox(PodSandboxConfig) → PodSandboxId
  StopPodSandbox(PodSandboxId)
  RemovePodSandbox(PodSandboxId)
  CreateContainer(PodSandboxId, ContainerConfig) → ContainerId
  StartContainer(ContainerId)
  StopContainer(ContainerId, timeout)
  ExecSync(ContainerId, cmd) → stdout/stderr
  Exec(ContainerId, ...) → stream URL

ImageService:
  PullImage(ImageSpec, auth) → ImageRef
  ListImages() → []Image
  RemoveImage(ImageRef)

CRI Call Flow Diagram

Kubernetes API Server
        │
        │  Pod spec created
        ▼
    kubelet
        │
        │  gRPC: RunPodSandbox(pod spec)
        ▼
containerd (CRI plugin at /run/containerd/containerd.sock)
        │
        │  1. Create network namespace
        │  2. Call CNI plugin to configure networking
        │  3. Create pause container (infra container)
        ▼
   [pause container — holds Pod namespaces]
        │
        │  gRPC: CreateContainer(pod sandbox id, container spec)
        ▼
containerd
        │
        │  1. Pull image if not cached
        │  2. Create snapshot (overlay layers)
        │  3. Generate OCI bundle (config.json + rootfs)
        │  4. Invoke containerd-shim
        ▼
containerd-shim-runc-v2
        │
        │  exec: runc create
        ▼
     runc
        │
        │  clone() → set up namespaces, cgroups, mounts
        │  join network namespace of pause container
        ▼
   [container process running]

CRI-O — Kubernetes-Native Runtime

CRI-O is a minimal CRI implementation built specifically for Kubernetes. Key differences from containerd:

No Docker compatibility layer
No general-purpose API (only the CRI API)
Uses runc (or any OCI runtime) directly
Lighter weight: less code surface area
Default runtime in OpenShift

CRI-O's philosophy: do exactly what Kubernetes needs, nothing more.

Docker Architecture Evolution

Original Docker (2013-2016):
┌──────────────────────────────────────────┐
│ docker daemon (monolith)                  │
│  image pull, build, run, network, volumes│
└──────────────────────────────────────────┘
          │ fork+exec
          ▼
    container process

Modern Docker (2017+):
┌──────────────┐     gRPC      ┌─────────────────────────┐
│ dockerd      │ ──────────→   │ containerd               │
│ (docker API) │               │ (container lifecycle)    │
└──────────────┘               └─────────────────────────┘
                                          │
                                          │ shim API
                                          ▼
                               containerd-shim-runc-v2
                                          │
                                          │ exec
                                          ▼
                                        runc
                                          │
                                          │ clone()/cgroups/mounts
                                          ▼
                                  container process

Docker itself now uses containerd internally — docker run is essentially a convenience API that translates to containerd operations.

Container Runtime Stack Layers Diagram

┌────────────────────────────────────────────────────────────┐
│                   User / Orchestrator                       │
│    kubectl / docker CLI / podman CLI                        │
└─────────────────────┬──────────────────────────────────────┘
                      │ REST/gRPC
┌─────────────────────▼──────────────────────────────────────┐
│              High-level Runtime                             │
│   containerd / CRI-O / dockerd                              │
│   - Image pull/push (registry protocol)                     │
│   - Snapshot management (overlayfs layers)                  │
│   - CRI gRPC server (for Kubernetes)                        │
└─────────────────────┬──────────────────────────────────────┘
                      │ shim API / exec
┌─────────────────────▼──────────────────────────────────────┐
│           containerd-shim-runc-v2                           │
│   - Per-container process, survives containerd restart      │
│   - Manages stdio, reports exit status                      │
└─────────────────────┬──────────────────────────────────────┘
                      │ exec (OCI bundle)
┌─────────────────────▼──────────────────────────────────────┐
│              Low-level OCI Runtime                          │
│   runc / crun / youki                                       │
│   - Reads config.json                                       │
│   - Sets up namespaces, cgroups, mounts                     │
│   - Drops caps, applies seccomp                             │
│   - execve() container entrypoint                           │
└─────────────────────┬──────────────────────────────────────┘
                      │ syscalls
┌─────────────────────▼──────────────────────────────────────┐
│              Linux Kernel                                   │
│   namespaces (clone), cgroups (/sys/fs/cgroup),             │
│   overlayfs, seccomp-BPF, capabilities, netlink             │
└────────────────────────────────────────────────────────────┘

Rootless Containers

Rootless containers run the entire container stack — including containerd or podman and the containers themselves — as a non-root user. This significantly improves security: a container escape only yields the unprivileged host UID.

How it works: 1. User namespace: maps container UID 0 → unprivileged host UID (e.g., 100000) 2. Network: uses slirp4netns or pasta for userspace networking (no host network bridge required) 3. Cgroups: uses the portion of the cgroup hierarchy delegated to the user by systemd 4. Overlayfs: requires kernel support for overlayfs in user namespaces (kernel 5.11+ with fuse-overlayfs as fallback)

Tools: rootless containerd (nerdctl), podman (default rootless), rootlesskit

Production Examples

Inspecting what runc does for a container:

# Get container bundle path from containerd
ctr containers info <id>
# Look at the generated config.json
cat /run/containerd/io.containerd.runtime.v2.task/default/<id>/config.json

Using crictl (CRI debug tool):

# List pods
crictl pods
# List containers
crictl ps
# Exec into a container via CRI
crictl exec -it <container-id> bash
# Pull an image via CRI
crictl pull nginx:latest

Direct containerd interaction:

# Pull and run without Docker
ctr images pull docker.io/library/nginx:latest
ctr run --rm docker.io/library/nginx:latest mynginx nginx -g "daemon off;"

Debugging Notes

"container not found": Could be containerd-shim crashed while container is still running. PID in /run/containerd/.../<id>/init.pid still alive but shim is gone. Requires manual cleanup.
slow docker exec: The setns() call for network namespace is a known latency contributor. Profile with strace -T -e trace=setns docker exec ....
runc failing with permission error: Check if cgroup v2 delegation is properly set up for the user/service running runc.
containerd snapshotter issues: If overlayfs fails, containerd falls back to naive snapshotter. Check ctr snapshots ls for orphaned snapshots.
CRI sandbox vs container confusion: In Kubernetes, a "pod" maps to a "sandbox" in CRI. The sandbox is the pause container. Individual containers are created inside the sandbox. crictl pods shows sandboxes; crictl ps shows containers within sandboxes.

Security Implications

runc and host kernel: Since runc uses host kernel namespaces, a kernel vulnerability can be exploited from within a container to escape.
containerd-shim process: The shim runs as root. Its socket must be protected from access by container processes.
OCI bundle config.json injection: If an attacker can modify config.json before runc reads it, they can inject capabilities, seccomp bypasses, or additional bind mounts.
Image layer poisoning: If the content-addressable store is tampered with, containers run malicious code. Digest verification on pull is critical.
rootless containers: Running rootless eliminates the "root inside container = root escape" scenario, but user namespace exploits are a risk. Keep kernels patched.

Performance Implications

Container startup latency: Dominated by: image layer extraction (cold start), snapshot creation (overlayfs setup), cgroup creation, namespace setup, CNI plugin calls. Typical range: 100ms–2s.
runc vs crun: crun (C implementation) starts containers ~2x faster than runc (Go) due to lower startup overhead.
containerd vs CRI-O: CRI-O has slightly lower baseline CPU usage due to minimal feature set.
Shim overhead: Each running container has a shim process (typically ~4MB RSS). On nodes with 1000+ containers, shim memory adds up.

Failure Modes

Failure	Symptom	Diagnosis
containerd crash	All containers still run (shims alive), but can't start/stop new ones	Check containerd logs, restart containerd; shims reconnect
runc hang	Container start hangs indefinitely	Check kernel logs, `strace` the runc process; often caused by seccomp or mount issues
Shim leak	Old container gone but shim still running	`ps aux \| grep containerd-shim`, check PID file; cleanup needed
CRI version mismatch	kubelet can't connect to runtime	Check CRI API version in kubelet and containerd config
Snapshot corrupted	Container fails with filesystem errors	`ctr snapshots rm` on corrupted snapshot; may require image re-pull

Modern Usage

containerd is the dominant production runtime, used by Docker, Kubernetes (default since 1.20), and major cloud providers
Podman: Drop-in Docker replacement using rootless containers, no daemon, uses runc/crun directly
nerdctl: docker-compatible CLI for containerd, supports Compose, rootless mode
WebAssembly runtimes: Containerd's WasmEdge and Spin plugins enable running Wasm workloads as containers using a different "runtime" path

Future Directions

OCI v2 image spec: Working group on improved image format with streaming layers, content deduplication at chunk level
Confidential containers: OCI runtime for TEE (Trusted Execution Environment) containers — AMD SEV, Intel TDX; kata-containers with encrypted memory
crun plugins (libkrun): micro-VM runner via libkrun for kernel isolation without full VM overhead
Namespace-less containers: Exploration of BPF-based isolation that does not require traditional namespace overhead

Exercises

Manually create an OCI bundle for a simple echo hello container. Write a minimal config.json. Run it with runc run mycontainer.
Use strace -e trace=clone,unshare,setns runc run mycontainer to observe the exact namespace setup calls runc makes.
Pull an image with ctr images pull and inspect its layers. Find the content blobs in /var/lib/containerd. Verify the digest matches the manifest.
Set up crictl on a Kubernetes node. List all pods and containers. Compare output with kubectl get pods.
Run containerd-shim-runc-v2 --help and read its flags. Start a container and find the shim's PID. Kill containerd. Verify the container is still running. Restart containerd and verify you can interact with the container again.
Build a rootless container setup with podman (no root). Run a web server container. Inspect how the UID mapping is configured using cat /proc/<PID>/uid_map.

References

OCI Image Spec: github.com/opencontainers/image-spec
OCI Runtime Spec: github.com/opencontainers/runtime-spec
containerd documentation: containerd.io/docs
runc source: github.com/opencontainers/runc
crun source: github.com/containers/crun
CRI API definition: github.com/kubernetes/cri-api
Kubernetes CRI documentation: kubernetes.io/docs/concepts/architecture/cri/
crictl tool: github.com/kubernetes-sigs/cri-tools
Michael Crosby's containerd design documents