Section 20: Containers — Overview
Section Purpose and Scope
This section provides a deep technical examination of container technology, covering the Linux kernel primitives that make containers possible, the layered software stack from OCI specification through high-level runtimes, and the security model that governs isolation. Containers are not a new OS feature — they are a compositional use of several independent kernel mechanisms that together produce an isolated, portable execution environment. Understanding containers at this level is essential for anyone designing secure, high-performance container infrastructure.
Prerequisites
- Section 03: Kernel Fundamentals (system calls, kernel/user boundary)
- Section 04: Kernel Architecture (VFS, network stack overview)
- Section 07: Process Management (fork, exec, process trees, /proc)
- Section 11: Memory Management (virtual memory, page tables)
- Section 13: Filesystems (VFS, mount namespaces, overlay filesystems)
- Section 15: Networking (network stack, interfaces, routing)
- Section 19: Virtualization (hypervisor contrast with containers)
Learning Objectives
By the end of this section you will be able to:
- Explain each Linux namespace type and what resource it isolates.
- Describe the cgroups v1 and v2 hierarchy and resource accounting model.
- Trace the lifecycle of a container from
docker runthrough runc to a running process tree. - Articulate the OCI Image Specification and Runtime Specification.
- Explain overlay filesystem construction and the role of layers.
- Analyze the container attack surface and enumerate the kernel mitigations.
- Compare rootless containers, gVisor, and Kata Containers as security posture tradeoffs.
- Debug container networking using low-level tools (ip netns, veth pairs, iptables).
Architecture Overview
┌─────────────────────────────────────────────────────────────────┐
│ User Request │
│ docker run / kubectl │
└────────────────────────────┬────────────────────────────────────┘
│
┌────────────────────────────▼────────────────────────────────────┐
│ High-Level Container Runtime │
│ Docker Daemon / containerd / CRI-O │
│ - Image pull & verify (OCI Image Spec) │
│ - Snapshot management (overlayfs layers) │
│ - CRI gRPC interface (for Kubernetes) │
└────────────────────────────┬────────────────────────────────────┘
│ OCI Runtime Spec bundle
┌────────────────────────────▼────────────────────────────────────┐
│ Low-Level Container Runtime │
│ runc / crun / youki │
│ - Clone() with namespace flags │
│ - cgroup setup (memory, cpu, pids, io limits) │
│ - seccomp filter installation │
│ - capability dropping │
│ - pivot_root / chroot │
│ - exec container entrypoint │
└────────────────────────────┬────────────────────────────────────┘
│
┌────────────────────────────▼────────────────────────────────────┐
│ Linux Kernel │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────────┐ │
│ │ PID NS │ │ NET NS │ │ MNT NS │ │ USER NS │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────────────┘ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────────┐ │
│ │ UTS NS │ │ IPC NS │ │ CGROUP NS│ │ TIME NS │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ cgroups v2 hierarchy │ │
│ │ /sys/fs/cgroup/<container-id>/ │ │
│ │ memory.max cpu.weight pids.max io.max │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ seccomp BPF filter (syscall filter) │ │
│ └─────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Filesystem Layer Stack (overlayfs):
┌────────────────────────────────────────┐
│ Container RW layer (upperdir) │ ← writes go here
├────────────────────────────────────────┤
│ Image layer N (lowerdir N) │ read-only
├────────────────────────────────────────┤
│ Image layer 2 (lowerdir 2) │ read-only
├────────────────────────────────────────┤
│ Base layer (lowerdir 1) │ read-only
└────────────────────────────────────────┘
merged view presented to container
Key Concepts
- Linux Namespaces: Kernel mechanism providing isolated views of global system resources. Eight namespace types:
pid,net,mnt,uts,ipc,user,cgroup,time. Created viaclone(2),unshare(2),setns(2). - cgroups v1: Per-subsystem hierarchy mounted at
/sys/fs/cgroup/<subsystem>/. Subsystems are independent trees. Known for complexity and inconsistency between controllers. - cgroups v2: Unified hierarchy. Single tree at
/sys/fs/cgroup/. Introduces delegation model, PSI (pressure stall information), and proper resource distribution across nested groups. Dominant since kernel 5.x. - seccomp-BPF: Berkeley Packet Filter programs attached to the seccomp syscall filter interface. Filters are evaluated per syscall before dispatch. Default Docker profile blocks ~44 syscalls.
- Capabilities: Fine-grained privilege decomposition of root. Containers drop all capabilities except an explicit allowlist (e.g.,
CAP_NET_BIND_SERVICE). User namespaces allow unprivileged processes to hold capabilities scoped to their namespace. - OCI Image Specification: Defines image index, manifest, config, and layer tarballs. Images are content-addressed (SHA-256 digests). Distribution Spec covers registry push/pull protocol.
- OCI Runtime Specification: JSON bundle format (
config.json+ rootfs) describing mounts, namespaces, capabilities, hooks, and process to exec. runc consumes this bundle. - containerd: Industry-standard container runtime used by Docker and Kubernetes. Manages image lifecycle, snapshots, and delegates to shim processes (containerd-shim) which call runc.
- CRI-O: Lightweight CRI implementation for Kubernetes. Directly implements the Container Runtime Interface without Docker overhead.
- Overlay Filesystem: Stacks read-only lower layers with a read-write upper layer. Copy-on-write: first write to a file copies it to upperdir. Whiteout files represent deletions.
- Rootless Containers: Run entire container stack without any privileged process. Requires user namespace support. Tools: Podman, rootless Docker, rootlesskit.
- gVisor: Google's user-space kernel (Sentry) intercepts syscalls via ptrace or KVM. Provides strong isolation at the cost of performance overhead and syscall compatibility.
- Kata Containers: Hardware-virtualized containers. Each pod runs in a lightweight VM (QEMU/cloud-hypervisor/firecracker). Combines container UX with VM isolation boundary.
Major Historical Milestones
| Year | Event |
|---|---|
| 2000 | FreeBSD Jails — first widely-used OS-level virtualization |
| 2004 | Solaris Zones introduced process and resource isolation |
| 2006 | Google engineers submit cgroups (process containers) to Linux kernel |
| 2008 | Linux namespaces (pid, net, mnt, uts, ipc) merged; LXC created |
| 2008 | cgroups v1 merged into Linux 2.6.24 |
| 2013 | Docker 0.1 released — images + runtime composited into developer tool |
| 2014 | Kubernetes announced by Google |
| 2015 | OCI (Open Container Initiative) founded; runc donated by Docker |
| 2016 | containerd extracted from Docker; user namespaces mature |
| 2016 | gVisor research begins at Google |
| 2017 | Kata Containers project starts (merge of Intel Clear Containers + Hyper runV) |
| 2018 | cgroups v2 reaches production readiness (kernel 4.15+); OCI specs v1.0 |
| 2019 | gVisor open-sourced; rootless containers become practical |
| 2020 | cgroups v2 becomes default on major distros (Fedora 31, Ubuntu 21.10) |
| 2021 | Seccomp notify API enables user-space syscall handling for containers |
| 2022 | Time namespace merged (kernel 5.6 earlier, widely adopted) |
| 2023 | Widespread adoption of sigstore/cosign for container image supply chain |
Modern Relevance
Containers are the foundational unit of modern cloud-native infrastructure. Every major cloud provider's serverless and container service (AWS ECS/Fargate, GCP Cloud Run, Azure Container Instances) runs on these primitives. The security properties of a container deployment depend entirely on understanding namespace isolation boundaries, cgroup resource accounting accuracy, and the syscall attack surface.
Performance-sensitive deployments rely on precise cgroup v2 tuning to prevent noisy-neighbor effects. The rise of multi-tenant Kubernetes clusters has pushed the security boundary conversation toward Kata and gVisor for workloads requiring stronger isolation than namespace-based containers provide.
The supply chain security problem — ensuring container images are unmodified from build to runtime — is an active area with sigstore, SLSA frameworks, and in-toto attestations becoming requirements in regulated industries.
File Map
20-containers/
├── 00-overview.md ← this file
├── 01-container-history.md ← jails, zones, LXC, Docker origins
├── 02-linux-namespaces.md ← all 8 namespace types, clone/unshare/setns
├── 03-cgroups-v1.md ← subsystem hierarchy, controller semantics
├── 04-cgroups-v2.md ← unified hierarchy, PSI, delegation
├── 05-seccomp.md ← BPF filters, default profiles, notify API
├── 06-capabilities.md ← capability model, ambient caps, user ns caps
├── 07-container-runtimes.md ← runc, crun, containerd, CRI-O, shim design
├── 08-oci-spec.md ← image spec, runtime spec, distribution spec
├── 09-docker-internals.md ← daemon architecture, build process
├── 10-container-networking.md ← veth pairs, bridge, CNI basics, NetworkPolicy
├── 11-overlay-filesystems.md ← overlayfs mechanics, whiteouts, performance
├── 12-container-security.md ← attack surface, CVEs, hardening guide
├── 13-rootless-containers.md ← user ns mechanics, uidmap, slirp4netns
├── 14-gvisor.md ← sentry architecture, gofer, ptrace vs KVM
└── 15-kata-containers.md ← VM-based containers, agent, VSock, firecracker
Cross-References
- Section 03 (Kernel Fundamentals): System call interface that seccomp filters intercept
- Section 07 (Process Management): Process trees, PID namespace interaction with /proc
- Section 11 (Memory Management): Memory cgroup accounting, OOM killer in cgroup context
- Section 13 (Filesystems): overlayfs, mount namespaces, VFS mount propagation
- Section 15 (Networking): veth pairs, bridges, network namespace plumbing
- Section 19 (Virtualization): Contrast with hypervisors; Kata uses hypervisor inside container model
- Section 22 (Kubernetes Internals): CRI interface, CNI plugins, CSI — all consume container primitives
- Section 26 (Security): Capabilities, seccomp, LSM (SELinux/AppArmor) in container context
- Section 27 (Kernel Exploits): Container escape techniques, namespace breakout CVEs