Skip to content

Section 20: Containers — Overview

Section Purpose and Scope

This section provides a deep technical examination of container technology, covering the Linux kernel primitives that make containers possible, the layered software stack from OCI specification through high-level runtimes, and the security model that governs isolation. Containers are not a new OS feature — they are a compositional use of several independent kernel mechanisms that together produce an isolated, portable execution environment. Understanding containers at this level is essential for anyone designing secure, high-performance container infrastructure.


Prerequisites

  • Section 03: Kernel Fundamentals (system calls, kernel/user boundary)
  • Section 04: Kernel Architecture (VFS, network stack overview)
  • Section 07: Process Management (fork, exec, process trees, /proc)
  • Section 11: Memory Management (virtual memory, page tables)
  • Section 13: Filesystems (VFS, mount namespaces, overlay filesystems)
  • Section 15: Networking (network stack, interfaces, routing)
  • Section 19: Virtualization (hypervisor contrast with containers)

Learning Objectives

By the end of this section you will be able to:

  1. Explain each Linux namespace type and what resource it isolates.
  2. Describe the cgroups v1 and v2 hierarchy and resource accounting model.
  3. Trace the lifecycle of a container from docker run through runc to a running process tree.
  4. Articulate the OCI Image Specification and Runtime Specification.
  5. Explain overlay filesystem construction and the role of layers.
  6. Analyze the container attack surface and enumerate the kernel mitigations.
  7. Compare rootless containers, gVisor, and Kata Containers as security posture tradeoffs.
  8. Debug container networking using low-level tools (ip netns, veth pairs, iptables).

Architecture Overview

  ┌─────────────────────────────────────────────────────────────────┐
  │                        User Request                             │
  │                    docker run / kubectl                         │
  └────────────────────────────┬────────────────────────────────────┘
                               │
  ┌────────────────────────────▼────────────────────────────────────┐
  │               High-Level Container Runtime                      │
  │          Docker Daemon / containerd / CRI-O                     │
  │   - Image pull & verify (OCI Image Spec)                        │
  │   - Snapshot management (overlayfs layers)                      │
  │   - CRI gRPC interface (for Kubernetes)                         │
  └────────────────────────────┬────────────────────────────────────┘
                               │  OCI Runtime Spec bundle
  ┌────────────────────────────▼────────────────────────────────────┐
  │               Low-Level Container Runtime                       │
  │                     runc / crun / youki                         │
  │   - Clone() with namespace flags                                │
  │   - cgroup setup (memory, cpu, pids, io limits)                 │
  │   - seccomp filter installation                                 │
  │   - capability dropping                                         │
  │   - pivot_root / chroot                                         │
  │   - exec container entrypoint                                   │
  └────────────────────────────┬────────────────────────────────────┘
                               │
  ┌────────────────────────────▼────────────────────────────────────┐
  │                     Linux Kernel                                │
  │  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────────┐  │
  │  │  PID NS  │ │  NET NS  │ │  MNT NS  │ │    USER NS       │  │
  │  └──────────┘ └──────────┘ └──────────┘ └──────────────────┘  │
  │  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────────┐  │
  │  │  UTS NS  │ │  IPC NS  │ │ CGROUP NS│ │    TIME NS       │  │
  │  └──────────┘ └──────────┘ └──────────┘ └──────────────────┘  │
  │                                                                 │
  │  ┌─────────────────────────────────────────────────────────┐   │
  │  │                 cgroups v2 hierarchy                    │   │
  │  │   /sys/fs/cgroup/<container-id>/                        │   │
  │  │   memory.max  cpu.weight  pids.max  io.max              │   │
  │  └─────────────────────────────────────────────────────────┘   │
  │                                                                 │
  │  ┌─────────────────────────────────────────────────────────┐   │
  │  │            seccomp BPF filter (syscall filter)          │   │
  │  └─────────────────────────────────────────────────────────┘   │
  └─────────────────────────────────────────────────────────────────┘

  Filesystem Layer Stack (overlayfs):
  ┌────────────────────────────────────────┐
  │         Container RW layer (upperdir)  │  ← writes go here
  ├────────────────────────────────────────┤
  │         Image layer N (lowerdir N)     │  read-only
  ├────────────────────────────────────────┤
  │         Image layer 2 (lowerdir 2)     │  read-only
  ├────────────────────────────────────────┤
  │         Base layer (lowerdir 1)        │  read-only
  └────────────────────────────────────────┘
             merged view presented to container

Key Concepts

  • Linux Namespaces: Kernel mechanism providing isolated views of global system resources. Eight namespace types: pid, net, mnt, uts, ipc, user, cgroup, time. Created via clone(2), unshare(2), setns(2).
  • cgroups v1: Per-subsystem hierarchy mounted at /sys/fs/cgroup/<subsystem>/. Subsystems are independent trees. Known for complexity and inconsistency between controllers.
  • cgroups v2: Unified hierarchy. Single tree at /sys/fs/cgroup/. Introduces delegation model, PSI (pressure stall information), and proper resource distribution across nested groups. Dominant since kernel 5.x.
  • seccomp-BPF: Berkeley Packet Filter programs attached to the seccomp syscall filter interface. Filters are evaluated per syscall before dispatch. Default Docker profile blocks ~44 syscalls.
  • Capabilities: Fine-grained privilege decomposition of root. Containers drop all capabilities except an explicit allowlist (e.g., CAP_NET_BIND_SERVICE). User namespaces allow unprivileged processes to hold capabilities scoped to their namespace.
  • OCI Image Specification: Defines image index, manifest, config, and layer tarballs. Images are content-addressed (SHA-256 digests). Distribution Spec covers registry push/pull protocol.
  • OCI Runtime Specification: JSON bundle format (config.json + rootfs) describing mounts, namespaces, capabilities, hooks, and process to exec. runc consumes this bundle.
  • containerd: Industry-standard container runtime used by Docker and Kubernetes. Manages image lifecycle, snapshots, and delegates to shim processes (containerd-shim) which call runc.
  • CRI-O: Lightweight CRI implementation for Kubernetes. Directly implements the Container Runtime Interface without Docker overhead.
  • Overlay Filesystem: Stacks read-only lower layers with a read-write upper layer. Copy-on-write: first write to a file copies it to upperdir. Whiteout files represent deletions.
  • Rootless Containers: Run entire container stack without any privileged process. Requires user namespace support. Tools: Podman, rootless Docker, rootlesskit.
  • gVisor: Google's user-space kernel (Sentry) intercepts syscalls via ptrace or KVM. Provides strong isolation at the cost of performance overhead and syscall compatibility.
  • Kata Containers: Hardware-virtualized containers. Each pod runs in a lightweight VM (QEMU/cloud-hypervisor/firecracker). Combines container UX with VM isolation boundary.

Major Historical Milestones

Year Event
2000 FreeBSD Jails — first widely-used OS-level virtualization
2004 Solaris Zones introduced process and resource isolation
2006 Google engineers submit cgroups (process containers) to Linux kernel
2008 Linux namespaces (pid, net, mnt, uts, ipc) merged; LXC created
2008 cgroups v1 merged into Linux 2.6.24
2013 Docker 0.1 released — images + runtime composited into developer tool
2014 Kubernetes announced by Google
2015 OCI (Open Container Initiative) founded; runc donated by Docker
2016 containerd extracted from Docker; user namespaces mature
2016 gVisor research begins at Google
2017 Kata Containers project starts (merge of Intel Clear Containers + Hyper runV)
2018 cgroups v2 reaches production readiness (kernel 4.15+); OCI specs v1.0
2019 gVisor open-sourced; rootless containers become practical
2020 cgroups v2 becomes default on major distros (Fedora 31, Ubuntu 21.10)
2021 Seccomp notify API enables user-space syscall handling for containers
2022 Time namespace merged (kernel 5.6 earlier, widely adopted)
2023 Widespread adoption of sigstore/cosign for container image supply chain

Modern Relevance

Containers are the foundational unit of modern cloud-native infrastructure. Every major cloud provider's serverless and container service (AWS ECS/Fargate, GCP Cloud Run, Azure Container Instances) runs on these primitives. The security properties of a container deployment depend entirely on understanding namespace isolation boundaries, cgroup resource accounting accuracy, and the syscall attack surface.

Performance-sensitive deployments rely on precise cgroup v2 tuning to prevent noisy-neighbor effects. The rise of multi-tenant Kubernetes clusters has pushed the security boundary conversation toward Kata and gVisor for workloads requiring stronger isolation than namespace-based containers provide.

The supply chain security problem — ensuring container images are unmodified from build to runtime — is an active area with sigstore, SLSA frameworks, and in-toto attestations becoming requirements in regulated industries.


File Map

20-containers/
├── 00-overview.md                  ← this file
├── 01-container-history.md         ← jails, zones, LXC, Docker origins
├── 02-linux-namespaces.md          ← all 8 namespace types, clone/unshare/setns
├── 03-cgroups-v1.md                ← subsystem hierarchy, controller semantics
├── 04-cgroups-v2.md                ← unified hierarchy, PSI, delegation
├── 05-seccomp.md                   ← BPF filters, default profiles, notify API
├── 06-capabilities.md              ← capability model, ambient caps, user ns caps
├── 07-container-runtimes.md        ← runc, crun, containerd, CRI-O, shim design
├── 08-oci-spec.md                  ← image spec, runtime spec, distribution spec
├── 09-docker-internals.md          ← daemon architecture, build process
├── 10-container-networking.md      ← veth pairs, bridge, CNI basics, NetworkPolicy
├── 11-overlay-filesystems.md       ← overlayfs mechanics, whiteouts, performance
├── 12-container-security.md        ← attack surface, CVEs, hardening guide
├── 13-rootless-containers.md       ← user ns mechanics, uidmap, slirp4netns
├── 14-gvisor.md                    ← sentry architecture, gofer, ptrace vs KVM
└── 15-kata-containers.md           ← VM-based containers, agent, VSock, firecracker

Cross-References

  • Section 03 (Kernel Fundamentals): System call interface that seccomp filters intercept
  • Section 07 (Process Management): Process trees, PID namespace interaction with /proc
  • Section 11 (Memory Management): Memory cgroup accounting, OOM killer in cgroup context
  • Section 13 (Filesystems): overlayfs, mount namespaces, VFS mount propagation
  • Section 15 (Networking): veth pairs, bridges, network namespace plumbing
  • Section 19 (Virtualization): Contrast with hypervisors; Kata uses hypervisor inside container model
  • Section 22 (Kubernetes Internals): CRI interface, CNI plugins, CSI — all consume container primitives
  • Section 26 (Security): Capabilities, seccomp, LSM (SELinux/AppArmor) in container context
  • Section 27 (Kernel Exploits): Container escape techniques, namespace breakout CVEs