04 — Container Escape Techniques

Overview

Container security is one of the most frequently misunderstood topics in modern infrastructure. The common mental model — "containers provide security isolation" — is dangerously incomplete. Linux namespaces and cgroups provide resource isolation and a separated view of the system, but they are not a security boundary equivalent to a virtual machine hypervisor. A single kernel vulnerability exploited from inside a container gives an attacker the same ring-0 access they would have on a bare metal machine. Even without kernel exploits, misconfigurations expose straightforward escape paths that require no kernel CVE at all.

This document covers the full taxonomy of container escape techniques: from configuration mistakes that give immediate host access, to CVEs that exploit the container runtime itself, to kernel vulnerabilities that work from within an unprivileged container.

Prerequisites

Linux namespaces: pid, net, mnt, user, uts, ipc (see 20-containers/02-linux-namespaces.md)
cgroups v1 and v2 (see 20-containers/03-cgroups.md)
Linux capabilities (CAP_SYS_ADMIN, CAP_NET_ADMIN, etc.)
Basic kernel exploit concepts (see 27-kernel-exploits/01-kernel-exploit-classes.md)
Docker and OCI container runtime concepts

The Container Security Assumption vs. Reality

Common mental model (WRONG):
+------------------+    +------------------+
|   Container A    |    |   Container B    |
|  [isolated]      |    |  [isolated]      |
+------------------+    +------------------+
        |                       |
        +------- wall ----------+
        (containers cannot touch each other
         or the host OS)

Reality:
+------------------+    +------------------+
|   Container A    |    |   Container B    |
|  [user processes]|    |  [user processes]|
+------------------+    +------------------+
        |                       |
        +------+  +------+------+
               |  |
       system calls (unfiltered unless seccomp applied)
               |
+--------------------------------------------+
|           LINUX KERNEL (shared)            |
|  [one bug here = both containers owned]    |
+--------------------------------------------+
               |
        physical hardware

The kernel is shared between all containers and the host. The container runtime (runc, containerd) sets up namespaces and drops capabilities, but the syscall interface to the kernel remains available. Every kernel vulnerability is potentially exploitable from a container.

Escape Taxonomy

Container Escape Techniques
|
+-- Configuration Escapes (no CVE needed)
|   +-- Privileged container
|   +-- Host PID/network/IPC namespace
|   +-- Host filesystem mount
|   +-- Docker socket mount
|   +-- Dangerous capabilities (CAP_SYS_ADMIN, etc.)
|
+-- Runtime CVEs
|   +-- CVE-2019-5736 (runc binary overwrite)
|   +-- CVE-2020-15257 (containerd abstract socket)
|   +-- CVE-2019-14271 (Docker cp race)
|
+-- Kernel CVEs from Container
|   +-- CVE-2022-0847 Dirty Pipe (overwrite read-only files)
|   +-- CVE-2016-5195 Dirty COW (write to read-only mapping)
|   +-- User namespace UAF triggers (unprivileged escalation)
|
+-- Orchestration Escapes
    +-- Kubernetes API server access
    +-- SSRF to cloud metadata endpoint
    +-- Exposed kubelet port

Privileged Container Escape

A privileged container (--privileged) is given nearly all Linux capabilities, all devices are accessible, and most namespace isolation is removed. This is a direct, immediate escape:

# Inside privileged container:
# Step 1: Identify the host disk device
fdisk -l | grep "^/dev/"

# Step 2: Mount the host root filesystem
mkdir /host
mount /dev/sda1 /host

# Step 3: You now have full read/write access to host root
ls /host/etc/shadow
chroot /host bash

# Alternative: enter all host namespaces
nsenter --target 1 --mount --uts --ipc --net --pid -- /bin/bash
# ^ Now running in the host's PID 1 namespace == host shell

Detection: --privileged flag in docker inspect output. Policy enforcement: OPA/Gatekeeper, Kyverno admission webhooks that reject privileged pods.

Host PID Namespace Escape

# Container started with --pid=host:
# /proc/1/root is the host's root filesystem
# because PID 1 is a host process (systemd, init)

ls /proc/1/root/etc/  # host /etc
cat /proc/1/root/etc/shadow  # host /etc/shadow

# If you can write:
cp /proc/1/root/etc/crontab /tmp/crontab_backup
echo "* * * * * root /tmp/exploit.sh" >> /proc/1/root/etc/crontab

Combined with --net=host, this provides full host network access, bypassing all container network policy.

Docker Socket Mount Escape

A container with /var/run/docker.sock mounted has full control over the Docker daemon — and therefore can launch a privileged container:

# Inside container with docker.sock mounted:
docker run --rm -it \
    --privileged \
    -v /:/host \
    ubuntu:20.04 \
    chroot /host bash

This is a complete escape. The Docker socket represents root-equivalent access to the host. Never mount /var/run/docker.sock into containers unless absolutely required and additional controls are in place.

CVE-2019-5736: runc Binary Overwrite

This CVE is technically elegant and alarming: an attacker can overwrite the host's runc binary while a new container is being started, giving persistent code execution as root on the host. Affected: Docker < 18.09.2, runc < 1.0-rc6, Kubernetes on affected nodes.

Technical Mechanism

When a container starts: 1. The host runc process opens the container's root filesystem 2. runc executes the container's init process 3. Briefly, /proc/self/exe in the container's namespace points to the host's runc binary through a file descriptor

The attack exploits this window:

Host side:                          Container side:

runc starts                         
  |                                 
  | sets up namespaces              
  |                                 
  | execs container init    ------> container init runs
  |                                 |
  |                                 | open("/proc/self/exe", O_RDWR)
  |                                 | (resolves to host runc fd!)
  |                                 |
  | (brief window: runc fd           | O_RDONLY on /proc/<runc_pid>/exe
  |  is still open in               | Loop: overwrite via /proc/self/fd/X
  |  container namespace)           | Write attacker binary to fd
  |
  | reads itself... now reading
    attacker's binary

Next container start: host executes attacker's binary as root

The attack requires a malicious container image that, at container startup, races to open /proc/self/exe as a writable fd (possible because /proc/<pid>/exe can be opened for writing via a symlink race through the container's own /proc) and overwrites the runc binary content while runc's fd is held open.

The Symlink Race

The exploit chains: 1. Container /proc/self/exe -> kernel fd -> host runc binary 2. open() on the /proc/self/fd/<n> fd from inside the container with O_RDWR at just the right moment 3. Write payload into the open fd (write-to-the-binary-being-executed attack)

Patched by making runc seal its own binary (open it O_PATH, then use fexecve) and by the kernel handling /proc/self/exe overwrites of running executables more carefully.

CVE-2020-15257: containerd Abstract Socket

Containerd listens on an abstract Unix domain socket for its API. Abstract sockets are in the network namespace, but prior to the fix, containerd bound on 0.0.0.0-equivalent which was accessible from any container sharing the host's network namespace (including --net=host containers and any container whose network namespace could reach the host's).

The impact: a container with --net=host (or access to the host network namespace) could connect to containerd's socket and: - List all containers - Start new privileged containers - Execute commands in other containers

Fix: containerd now binds to a socket path under /run/containerd/ with strict permissions, and the abstract namespace binding was removed.

CVE-2022-0847: Dirty Pipe

Dirty Pipe is a Linux kernel vulnerability (Linux 5.8–5.16.11, 5.15.25, 5.10.102) discovered by Max Kellermann. It allows overwriting the contents of read-only files via pipe splicing.

Technical Mechanism

The bug is in the pipe_write() kernel function. Pipe pages could be left with the PIPE_BUF_FLAG_CAN_MERGE flag set incorrectly after a partial write. If an attacker then splices a read-only file's page into the pipe, that page is marked mergeable, and a subsequent write to the pipe overwrites the file's page cache — which is backed by the actual file.

Normal write to pipe:
  pipe buffer -> new page (writable, CAN_MERGE)

Dirty Pipe attack:
  1. Write 1 byte to pipe (sets CAN_MERGE on buffer)
  2. Drain pipe (but flag remains on buffer entry)
  3. splice(readonly_fd, pipe, ...) -- splice read-only file to pipe
     (reuses existing pipe buffer entry with CAN_MERGE flag)
  4. write(pipe_fd, payload, ...) -- merges into the spliced page!
     NOW THE READ-ONLY FILE'S PAGE CACHE IS MODIFIED

Container Escape via Dirty Pipe

From inside a container: 1. Identify a setuid binary accessible in the container (e.g., /usr/bin/passwd) 2. Use Dirty Pipe to overwrite the setuid binary's content with a shell script or shellcode 3. Execute the binary — it now runs as root (setuid bit preserved, content replaced) 4. From root inside the container, use other techniques to access the host

More directly, if the host's /etc/passwd or /etc/cron.d/* is accessible (via volume mount or overlayfs), overwrite it to add a root shell backdoor.

Kernel Exploits from Unprivileged Containers

User namespaces (enabled by default on many distros) allow unprivileged users to create their own namespace environments with a limited set of capabilities inside. This has expanded the attack surface significantly: kernel code that was previously only reachable by root is now reachable by any user via unshare -U.

The Pattern

# Unprivileged user triggers kernel exploit via user namespace:
unshare -Urm  # Create user namespace with uid mapping
# Now have "root" inside the namespace
# Trigger kernel UAF/race/OOB that previously required root

Several high-profile kernel CVEs were exploitable via this path: - CVE-2021-3492: shiftfs double-free - CVE-2021-22555: Netfilter heap OOB write - CVE-2022-1786: io_uring use-after-free

Mitigations: kernel.unprivileged_userns_clone=0 (Ubuntu-specific sysctl) or user.max_user_namespaces=0 (blocks non-root user namespace creation). However, this breaks many legitimate programs (Firefox, Chrome, Flatpak, bubblewrap sandboxes) that use user namespaces for isolation.

Detection and Monitoring

# Falco rules for container escapes:
# Rule: detect nsenter or chroot to host namespace
- rule: Container Escape via nsenter
  condition: spawned_process and container and
    proc.name in (nsenter, chroot) and
    proc.args contains "1"  # targeting PID 1
  output: "Possible container escape (proc=%proc.name args=%proc.args)"

# Detect /proc/*/root access
- rule: Access Host Filesystem via /proc/1/root
  condition: open_read and fd.name startswith "/proc/1/root"
    and container
  output: "Container reading host filesystem via /proc/1/root"

Security Implications

Default Docker is not a security boundary: The default Docker configuration does not use seccomp profiles for many syscalls, does not drop all capabilities, and does not use user namespaces. A determined attacker with code execution in a container has many options.
Kubernetes default pod security: Before Pod Security Admission (PSA, Kubernetes 1.25), many clusters ran pods without security context restrictions. PSA's restricted policy prevents privileged containers, enforces read-only root filesystem, drops all capabilities, and requires non-root UID.
Seccomp profiles reduce attack surface: A strict seccomp profile that denies unneeded syscalls (e.g., unshare, ptrace, perf_event_open) prevents entire classes of kernel exploits. Docker's default seccomp profile blocks ~44 syscalls; a minimal profile blocks many more.
Runtime security (Falco, Tetragon): eBPF-based runtime security tools observe syscalls and process behavior at runtime, providing detection even when prevention fails.

Performance Implications

Container security controls have measurable performance costs: - Seccomp filters: Each syscall passes through the BPF filter. Modern kernels cache seccomp filter results, reducing overhead to ~1-2% for typical workloads. - AppArmor/SELinux profiles: Policy checks add latency on file access, socket operations. 2-5% overhead is typical; can be higher for I/O-heavy workloads. - User namespace UID remapping: Sub-UID mapping adds overhead to permission checks and filesystem operations. Rootless Docker uses this.

Failure Modes

Over-privileged containers in production: --privileged or CAP_SYS_ADMIN granted because a developer needed one feature (e.g., loading a kernel module for development) left in production configurations.
Docker-in-Docker (DinD) via socket: CI/CD pipelines mounting /var/run/docker.sock for build jobs create a persistent escape path for any code running in a build job.
Outdated container runtime: runc and containerd have had multiple CVEs. Unpatched runtimes on worker nodes represent a persistent risk. Runtime version should be monitored alongside kernel version in vulnerability management.

Modern Defenses

gVisor (Google): Intercepts syscalls with a user-space kernel written in Go. Provides a much smaller kernel attack surface. Used in Google Cloud Run. Performance overhead: 20-100% on syscall-heavy workloads.
Kata Containers: Each container runs in a lightweight QEMU/Firecracker VM. True hypervisor isolation. Performance overhead: 10-50ms startup, 10-15% runtime vs native.
Firecracker: AWS's microVM for Lambda and Fargate. Full KVM isolation in < 125 MB memory with < 125ms boot time. Purpose-built for container-like workloads.
Seccomp + AppArmor + capabilities: Defense in depth with no VM overhead. Not foolproof against kernel CVEs but eliminates configuration escapes and reduces exploit surface.

Exercises

In a local Docker container, verify the escape path is open with a privileged container: docker run --privileged -it ubuntu bash. Inside, mount the host disk and read a file from the host's home directory. Then run the same container without --privileged and observe why the same commands fail.
Set up a container with Docker socket mounted (-v /var/run/docker.sock:/var/run/docker.sock) and demonstrate launching a privileged container from within it. Enumerate what information about other containers is accessible.
Write a Falco rule that detects when a process inside a container attempts to open any /proc/<pid>/exe file, which is the first step of the CVE-2019-5736 attack pattern.
Apply the Dirty Pipe CVE PoC (archived at https://github.com/AlexisAhmed/CVE-2022-0847-DirtyPipe-Exploits) to an unpatched Linux 5.8 VM (use an old kernel in VirtualBox). Verify that a read-only SUID binary can be overwritten. Patch the kernel and verify the fix.
Configure a Kubernetes Pod Security Admission policy (restricted level) on a test cluster and attempt to deploy a privileged pod. Observe the rejection. Then attempt each of: hostPID: true, hostNetwork: true, mounting / as a hostPath volume. Verify each is rejected.
Analyze the Docker default seccomp profile (/etc/docker/seccomp/default.json). Identify 5 syscalls that are blocked and explain what kernel functionality each prevents. Identify 3 syscalls you would additionally block for a web server container.

References

Kellermann, M. (2022). "The Dirty Pipe Vulnerability." https://dirtypipe.cm4all.com/
CVE-2019-5736 write-up: https://blog.dragonsector.pl/2019/02/cve-2019-5736-escape-from-docker-and.html
Trail of Bits: "Understanding Docker container escapes." https://blog.trailofbits.com/2019/07/19/
Kerrisk, M. (2021). "Namespaces in operation" series. LWN.net.
NCC Group: "Understanding and Hardening Linux Containers." Whitepaper.
gVisor security model: https://gvisor.dev/docs/architecture_guide/security/
Falco rules for container security: https://falco.org/docs/rules/
Kubernetes Pod Security Standards: https://kubernetes.io/docs/concepts/security/pod-security-standards/
Azimov, I. "Container escapes in 2022." (BHEU 2022 talk)
CIS Docker Benchmark: https://www.cisecurity.org/benchmark/docker