Container Security

Technical Overview

Container security is a layered discipline. The fundamental insight is this: namespaces provide isolation, not security. A process running as root inside a PID namespace is still root from the kernel's perspective. If it can invoke a kernel vulnerability, it can escape the container. Robust container security requires multiple independent defense-in-depth layers, each reducing the attack surface so that bypassing any single layer is insufficient for a full compromise.

The security model decomposes into: who runs the process (UID/GID, user namespaces), what the process can do (capabilities, seccomp, AppArmor/SELinux), what the process can access (read-only rootfs, volume restrictions), and how the runtime is configured (privileged mode, no-new-privileges). Understanding each layer, its guarantees, and its limits is essential for secure container deployments.

Prerequisites

Linux namespaces (section 01), especially user namespaces
Linux capabilities model (capabilities(7))
System call basics, BPF fundamentals
Linux process credentials (real/effective UID, saved-set-UID)
Understanding of kernel privilege escalation paths

Historical Context

Early container deployments (Docker 1.0, 2013) ran containers as root with minimal capability restrictions and no seccomp filtering. The ecosystem has progressively hardened defaults:

2014: Docker began dropping capabilities by default (dropped CAP_SYS_ADMIN and others)
2015: Docker added default seccomp profile (blocks ~44 syscalls)
2016: AppArmor profile generation by default on Ubuntu/Debian
2017: User namespace support in Docker Engine (rootless mode preview)
2019: CVE-2019-5736 (runc overwrite) demonstrated that running as root in a container with host kernel access is deeply insecure — drove accelerated adoption of rootless
2020: Rootless containerd and Podman became production-ready
2022: CVE-2022-0185 showed kernel heap overflows reachable from containers

The Container Security Model

Security Layer Stack (outermost = weakest single layer)
┌───────────────────────────────────────────────────────────┐
│  Orchestrator policies (RBAC, PodSecurity, OPA/Gatekeeper)│  ← admission control
├───────────────────────────────────────────────────────────┤
│  User Namespaces                                           │  ← UID remapping
│  (container root = unprivileged host UID)                  │
├───────────────────────────────────────────────────────────┤
│  Linux Capabilities                                        │  ← drop unnecessary caps
│  (drop all, add back only required)                        │
├───────────────────────────────────────────────────────────┤
│  Seccomp-BPF                                               │  ← syscall filtering
│  (block dangerous syscall surface)                         │
├───────────────────────────────────────────────────────────┤
│  AppArmor / SELinux profiles                               │  ← MAC policy
│  (file, network, capability rules)                         │
├───────────────────────────────────────────────────────────┤
│  Read-only rootfs + no-new-privileges                      │  ← filesystem/exec control
└───────────────────────────────────────────────────────────┘
     Each layer independently limits attacker capabilities.
     Bypassing one layer does NOT automatically grant full access.

Layer 1: User Namespaces

User namespaces map UID/GID inside the container to different (typically unprivileged) UIDs on the host:

Inside container     Host
UID 0 (root)    →   UID 100000
UID 1000        →   UID 101000
GID 0           →   GID 100000

This mapping is configured in /proc/PID/uid_map and /proc/PID/gid_map.

Significance: Even if an attacker achieves UID 0 inside the container (e.g., via a setuid binary or capability escalation inside the container), on the host they are UID 100000 — an unprivileged user. Many host resources (device files, sensitive directories, other namespaces) are not accessible to UID 100000.

How this defeats privilege escalation: The most common container escape pattern is: find a kernel vulnerability → exploit it → gain host root. With user namespaces, the attacker's host UID is already unprivileged, so even a successful kernel exploit that grants "root" within the user namespace context yields much less power.

Rootless containers: Entire container stack runs as an unprivileged user. podman and rootless containerd with nerdctl implement this. No root required anywhere in the stack — the container user maps UID 0 → their own UID, and network connectivity uses userspace network drivers (slirp4netns).

Layer 2: Linux Capabilities

Traditional Unix uses a binary root/non-root model. Linux capabilities divide root privileges into ~40 distinct capabilities, each independently grantable or revocable.

Docker Default Capability Set

Docker drops these capabilities from the default set (the container does NOT have them unless explicitly added):

Dropped by default:
CAP_AUDIT_CONTROL    - configure kernel audit
CAP_AUDIT_READ       - read audit log
CAP_AUDIT_WRITE      - write to audit log
CAP_BLOCK_SUSPEND    - prevent system suspend
CAP_DAC_READ_SEARCH  - bypass DAC read/search
CAP_FSETID           - set SUID/SGID on files
CAP_IPC_LOCK         - lock memory
CAP_MAC_ADMIN        - MAC policy administration
CAP_MAC_OVERRIDE     - MAC policy override
CAP_MKNOD            - create device files
CAP_SETPCAP          - modify own capability sets
CAP_SYS_ADMIN        - wide-ranging admin (mount, ioctl, etc.)
CAP_SYS_BOOT         - reboot
CAP_SYS_MODULE       - load/unload kernel modules
CAP_SYS_NICE         - adjust process priorities
CAP_SYS_RAWIO        - raw I/O port access
CAP_SYS_RESOURCE     - override resource limits
CAP_SYS_TIME         - set system clock
CAP_SYS_TTY_CONFIG   - configure TTY devices
CAP_WAKE_ALARM       - trigger CLOCK_REALTIME alarms

Retained by default (containers have these): - CAP_CHOWN: change file ownership - CAP_NET_BIND_SERVICE: bind ports <1024 - CAP_NET_RAW: use raw sockets (needed for ping) - CAP_SETUID, CAP_SETGID: switch UIDs/GIDs - CAP_KILL: send signals to processes

Best practice: Start with --cap-drop=ALL and add back only what the application needs:

docker run --cap-drop=ALL --cap-add=NET_BIND_SERVICE nginx

CAP_SYS_ADMIN is the most dangerous capability to grant — it is essentially a second root. It allows mounting filesystems, manipulating namespaces, loading eBPF programs (pre-5.8), accessing kernel memory via perf_event_open, and many other sensitive operations.

Layer 3: Seccomp-BPF

Seccomp (Secure Computing Mode) in BPF mode filters system calls based on a BPF program evaluated at each syscall entry. Docker installs a default seccomp profile (JSON format, ~400 lines) that blocks approximately 44 syscalls.

Key syscalls blocked by Docker default profile

ptrace          - process tracing (used in exploits and debugging)
mount           - mounting filesystems (would bypass namespace isolation)
kexec_load      - load new kernel
reboot          - reboot host
swapoff/swapon  - manage swap
syslog          - access kernel log (information disclosure)
unshare         - create new namespaces (blocked with CLONE_NEWUSER)
clone (with CLONE_NEWUSER) - create user namespaces
add_key/keyctl  - kernel keyring (historical exploit surface)
request_key     - kernel keyring
acct            - process accounting
bpf (partially) - eBPF loading restricted

Seccomp profile structure

{
  "defaultAction": "SCMP_ACT_ERRNO",
  "architectures": ["SCMP_ARCH_X86_64", "SCMP_ARCH_X86", "SCMP_ARCH_X32"],
  "syscalls": [
    {
      "names": ["accept", "accept4", "access", "adjtimex", ...],
      "action": "SCMP_ACT_ALLOW"
    },
    {
      "names": ["clone"],
      "action": "SCMP_ACT_ALLOW",
      "args": [
        {
          "index": 0,
          "value": 2114060288,
          "op": "SCMP_CMP_MASKED_EQ"
        }
      ]
    }
  ]
}

The defaultAction: SCMP_ACT_ERRNO makes unrecognized syscalls return EPERM — an allowlist approach. Syscalls on the list with SCMP_ACT_ALLOW are permitted. Some syscalls have argument-level filtering (like clone permitting it but only without CLONE_NEWUSER).

Seccomp overhead: Each syscall traverses the BPF filter. On modern hardware with JIT-compiled BPF, overhead is 2-10ns per syscall — negligible for most workloads but measurable for syscall-heavy code.

Custom profiles: Applications that need unusual syscalls (e.g., perf_event_open for profiling, bpf for observability) require custom profiles. Tools like oci-seccomp-bpf-hook can auto-generate profiles by tracing a container's actual syscall usage.

Layer 4: AppArmor and SELinux

Mandatory Access Control (MAC) systems enforce policy independently of user/capability checks — even root is subject to MAC policy.

AppArmor (Ubuntu/Debian default)

Docker generates and loads an AppArmor profile for each container by default on Ubuntu. The profile (docker-default) restricts: - File access: denies writes to /proc/sys/, /sys/ (with exceptions for safe entries) - Mount operations: denied - Signal sending to processes outside the container - Network raw socket creation

# Excerpt from docker-default AppArmor profile
profile docker-default flags=(attach_disconnected,mediate_deleted) {
  network,
  capability,
  file,
  umount,
  # deny writes to kernel interfaces
  deny @{PROC}/* w,
  deny @{PROC}/{[^1-9],[^1-9][^0-9],...}/stat rw,
  deny /sys/[^f]*/** wklx,
  deny /sys/f[^s]*/** wklx,
  /sys/fs/cgroup/** rw,
  ...
}

SELinux (RHEL/CentOS/Fedora default)

SELinux uses type enforcement — each process gets a type label, each file/resource gets a type label, and policy rules define which process types can access which resource types.

Containers running under SELinux get the container_t type. Host resources that containers should not access have incompatible types, and the SELinux policy denies those accesses even if the process is UID 0.

The --security-opt label=type:my_custom_t flag allows specifying a custom SELinux context for fine-grained container isolation.

Layer 5: Read-Only Rootfs and No-New-Privileges

Read-Only Rootfs

docker run --read-only nginx

Mounts the container's root filesystem read-only. The container process cannot write to its rootfs — it cannot modify binaries, install tools, or write logs to the filesystem. This limits persistence after an exploit and prevents many attack patterns that rely on dropping files.

Combine with --tmpfs /tmp:noexec,nosuid to allow temporary file creation without permitting execution of attacker-written code.

`no-new-privileges`

Sets the PR_SET_NO_NEW_PRIVS prctl flag on the container process. This bit is inherited across execve() and prevents: - setuid binaries from gaining elevated privileges (the setuid bit is ignored) - setcap binaries from gaining additional capabilities - AppArmor/SELinux profile transitions that would elevate privilege

docker run --security-opt=no-new-privileges nginx

This is particularly important when dropping capabilities: without no-new-privileges, a setuid binary inside the container could regain capabilities that were dropped at container start.

Notable Container Escapes

CVE-2019-5736: runc /proc/self/exe Overwrite

Severity: Critical
Affected: runc < 1.0-rc6, Docker < 18.09.2

Mechanism: 1. Attacker runs a malicious container image 2. Container entrypoint is a script that, when executed, opens /proc/self/exe (which points to the runc binary being executed) as a file descriptor 3. During docker exec, runc opens the container's /proc/self/exe — at this moment, /proc/self/exe in the container points to the runc binary on the host through the proc filesystem 4. Attacker writes a payload to the runc binary via this file descriptor 5. Next time runc runs, it executes the attacker's code as root on the host

Timeline:
[runc starts]    → opens /proc/self/exe = runc binary
                   Container starts polling for /proc/self/exe to open
[docker exec]    → runc enters container namespace
                   Container can now open /proc/self/exe via its fd
                   Container overwrites runc binary
[next runc call] → executes attacker's code as root on host

Fix: runc now uses a sealed copy of itself (opened as O_PATH with O_CLOEXEC) before entering container namespaces, so the container cannot access the runc binary through proc.

CVE-2022-0185: Kernel Heap Overflow in fsconfig

Severity: High
Affected: Linux 5.1–5.16.2

Mechanism: - A heap overflow in the fsconfig() syscall (used for new filesystem configuration API) could be triggered from within a container - If the container has CAP_SYS_ADMIN (e.g., --privileged) or if user namespaces are enabled, the syscall is accessible - The overflow led to kernel code execution and container escape to host root

Mitigation: seccomp profiles blocking fsconfig (added to Docker default profile after disclosure), kernel patch, or restricting user namespaces.

The `--privileged` Escape

Running a container with --privileged grants: - All capabilities (including CAP_SYS_ADMIN) - Access to all host devices (/dev/) - Disabled seccomp profile - Disabled AppArmor profile - Full access to all cgroup controllers

A trivial privileged container escape:

# Inside privileged container:
mkdir /mnt/host
mount /dev/sda1 /mnt/host    # CAP_SYS_ADMIN allows mounting
chroot /mnt/host              # Now in host filesystem
# Write to /etc/cron.d, SSH authorized_keys, etc.

Or simpler: nsenter --target 1 --mount --uts --ipc --net --pid -- bash — enter init's namespaces from inside a privileged container (PID 1 is host init).

Security Layer Diagram

Container Process Attack Path (without user namespaces)
                    │
                    │  exploit kernel vuln
                    ▼
  ┌─────────────────────────────────────────┐
  │           Security Layers               │
  │                                         │
  │  Seccomp-BPF        blocks syscall?     │ ← attacker must find allowed syscall
  │  AppArmor/SELinux   denies action?      │ ← even if syscall allowed, MAC denies
  │  Capabilities       has needed cap?     │ ← cap dropped? can't do privileged op
  │  User namespace     maps to unpriv UID? │ ← even root in ns = unpriv on host
  │                                         │
  └─────────────────────────────────────────┘
                    │  all layers bypassed
                    ▼
              Host kernel control
              (container escape)

With ALL layers active: attacker must bypass seccomp (find gap),
bypass AppArmor/SELinux (policy error), have needed cap (not dropped),
AND deal with user namespace UID remapping. Defense in depth.

Kubernetes Security Context

In Kubernetes, container security is configured via securityContext:

spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    runAsGroup: 3000
    fsGroup: 2000
    seccompProfile:
      type: RuntimeDefault
  containers:
  - name: app
    securityContext:
      allowPrivilegeEscalation: false     # no-new-privileges
      readOnlyRootFilesystem: true
      capabilities:
        drop: ["ALL"]
        add: ["NET_BIND_SERVICE"]

PodSecurity Admission (GA in Kubernetes 1.25, replaces PodSecurityPolicy) enforces three profiles: - privileged: No restrictions - baseline: Prevents known privilege escalations (no privileged, no host namespaces, restricted volume types) - restricted: Hardened (requires non-root, read-only rootfs, drops all caps, requires RuntimeDefault seccomp)

Production Examples

Check seccomp status of a running container:

# Find container PID
cat /proc/<PID>/status | grep Seccomp
# 0 = disabled, 1 = strict, 2 = filter (BPF)

Verify AppArmor profile:

cat /proc/<PID>/attr/current
# Should show: docker-default (enforce)

Detect privilege escalation attempts:

# Audit log for capability violations
ausearch -m AVC,USER_AVC -ts today | grep container_t

Debugging Notes

Seccomp blocking legitimate syscall: Application fails with mysterious EPERM or EACCES. Diagnose with strace -e trace=all <command> and look for syscalls returning EPERM. Then allow that syscall in a custom seccomp profile.
AppArmor false positive: Application gets Permission denied despite correct file permissions. Check dmesg | grep apparmor or journalctl | grep DENIED for AppArmor denials.
no-new-privileges breaking setuid binaries: su, sudo, ping inside container may fail. This is expected and correct behavior. Use a proper init system or grant only required capabilities.
User namespace UID mapping issues: File ownership inside container shows nobody on host (unmapped UID). Check /proc/PID/uid_map — if mapping is missing or incorrect, files created inside container will have unmapped UIDs on host.

Performance Implications

Seccomp-BPF: ~2-10ns overhead per syscall with JIT. For syscall-heavy workloads (many small file operations), this can be 1-3% overhead.
AppArmor path-based checks: Each file open traverses AppArmor rules. Large rule sets add latency to file I/O. Usually negligible but measurable in benchmark conditions.
User namespace UID mapping lookup: Minor overhead on each UID translation. Not measurable in practice.
Read-only rootfs: No performance impact on reads; writes to tmpfs have normal tmpfs performance.

Failure Modes

Failure	Symptom	Cause
Seccomp blocks needed syscall	EPERM on legitimate operation	Default profile too restrictive; add syscall to custom profile
AppArmor blocks file access	EACCES despite correct permissions	Profile rule missing; check AVC logs
No-new-privileges breaks app	sudo/setuid fails	Expected; fix by granting needed caps directly
User NS mapping race	File ownership wrong	uid_map written after process started; runtime bug
Privileged container escape	Host compromise	Do not use --privileged; use targeted capability grants

Modern Usage

Confidential containers: Combining TEE (AMD SEV, Intel TDX) with container runtimes to encrypt container memory even from the host kernel/hypervisor
Sigstore/Cosign: Cryptographic signing and verification of container images to ensure image integrity before execution
Runtime security (Falco, Tetragon): eBPF-based runtime security — detect anomalous syscall patterns, unexpected file accesses, and privilege escalation attempts in real-time
Kata Containers: VM-level isolation combined with OCI compatibility (covered in section 07)

Future Directions

Landlock LSM: User-space sandboxing using a stackable LSM module — allows applications to restrict their own filesystem access without root or seccomp
Deeper hardware isolation: Trusted Execution Environments (TEEs) providing hardware-enforced memory encryption for containers
eBPF LSM: Replacing AppArmor/SELinux with programmable eBPF-based security policies — more flexible, easier to iterate on
Supply chain security: In-toto, SLSA, and SBOM integration making container image provenance verifiable end-to-end

Exercises

Run a container with --cap-drop=ALL and verify that ping fails (requires CAP_NET_RAW). Add only CAP_NET_RAW back and verify ping works.
Write a custom seccomp profile that allows all syscalls except mkdir. Build a container image that tries to call mkdir and verify it's blocked. Check the return code and errno.
Run a container as UID 1000 with --read-only and --tmpfs /tmp. Try to write a file to /etc/ (should fail) and to /tmp/ (should succeed). Verify behavior matches expectations.
Reproduce the privileged container escape: run a privileged container and mount the host's root device. Access a file on the host filesystem from inside the container.
Enable Falco on a Kubernetes cluster. Trigger a rule violation (e.g., shell spawned inside container). Observe the alert. Write a custom Falco rule.
Examine the Docker default seccomp profile. Identify 5 syscalls it blocks and explain why each is blocked from a security perspective.

References

capabilities(7) — Linux man page
seccomp(2) — Linux man page
CVE-2019-5736 writeup: unit42.paloaltonetworks.com
CVE-2022-0185 writeup: ssd-disclosure.com
Docker security documentation: docs.docker.com/engine/security/
NCC Group container security audit reports (public)
Kubernetes Pod Security Standards: kubernetes.io/docs/concepts/security/pod-security-standards/
Falco: falco.org
Tetragon (Cilium runtime security): tetragon.io
Linux AppArmor documentation, SELinux documentation
man 5 apparmor.d for AppArmor profile syntax