06 — Linux Capabilities

Technical Overview

Linux capabilities fragment the traditional all-or-nothing root privilege into ~40 distinct capabilities, each granting a specific set of privileged operations. Before capabilities (POSIX 1003.1e draft, implemented in Linux 2.2, 1999), any program needing any privilege had to run as root (UID 0)—gaining every privilege simultaneously. This violated the principle of least privilege: a DNS resolver needs to bind to port 53 but has no need to mount filesystems, load kernel modules, or modify routing tables.

With capabilities, named can be granted CAP_NET_BIND_SERVICE alone. A compromise of named grants the attacker only network binding capability—not full root. The blast radius of each exploit is minimized to the specific capabilities the compromised process holds.

Prerequisites

Linux process model (UID, GID, EUID, EGID).
ELF binary and setuid mechanics.
File permission model.
Basic understanding of kernel system call privilege checks.

Core Content

Capability Sets

Each process maintains five capability sets:

Set	Meaning
Permitted (P)	Maximum capabilities the process can acquire. Set at exec from Inheritable ∩ Bounding + file capabilities.
Effective (E)	Currently active capabilities. Subset of Permitted. What the kernel checks.
Inheritable (I)	Capabilities passed through execve to child processes.
Ambient (A)	Non-privileged inheritance through execve without setuid. Linux 4.3.
Bounding (B)	Upper bound on capabilities that can ever be acquired. Can only be dropped, never raised.

Formal capability transitions on execve:

P'(new) = (P(old) ∩ I(old) ∩ file_inheritable) ∪ (file_permitted ∩ bounding)
E'(new) = P'(new)  if file has setuid bit or setpcap bit
          0        otherwise
I'(new) = I(old) ∩ I(file)
A'(new) = A(old)   if PR_SET_AMBIENT was set and A ⊆ P and A ⊆ I
           0       otherwise
B'(new) = B(old)   (bounding set never changes across exec)

This complexity exists to prevent privilege escalation through execve.

Capability Set Diagram

Process capabilities:
┌─────────────────────────────────────────────────────────────────────┐
│  Bounding Set (B)     — maximum limit, monotonically decreasing     │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │  Permitted Set (P)  — what can be activated                  │    │
│  │  ┌───────────────────────────────────────────────────────┐  │    │
│  │  │  Effective Set (E)  — what kernel checks on syscalls   │  │    │
│  │  └───────────────────────────────────────────────────────┘  │    │
│  │  ┌───────────────────────────────────────────────────────┐  │    │
│  │  │  Inheritable Set (I) — passed through exec            │  │    │
│  │  └───────────────────────────────────────────────────────┘  │    │
│  └─────────────────────────────────────────────────────────────┘    │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │  Ambient Set (A) — inherits across exec without setuid      │    │
│  └─────────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────┘

Key Capabilities

Capability	Privileges Granted	Risk Level
CAP_NET_ADMIN	Configure network interfaces, routing, firewall rules, change IP, packet filtering	High
CAP_SYS_ADMIN	Mount filesystems, `ptrace`, `chroot`, set hostname, `ioprio_set`, many more	Critical (near-root)
CAP_SYS_PTRACE	Trace any process, read/write its memory	Critical
CAP_SYS_RAWIO	Raw I/O to devices, `/dev/mem`, port I/O	Critical
CAP_DAC_OVERRIDE	Bypass file read/write/execute permission checks	High
CAP_DAC_READ_SEARCH	Bypass file read and directory search checks	High
CAP_SETUID	Set arbitrary UIDs (including uid=0)	Critical
CAP_SETGID	Set arbitrary GIDs	High
CAP_NET_BIND_SERVICE	Bind to ports < 1024	Low
CAP_NET_RAW	Use raw sockets (ping, packet capture)	Medium
CAP_KILL	Send signals to arbitrary processes	Medium
CAP_CHOWN	Change file ownership	Medium
CAP_FOWNER	Bypass permission checks on file operations	Medium
CAP_SYS_MODULE	Load/unload kernel modules	Critical (kernel code exec)
CAP_SYS_CHROOT	Use chroot()	Medium
CAP_MKNOD	Create device files	Medium
CAP_AUDIT_WRITE	Write audit records to kernel audit log	Low
CAP_IPC_LOCK	Lock memory (mlock), use huge pages	Low
CAP_SYS_NICE	Set process priorities, scheduling policies	Low
CAP_SYS_RESOURCE	Override resource limits	Medium
CAP_SYS_TIME	Set system time	Medium

CAP_SYS_ADMIN: The "Bag of Everything"

CAP_SYS_ADMIN is the most dangerous capability—it grants a vast collection of unrelated privileges. Listing just some:

CAP_SYS_ADMIN grants ability to:
  mount(2)        — mount any filesystem
  umount2(2)      — unmount
  pivot_root(2)   — change root filesystem
  sethostname     — change hostname
  setdomainname   — change domain
  iopl(2)         — modify I/O privilege level
  mknod for block/char devices
  IPC adjustments
  ptrace(2)       — trace arbitrary process
  keyctl(2)       — manipulate kernel keyring
  BPF: load any BPF program type
  ... and ~40 more operations

Any container or service that requests CAP_SYS_ADMIN should be treated with extreme suspicion—it is nearly equivalent to running as root. Many container escape exploits rely on abusing CAP_SYS_ADMIN to call mount, pivot_root, or load malicious kernel modules.

Capability Security Principle: Least Privilege

The correct approach for a service:

Identify required capabilities by tracing which privileged operations the service performs.
Drop all capabilities at startup.
Add back only the minimum required.
After acquiring privileged resources (bind port, open raw socket), drop remaining capabilities.

#include <sys/capability.h>
#include <sys/prctl.h>

// Drop all capabilities (service starts as root for init setup)
void drop_capabilities(void) {
    cap_t caps = cap_init();  // empty capability set
    if (cap_set_proc(caps) < 0) {
        perror("cap_set_proc");
        exit(1);
    }
    cap_free(caps);

    // Also prevent regaining root via setuid
    prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
}

// Or drop specific capabilities while retaining others:
void keep_net_bind_only(void) {
    cap_value_t caps[] = {CAP_NET_BIND_SERVICE};
    cap_t cap_set = cap_init();
    cap_set_flag(cap_set, CAP_PERMITTED, 1, caps, CAP_SET);
    cap_set_flag(cap_set, CAP_EFFECTIVE, 1, caps, CAP_SET);
    cap_set_proc(cap_set);
    cap_free(cap_set);
}

Viewing Capabilities

# View process capabilities (by PID)
cat /proc/<pid>/status | grep Cap
# CapInh: 0000000000000000
# CapPrm: 0000003fffffffff
# CapEff: 0000003fffffffff
# CapBnd: 0000003fffffffff
# CapAmb: 0000000000000000

# Decode hex capability mask
capsh --decode=0000003fffffffff

# View capabilities of current shell
capsh --print

# View capabilities in human-readable form
getpcaps <pid>

Docker Default Capabilities

Docker drops 14 capabilities by default when starting containers:

Docker drops: CAP_AUDIT_CONTROL, CAP_AUDIT_READ, CAP_BLOCK_SUSPEND,
              CAP_DAC_READ_SEARCH, CAP_IPC_LOCK, CAP_IPC_OWNER,
              CAP_LEASE, CAP_LINUX_IMMUTABLE, CAP_MAC_ADMIN,
              CAP_MAC_OVERRIDE, CAP_NET_ADMIN, CAP_NET_BROADCAST,
              CAP_SYS_ADMIN, CAP_SYS_BOOT, CAP_SYS_MODULE,
              CAP_SYS_NICE, CAP_SYS_PACCT, CAP_SYS_PTRACE,
              CAP_SYS_RAWIO, CAP_SYS_RESOURCE, CAP_SYS_TIME,
              CAP_SYS_TTY_CONFIG, CAP_WAKE_ALARM

Docker keeps: CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FOWNER, CAP_FSETID,
              CAP_KILL, CAP_SETGID, CAP_SETUID, CAP_SETPCAP,
              CAP_NET_BIND_SERVICE, CAP_NET_RAW, CAP_SYS_CHROOT,
              CAP_MKNOD, CAP_AUDIT_WRITE, CAP_SETFCAP

Check and modify:

# Check container capabilities
docker run --rm alpine capsh --print

# Add a capability
docker run --cap-add NET_ADMIN nginx

# Drop all and add specific
docker run --cap-drop ALL --cap-add NET_BIND_SERVICE nginx

# Add SYS_PTRACE for debugging (careful!)
docker run --cap-add SYS_PTRACE nginx

File Capabilities (setcap)

Traditional setuid grants all root capabilities. File capabilities allow granting specific capabilities to a binary without setuid:

# Grant ping ability to bind raw sockets without setuid
setcap cap_net_raw+ep /bin/ping

# Grant tcpdump ability to capture packets
setcap cap_net_raw,cap_net_admin+ep /usr/sbin/tcpdump

# Grant nginx ability to bind port 80 without root
setcap cap_net_bind_service+ep /usr/sbin/nginx

# View file capabilities
getcap /bin/ping
# /bin/ping = cap_net_raw+ep

# Remove capabilities
setcap -r /bin/ping

# Flags:
# e = effective (activate the capability when binary runs)
# p = permitted (capability is in the permitted set)
# i = inheritable

Before file capabilities, every tool that needed raw sockets (ping, tcpdump, wireshark) ran setuid root. File capabilities enable these to run as non-root with exactly the capability needed.

Ambient Capabilities (Linux 4.3+)

Traditional capability inheritance through execve requires either: 1. The file to have file capabilities set (setcap). 2. The file to be setuid.

Neither works for scripts, because scripts invoke the interpreter (python3, bash) which has no file capabilities set for the specific script.

Ambient capabilities (PR_CAP_AMBIENT, Linux 4.3) allow a process to pass capabilities to execve'd children even when the executable doesn't have file capabilities:

// Parent: grant CAP_NET_BIND_SERVICE to all children, including scripts
prctl(PR_CAP_AMBIENT, PR_CAP_AMBIENT_RAISE, CAP_NET_BIND_SERVICE, 0, 0);

// Requirement: capability must be in both Permitted and Inheritable
// Then execve'd Python script runs with CAP_NET_BIND_SERVICE

Use case: systemd unit files with AmbientCapabilities=CAP_NET_BIND_SERVICE.

# /etc/systemd/system/myapp.service
[Service]
User=myapp
AmbientCapabilities=CAP_NET_BIND_SERVICE
ExecStart=/opt/myapp/server.py

Capability Confusion Attacks

Capability confusion occurs when a library or tool adds capabilities it does not need, or a process escalates via an unexpected capability interaction.

CAP_SYS_ADMIN container escape: if a container has CAP_SYS_ADMIN, it can call mount(). Mounting an overlayfs with a crafted work directory over /etc allows writing to the host's /etc (via the overlayfs escape—CVE-2021-3493, Ubuntu). The capability was granted for a legitimate purpose (cgroup management), but it also enabled filesystem operations.

CAP_NET_RAW + ARP spoofing: CAP_NET_RAW is intended for ICMP (ping). But it also allows sending arbitrary Ethernet frames, enabling ARP poisoning, which can MITM traffic on a shared network segment. Privileged containers with CAP_NET_RAW in a cloud provider's network can attack neighbor VMs.

CAP_SETUID identity confusion: a process with CAP_SETUID that sets UID=0 gains full root capabilities (all capabilities re-enter the effective set for root). This is why CAP_SETUID should be considered as dangerous as root.

Kubernetes Security Context

apiVersion: v1
kind: Pod
spec:
  containers:
  - name: app
    securityContext:
      # Drop all capabilities
      capabilities:
        drop:
        - ALL
        add:
        - NET_BIND_SERVICE  # only add what's needed
      allowPrivilegeEscalation: false  # prevent setuid escalation
      readOnlyRootFilesystem: true
      runAsNonRoot: true
      runAsUser: 1000

Kubernetes PodSecurityAdmission Restricted profile requires: - allowPrivilegeEscalation: false - capabilities.drop: [ALL] - runAsNonRoot: true

Historical Context

POSIX 1003.1e (the draft standard for capabilities and ACLs) was abandoned in 1997 without ratification. Linux's capability implementation predates and diverges from the draft standard. The Linux kernel implementation (2.2, 1999) was a simplified version, with many refinements through Linux 5.x.

The original capability system had a fundamental limitation: it could not prevent privilege escalation via setuid binaries. The permitted set was automatically maxed for root processes, meaning any root process could give itself any capability. Linux 2.6.25 (2008) added PR_SET_SECUREBITS to harden this, allowing root processes to be treated as non-root for capability purposes.

Ambient capabilities (Linux 4.3, 2015) addressed the long-standing gap of script privilege inheritance.

Production Examples

Case: nginx running as root → file capability. A legacy deployment ran nginx as root to bind port 80 and 443. A contractor added setcap cap_net_bind_service+ep /usr/sbin/nginx and changed the config to user www-data. The nginx master process now runs as www-data, drops the capability after binding, and spawns workers with no capabilities. An nginx vulnerability now grants the attacker the privileges of the www-data user rather than root.

Case: runc CVE-2019-5736 (container escape). A vulnerability in the runc container runtime allowed a malicious container to overwrite the runc binary on the host through /proc/self/exe. The exploit required CAP_SYS_PTRACE inside the container. Containers with the default Docker profile (no SYS_PTRACE) were not exploitable—demonstrating that capability restrictions in Docker's default profile provided meaningful protection.

Debugging Notes

# Decode all capability masks in /proc/<pid>/status
process=1234
while read -r line; do
    case $line in
        Cap*) name=$(echo $line | cut -d: -f1)
              val=$(echo $line | cut -d' ' -f2)
              echo -n "$name: "; capsh --decode=$val ;;
    esac
done < /proc/$process/status

# Check if a binary has file capabilities
getcap -r /usr/bin 2>/dev/null

# Strace capability-related syscalls
strace -e trace=capset,capget,prctl ./myprogram

# Check systemd service capabilities
systemctl show myservice | grep -i cap

Security Implications

Capabilities reduce the blast radius of exploits but are not a complete solution:

Over-broad capabilities remain dangerous. CAP_SYS_ADMIN in a container is effectively root.
Capability leakage through execve. Inheritable capabilities that leak into setuid-root binaries restore full root capabilities. Use PR_SET_NO_NEW_PRIVS to prevent this.
File capability attacks. If an attacker can write to a binary that has file capabilities (setcap), they can execute it to gain those capabilities. Protect binaries with file capabilities from unprivileged writes.
CAP_NET_RAW in shared networks. Even in a container, CAP_NET_RAW allows ARP spoofing on a flat L2 network. Use network policies (Kubernetes NetworkPolicy, iptables) to restrict inter-pod traffic.

Performance Implications

Capability checks are performed on every privileged syscall. The check is a bitwise test against the effective capability set—essentially free (< 1 ns). No performance impact from using capabilities vs. running as root.

The getcap/setcap operations are extended attributes on the filesystem (stored in security.capability xattr). Reading file capabilities adds one xattr lookup per execve—negligible except in extremely high-exec-rate workloads.

Failure Modes and Real Incidents

CVE-2021-3493 (Ubuntu OverlayFS privilege escalation): Ubuntu's patched OverlayFS allowed mount() inside user namespaces without CAP_SYS_ADMIN in the initial namespace. Combined with Ubuntu-specific kernel patches, a user could gain root. The Linux capability model's namespace-relative interpretation of CAP_SYS_ADMIN created confusion—a capability that was "safe" inside a user namespace was effectively dangerous due to Ubuntu's filesystem patches.

CVE-2019-5736 (runc container escape): The runc binary is executed with CAP_SYS_PTRACE internally. A malicious container image could abuse /proc/self/exe to overwrite the runc binary during execution, gaining host code execution. This demonstrated that even well-designed capability models have unexpected interactions with Linux's /proc filesystem.

Modern Usage

In Kubernetes, capability management is enforced by Pod Security Admission (Kubernetes 1.25+). The Restricted policy requires drop: [ALL] with explicit add: for any needed capability. CIS Kubernetes Benchmark mandates capability restrictions for all workloads.

systemd (v219+) supports CapabilityBoundingSet and AmbientCapabilities in unit files, enabling system services to run with minimal capabilities without custom code changes.

Future Directions

Landlock (Linux 5.13): a complement to capabilities providing path-based access control (sandboxing which directories a process can access) via seccomp-BPF-like mechanism without requiring root or CAP_SYS_ADMIN.
Finer-grained capabilities: ongoing proposals to split CAP_SYS_ADMIN into more specific capabilities. Its current breadth is a design debt from when the capability was "everything else."
Capability-aware container runtimes: future versions of OCI runtimes may generate minimal capability sets automatically from static analysis of container images.

Exercises

Run capsh --print on your system. Identify which capabilities your current shell has. Explain what each effective capability would allow an attacker to do if they compromised this shell.
Grant CAP_NET_BIND_SERVICE to a test binary using setcap. Verify the binary can bind to port 80 when run as a non-root user. Use getcap to confirm the file capability, and cat /proc/<pid>/status | grep Cap to observe the capability sets during execution.
Write a C program that drops all capabilities after binding to a privileged port. Verify: (a) the port is bound successfully, (b) after the drop, getpcaps <pid> shows empty capability sets, (c) the process still serves connections on the bound port.
Examine the Docker default capability set. Identify three capabilities that Docker keeps enabled by default. For each: explain the legitimate use case and the potential attack vector.
Configure a systemd unit file for a Python HTTP server that uses AmbientCapabilities=CAP_NET_BIND_SERVICE to bind on port 80 without root. Verify the service starts, binds port 80, and the process runs as a non-root user.

References

Linux man page capabilities(7): https://man7.org/linux/man-pages/man7/capabilities.7.html
Hallyn, S. "POSIX.1e Capabilities." Linux kernel documentation. https://www.kernel.org/doc/html/latest/userspace-api/capabilities.html
Docker security documentation: https://docs.docker.com/engine/security/
Kubernetes Pod Security: https://kubernetes.io/docs/concepts/security/pod-security-standards/
CVE-2019-5736: https://nvd.nist.gov/vuln/detail/CVE-2019-5736
CVE-2021-3493: https://nvd.nist.gov/vuln/detail/CVE-2021-3493
libcap: https://git.kernel.org/pub/scm/libs/libcap/libcap.git
Landlock documentation: https://landlock.io/