06 — Linux Capabilities
Technical Overview
Linux capabilities fragment the traditional all-or-nothing root privilege into ~40 distinct capabilities, each granting a specific set of privileged operations. Before capabilities (POSIX 1003.1e draft, implemented in Linux 2.2, 1999), any program needing any privilege had to run as root (UID 0)—gaining every privilege simultaneously. This violated the principle of least privilege: a DNS resolver needs to bind to port 53 but has no need to mount filesystems, load kernel modules, or modify routing tables.
With capabilities, named can be granted CAP_NET_BIND_SERVICE alone. A compromise of named grants the attacker only network binding capability—not full root. The blast radius of each exploit is minimized to the specific capabilities the compromised process holds.
Prerequisites
- Linux process model (UID, GID, EUID, EGID).
- ELF binary and setuid mechanics.
- File permission model.
- Basic understanding of kernel system call privilege checks.
Core Content
Capability Sets
Each process maintains five capability sets:
| Set | Meaning |
|---|---|
| Permitted (P) | Maximum capabilities the process can acquire. Set at exec from Inheritable ∩ Bounding + file capabilities. |
| Effective (E) | Currently active capabilities. Subset of Permitted. What the kernel checks. |
| Inheritable (I) | Capabilities passed through execve to child processes. |
| Ambient (A) | Non-privileged inheritance through execve without setuid. Linux 4.3. |
| Bounding (B) | Upper bound on capabilities that can ever be acquired. Can only be dropped, never raised. |
Formal capability transitions on execve:
P'(new) = (P(old) ∩ I(old) ∩ file_inheritable) ∪ (file_permitted ∩ bounding)
E'(new) = P'(new) if file has setuid bit or setpcap bit
0 otherwise
I'(new) = I(old) ∩ I(file)
A'(new) = A(old) if PR_SET_AMBIENT was set and A ⊆ P and A ⊆ I
0 otherwise
B'(new) = B(old) (bounding set never changes across exec)
This complexity exists to prevent privilege escalation through execve.
Capability Set Diagram
Process capabilities:
┌─────────────────────────────────────────────────────────────────────┐
│ Bounding Set (B) — maximum limit, monotonically decreasing │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Permitted Set (P) — what can be activated │ │
│ │ ┌───────────────────────────────────────────────────────┐ │ │
│ │ │ Effective Set (E) — what kernel checks on syscalls │ │ │
│ │ └───────────────────────────────────────────────────────┘ │ │
│ │ ┌───────────────────────────────────────────────────────┐ │ │
│ │ │ Inheritable Set (I) — passed through exec │ │ │
│ │ └───────────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Ambient Set (A) — inherits across exec without setuid │ │
│ └─────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
Key Capabilities
| Capability | Privileges Granted | Risk Level |
|---|---|---|
| CAP_NET_ADMIN | Configure network interfaces, routing, firewall rules, change IP, packet filtering | High |
| CAP_SYS_ADMIN | Mount filesystems, ptrace, chroot, set hostname, ioprio_set, many more |
Critical (near-root) |
| CAP_SYS_PTRACE | Trace any process, read/write its memory | Critical |
| CAP_SYS_RAWIO | Raw I/O to devices, /dev/mem, port I/O |
Critical |
| CAP_DAC_OVERRIDE | Bypass file read/write/execute permission checks | High |
| CAP_DAC_READ_SEARCH | Bypass file read and directory search checks | High |
| CAP_SETUID | Set arbitrary UIDs (including uid=0) | Critical |
| CAP_SETGID | Set arbitrary GIDs | High |
| CAP_NET_BIND_SERVICE | Bind to ports < 1024 | Low |
| CAP_NET_RAW | Use raw sockets (ping, packet capture) | Medium |
| CAP_KILL | Send signals to arbitrary processes | Medium |
| CAP_CHOWN | Change file ownership | Medium |
| CAP_FOWNER | Bypass permission checks on file operations | Medium |
| CAP_SYS_MODULE | Load/unload kernel modules | Critical (kernel code exec) |
| CAP_SYS_CHROOT | Use chroot() | Medium |
| CAP_MKNOD | Create device files | Medium |
| CAP_AUDIT_WRITE | Write audit records to kernel audit log | Low |
| CAP_IPC_LOCK | Lock memory (mlock), use huge pages | Low |
| CAP_SYS_NICE | Set process priorities, scheduling policies | Low |
| CAP_SYS_RESOURCE | Override resource limits | Medium |
| CAP_SYS_TIME | Set system time | Medium |
CAP_SYS_ADMIN: The "Bag of Everything"
CAP_SYS_ADMIN is the most dangerous capability—it grants a vast collection of unrelated privileges. Listing just some:
CAP_SYS_ADMIN grants ability to:
mount(2) — mount any filesystem
umount2(2) — unmount
pivot_root(2) — change root filesystem
sethostname — change hostname
setdomainname — change domain
iopl(2) — modify I/O privilege level
mknod for block/char devices
IPC adjustments
ptrace(2) — trace arbitrary process
keyctl(2) — manipulate kernel keyring
BPF: load any BPF program type
... and ~40 more operations
Any container or service that requests CAP_SYS_ADMIN should be treated with extreme suspicion—it is nearly equivalent to running as root. Many container escape exploits rely on abusing CAP_SYS_ADMIN to call mount, pivot_root, or load malicious kernel modules.
Capability Security Principle: Least Privilege
The correct approach for a service:
- Identify required capabilities by tracing which privileged operations the service performs.
- Drop all capabilities at startup.
- Add back only the minimum required.
- After acquiring privileged resources (bind port, open raw socket), drop remaining capabilities.
#include <sys/capability.h>
#include <sys/prctl.h>
// Drop all capabilities (service starts as root for init setup)
void drop_capabilities(void) {
cap_t caps = cap_init(); // empty capability set
if (cap_set_proc(caps) < 0) {
perror("cap_set_proc");
exit(1);
}
cap_free(caps);
// Also prevent regaining root via setuid
prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
}
// Or drop specific capabilities while retaining others:
void keep_net_bind_only(void) {
cap_value_t caps[] = {CAP_NET_BIND_SERVICE};
cap_t cap_set = cap_init();
cap_set_flag(cap_set, CAP_PERMITTED, 1, caps, CAP_SET);
cap_set_flag(cap_set, CAP_EFFECTIVE, 1, caps, CAP_SET);
cap_set_proc(cap_set);
cap_free(cap_set);
}
Viewing Capabilities
# View process capabilities (by PID)
cat /proc/<pid>/status | grep Cap
# CapInh: 0000000000000000
# CapPrm: 0000003fffffffff
# CapEff: 0000003fffffffff
# CapBnd: 0000003fffffffff
# CapAmb: 0000000000000000
# Decode hex capability mask
capsh --decode=0000003fffffffff
# View capabilities of current shell
capsh --print
# View capabilities in human-readable form
getpcaps <pid>
Docker Default Capabilities
Docker drops 14 capabilities by default when starting containers:
Docker drops: CAP_AUDIT_CONTROL, CAP_AUDIT_READ, CAP_BLOCK_SUSPEND,
CAP_DAC_READ_SEARCH, CAP_IPC_LOCK, CAP_IPC_OWNER,
CAP_LEASE, CAP_LINUX_IMMUTABLE, CAP_MAC_ADMIN,
CAP_MAC_OVERRIDE, CAP_NET_ADMIN, CAP_NET_BROADCAST,
CAP_SYS_ADMIN, CAP_SYS_BOOT, CAP_SYS_MODULE,
CAP_SYS_NICE, CAP_SYS_PACCT, CAP_SYS_PTRACE,
CAP_SYS_RAWIO, CAP_SYS_RESOURCE, CAP_SYS_TIME,
CAP_SYS_TTY_CONFIG, CAP_WAKE_ALARM
Docker keeps: CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FOWNER, CAP_FSETID,
CAP_KILL, CAP_SETGID, CAP_SETUID, CAP_SETPCAP,
CAP_NET_BIND_SERVICE, CAP_NET_RAW, CAP_SYS_CHROOT,
CAP_MKNOD, CAP_AUDIT_WRITE, CAP_SETFCAP
Check and modify:
# Check container capabilities
docker run --rm alpine capsh --print
# Add a capability
docker run --cap-add NET_ADMIN nginx
# Drop all and add specific
docker run --cap-drop ALL --cap-add NET_BIND_SERVICE nginx
# Add SYS_PTRACE for debugging (careful!)
docker run --cap-add SYS_PTRACE nginx
File Capabilities (setcap)
Traditional setuid grants all root capabilities. File capabilities allow granting specific capabilities to a binary without setuid:
# Grant ping ability to bind raw sockets without setuid
setcap cap_net_raw+ep /bin/ping
# Grant tcpdump ability to capture packets
setcap cap_net_raw,cap_net_admin+ep /usr/sbin/tcpdump
# Grant nginx ability to bind port 80 without root
setcap cap_net_bind_service+ep /usr/sbin/nginx
# View file capabilities
getcap /bin/ping
# /bin/ping = cap_net_raw+ep
# Remove capabilities
setcap -r /bin/ping
# Flags:
# e = effective (activate the capability when binary runs)
# p = permitted (capability is in the permitted set)
# i = inheritable
Before file capabilities, every tool that needed raw sockets (ping, tcpdump, wireshark) ran setuid root. File capabilities enable these to run as non-root with exactly the capability needed.
Ambient Capabilities (Linux 4.3+)
Traditional capability inheritance through execve requires either:
1. The file to have file capabilities set (setcap).
2. The file to be setuid.
Neither works for scripts, because scripts invoke the interpreter (python3, bash) which has no file capabilities set for the specific script.
Ambient capabilities (PR_CAP_AMBIENT, Linux 4.3) allow a process to pass capabilities to execve'd children even when the executable doesn't have file capabilities:
// Parent: grant CAP_NET_BIND_SERVICE to all children, including scripts
prctl(PR_CAP_AMBIENT, PR_CAP_AMBIENT_RAISE, CAP_NET_BIND_SERVICE, 0, 0);
// Requirement: capability must be in both Permitted and Inheritable
// Then execve'd Python script runs with CAP_NET_BIND_SERVICE
Use case: systemd unit files with AmbientCapabilities=CAP_NET_BIND_SERVICE.
# /etc/systemd/system/myapp.service
[Service]
User=myapp
AmbientCapabilities=CAP_NET_BIND_SERVICE
ExecStart=/opt/myapp/server.py
Capability Confusion Attacks
Capability confusion occurs when a library or tool adds capabilities it does not need, or a process escalates via an unexpected capability interaction.
CAP_SYS_ADMIN container escape: if a container has CAP_SYS_ADMIN, it can call mount(). Mounting an overlayfs with a crafted work directory over /etc allows writing to the host's /etc (via the overlayfs escape—CVE-2021-3493, Ubuntu). The capability was granted for a legitimate purpose (cgroup management), but it also enabled filesystem operations.
CAP_NET_RAW + ARP spoofing: CAP_NET_RAW is intended for ICMP (ping). But it also allows sending arbitrary Ethernet frames, enabling ARP poisoning, which can MITM traffic on a shared network segment. Privileged containers with CAP_NET_RAW in a cloud provider's network can attack neighbor VMs.
CAP_SETUID identity confusion: a process with CAP_SETUID that sets UID=0 gains full root capabilities (all capabilities re-enter the effective set for root). This is why CAP_SETUID should be considered as dangerous as root.
Kubernetes Security Context
apiVersion: v1
kind: Pod
spec:
containers:
- name: app
securityContext:
# Drop all capabilities
capabilities:
drop:
- ALL
add:
- NET_BIND_SERVICE # only add what's needed
allowPrivilegeEscalation: false # prevent setuid escalation
readOnlyRootFilesystem: true
runAsNonRoot: true
runAsUser: 1000
Kubernetes PodSecurityAdmission Restricted profile requires:
- allowPrivilegeEscalation: false
- capabilities.drop: [ALL]
- runAsNonRoot: true
Historical Context
POSIX 1003.1e (the draft standard for capabilities and ACLs) was abandoned in 1997 without ratification. Linux's capability implementation predates and diverges from the draft standard. The Linux kernel implementation (2.2, 1999) was a simplified version, with many refinements through Linux 5.x.
The original capability system had a fundamental limitation: it could not prevent privilege escalation via setuid binaries. The permitted set was automatically maxed for root processes, meaning any root process could give itself any capability. Linux 2.6.25 (2008) added PR_SET_SECUREBITS to harden this, allowing root processes to be treated as non-root for capability purposes.
Ambient capabilities (Linux 4.3, 2015) addressed the long-standing gap of script privilege inheritance.
Production Examples
Case: nginx running as root → file capability. A legacy deployment ran nginx as root to bind port 80 and 443. A contractor added setcap cap_net_bind_service+ep /usr/sbin/nginx and changed the config to user www-data. The nginx master process now runs as www-data, drops the capability after binding, and spawns workers with no capabilities. An nginx vulnerability now grants the attacker the privileges of the www-data user rather than root.
Case: runc CVE-2019-5736 (container escape). A vulnerability in the runc container runtime allowed a malicious container to overwrite the runc binary on the host through /proc/self/exe. The exploit required CAP_SYS_PTRACE inside the container. Containers with the default Docker profile (no SYS_PTRACE) were not exploitable—demonstrating that capability restrictions in Docker's default profile provided meaningful protection.
Debugging Notes
# Decode all capability masks in /proc/<pid>/status
process=1234
while read -r line; do
case $line in
Cap*) name=$(echo $line | cut -d: -f1)
val=$(echo $line | cut -d' ' -f2)
echo -n "$name: "; capsh --decode=$val ;;
esac
done < /proc/$process/status
# Check if a binary has file capabilities
getcap -r /usr/bin 2>/dev/null
# Strace capability-related syscalls
strace -e trace=capset,capget,prctl ./myprogram
# Check systemd service capabilities
systemctl show myservice | grep -i cap
Security Implications
Capabilities reduce the blast radius of exploits but are not a complete solution:
- Over-broad capabilities remain dangerous.
CAP_SYS_ADMINin a container is effectively root. - Capability leakage through
execve. Inheritable capabilities that leak into setuid-root binaries restore full root capabilities. UsePR_SET_NO_NEW_PRIVSto prevent this. - File capability attacks. If an attacker can write to a binary that has file capabilities (
setcap), they can execute it to gain those capabilities. Protect binaries with file capabilities from unprivileged writes. - CAP_NET_RAW in shared networks. Even in a container,
CAP_NET_RAWallows ARP spoofing on a flat L2 network. Use network policies (Kubernetes NetworkPolicy, iptables) to restrict inter-pod traffic.
Performance Implications
Capability checks are performed on every privileged syscall. The check is a bitwise test against the effective capability set—essentially free (< 1 ns). No performance impact from using capabilities vs. running as root.
The getcap/setcap operations are extended attributes on the filesystem (stored in security.capability xattr). Reading file capabilities adds one xattr lookup per execve—negligible except in extremely high-exec-rate workloads.
Failure Modes and Real Incidents
CVE-2021-3493 (Ubuntu OverlayFS privilege escalation): Ubuntu's patched OverlayFS allowed mount() inside user namespaces without CAP_SYS_ADMIN in the initial namespace. Combined with Ubuntu-specific kernel patches, a user could gain root. The Linux capability model's namespace-relative interpretation of CAP_SYS_ADMIN created confusion—a capability that was "safe" inside a user namespace was effectively dangerous due to Ubuntu's filesystem patches.
CVE-2019-5736 (runc container escape): The runc binary is executed with CAP_SYS_PTRACE internally. A malicious container image could abuse /proc/self/exe to overwrite the runc binary during execution, gaining host code execution. This demonstrated that even well-designed capability models have unexpected interactions with Linux's /proc filesystem.
Modern Usage
In Kubernetes, capability management is enforced by Pod Security Admission (Kubernetes 1.25+). The Restricted policy requires drop: [ALL] with explicit add: for any needed capability. CIS Kubernetes Benchmark mandates capability restrictions for all workloads.
systemd (v219+) supports CapabilityBoundingSet and AmbientCapabilities in unit files, enabling system services to run with minimal capabilities without custom code changes.
Future Directions
- Landlock (Linux 5.13): a complement to capabilities providing path-based access control (sandboxing which directories a process can access) via seccomp-BPF-like mechanism without requiring root or CAP_SYS_ADMIN.
- Finer-grained capabilities: ongoing proposals to split
CAP_SYS_ADMINinto more specific capabilities. Its current breadth is a design debt from when the capability was "everything else." - Capability-aware container runtimes: future versions of OCI runtimes may generate minimal capability sets automatically from static analysis of container images.
Exercises
-
Run
capsh --printon your system. Identify which capabilities your current shell has. Explain what each effective capability would allow an attacker to do if they compromised this shell. -
Grant
CAP_NET_BIND_SERVICEto a test binary usingsetcap. Verify the binary can bind to port 80 when run as a non-root user. Usegetcapto confirm the file capability, andcat /proc/<pid>/status | grep Capto observe the capability sets during execution. -
Write a C program that drops all capabilities after binding to a privileged port. Verify: (a) the port is bound successfully, (b) after the drop,
getpcaps <pid>shows empty capability sets, (c) the process still serves connections on the bound port. -
Examine the Docker default capability set. Identify three capabilities that Docker keeps enabled by default. For each: explain the legitimate use case and the potential attack vector.
-
Configure a systemd unit file for a Python HTTP server that uses
AmbientCapabilities=CAP_NET_BIND_SERVICEto bind on port 80 without root. Verify the service starts, binds port 80, and the process runs as a non-root user.
References
- Linux man page capabilities(7): https://man7.org/linux/man-pages/man7/capabilities.7.html
- Hallyn, S. "POSIX.1e Capabilities." Linux kernel documentation. https://www.kernel.org/doc/html/latest/userspace-api/capabilities.html
- Docker security documentation: https://docs.docker.com/engine/security/
- Kubernetes Pod Security: https://kubernetes.io/docs/concepts/security/pod-security-standards/
- CVE-2019-5736: https://nvd.nist.gov/vuln/detail/CVE-2019-5736
- CVE-2021-3493: https://nvd.nist.gov/vuln/detail/CVE-2021-3493
- libcap: https://git.kernel.org/pub/scm/libs/libcap/libcap.git
- Landlock documentation: https://landlock.io/