03 — Seccomp

Technical Overview

Seccomp (Secure Computing Mode) is a Linux kernel feature that restricts the system calls a process can make. By reducing the available syscall surface, seccomp dramatically limits what an attacker can do after achieving code execution in a sandboxed process—even if they escape the application's own security controls.

The insight is simple: most application code needs only a small subset of the ~350 Linux syscalls. A web server renderer needs read, write, send, recv, and a few others. It has no legitimate reason to call ptrace, mount, kexec_load, or create_module. If an attacker exploits the renderer, they cannot use these dangerous syscalls to escape the sandbox.

seccomp-BPF extends the original strict mode with an arbitrary BPF filter program, enabling fine-grained policy including argument inspection.

Prerequisites

System call mechanics (syscall numbers, calling convention, the syscall instruction).
BPF (Berkeley Packet Filter) program basics.
Linux process model (fork, execve, privilege).
prctl(2) and seccomp(2) system call knowledge.

Core Content

Seccomp Strict Mode

The original seccomp (Linux 2.6.12, 2005) is extremely restrictive. Once a process sets PR_SET_SECCOMP in strict mode, only four syscalls are allowed: - read(2) — read from open file descriptors (opened before seccomp activation). - write(2) — write to open file descriptors. - _exit(2) — terminate. - sigreturn(2) — return from signal handler.

Any other syscall results in SIGKILL. This is useful for sandboxed computation (take input, compute, write output) but impractical for most applications.

#include <sys/prctl.h>
#include <linux/seccomp.h>

prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT);
// From this point: only read/write/_exit/sigreturn allowed

Seccomp-BPF: Filter Mode

Seccomp-BPF (Linux 3.5, 2012, David Drysdale and Will Drewry) adds the ability to write a BPF filter program that evaluates each syscall and returns an action.

The filter receives a seccomp_data struct:

struct seccomp_data {
    int nr;                   /* system call number */
    __u32 arch;               /* AUDIT_ARCH_* */
    __u64 instruction_pointer; /* at time of syscall */
    __u64 args[6];            /* syscall arguments */
};

The filter returns one of these actions:

Return Value	Meaning
`SECCOMP_RET_ALLOW`	Allow the syscall
`SECCOMP_RET_ERRNO(n)`	Deny with errno n (e.g., `ENOSYS`)
`SECCOMP_RET_TRAP`	Send SIGSYS to process (can handle in signal handler)
`SECCOMP_RET_TRACE`	Notify ptracer (ptrace-based sandboxes)
`SECCOMP_RET_LOG`	Allow and log (requires `SECCOMP_RET_LOG` support)
`SECCOMP_RET_KILL_THREAD`	Kill the calling thread
`SECCOMP_RET_KILL_PROCESS`	Kill the entire process group (Linux 4.14)
`SECCOMP_RET_USER_NOTIF`	Notify a privileged supervisor process (Linux 5.0)

Writing a Seccomp-BPF Filter

Using libseccomp (high-level API):

#include <seccomp.h>

scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_KILL_PROCESS);  // default: kill

// Allow specific syscalls
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read),    0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write),   0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(mmap),    0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(brk),     0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit_group), 0);

// Allow open only if flags do not include O_WRONLY or O_RDWR
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(open), 1,
                 SCMP_A1(SCMP_CMP_MASKED_EQ, O_WRONLY | O_RDWR, 0));

// Apply filter
seccomp_load(ctx);
seccomp_release(ctx);

Using raw BPF macros (low-level, from linux/filter.h):

#include <linux/seccomp.h>
#include <linux/filter.h>
#include <linux/audit.h>
#include <sys/prctl.h>

struct sock_filter filter[] = {
    // Load syscall number
    BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
             offsetof(struct seccomp_data, nr)),

    // Allow read
    BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_read,  0, 1),
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),

    // Allow write
    BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_write, 0, 1),
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),

    // Kill everything else
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL_PROCESS),
};

struct sock_fprog prog = {
    .len = sizeof(filter) / sizeof(filter[0]),
    .filter = filter,
};

prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);  // required before seccomp
prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog);

PR_SET_NO_NEW_PRIVS is required so that the filter cannot be bypassed via setuid executables—the process can never gain more privileges.

Seccomp Decision Flow

Process executes syscall instruction
    │
    ▼
Kernel syscall entry (arch/x86/entry/common.c)
    │
    ▼
seccomp_run_filters()
    │
    ├── Is seccomp active? No → execute syscall normally
    │
    └── Yes → run BPF filter program with seccomp_data
                    │
            filter returns action:
                    │
          ┌─────────┼──────────────┐
          │         │              │
     ALLOW       ERRNO(n)    KILL_PROCESS
          │         │              │
    execute    return -n      SIGKILL to
    syscall    to caller      process group

Seccomp in Docker

Docker's default seccomp profile (moby/profiles/seccomp/default.json) blocks 44 syscalls that have no legitimate use in containers or are known dangerous:

Notable blocked syscalls: - keyctl, add_key, request_key — kernel keyring (potential privilege escalation). - ptrace — process tracing (container escape vector). - mount, umount2 — filesystem mounting (namespace escape). - pivot_root — change root filesystem. - kexec_load, kexec_file_load — load new kernel. - create_module, init_module, finit_module, delete_module — load kernel modules. - unshare, setns — namespace manipulation. - acct — BSD accounting. - bdflush — obsolete filesystem flushing.

# Run container with default seccomp profile (default)
docker run --security-opt seccomp=/etc/docker/seccomp.json nginx

# Run without seccomp (dangerous)
docker run --security-opt seccomp=unconfined nginx

# Run with custom seccomp profile
docker run --security-opt seccomp=my-custom-profile.json nginx

# Inspect what profile is active
docker inspect <container_id> | jq '.[0].HostConfig.SecurityOpt'

Seccomp in Chrome

Chrome was one of the first major user-facing applications to deploy seccomp. The sandbox architecture:

Browser Process (unconfined)
    │
    ├── Renderer Process (SECCOMP_FILTER)
    │     Allowed: read, write, recv, send, mmap, futex, clock_gettime, ...
    │     Blocked: socket (new sockets), open (new file opens), fork, exec, ...
    │
    ├── GPU Process (SECCOMP_FILTER, less restrictive)
    │
    └── Utility Process (SECCOMP_FILTER)

Chrome's renderer sandbox (since Chrome 23, 2012): - Uses seccomp-BPF for Linux. - Also uses Linux namespaces (PID, net, user) for additional isolation. - The renderer handles untrusted web content; if compromised, the seccomp filter prevents the attacker from making arbitrary syscalls to escape to the OS.

The policy allows rendering operations (mmap, read/write to pre-opened FDs, IPC to the browser process) but blocks all socket creation, exec, and file opening.

Seccomp Overhead

The BPF filter runs on every syscall entry. BPF is JIT-compiled in the kernel (since Linux 3.x), so the filter execution is: - Simple filters (allow/deny by syscall number): ~10–50 ns per syscall. - Complex filters (argument matching): ~50–200 ns per syscall.

For a web server making millions of syscalls per second, this is 0.1–1% overhead. For most applications: negligible.

# Measure seccomp overhead
perf stat -e syscalls:sys_enter_read ./my_program
# Compare timing with and without seccomp filter

Seccomp for Security Containers

Kubernetes supports seccomp via Pod Security:

# Pod spec with seccomp
apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  securityContext:
    seccompProfile:
      type: RuntimeDefault  # use container runtime's default profile
  containers:
  - name: my-container
    image: myapp:latest
    securityContext:
      seccompProfile:
        type: Localhost
        localhostProfile: profiles/my-app.json  # custom profile

Kubernetes 1.19+: RuntimeDefault seccomp is the secure default. Unconfined (no seccomp) remains available but is discouraged.

Seccomp Bypass Techniques

New syscalls added after filter written. Seccomp filters match on syscall number. When a new syscall is added to the kernel (e.g., memfd_create in 3.17, io_uring_setup in 5.1), a filter written before that kernel version will not block it. An attacker with code execution can call the new syscall if they know its number.

Mitigation: use SECCOMP_RET_ERRNO(ENOSYS) as the default action for unknown syscalls rather than SECCOMP_RET_ALLOW. This is Docker's approach.

ioctl with unchecked subcommands. Many seccomp policies allow ioctl unconditionally (because blocking it would break many programs). But ioctl with a specific cmd argument can do almost anything: TIOCSTI injects characters into a terminal, KVM_CREATE_VM creates a VM, PERF_EVENT_IOC_* manipulates perf events.

Properly filtering ioctl: use seccomp argument matching to restrict allowed cmd values:

// Only allow ioctl with FIONREAD (check buffer size)
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(ioctl), 1,
                 SCMP_A1(SCMP_CMP_EQ, FIONREAD));

io_uring as seccomp bypass. io_uring (Linux 5.1) operations include IORING_OP_OPENAT, IORING_OP_SOCKET, IORING_OP_CONNECT—syscall-equivalent operations performed inside the kernel without triggering traditional syscall seccomp filters (they go through io_uring_enter, not the individual syscall entries). This is a known design limitation; kernel work is ongoing to enforce seccomp on io_uring operations.

Chrome and Android disable io_uring precisely because of this bypass potential.

Historical Context

The original seccomp (Andrea Arcangeli, 2005) was designed for computational grids—untrusted code that reads input, computes, and writes output. Its strict four-syscall model was too limited for real applications.

seccomp-BPF was designed at Google for Chrome OS (Will Drewry, 2012). The key insight was reusing the BPF virtual machine (originally designed for network packet filtering) to express syscall filtering policy. This avoided inventing a new policy language and leveraged existing kernel infrastructure.

Linux namespaces (PID, net, mount, user) combined with seccomp-BPF became the basis for all modern container isolation (Docker, containerd, gVisor). gVisor (Google, 2018) takes this further: a complete Linux kernel implemented in Go/Rust that runs containerized applications, with its own seccomp-filtered syscall interface to the host kernel.

Production Examples

Case: Chrome renderer exploit contained by seccomp. In 2022, CVE-2022-1364 was a type confusion vulnerability in Chrome's V8 JavaScript engine allowing code execution in the renderer process. The renderer's seccomp-BPF policy prevented the attacker from escaping the renderer sandbox despite achieving JavaScript-level code execution. To fully escape, the attacker needed a separate sandbox escape vulnerability (a separate CVE). seccomp reduced a remote code execution to a renderer compromise—a significant containment. Reference: Google Project Zero blog.

Case: Docker container with strace blocked. A developer tried to run strace inside a Docker container for debugging. It failed with "Operation not permitted." Root cause: Docker's default seccomp profile blocks ptrace (which strace requires). Fix: docker run --cap-add SYS_PTRACE --security-opt seccomp=unconfined or a custom seccomp profile allowing ptrace. Lesson: seccomp is not transparently compatible with all debugging tools.

Debugging Notes

# Check if a process has seccomp enabled
cat /proc/<pid>/status | grep Seccomp
# Seccomp: 0 (disabled), 1 (strict), 2 (filter)

# Find what syscall triggered SIGSYS
strace -e trace=all ./program 2>&1 | grep -i "Operation not permitted"

# Check seccomp filter on a process (Linux 4.3+)
cat /proc/<pid>/seccomp_filter  # requires CAP_SYS_ADMIN

# Generate a seccomp profile from strace output (Docker approach)
strace -f -e trace=all -o strace.out ./my_program 2>&1
# Then use docker's seccomp profile generator

# oci-seccomp-bpf-hook: auto-generate seccomp profiles from container runs

Security Implications

seccomp reduces the kernel attack surface available to a compromised process. The total number of potential kernel exploits scales with the number of accessible syscalls. Restricting to 50 syscalls vs. 350 eliminates ~85% of potential syscall-level attack paths.

seccomp does not protect against: - Bugs within the allowed syscalls (e.g., a bug in read() itself). - CPU side-channel attacks (Spectre, Meltdown). - Attacks via allowed syscalls with dangerous arguments (see ioctl bypass above). - Bugs in the BPF filter itself (though the BPF verifier prevents obvious bugs).

Defense: combine seccomp with namespaces, capabilities, SELinux/AppArmor, and ASLR+NX for comprehensive defense-in-depth.

Performance Implications

BPF JIT compilation (Linux 3.x+) makes seccomp-BPF fast enough for production. The overhead is proportional to filter complexity. For most application profiles (50–100 syscall allow rules), overhead is < 1% even on syscall-heavy workloads.

Disabling BPF JIT (echo 0 > /proc/sys/net/core/bpf_jit_enable) increases seccomp overhead to ~200 ns per syscall (interpreted BPF). Always ensure BPF JIT is enabled in production.

Failure Modes and Real Incidents

Firefox seccomp-protect breakage (2017). Mozilla deployed seccomp-BPF in Firefox 54 on Linux. Several third-party media plugins called syscalls blocked by the filter, causing crashes on startup. The filter was initially too aggressive; Mozilla had to iteratively relax it based on crash reports. Lesson: generating minimal-privilege seccomp profiles requires extensive workload analysis, not just blocking "obviously dangerous" syscalls.

Kubernetes cluster escape via unconstrained seccomp (2022). A security audit found that many Kubernetes workloads ran without a seccomp profile (Unconfined). One compromised pod was able to call keyctl to access the node's Kubernetes service account tokens stored in the kernel keyring, enabling cluster-wide privilege escalation. The fix: enforce RuntimeDefault seccomp as a PodSecurityAdmission policy.

Modern Usage

seccomp is now baseline hygiene for containerized workloads: - Docker: RuntimeDefault profile applied unless --security-opt seccomp=unconfined. - Kubernetes 1.27+: RuntimeDefault is the default for all new Pods when PodSecurityAdmission policy is enabled at Restricted level. - Go programs: the github.com/seccomp/libseccomp-golang library wraps libseccomp for Go applications. - Rust: seccompiler crate provides safe BPF filter generation.

seccomp notify (SECCOMP_RET_USER_NOTIF, Linux 5.0): allows a privileged supervisor process to intercept seccomp-blocked syscalls and handle them on behalf of the sandboxed process. Used by Podman's rootless container mode to handle mount and mknod that a rootless container cannot perform directly.

Future Directions

seccomp + io_uring enforcement: work is ongoing to make seccomp filters apply to io_uring operations, closing the bypass vector.
seccomp policy synthesis: tools that auto-generate minimal seccomp profiles from observed syscall traces (oci-seccomp-bpf-hook, Falco, Sysdig).
eBPF LSM + seccomp: the BPF_LSM mode (Linux 5.7) allows eBPF programs attached to LSM hooks to implement richer policies than seccomp's syscall-level filter—including context-aware decisions.

Exercises

Write a C program that installs a seccomp-BPF filter allowing only read, write, exit_group, and rt_sigreturn. Verify it by attempting to call getpid() after the filter—it should be killed.
Examine Docker's default seccomp profile (/etc/docker/seccomp.json or from the Moby repository). Identify 5 blocked syscalls and for each: explain what the syscall does, and what attack it prevents.
Use strace -c ./myprogram to record the syscall profile of a real application. Identify the 10 most-called syscalls. Design a minimal seccomp allowlist. What syscalls would you add for safety margin?
Attempt to run strace inside a Docker container with the default seccomp profile. Document the error. Then create a custom seccomp profile that adds ptrace to the allowlist and verify strace works.
Read the io_uring seccomp bypass issue. Write a test program that uses io_uring_enter with IORING_OP_OPENAT to open a file after a seccomp filter blocking openat is installed. Verify that the io_uring path bypasses the filter (on a kernel without the fix).

References

Drewry, W. "Seccomp filter." Linux kernel documentation. https://www.kernel.org/doc/html/latest/userspace-api/seccomp_filter.html
Edge, J. "A seccomp overview." LWN.net, 2015. https://lwn.net/Articles/656307/
Krauss, J. "Making seccomp filters more expressive with BPF." Linux Plumbers Conference, 2019.
Docker seccomp profile: https://github.com/moby/moby/blob/master/profiles/seccomp/default.json
Chrome sandbox documentation: https://chromium.googlesource.com/chromium/src/+/main/docs/linux/sandboxing.md
libseccomp: https://github.com/seccomp/libseccomp
gVisor: https://gvisor.dev/
io_uring security concern: https://security.googleblog.com/2023/06/learnings-from-kctf-vrps-regarding-io.html