System Calls

Technical Overview

A system call (syscall) is the mechanism by which a user-space program requests a service from the kernel. It is the sole defined interface between user space (ring 3) and kernel space (ring 0). Every file operation, network connection, process creation, and memory allocation ultimately traverses this interface. Understanding system calls means understanding exactly how user-space software gets anything done, because without them, a process cannot interact with the world outside its own private memory.

On x86-64 Linux, there are approximately 400 system calls defined in arch/x86/entry/syscalls/syscall_64.tbl. They range from trivial (getpid, which just reads a field in the current task struct) to complex (io_uring_enter, which can submit and complete thousands of I/O operations). The syscall number, a small integer, uniquely identifies which kernel function to invoke. A convention for passing arguments and returning values is defined in the ABI specification and followed by both the C library and the kernel.

Prerequisites

02-user-space-vs-kernel-space.md: why the kernel/user boundary exists
03-cpu-privilege-rings.md: how ring transitions work
Basic C programming knowledge
Familiarity with the concept of a function call and return value

Core Content

The System Call Table

The Linux kernel maintains a table of function pointers — sys_call_table[] — indexed by syscall number. On x86-64, it is defined in arch/x86/entry/syscall_64.c and populated from arch/x86/entry/syscalls/syscall_64.tbl.

Partial syscall_64.tbl (x86-64):
  Number  ABI     Name                Entry Point
  0       common  read                sys_read
  1       common  write               sys_write
  2       common  open                sys_open
  3       common  close               sys_close
  9       common  mmap                sys_mmap
  12      common  brk                 sys_brk
  56      common  clone               sys_clone
  59      common  execve              sys_execve
  60      common  exit                sys_exit
  61      common  wait4               sys_wait4
  62      common  kill                sys_kill
  231     common  exit_group          sys_exit_group
  333     common  io_uring_setup      sys_io_uring_setup
  334     common  io_uring_enter      sys_io_uring_enter

The total count on Linux 6.6 is ~460 syscalls for x86-64. Different architectures have different tables and different numbering. This is why syscall(SYS_write, ...) uses architecture-specific constants from <sys/syscall.h>.

How a Syscall Works on x86-64: The Complete Flow

User Space                          Kernel Space
─────────────────                   ─────────────────────────────────────

glibc write(fd, buf, len)
  │
  │  Sets up registers:
  │  rax = 1  (SYS_write)
  │  rdi = fd
  │  rsi = buf
  │  rdx = len
  │
  │  Executes SYSCALL instruction
  │  ─────────────────────────────────────────────────────────►
  │                                  CPU loads RIP from LSTAR MSR
  │                                  → entry_SYSCALL_64
  │                                  CPU clears IF (interrupt flag)
  │                                  Saves old RSP, loads kernel RSP
  │                                  from MSR_KERNEL_GS_BASE (via swapgs)
  │                                  
  │                                  entry_SYSCALL_64:
  │                                    swapgs          ; switch GS base
  │                                    mov %rsp, %gs:scratch_space
  │                                    mov %gs:cpu_tss+TSS_sp2, %rsp
  │                                    pushq %r11      ; old RFLAGS
  │                                    pushq %rcx      ; old RIP (return addr)
  │                                    PUSH_REGS       ; save all GPRs
  │                                    mov %rsp, %rdi  ; pt_regs *regs
  │                                    call do_syscall_64()
  │                                  
  │                                  do_syscall_64():
  │                                    nr = regs->ax   ; syscall number
  │                                    if (nr < NR_syscalls):
  │                                      ret = sys_call_table[nr](regs)
  │                                    regs->ax = ret  ; store return value
  │                                  
  │                                  (return path)
  │                                    POP_REGS        ; restore registers
  │                                    swapgs          ; restore GS base
  │                                    SYSRET          ; return to user, CPL=3
  │  ◄─────────────────────────────────────────────────────────
  │
  │  Returns from glibc wrapper
  │  Return value in rax
  │  errno set if rax < 0

Key mechanism: The SYSCALL instruction is not a general call — it is a special fast-path instruction that: 1. Saves RIP (return address) into RCX 2. Saves RFLAGS into R11 3. Masks RFLAGS using SFMASK MSR (clears IF, preventing interrupts during entry) 4. Loads a new RIP from LSTAR MSR (set at boot to entry_SYSCALL_64) 5. Changes CPL from 3 to 0 (by loading CS/SS from STAR MSR) 6. Does NOT switch the stack — the kernel switches it early in entry_SYSCALL_64

The LSTAR MSR is configured at boot in arch/x86/kernel/cpu/common.c:

wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);

Syscall Arguments: Register Convention

On x86-64 Linux (defined in the x86-64 ABI supplement):

Register	Role
`rax`	Syscall number (input), return value (output)
`rdi`	1st argument
`rsi`	2nd argument
`rdx`	3rd argument
`r10`	4th argument (note: `r10`, not `rcx` as in C calling convention)
`r8`	5th argument
`r9`	6th argument

A maximum of 6 arguments are passed in registers. Syscalls that need more data use a struct pointer as one of the arguments (e.g., io_uring_setup passes a struct io_uring_params * as the second argument).

Why r10 instead of rcx? The SYSCALL instruction saves the old RIP (return address) into rcx, destroying it. The kernel entry code cannot use rcx as the 4th argument. The ABI uses r10 instead, and glibc's syscall wrappers perform mov rcx, r10 before entering the kernel path.

Return Value and errno

The kernel places the return value in rax. For success, this is the return value (e.g., number of bytes written). For errors, the kernel returns -errno (a negative error code, such as -ENOENT = -2). glibc's syscall wrappers check if the return value is in the range [-4096, -1], and if so, set errno = -ret and return -1.

// Inside glibc's write() wrapper (simplified):
long ret = syscall(SYS_write, fd, buf, count);
if (ret < 0 && ret > -4096) {
    errno = -ret;   // e.g., ret=-2 → errno=ENOENT
    return -1;
}
return ret;

The errno global is thread-local (TLS) in modern glibc — each thread has its own errno.

VDSO: Virtual Dynamic Shared Object

The vDSO is a small shared library (2–4 KiB) that the kernel maps into every process's address space. It implements a handful of system calls entirely in user space, reading from kernel-maintained memory pages, avoiding the SYSCALL trap overhead entirely.

Process virtual memory (from /proc/self/maps):
  7ffff7ffd000-7ffff7fff000 r-xp  [vdso]
  ffffffff80000000-...              [vsyscall] (legacy)

Syscalls implemented in the vDSO (Linux x86-64): - clock_gettime(CLOCK_REALTIME, ...) — reads from a struct vdso_data page the kernel keeps updated - clock_gettime(CLOCK_MONOTONIC, ...) — same page - gettimeofday() — reads from the same page - getcpu() — reads the current CPU number from a per-CPU page - clock_getres() — returns static data

How vDSO clock_gettime works:

Kernel maintains:
  struct vdso_data {
      u64 seq;         // seqcount for consistency
      u64 cycle_last;  // last TSC value when updated
      u64 mult, shift; // TSC → nanoseconds conversion
      u64 basetime;    // Wall clock base
      ...
  } at a fixed kernel virtual address, also mapped r/o into user space

vDSO clock_gettime():
  1. Read seq (verify even = no update in progress)
  2. Read TSC via RDTSC
  3. Compute: ns = (TSC - cycle_last) * mult >> shift + basetime
  4. Read seq again (verify unchanged)
  5. If seq changed during read: retry
  6. Return ns (no syscall trap!)

Cost: ~5ns for clock_gettime via vDSO vs. ~100–200ns via actual syscall.

vsyscall: The older mechanism, now largely deprecated. The vsyscall page was at a fixed address (0xffffffffff600000) in 64-bit virtual address space. It is now implemented as a trap-and-emulate mechanism (the page is present but not executable; accessing it causes a #PF that the kernel emulates) to prevent it from being used as a ROP gadget at a predictable address.

seccomp Syscall Filtering

seccomp (Secure Computing Mode, kernel/seccomp.c) allows a process to restrict which syscalls it or its children may invoke. It operates via a BPF (Berkeley Packet Filter) program that runs for every syscall entry:

// Simplified seccomp filter example
struct sock_fprog prog = {
    .len = ...,
    .filter = (struct sock_filter[]) {
        BPF_STMT(BPF_LD | BPF_W | BPF_ABS, offsetof(struct seccomp_data, nr)),
        BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_write, 0, 1),
        BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
        BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL_PROCESS),
    }
};
prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog);

The BPF program receives a struct seccomp_data containing the syscall number and arguments. It returns an action: SECCOMP_RET_ALLOW, SECCOMP_RET_KILL_PROCESS, SECCOMP_RET_ERRNO(e), SECCOMP_RET_TRAP (sends SIGSYS), or SECCOMP_RET_TRACE (notifies a tracer).

Used by: Chrome (every renderer process has a strict seccomp filter), Firefox, systemd (for sandboxed services), Docker (default seccomp profile blocks ~44 syscalls), OpenSSH.

strace Internals: ptrace

strace uses the ptrace(2) syscall to intercept every system call made by the traced process. The mechanism:

strace forks and calls ptrace(PTRACE_TRACEME) in the child before exec
The traced process stops at every syscall entry and exit
strace calls ptrace(PTRACE_SYSCALL) to continue execution until the next syscall event
At each stop, strace reads struct user_regs_struct via ptrace(PTRACE_GETREGS) to get the syscall number (in orig_rax) and arguments

The overhead is significant: every syscall becomes two ptrace stops (entry + exit), plus the overhead of the tracer process scheduling and context switching. strace adds 10–100x overhead to syscall-heavy programs. perf trace achieves similar output with lower overhead using tracepoints via perf event ring buffers.

Syscall Overhead Measurement

# Method 1: Use perf stat
perf stat -e "syscalls:sys_enter_read" -a sleep 1

# Method 2: Benchmark getpid (minimal syscall)
cat << 'EOF' > bench_syscall.c
#include <unistd.h>
#include <time.h>
#include <stdio.h>
int main() {
    struct timespec t1, t2;
    int N = 10000000;
    clock_gettime(CLOCK_MONOTONIC, &t1);
    for (int i = 0; i < N; i++) syscall(39); // SYS_getpid
    clock_gettime(CLOCK_MONOTONIC, &t2);
    long ns = (t2.tv_sec - t1.tv_sec) * 1000000000L + (t2.tv_nsec - t1.tv_nsec);
    printf("%.1f ns per syscall\n", (double)ns / N);
}
EOF
gcc -O2 -o bench_syscall bench_syscall.c && ./bench_syscall
# Typical output: 90-200 ns per syscall (depending on KPTI, Spectre mitigations)

Adding a New Syscall (Theoretical Walk-Through)

Linux uses the SYSCALL_DEFINE macro family to define syscalls. A new syscall taking two arguments would be defined as:

// In an appropriate kernel source file (e.g., kernel/mymodule.c)
#include <linux/syscalls.h>

SYSCALL_DEFINE2(my_syscall, int, arg1, unsigned long, arg2)
{
    if (arg1 < 0)
        return -EINVAL;
    // ... implementation ...
    return 0;
}

This macro expands to:

// Generates: asmlinkage long sys_my_syscall(int arg1, unsigned long arg2)
// Plus instrumentation for syscall auditing, seccomp, etc.

Then in arch/x86/entry/syscalls/syscall_64.tbl:

462   common  my_syscall  sys_my_syscall

And in include/linux/syscalls.h:

asmlinkage long sys_my_syscall(int arg1, unsigned long arg2);

In practice, new syscalls are rare and scrutinized heavily. The preferred modern approach for new kernel-user interfaces is either extending an existing syscall (via new flags), using ioctl, using netlink, or designing an io_uring-based interface.

Historical Context

The concept of a system call interface formalized in Unix v6 (1975): 57 system calls, documented in the Unix Programmer's Manual. Bell Labs' decision to define a small, stable set of syscalls and implement them in C (rather than assembly) was revolutionary — it made the OS portable and the interface learnable.

The original mechanism for entering kernel mode on x86 was INT 0x80 (software interrupt vector 80h, used by Linux from the beginning through the 32-bit era). INT 0x80 was slow (~300 cycles) because it looked up the handler in the IDT, switched stacks, and saved/restored many registers through the full interrupt path.

Intel introduced SYSENTER/SYSEXIT in Pentium II (1997) as a faster syscall mechanism for 32-bit code. AMD introduced SYSCALL/SYSRET for 64-bit, which Intel also adopted. Linux 64-bit uses SYSCALL exclusively. Modern 32-bit Linux also uses SYSENTER when available (via the vDSO's __kernel_vsyscall).

The number of Linux syscalls has grown from 57 (Unix v6) to ~460 on x86-64 (Linux 6.6), reflecting 50 years of OS feature growth. Each syscall represents a point where user-space software needed to ask the kernel for something it couldn't do itself.

Production Examples

PostgreSQL and system calls: A PostgreSQL query executing a sequential scan of a large table will issue millions of read(2) or pread(2) syscalls when reading from disk. Switching to io_uring (available in PostgreSQL experimental patches) batches these into ring buffer submissions, reducing syscall overhead by 60–80% for I/O-bound workloads at high concurrency.

Nginx and sendfile: Nginx uses sendfile(2) to serve static files. This syscall transfers data directly from the page cache to a socket buffer without copying it to user space. The sendfile path in the kernel (net/socket.c → do_sendfile()) is a zero-copy path: the kernel reads file data from the page cache and writes it to the socket DMA scatter-gather list without an intermediate user-space buffer. This is why Nginx can serve tens of thousands of requests/second with low CPU usage.

Chrome's seccomp sandbox: Every Chrome renderer process runs under a strict seccomp-BPF filter that allows approximately 70 syscalls (out of ~460 available). Any attempt to call a blocked syscall returns SIGSYS. This limits the damage a renderer exploit can do — even with arbitrary code execution in the renderer, the attacker cannot call execve, socket, or ptrace.

Debugging Notes

# Count syscalls for a command
strace -c -e trace=all ls /tmp

# Show syscalls with timestamps and arguments
strace -T -tt -e trace=network curl https://example.com

# Attach to a running process
strace -p $(pgrep nginx | head -1)

# Perf-based syscall tracing (lower overhead)
perf trace -p <pid> --duration 5

# eBPF-based: show syscall counts system-wide
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm, args->id] = count(); }' --timeout 10

# Show blocked syscalls in a seccomp-filtered process
# (look for SIGSYS in strace output or auditd logs)
ausearch -m SECCOMP -ts today

Slow syscall diagnosis:

# Show syscalls taking >1ms
perf trace --max-stack 5 -e '!futex,epoll_wait,poll,select,clock_nanosleep' \
  --min-latency 1 -p <pid>

Security Implications

The syscall interface is the kernel's primary attack surface from user space. Security hardening focuses on:

seccomp: restricts the set of syscalls available, reducing attack surface
Argument validation: every kernel syscall must validate all arguments (pointer validity, bounds, permissions) before acting. Failures to validate lead to privilege escalation (Dirty COW, Dirty Pipe)
KASLR: randomizes the kernel's virtual address layout, making it harder to know where sys_call_table is (relevant for ROP-based kernel exploits)
Syscall auditing: Linux Security Module hooks and the audit subsystem (kernel/audit.c) log syscalls for auditd, used in security-sensitive environments (PCI DSS, HIPAA compliance)
SECCOMP_FILTER_FLAG_NEW_LISTENER: allows a supervisor process to inspect and potentially modify syscall arguments before they reach the kernel — used by container runtimes for policy enforcement

Performance Implications

Syscall cost breakdown (approximately, on a modern x86-64 with KPTI):
Register save/restore: ~20 cycles
swapgs: ~10 cycles
CR3 switch (KPTI): ~50 cycles with PCID, ~200 without
Spectre v2 mitigation (IBRS/STIBP): 0–200 cycles depending on microcode/CPU
Actual syscall work: varies (getpid: ~5 cycles; write to disk: millions)
Total round trip (empty syscall): 100–500 ns
Eliminating syscalls: vDSO, io_uring, DPDK, RDMA all reduce or eliminate syscalls for hot paths
Syscall heavy vs. compute heavy: a process spending >20% of CPU time in syscalls is syscall-limited. Optimization involves batching, async APIs, or in-kernel processing (eBPF)

Failure Modes and Real Incidents

CVE-2022-0847 "Dirty Pipe": A bug in splice() and the pipe buffer system (fs/pipe.c) allowed user space to call splice() followed by write() in a way that overwrote pages in the page cache for read-only files — including SUID binaries like /usr/bin/passwd. Root cause: a splice path initialized pipe buffer flags incorrectly, allowing a subsequent write to pollute a page cache entry. The flaw was in the syscall implementation's interaction between splice, write, and copy-on-write semantics.

getdents() buffer size bug (historical): Early implementations of filesystem directory scanning via getdents(2) had edge cases where a tiny buffer size caused infinite loops or incorrect cursor advancement in some filesystem drivers. Programs using very small buffer sizes in getdents calls (some older find(1) implementations) triggered these bugs.

ptrace() privilege escalation (CVE-2019-13272): A race condition in ptrace(PTRACE_TRACEME) combined with the prctl(PR_SET_DUMPABLE) call allowed unprivileged processes to gain root privileges by racing the ptrace attach with a credentials check. Fixed in Linux 5.1.2.

Modern Usage

io_uring (Linux 5.1+): The most significant change to the Linux syscall interface in decades. Instead of one syscall per I/O operation, io_uring uses shared ring buffers between user space and kernel. A single io_uring_enter(2) call can submit and wait for thousands of operations. The kernel can also operate in IORING_SETUP_SQPOLL mode where a kernel thread polls the submission queue continuously, eliminating even the io_uring_enter syscall.

eBPF syscall (bpf(2)): A single syscall with a cmd argument selects one of ~50 operations: load a BPF program, create a map, attach a program to a hook, query program info. The BPF program, once loaded and verified, runs in the kernel without further syscalls. This is the mechanism for deploying custom kernel logic without writing kernel modules.

pidfd: Modern Linux process management uses file descriptors rather than PIDs. pidfd_open(2), pidfd_send_signal(2), clone3(2) with CLONE_PIDFD — these provide race-free process management, since a pidfd refers to a specific process identity even after the PID is reused.

Future Directions

Landlock LSM: A user-space-accessible LSM (merged in Linux 5.13) that allows processes to restrict their own filesystem access. Built on top of the syscall interface (using landlock_create_ruleset, landlock_add_rule, landlock_restrict_self syscalls).
io_uring expansion: io_uring operations continue to expand — IORING_OP_SOCKET, IORING_OP_CONNECT, IORING_OP_SEND, IORING_OP_RECV bring networking into the async io_uring model. Eventually, entire network servers may run without a traditional syscall per operation.
Syscall-less user/kernel communication: Research into shared-memory kernel-user communication (a generalization of vDSO) for low-latency data exchange without syscall trap overhead.
eBPF as syscall implementation: Some proposals involve implementing entire new "syscalls" as eBPF programs attached to the bpf_syscall hook, allowing rapid prototyping of new kernel interfaces without kernel patches.

Exercises

Use ausyscall --dump to list all syscall names and numbers on your system. Compare the x86-64 table with the ARM64 table (on an ARM system or cross-reference online). Find 5 syscalls that have different numbers on the two architectures.
Write a C program that makes a write syscall using inline assembly on x86-64 (not via glibc — directly using asm volatile("syscall"...)). Print "hello, syscall\n" to stdout. This requires setting rax=1, rdi=1, rsi=buf, rdx=len and executing syscall.
Install seccomp-tools and analyze Chrome's renderer seccomp filter (/proc/$(pgrep -f "renderer")/status shows Seccomp: 2). Use seccomp-tools dump to extract and disassemble the BPF filter. List the blocked syscalls.
Measure the cost difference between clock_gettime(CLOCK_REALTIME) (which uses the vDSO, no syscall) and clock_gettime(CLOCK_REALTIME_COARSE) vs. a real syscall like getpid(). Write a microbenchmark. What is the ratio?
Read the source of fs/read_write.c, specifically ksys_read() and vfs_read(). Trace the call path from sys_read() (via SYSCALL_DEFINE3(read, ...)) down to the struct file_operations .read_iter() function pointer. List every function called in sequence. What is the purpose of each?

References

Linux kernel source: arch/x86/entry/entry_64.S, arch/x86/entry/common.c, arch/x86/entry/syscalls/syscall_64.tbl, kernel/sys.c, include/linux/syscalls.h
x86-64 ABI Supplement: Michael Matz et al., "System V Application Binary Interface AMD64 Architecture Processor Supplement", Version 1.0
Michael Kerrisk, The Linux Programming Interface, No Starch Press, 2010 (the definitive reference for Linux syscalls)
Brendan Gregg, "Linux Systems Performance", USENIX LISA 2019 (YouTube)
Jann Horn, "Dirty Pipe (CVE-2022-0847)", https://dirtypipe.cm4all.com/
io_uring documentation: Documentation/block/io-uring.rst, Jens Axboe's papers: https://kernel.dk/io_uring.pdf
vDSO documentation: Documentation/vDSO/, lib/vdso/
seccomp documentation: Documentation/userspace-api/seccomp_filter.rst
Linux man-pages project: https://man7.org/linux/man-pages/