System Calls
Technical Overview
A system call (syscall) is the mechanism by which a user-space program requests a service from the kernel. It is the sole defined interface between user space (ring 3) and kernel space (ring 0). Every file operation, network connection, process creation, and memory allocation ultimately traverses this interface. Understanding system calls means understanding exactly how user-space software gets anything done, because without them, a process cannot interact with the world outside its own private memory.
On x86-64 Linux, there are approximately 400 system calls defined in arch/x86/entry/syscalls/syscall_64.tbl. They range from trivial (getpid, which just reads a field in the current task struct) to complex (io_uring_enter, which can submit and complete thousands of I/O operations). The syscall number, a small integer, uniquely identifies which kernel function to invoke. A convention for passing arguments and returning values is defined in the ABI specification and followed by both the C library and the kernel.
Prerequisites
02-user-space-vs-kernel-space.md: why the kernel/user boundary exists03-cpu-privilege-rings.md: how ring transitions work- Basic C programming knowledge
- Familiarity with the concept of a function call and return value
Core Content
The System Call Table
The Linux kernel maintains a table of function pointers — sys_call_table[] — indexed by syscall number. On x86-64, it is defined in arch/x86/entry/syscall_64.c and populated from arch/x86/entry/syscalls/syscall_64.tbl.
Partial syscall_64.tbl (x86-64):
Number ABI Name Entry Point
0 common read sys_read
1 common write sys_write
2 common open sys_open
3 common close sys_close
9 common mmap sys_mmap
12 common brk sys_brk
56 common clone sys_clone
59 common execve sys_execve
60 common exit sys_exit
61 common wait4 sys_wait4
62 common kill sys_kill
231 common exit_group sys_exit_group
333 common io_uring_setup sys_io_uring_setup
334 common io_uring_enter sys_io_uring_enter
The total count on Linux 6.6 is ~460 syscalls for x86-64. Different architectures have different tables and different numbering. This is why syscall(SYS_write, ...) uses architecture-specific constants from <sys/syscall.h>.
How a Syscall Works on x86-64: The Complete Flow
User Space Kernel Space
───────────────── ─────────────────────────────────────
glibc write(fd, buf, len)
│
│ Sets up registers:
│ rax = 1 (SYS_write)
│ rdi = fd
│ rsi = buf
│ rdx = len
│
│ Executes SYSCALL instruction
│ ─────────────────────────────────────────────────────────►
│ CPU loads RIP from LSTAR MSR
│ → entry_SYSCALL_64
│ CPU clears IF (interrupt flag)
│ Saves old RSP, loads kernel RSP
│ from MSR_KERNEL_GS_BASE (via swapgs)
│
│ entry_SYSCALL_64:
│ swapgs ; switch GS base
│ mov %rsp, %gs:scratch_space
│ mov %gs:cpu_tss+TSS_sp2, %rsp
│ pushq %r11 ; old RFLAGS
│ pushq %rcx ; old RIP (return addr)
│ PUSH_REGS ; save all GPRs
│ mov %rsp, %rdi ; pt_regs *regs
│ call do_syscall_64()
│
│ do_syscall_64():
│ nr = regs->ax ; syscall number
│ if (nr < NR_syscalls):
│ ret = sys_call_table[nr](regs)
│ regs->ax = ret ; store return value
│
│ (return path)
│ POP_REGS ; restore registers
│ swapgs ; restore GS base
│ SYSRET ; return to user, CPL=3
│ ◄─────────────────────────────────────────────────────────
│
│ Returns from glibc wrapper
│ Return value in rax
│ errno set if rax < 0
Key mechanism: The SYSCALL instruction is not a general call — it is a special fast-path instruction that:
1. Saves RIP (return address) into RCX
2. Saves RFLAGS into R11
3. Masks RFLAGS using SFMASK MSR (clears IF, preventing interrupts during entry)
4. Loads a new RIP from LSTAR MSR (set at boot to entry_SYSCALL_64)
5. Changes CPL from 3 to 0 (by loading CS/SS from STAR MSR)
6. Does NOT switch the stack — the kernel switches it early in entry_SYSCALL_64
The LSTAR MSR is configured at boot in arch/x86/kernel/cpu/common.c:
wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
Syscall Arguments: Register Convention
On x86-64 Linux (defined in the x86-64 ABI supplement):
| Register | Role |
|---|---|
rax |
Syscall number (input), return value (output) |
rdi |
1st argument |
rsi |
2nd argument |
rdx |
3rd argument |
r10 |
4th argument (note: r10, not rcx as in C calling convention) |
r8 |
5th argument |
r9 |
6th argument |
A maximum of 6 arguments are passed in registers. Syscalls that need more data use a struct pointer as one of the arguments (e.g., io_uring_setup passes a struct io_uring_params * as the second argument).
Why r10 instead of rcx? The SYSCALL instruction saves the old RIP (return address) into rcx, destroying it. The kernel entry code cannot use rcx as the 4th argument. The ABI uses r10 instead, and glibc's syscall wrappers perform mov rcx, r10 before entering the kernel path.
Return Value and errno
The kernel places the return value in rax. For success, this is the return value (e.g., number of bytes written). For errors, the kernel returns -errno (a negative error code, such as -ENOENT = -2). glibc's syscall wrappers check if the return value is in the range [-4096, -1], and if so, set errno = -ret and return -1.
// Inside glibc's write() wrapper (simplified):
long ret = syscall(SYS_write, fd, buf, count);
if (ret < 0 && ret > -4096) {
errno = -ret; // e.g., ret=-2 → errno=ENOENT
return -1;
}
return ret;
The errno global is thread-local (TLS) in modern glibc — each thread has its own errno.
VDSO: Virtual Dynamic Shared Object
The vDSO is a small shared library (2–4 KiB) that the kernel maps into every process's address space. It implements a handful of system calls entirely in user space, reading from kernel-maintained memory pages, avoiding the SYSCALL trap overhead entirely.
Process virtual memory (from /proc/self/maps):
7ffff7ffd000-7ffff7fff000 r-xp [vdso]
ffffffff80000000-... [vsyscall] (legacy)
Syscalls implemented in the vDSO (Linux x86-64):
- clock_gettime(CLOCK_REALTIME, ...) — reads from a struct vdso_data page the kernel keeps updated
- clock_gettime(CLOCK_MONOTONIC, ...) — same page
- gettimeofday() — reads from the same page
- getcpu() — reads the current CPU number from a per-CPU page
- clock_getres() — returns static data
How vDSO clock_gettime works:
Kernel maintains:
struct vdso_data {
u64 seq; // seqcount for consistency
u64 cycle_last; // last TSC value when updated
u64 mult, shift; // TSC → nanoseconds conversion
u64 basetime; // Wall clock base
...
} at a fixed kernel virtual address, also mapped r/o into user space
vDSO clock_gettime():
1. Read seq (verify even = no update in progress)
2. Read TSC via RDTSC
3. Compute: ns = (TSC - cycle_last) * mult >> shift + basetime
4. Read seq again (verify unchanged)
5. If seq changed during read: retry
6. Return ns (no syscall trap!)
Cost: ~5ns for clock_gettime via vDSO vs. ~100–200ns via actual syscall.
vsyscall: The older mechanism, now largely deprecated. The vsyscall page was at a fixed address (0xffffffffff600000) in 64-bit virtual address space. It is now implemented as a trap-and-emulate mechanism (the page is present but not executable; accessing it causes a #PF that the kernel emulates) to prevent it from being used as a ROP gadget at a predictable address.
seccomp Syscall Filtering
seccomp (Secure Computing Mode, kernel/seccomp.c) allows a process to restrict which syscalls it or its children may invoke. It operates via a BPF (Berkeley Packet Filter) program that runs for every syscall entry:
// Simplified seccomp filter example
struct sock_fprog prog = {
.len = ...,
.filter = (struct sock_filter[]) {
BPF_STMT(BPF_LD | BPF_W | BPF_ABS, offsetof(struct seccomp_data, nr)),
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_write, 0, 1),
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL_PROCESS),
}
};
prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog);
The BPF program receives a struct seccomp_data containing the syscall number and arguments. It returns an action: SECCOMP_RET_ALLOW, SECCOMP_RET_KILL_PROCESS, SECCOMP_RET_ERRNO(e), SECCOMP_RET_TRAP (sends SIGSYS), or SECCOMP_RET_TRACE (notifies a tracer).
Used by: Chrome (every renderer process has a strict seccomp filter), Firefox, systemd (for sandboxed services), Docker (default seccomp profile blocks ~44 syscalls), OpenSSH.
strace Internals: ptrace
strace uses the ptrace(2) syscall to intercept every system call made by the traced process. The mechanism:
straceforks and callsptrace(PTRACE_TRACEME)in the child beforeexec- The traced process stops at every syscall entry and exit
stracecallsptrace(PTRACE_SYSCALL)to continue execution until the next syscall event- At each stop,
stracereadsstruct user_regs_structviaptrace(PTRACE_GETREGS)to get the syscall number (inorig_rax) and arguments
The overhead is significant: every syscall becomes two ptrace stops (entry + exit), plus the overhead of the tracer process scheduling and context switching. strace adds 10–100x overhead to syscall-heavy programs. perf trace achieves similar output with lower overhead using tracepoints via perf event ring buffers.
Syscall Overhead Measurement
# Method 1: Use perf stat
perf stat -e "syscalls:sys_enter_read" -a sleep 1
# Method 2: Benchmark getpid (minimal syscall)
cat << 'EOF' > bench_syscall.c
#include <unistd.h>
#include <time.h>
#include <stdio.h>
int main() {
struct timespec t1, t2;
int N = 10000000;
clock_gettime(CLOCK_MONOTONIC, &t1);
for (int i = 0; i < N; i++) syscall(39); // SYS_getpid
clock_gettime(CLOCK_MONOTONIC, &t2);
long ns = (t2.tv_sec - t1.tv_sec) * 1000000000L + (t2.tv_nsec - t1.tv_nsec);
printf("%.1f ns per syscall\n", (double)ns / N);
}
EOF
gcc -O2 -o bench_syscall bench_syscall.c && ./bench_syscall
# Typical output: 90-200 ns per syscall (depending on KPTI, Spectre mitigations)
Adding a New Syscall (Theoretical Walk-Through)
Linux uses the SYSCALL_DEFINE macro family to define syscalls. A new syscall taking two arguments would be defined as:
// In an appropriate kernel source file (e.g., kernel/mymodule.c)
#include <linux/syscalls.h>
SYSCALL_DEFINE2(my_syscall, int, arg1, unsigned long, arg2)
{
if (arg1 < 0)
return -EINVAL;
// ... implementation ...
return 0;
}
This macro expands to:
// Generates: asmlinkage long sys_my_syscall(int arg1, unsigned long arg2)
// Plus instrumentation for syscall auditing, seccomp, etc.
Then in arch/x86/entry/syscalls/syscall_64.tbl:
462 common my_syscall sys_my_syscall
And in include/linux/syscalls.h:
asmlinkage long sys_my_syscall(int arg1, unsigned long arg2);
In practice, new syscalls are rare and scrutinized heavily. The preferred modern approach for new kernel-user interfaces is either extending an existing syscall (via new flags), using ioctl, using netlink, or designing an io_uring-based interface.
Historical Context
The concept of a system call interface formalized in Unix v6 (1975): 57 system calls, documented in the Unix Programmer's Manual. Bell Labs' decision to define a small, stable set of syscalls and implement them in C (rather than assembly) was revolutionary — it made the OS portable and the interface learnable.
The original mechanism for entering kernel mode on x86 was INT 0x80 (software interrupt vector 80h, used by Linux from the beginning through the 32-bit era). INT 0x80 was slow (~300 cycles) because it looked up the handler in the IDT, switched stacks, and saved/restored many registers through the full interrupt path.
Intel introduced SYSENTER/SYSEXIT in Pentium II (1997) as a faster syscall mechanism for 32-bit code. AMD introduced SYSCALL/SYSRET for 64-bit, which Intel also adopted. Linux 64-bit uses SYSCALL exclusively. Modern 32-bit Linux also uses SYSENTER when available (via the vDSO's __kernel_vsyscall).
The number of Linux syscalls has grown from 57 (Unix v6) to ~460 on x86-64 (Linux 6.6), reflecting 50 years of OS feature growth. Each syscall represents a point where user-space software needed to ask the kernel for something it couldn't do itself.
Production Examples
PostgreSQL and system calls: A PostgreSQL query executing a sequential scan of a large table will issue millions of read(2) or pread(2) syscalls when reading from disk. Switching to io_uring (available in PostgreSQL experimental patches) batches these into ring buffer submissions, reducing syscall overhead by 60–80% for I/O-bound workloads at high concurrency.
Nginx and sendfile: Nginx uses sendfile(2) to serve static files. This syscall transfers data directly from the page cache to a socket buffer without copying it to user space. The sendfile path in the kernel (net/socket.c → do_sendfile()) is a zero-copy path: the kernel reads file data from the page cache and writes it to the socket DMA scatter-gather list without an intermediate user-space buffer. This is why Nginx can serve tens of thousands of requests/second with low CPU usage.
Chrome's seccomp sandbox: Every Chrome renderer process runs under a strict seccomp-BPF filter that allows approximately 70 syscalls (out of ~460 available). Any attempt to call a blocked syscall returns SIGSYS. This limits the damage a renderer exploit can do — even with arbitrary code execution in the renderer, the attacker cannot call execve, socket, or ptrace.
Debugging Notes
# Count syscalls for a command
strace -c -e trace=all ls /tmp
# Show syscalls with timestamps and arguments
strace -T -tt -e trace=network curl https://example.com
# Attach to a running process
strace -p $(pgrep nginx | head -1)
# Perf-based syscall tracing (lower overhead)
perf trace -p <pid> --duration 5
# eBPF-based: show syscall counts system-wide
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm, args->id] = count(); }' --timeout 10
# Show blocked syscalls in a seccomp-filtered process
# (look for SIGSYS in strace output or auditd logs)
ausearch -m SECCOMP -ts today
Slow syscall diagnosis:
# Show syscalls taking >1ms
perf trace --max-stack 5 -e '!futex,epoll_wait,poll,select,clock_nanosleep' \
--min-latency 1 -p <pid>
Security Implications
The syscall interface is the kernel's primary attack surface from user space. Security hardening focuses on:
- seccomp: restricts the set of syscalls available, reducing attack surface
- Argument validation: every kernel syscall must validate all arguments (pointer validity, bounds, permissions) before acting. Failures to validate lead to privilege escalation (Dirty COW, Dirty Pipe)
- KASLR: randomizes the kernel's virtual address layout, making it harder to know where
sys_call_tableis (relevant for ROP-based kernel exploits) - Syscall auditing: Linux Security Module hooks and the audit subsystem (
kernel/audit.c) log syscalls forauditd, used in security-sensitive environments (PCI DSS, HIPAA compliance) - SECCOMP_FILTER_FLAG_NEW_LISTENER: allows a supervisor process to inspect and potentially modify syscall arguments before they reach the kernel — used by container runtimes for policy enforcement
Performance Implications
- Syscall cost breakdown (approximately, on a modern x86-64 with KPTI):
- Register save/restore: ~20 cycles
swapgs: ~10 cycles- CR3 switch (KPTI): ~50 cycles with PCID, ~200 without
- Spectre v2 mitigation (IBRS/STIBP): 0–200 cycles depending on microcode/CPU
- Actual syscall work: varies (getpid: ~5 cycles; write to disk: millions)
-
Total round trip (empty syscall): 100–500 ns
-
Eliminating syscalls: vDSO, io_uring, DPDK, RDMA all reduce or eliminate syscalls for hot paths
- Syscall heavy vs. compute heavy: a process spending >20% of CPU time in syscalls is syscall-limited. Optimization involves batching, async APIs, or in-kernel processing (eBPF)
Failure Modes and Real Incidents
CVE-2022-0847 "Dirty Pipe": A bug in splice() and the pipe buffer system (fs/pipe.c) allowed user space to call splice() followed by write() in a way that overwrote pages in the page cache for read-only files — including SUID binaries like /usr/bin/passwd. Root cause: a splice path initialized pipe buffer flags incorrectly, allowing a subsequent write to pollute a page cache entry. The flaw was in the syscall implementation's interaction between splice, write, and copy-on-write semantics.
getdents() buffer size bug (historical): Early implementations of filesystem directory scanning via getdents(2) had edge cases where a tiny buffer size caused infinite loops or incorrect cursor advancement in some filesystem drivers. Programs using very small buffer sizes in getdents calls (some older find(1) implementations) triggered these bugs.
ptrace() privilege escalation (CVE-2019-13272): A race condition in ptrace(PTRACE_TRACEME) combined with the prctl(PR_SET_DUMPABLE) call allowed unprivileged processes to gain root privileges by racing the ptrace attach with a credentials check. Fixed in Linux 5.1.2.
Modern Usage
io_uring (Linux 5.1+): The most significant change to the Linux syscall interface in decades. Instead of one syscall per I/O operation, io_uring uses shared ring buffers between user space and kernel. A single io_uring_enter(2) call can submit and wait for thousands of operations. The kernel can also operate in IORING_SETUP_SQPOLL mode where a kernel thread polls the submission queue continuously, eliminating even the io_uring_enter syscall.
eBPF syscall (bpf(2)): A single syscall with a cmd argument selects one of ~50 operations: load a BPF program, create a map, attach a program to a hook, query program info. The BPF program, once loaded and verified, runs in the kernel without further syscalls. This is the mechanism for deploying custom kernel logic without writing kernel modules.
pidfd: Modern Linux process management uses file descriptors rather than PIDs. pidfd_open(2), pidfd_send_signal(2), clone3(2) with CLONE_PIDFD — these provide race-free process management, since a pidfd refers to a specific process identity even after the PID is reused.
Future Directions
- Landlock LSM: A user-space-accessible LSM (merged in Linux 5.13) that allows processes to restrict their own filesystem access. Built on top of the syscall interface (using
landlock_create_ruleset,landlock_add_rule,landlock_restrict_selfsyscalls). - io_uring expansion: io_uring operations continue to expand —
IORING_OP_SOCKET,IORING_OP_CONNECT,IORING_OP_SEND,IORING_OP_RECVbring networking into the async io_uring model. Eventually, entire network servers may run without a traditional syscall per operation. - Syscall-less user/kernel communication: Research into shared-memory kernel-user communication (a generalization of vDSO) for low-latency data exchange without syscall trap overhead.
- eBPF as syscall implementation: Some proposals involve implementing entire new "syscalls" as eBPF programs attached to the
bpf_syscallhook, allowing rapid prototyping of new kernel interfaces without kernel patches.
Exercises
-
Use
ausyscall --dumpto list all syscall names and numbers on your system. Compare the x86-64 table with the ARM64 table (on an ARM system or cross-reference online). Find 5 syscalls that have different numbers on the two architectures. -
Write a C program that makes a
writesyscall using inline assembly on x86-64 (not via glibc — directly usingasm volatile("syscall"...)). Print "hello, syscall\n" to stdout. This requires settingrax=1,rdi=1,rsi=buf,rdx=lenand executingsyscall. -
Install
seccomp-toolsand analyze Chrome's renderer seccomp filter (/proc/$(pgrep -f "renderer")/statusshowsSeccomp: 2). Useseccomp-tools dumpto extract and disassemble the BPF filter. List the blocked syscalls. -
Measure the cost difference between
clock_gettime(CLOCK_REALTIME)(which uses the vDSO, no syscall) andclock_gettime(CLOCK_REALTIME_COARSE)vs. a real syscall likegetpid(). Write a microbenchmark. What is the ratio? -
Read the source of
fs/read_write.c, specificallyksys_read()andvfs_read(). Trace the call path fromsys_read()(viaSYSCALL_DEFINE3(read, ...)) down to thestruct file_operations.read_iter()function pointer. List every function called in sequence. What is the purpose of each?
References
- Linux kernel source:
arch/x86/entry/entry_64.S,arch/x86/entry/common.c,arch/x86/entry/syscalls/syscall_64.tbl,kernel/sys.c,include/linux/syscalls.h - x86-64 ABI Supplement: Michael Matz et al., "System V Application Binary Interface AMD64 Architecture Processor Supplement", Version 1.0
- Michael Kerrisk, The Linux Programming Interface, No Starch Press, 2010 (the definitive reference for Linux syscalls)
- Brendan Gregg, "Linux Systems Performance", USENIX LISA 2019 (YouTube)
- Jann Horn, "Dirty Pipe (CVE-2022-0847)", https://dirtypipe.cm4all.com/
- io_uring documentation:
Documentation/block/io-uring.rst, Jens Axboe's papers: https://kernel.dk/io_uring.pdf - vDSO documentation:
Documentation/vDSO/,lib/vdso/ - seccomp documentation:
Documentation/userspace-api/seccomp_filter.rst - Linux man-pages project: https://man7.org/linux/man-pages/