System Call Implementation
Technical Overview
System call implementation is the kernel side of the user/kernel boundary. While 00-foundations/05-system-calls.md describes the syscall mechanism from the user-space perspective (how applications invoke syscalls, the ABI), this file documents the kernel-side implementation: how SYSCALL_DEFINE macros work, what happens in entry_SYSCALL_64 before C code is called, how the kernel dispatches calls, how it handles errors and retries, and how modern facilities like the vDSO and seccomp integrate into the dispatch path.
This is the territory of arch/x86/entry/entry_64.S, arch/x86/entry/common.c, and include/linux/syscalls.h. These files define the lowest-level plumbing of every kernel operation that user-space software depends on.
Prerequisites
00-foundations/05-system-calls.md: user-side perspective00-foundations/03-cpu-privilege-rings.md: ring transitions02-kernel-initialization.md: how LSTAR is configured at boot- Comfort with x86-64 assembly (to read
entry_64.S)
Core Content
SYSCALL_DEFINE: How Syscalls Are Defined in the Kernel
Every Linux system call is defined using the SYSCALL_DEFINE family of macros, located in include/linux/syscalls.h. The macro takes the number of arguments as a suffix (0 through 6):
// Zero-argument syscall:
SYSCALL_DEFINE0(getpid)
{
return task_tgid_vnr(current);
}
// One-argument syscall:
SYSCALL_DEFINE1(close, unsigned int, fd)
{
return close_fd(fd);
}
// Three-argument syscall:
SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
{
return ksys_read(fd, buf, count);
}
// Six-argument syscall (maximum on x86-64):
SYSCALL_DEFINE6(mmap, unsigned long, addr, unsigned long, len,
unsigned long, prot, unsigned long, flags,
unsigned long, fd, unsigned long, off)
{
...
}
What SYSCALL_DEFINE3(read, ...) expands to:
// Simplified expansion (actual macro is more complex for audit/instrumentation):
asmlinkage long sys_read(unsigned int fd, char __user *buf, size_t count);
asmlinkage long sys_read(unsigned int fd, char __user *buf, size_t count)
{
// Generated instrumentation code (audit, seccomp entry points, etc.)
long ret = __do_sys_read(fd, buf, count);
// Instrumentation exit
return ret;
}
static inline long __do_sys_read(unsigned int fd, char __user *buf, size_t count)
{
return ksys_read(fd, buf, count);
}
The asmlinkage attribute tells the compiler that arguments come from the stack (as set up by the assembly entry code that saves registers), not from registers as per the normal C calling convention. On x86-64, this matters because the kernel uses registers rdi/rsi/rdx/r10/r8/r9 (from pt_regs) but SYSCALL was called with the normal C ABI which uses rdi/rsi/rdx/rcx/r8/r9 (note rcx vs. r10 — rcx was overwritten by SYSCALL itself).
The actual current implementation since Linux 5.0 extracts arguments from pt_regs directly in do_syscall_64() and passes them as C arguments, removing the dependency on asmlinkage. The SYSCALL_DEFINEn macros generate wrappers that do this extraction.
__user annotation: The char __user *buf annotation marks that buf is a pointer to user-space memory (from the Sparse static analysis tool's perspective). This means:
1. The kernel must validate the pointer before dereferencing it
2. Dereferencing it directly would be a security vulnerability (and a SMAP violation)
3. The kernel must use copy_from_user(), copy_to_user(), get_user(), or put_user() to safely access it
entry_SYSCALL_64: The Assembly Entry Point
Source: arch/x86/entry/entry_64.S
When user space executes SYSCALL, the CPU automatically:
1. Saves RIP → RCX (return address)
2. Saves RFLAGS → R11
3. Masks RFLAGS using SFMASK MSR (clears IF, TF, DF)
4. Loads new CS:RIP from STAR/LSTAR MSRs (points to entry_SYSCALL_64)
5. Loads SS from STAR MSR
6. Does NOT switch stacks — this happens in the entry code
SYM_CODE_START(entry_SYSCALL_64)
UNWIND_HINT_ENTRY
swapgs ; Switch GS base:
; user: GS = TLS base
; kernel: GS = per-CPU data (struct cpu_info)
; After swapgs: GS.base = per-CPU struct
; Save user RSP, load kernel RSP:
movq %rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2) ; save user RSP
SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp ; KPTI: switch to kernel page tables
movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp ; load kernel stack
; Now on the kernel stack. Build struct pt_regs:
pushq $(SS_KERNEL) ; SS
pushq PER_CPU_VAR(cpu_tss_rw + TSS_sp2) ; RSP (saved user RSP)
pushq %r11 ; RFLAGS
pushq $(CS_KERNEL) ; CS
pushq %rcx ; RIP (return address saved by SYSCALL)
pushq $-ENOSYS ; orig_rax placeholder
pushq %rdi ; push all GPRs (PUSH_REGS macro)
pushq %rsi
pushq %rdx
pushq %rcx ; (already in stack as RIP, but convention)
pushq $-ENOSYS ; rax
pushq %r8
pushq %r9
pushq %r10
pushq %r11
; ... (all GPRs saved)
movq %rax, ORIG_RAX(%rsp) ; save original syscall number
; Enable interrupts (SYSCALL cleared IF; we're now on kernel stack, safe)
TRACE_IRQS_OFF
; ... enable_step_tracking, etc.
; Call the C handler:
movq %rsp, %rdi ; arg 1: struct pt_regs *regs
call do_syscall_64 ; dispatch the syscall
The struct pt_regs built on the stack contains all the register state, and do_syscall_64() receives a pointer to it.
Saving Registers and struct pt_regs
Source: arch/x86/include/uapi/asm/ptrace.h, arch/x86/include/asm/ptrace.h
struct pt_regs {
unsigned long r15;
unsigned long r14;
unsigned long r13;
unsigned long r12;
unsigned long rbp;
unsigned long rbx;
/* arguments: non callee-saved regs: */
unsigned long r11;
unsigned long r10;
unsigned long r9;
unsigned long r8;
unsigned long ax; /* return value / syscall number */
unsigned long cx; /* RIP saved by SYSCALL */
unsigned long dx;
unsigned long si;
unsigned long di;
unsigned long orig_ax; /* original syscall number */
/* return frame: */
unsigned long ip; /* = cx (RIP) */
unsigned long cs;
unsigned long flags; /* = r11 (RFLAGS) */
unsigned long sp; /* user RSP */
unsigned long ss;
};
This is the full CPU state at the point of the syscall. Every field is saved and will be restored on return. ptrace accesses this structure to read/modify register state of a traced process. PTRACE_GETREGS returns this entire struct; PTRACE_SETREGS overwrites it (used by debuggers to change the register state).
Syscall Dispatch Table and do_syscall_64()
Source: arch/x86/entry/common.c
// The dispatch table:
extern const sys_call_ptr_t sys_call_table[]; // defined in syscall_64.c
// sys_call_ptr_t = typedef long (*)(const struct pt_regs *)
// The dispatcher:
__visible noinstr void do_syscall_64(struct pt_regs *regs, int nr)
{
add_random_kstack_offset(); // stack canary randomization (Spectre mitigation)
nr = syscall_enter_from_user_mode(regs, nr); // security checks, seccomp, audit
instrumentation_begin();
if (!do_syscall_x64(regs, nr) && !do_syscall_x32(regs, nr) && nr != -1) {
/* Invalid syscall, but do_exit() or coredump if on a signal */
regs->ax = __x64_sys_ni_syscall(regs); // returns -ENOSYS
}
instrumentation_end();
syscall_exit_to_user_mode(regs); // signal handling, seccomp exit, tracing exit
}
static noinline bool do_syscall_x64(struct pt_regs *regs, int nr)
{
unsigned int unr = nr;
if (likely(unr < NR_syscalls)) {
unr = array_index_nospec(unr, NR_syscalls); // Spectre v1 mitigation
regs->ax = sys_call_table[unr](regs);
return true;
}
return false;
}
array_index_nospec() is a Spectre variant 1 mitigation that prevents the CPU from speculatively executing with an out-of-bounds nr before the bounds check completes.
Syscall Tracing: Audit Subsystem and seccomp
Audit subsystem: syscall_enter_from_user_mode() calls into the audit subsystem (kernel/audit.c) if auditing is enabled (via auditd). The audit subsystem logs syscall entries/exits with arguments for compliance (PCI DSS, HIPAA, FIPS) and security monitoring. Configuration via auditctl -a always,exit -F arch=b64 -S write -k writes.
seccomp: Also invoked from syscall_enter_from_user_mode(). If the current task has a seccomp filter (current->seccomp.mode == SECCOMP_MODE_FILTER), the filter BPF program is executed with struct seccomp_data as input. The filter's return action determines whether to allow, kill, or return an error.
// struct seccomp_data: what the seccomp BPF program sees
struct seccomp_data {
int nr; // syscall number
__u32 arch; // AUDIT_ARCH_X86_64
__u64 instruction_pointer; // caller's RIP (for whitelisting specific code paths)
__u64 args[6]; // syscall arguments
};
The seccomp filter runs in the kernel's BPF interpreter (or JIT-compiled native code). Its result:
- SECCOMP_RET_ALLOW: proceed
- SECCOMP_RET_KILL_PROCESS: immediately kill the process
- SECCOMP_RET_KILL_THREAD: kill only the calling thread
- SECCOMP_RET_ERRNO(e): return -e to user space
- SECCOMP_RET_TRAP: deliver SIGSYS to the process
- SECCOMP_RET_TRACE: notify a ptrace tracer
Return to User Space: SYSRET vs. IRET
After do_syscall_64() returns, entry_SYSCALL_64 restores registers and returns to user space. Two paths:
SYSRET (fast path):
; Restore condition: no exception, no signal, no debug, segment regs clean
POP_REGS restore_c_regs=1 ; restore all GPRs except RCX, R11
SWITCH_TO_USER_CR3_STACK ; KPTI: switch back to user page tables
swapgs ; restore user GS base
sysretq ; RCX→RIP, R11→RFLAGS, CS/SS from STAR, CPL→3
SYSRET is faster than IRET because it doesn't need to validate a full interrupt frame. It restores only RIP (from RCX) and RFLAGS (from R11) — the values saved by SYSCALL.
IRET (slow path):
Used when: the process needs to receive a signal, there was a trap, or segment registers need fixing (32-bit compatibility mode). IRET pops a full interrupt return frame (RIP, CS, RFLAGS, RSP, SS) and handles all mode transitions.
There is a historical vulnerability related to SYSRET on AMD processors: if RCX (the user-space return address) is non-canonical (bit 47 ≠ bits 48–63), SYSRET raises a #GP in ring 0. This was used in privilege escalation attacks. Mitigation: Linux checks RCX for canonical-ness before SYSRET and falls back to IRET if it's non-canonical.
Syscall Restart: ERESTARTSYS and EINTR
When a slow syscall (one that may block) is interrupted by a signal, the kernel has several choices:
User calls: read(fd, buf, 1024) // blocking read on socket with no data
│
▼
Kernel blocks in tcp_recvmsg(), waiting for data
│
Signal arrives (SIGALRM, SIGUSR1, etc.)
│
▼
tcp_recvmsg() returns -ERESTARTSYS
│
▼
syscall_exit_to_user_mode() → handle_signal()
→ signal handler runs in user space
→ on return, check: does the syscall restart?
Case 1: SA_RESTART set in sigaction flags
→ kernel sets up register state to re-execute the syscall
→ SYSRET returns to the SYSCALL instruction (RIP backed up by 2 bytes)
→ syscall transparently restarts
Case 2: SA_RESTART not set
→ kernel returns -EINTR to the user process
→ user process's read() returns -1, errno=EINTR
→ user code must check for EINTR and retry manually
The internal restart codes:
- -ERESTARTSYS: restart if SA_RESTART, else return -EINTR
- -ERESTARTNOINTR: always restart (even without SA_RESTART)
- -ERESTARTNOHAND: never restart, always return -EINTR
- -ERESTART_RESTARTBLOCK: restart via a special restart block (for nanosleep, futex with timeout)
The restart_block mechanism handles syscalls that need to re-enter with modified arguments. For nanosleep(2) interrupted midway, the remaining sleep time is different from the original. The restart_block stores the updated arguments.
Slow vs. Fast Syscall Paths
Fast path: Syscall that completes without sleeping:
SYSCALL → entry_SYSCALL_64 → do_syscall_64 → sys_getpid → return → SYSRET
Total: ~150-500ns (including KPTI overhead)
Slow path: Syscall that blocks (e.g., read(2) on a socket with no data):
SYSCALL → entry_SYSCALL_64 → do_syscall_64 → sys_read → vfs_read →
tcp_recvmsg → sk_wait_data → schedule() → [CPU switches to another task]
... time passes ...
[NIC interrupt arrives] → sk_wakeup → task rescheduled
→ sk_wait_data returns → tcp_recvmsg returns → sys_read returns
→ syscall_exit_to_user_mode → check signals → SYSRET
Total: microseconds to seconds depending on wait time
In the slow path, the process is in TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE sleep during the wait. strace -T shows the wall-clock time for each syscall, making slow syscalls immediately visible.
VDSO Implementation
Source: arch/x86/vdso/, lib/vdso/
The VDSO is a small shared object (ELF .so) that the kernel maps into every process's address space. On x86-64, the VDSO provides:
- __vdso_clock_gettime()
- __vdso_gettimeofday()
- __vdso_clock_getres()
- __vdso_getcpu()
The VDSO functions read from a kernel-maintained page: struct vdso_data (defined in include/vdso/datapage.h). The kernel updates this page (with a seqcount for consistency) every time the timekeeping state changes.
// Simplified vDSO clock_gettime:
notrace int __vdso_clock_gettime(clockid_t clock, struct __kernel_timespec *ts)
{
const struct vdso_data *vd = __arch_get_vdso_data();
switch (clock) {
case CLOCK_REALTIME:
do {
seq = vdso_read_begin(vd); // read seqcount, wait if odd (update in progress)
ns = vd->basetime[CLOCK_REALTIME].nsec;
cycles = rdtsc_ordered() - vd->cycle_last;
ns += (cycles * vd->mult) >> vd->shift;
} while (vdso_read_retry(vd, seq)); // if seqcount changed, retry
ts->tv_sec = ns / NSEC_PER_SEC;
ts->tv_nsec = ns % NSEC_PER_SEC;
return 0;
...
}
}
No SYSCALL instruction is executed. The entire function runs in user space using TSC reads and kernel-maintained multiplication factors. Cost: ~5ns vs. ~150ns for a real syscall.
The VDSO is built as a separate ELF binary (arch/x86/vdso/vdso64.lds.S defines the layout), compiled into the kernel, and mapped via arch_setup_additional_pages() during exec(). The mapping is at a random address (to prevent VDSO-based ROP attacks).
Adding a New Syscall: The Complete Walk-Through
For documentation purposes, here is how a kernel developer would add a new system call sys_my_new_syscall(int arg1, unsigned long arg2):
Step 1: Add the syscall number in arch/x86/entry/syscalls/syscall_64.tbl:
462 common my_new_syscall sys_my_new_syscall
Step 2: Add the declaration in include/linux/syscalls.h:
asmlinkage long sys_my_new_syscall(int arg1, unsigned long arg2);
Step 3: Add the declaration in include/uapi/asm-generic/unistd.h (for all architectures):
#define __NR_my_new_syscall 462
__SC_COMP(__NR_my_new_syscall, sys_my_new_syscall, compat_sys_my_new_syscall)
Step 4: Implement using SYSCALL_DEFINE in a kernel source file:
// kernel/my_module.c
#include <linux/syscalls.h>
SYSCALL_DEFINE2(my_new_syscall, int, arg1, unsigned long, arg2)
{
if (arg1 < 0)
return -EINVAL;
if (!access_ok((void __user *)arg2, sizeof(long)))
return -EFAULT;
// actual implementation
return 0;
}
Step 5: Update include/uapi/asm/unistd_64.h for user-space to see the number.
Step 6: Add a man page (strongly expected in the kernel community) and tests in tools/testing/selftests/.
In practice, new syscalls are rare and controversial. The Linux kernel community strongly prefers extending existing syscalls (via new flags) or using existing interfaces (ioctl, netlink, io_uring). New syscalls require cross-architecture support (adding to all architecture tables) and must be carefully ABI-designed since they can never be removed.
Historical Context
The first Linux system calls were defined ad-hoc in assembly (Linux 0.01, 1991). The SYSCALL_DEFINE macro family was introduced gradually. The original mechanism was INT 0x80 with a simple sys_call_table[] lookup indexed by EAX.
The SYSCALL/SYSRET mechanism was available on AMD processors from K8 (2003) and on Intel from later Core 2 (2006 for 64-bit mode). Linux 64-bit adopted SYSCALL/SYSRET from the beginning of x86-64 support (Linux 2.6.0-test, 2003), since INT 0x80 in 64-bit mode passes 32-bit arguments.
The audit subsystem was added in Linux 2.6.6 (2004) for security logging requirements. seccomp was added in Linux 2.6.12 (2005) initially as a very restrictive "only allow read/write/exit/sigreturn" mode (SECCOMP_MODE_STRICT). The more flexible seccomp-BPF (SECCOMP_MODE_FILTER) was added in Linux 3.5 (2012) by Will Drewry at Google.
KPTI (Kernel Page Table Isolation), the Meltdown mitigation, was added in Linux 4.15 (2018). This added significant overhead to the syscall path (CR3 switch on entry and exit). PCID (Process-Context Identifiers) was enabled to amortize this cost — with PCID, the CR3 switch doesn't flush the TLB but tags each entry with a context ID.
Production Examples
Google's use of seccomp: Every Chrome renderer process, Google Cloud Functions sandbox, and gVisor (which acts as a user-space kernel) uses seccomp-BPF to restrict the set of syscalls available. Chrome's renderer seccomp filter allows approximately 70 of the ~460 available syscalls. This means that even if an attacker achieves arbitrary code execution in the renderer (e.g., via a JavaScript engine bug), they cannot call execve, connect, or ptrace.
Linux io_uring syscall design: io_uring_setup(2), io_uring_enter(2), and io_uring_register(2) are three syscalls that define an entirely new I/O paradigm. The io_uring_enter implementation in io_uring/io_uring.c submits a batch of operations from the SQ (submission queue) ring and optionally waits for completions. The "batch" design minimizes SYSCALL overhead: one syscall can submit 1000 I/O operations.
Syscall audit at a financial institution: PCI DSS compliance requires logging all execve(2) and connect(2) calls for systems handling cardholder data. auditd with rules like -a always,exit -F arch=b64 -S execve logs every program execution. On a busy server making thousands of execve() calls per second, this generates substantial I/O and CPU overhead — tuning audit rules to the minimum required is a practical operations concern.
Debugging Notes
# Trace all syscalls for a process with timing:
strace -T -tt -p <pid>
# Count syscalls:
strace -c ./my_program
# Fast syscall tracing with perf:
perf trace -p <pid> --duration 5 --no-syscalls 2>&1 | head -50
# eBPF: syscall entry with arguments:
bpftrace -e '
tracepoint:syscalls:sys_enter_read {
printf("read: fd=%d, count=%d\n", args->fd, args->count);
}'
# Check if seccomp is active for a process:
cat /proc/<pid>/status | grep Seccomp
# 0 = disabled, 1 = strict, 2 = filter
# Audit syscall events:
ausearch -m SYSCALL -ts today -sv no # failed syscalls today
auditctl -l # current audit rules
# Measure syscall overhead with/without KPTI:
# (requires two kernels or KPTI disabling via nopti boot param)
./bench_syscall # simple getpid() benchmark
# Identify what's in the VDSO:
# Find [vdso] mapping:
cat /proc/self/maps | grep vdso
# Dump and disassemble:
dd if=/proc/self/mem bs=1 skip=$((0x7ffff7ffd000)) count=4096 of=/tmp/vdso.so 2>/dev/null
objdump -d /tmp/vdso.so
Security Implications
TOCTU (Time of Check to Time of Use) in syscall arguments: Syscall arguments that are user-space pointers must be validated with access_ok() before each dereference, not once at the top of the function. Between the check and the dereference, a concurrent thread could modify the pointed-to data (a TOCTOU race). The copy_from_user() function handles this by using SMAP-protected reads that fail safely if the mapping changes.
Compat syscalls and argument truncation: On 64-bit kernels running 32-bit processes (compat mode), arguments are 32 bits. The compat_sys_* functions handle the conversion. A classic bug: a 32-bit size_t argument that, when sign-extended to 64 bits, becomes a very large number (0xffffffff → 0xffffffffffffffff). Incorrect compat_sys implementations have caused multiple privilege escalation bugs.
array_index_nospec: Every syscall table lookup uses array_index_nospec(nr, NR_syscalls) — a Spectre v1 mitigation that prevents the CPU from speculatively executing sys_call_table[nr] with an out-of-bounds nr before the bounds check completes. Without this, Spectre gadgets in do_syscall_x64() could leak kernel memory.
Performance Implications
KPTI overhead breakdown (measured on Intel Skylake):
Without KPTI: ~100ns per syscall (getpid)
With KPTI, no PCID: ~300ns per syscall (TLB flush on every CR3 switch)
With KPTI + PCID: ~150ns per syscall (tagged TLB entries, no full flush)
PCID (enabled with CONFIG_X86_PCID on CPUs that support it) gives each address space a context ID tag. The TLB entry format includes the PCID, so CR3 switches don't invalidate TLB entries for the other address space. Linux uses PCID 0 for kernel page tables and PCID 1–4095 for user page tables (one per process, rotated).
Spectre v2 mitigation overhead: IBPB (Indirect Branch Predictor Barrier) issued on each ring-3 → ring-0 transition costs ~200 cycles on some microarchitectures. IBRS (enabled with spectre_v2=ibrs boot parameter) adds ~150 cycles per syscall. Modern CPUs (Ice Lake+) with EIBRS (Enhanced IBRS) have near-zero overhead.
Failure Modes and Real Incidents
CVE-2022-0847 "Dirty Pipe" (syscall interaction bug): The bug was in the interaction between splice(), write(), and the pipe buffer. The splice syscall implementation set a flag on pipe buffer pages incorrectly, and a subsequent write exploited this to modify the contents of read-only page-cache pages. Root cause: the PIPE_BUF_FLAG_CAN_MERGE flag was not cleared in the splice path in fs/splice.c. The bug was in the kernel-side implementation of the splice(2) + write(2) syscall interaction.
CVE-2014-4014 (capability check bypass in user namespace syscalls): The unshare(CLONE_NEWUSER) syscall, when creating a new user namespace, allowed bypassing capability checks for certain operations. The bug was in the syscall's kernel-side capability validation code (kernel/user_namespace.c) — a classic example of a privilege escalation via a complex syscall implementation.
Seccomp filter bypass via SECCOMP_RET_TRACE (CVE-2019-18634): The seccomp TSYNC (Thread Synchronization) feature, combined with SECCOMP_RET_TRACE, had a race condition that could allow a thread to bypass its own seccomp filter. Exploiting this required a carefully timed concurrent syscall. Fixed in Linux 5.5.
Modern Usage
io_uring IORING_OP_* as "meta-syscalls": io_uring defines a new class of operations (struct io_uring_sqe operations: IORING_OP_READ, IORING_OP_SEND, IORING_OP_SOCKET, etc.) that are submitted as ring buffer entries rather than as traditional syscalls. The kernel-side implementation in io_uring/ (io_uring/rw.c, io_uring/net.c) mirrors the traditional syscall implementation but in a batched, async-aware way. This is effectively a new syscall interface layered on top of a single io_uring_enter(2) syscall.
BPF syscall as a multiplexer: The bpf(2) syscall with its cmd argument is a modern example of a "fat" syscall that multiplexes many operations. BPF_PROG_LOAD, BPF_MAP_CREATE, BPF_PROG_ATTACH are all separate commands under one syscall number. This is the preferred modern pattern for new kernel interfaces — extend bpf(2) or io_uring rather than adding new syscall numbers.
Landlock: Three syscalls added in Linux 5.13 for application-level filesystem sandboxing: landlock_create_ruleset(2), landlock_add_rule(2), landlock_restrict_self(2). These are clean SYSCALL_DEFINE implementations that interact with the LSM framework. The implementation in security/landlock/ is a well-structured example of modern syscall addition.
Future Directions
- io_uring as the future I/O syscall: Long-term, the goal is for all I/O-related kernel functionality to be accessible via io_uring operations without additional syscalls.
io_uring_register(IORING_REGISTER_PBUF_RING, ...)already provides buffer management; future ops may cover file operations, signals, and process management. - Syscall user dispatch: Allows user-space libraries to intercept and handle some syscalls without kernel involvement, using a per-thread redirection table. Used by Wine and gVisor for Windows/Linux compatibility layers without ptrace overhead.
- Landlock expansion: Additional Landlock access rights (network access control, IPC restrictions) are being developed to make Landlock a general-purpose sandboxing tool, adding more
landlock_add_rulevariants.
Exercises
-
Find the
SYSCALL_DEFINE3(read, ...)definition infs/read_write.c. Expand the macro manually (followinginclude/linux/syscalls.h). What function names does it generate? What is__do_sys_read? How does it differ fromksys_read? -
Read
arch/x86/entry/entry_64.Sfromentry_SYSCALL_64to the firstcallinstruction. List in order every instruction executed and what it does. Pay special attention toswapgs, the stack switch, and thepushqsequence that buildspt_regs. -
Write a minimal C program that uses
seccomp(SECCOMP_SET_MODE_FILTER, ...)to install a filter that allows onlyread,write,exit_group, andbrksyscalls. Then try to callgetpid(). What happens? Inspect the signal received withsigaction(SIGSYS, ...). -
Measure the overhead of KPTI on your system. Write a microbenchmark that calls
getpid()in a loop. Run it withnoptiin the kernel command line (if testing in a VM) and without. What is the difference in ns/syscall? Checkdmesg | grep PTIto confirm KPTI status. -
Read
kernel/audit.c, specificallyaudit_log_syscall(). What information does it record for each audited syscall? Installauditd, add a rule to audit allexecve()calls, run a command, and inspect the audit log. Identify each field in theSYSCALLaudit record and match it to the kernel structure that provided it.
References
- Linux kernel source:
arch/x86/entry/entry_64.S,arch/x86/entry/common.c,include/linux/syscalls.h,arch/x86/include/asm/ptrace.h,kernel/seccomp.c,kernel/audit.c - x86-64 ABI: "System V Application Binary Interface, AMD64 Architecture Processor Supplement"
- seccomp filter documentation:
Documentation/userspace-api/seccomp_filter.rst - VDSO documentation:
Documentation/vDSO/ - Michael Kerrisk, The Linux Programming Interface, Chapter 3 (System Programming Concepts) and Chapter 21 (Signals)
- Intel SDM Vol 2B:
SYSCALL/SYSRETinstruction descriptions - Jann Horn, Linux security research: https://bugs.chromium.org/p/project-zero/
- Dirty Pipe blog: https://dirtypipe.cm4all.com/
- Will Drewry, "Seccomp filter for Chrome", BlackHat 2012