02 — fork(), exec(), and the Unix Process Creation Model
Technical Overview
Unix process creation is built on two orthogonal system calls: fork() duplicates the
calling process, and exec() replaces the calling process's program image with a new
one. Together they form the "fork-exec" idiom that every shell, service manager, and
container runtime uses. Linux exposes both through a single, powerful primitive —
clone() — that makes the degree of sharing between parent and child fully
configurable. Understanding these mechanics is essential for anyone writing daemons,
container runtimes, language runtimes, or anything that manages child processes.
Prerequisites
01-process-concept.md:task_struct, process address space, PID/TGID distinction- Virtual memory concepts: page tables, copy-on-write (CoW), VMAs
- ELF binary format basics (load segments, entry point, interpreter path)
- File descriptor semantics (reference counting,
O_CLOEXEC)
Core Content
The fork/exec/clone Relationship
User-space APIs Kernel entry point
───────────────────────────────── ──────────────────────────────────────
fork() ────────────────────► clone(SIGCHLD, 0)
│
vfork() ────────────────────► clone(CLONE_VFORK|CLONE_VM|SIGCHLD, 0)
│
pthread_create() ─────────────────► clone(CLONE_VM|CLONE_FS|CLONE_FILES│
CLONE_SIGHAND|CLONE_THREAD│ │
CLONE_SETTLS|..., stack) │
│ │
▼ │
kernel/fork.c: │
sys_clone() / clone3() │
│ │
▼ │
copy_process() ◄────────────────────┘
│
┌──────────────┴──────────────────────────────┐
│ dup_task_struct() — new stack + PCB │
│ copy_creds() — credentials │
│ copy_mm() — address space (CoW) │
│ dup_fd() — file descriptor table │
│ copy_fs() — fs context (cwd/root) │
│ copy_sighand() — signal handlers │
│ copy_signal() — signal state │
│ copy_thread() — arch register state │
│ alloc_pid() — PID allocation │
└─────────────────────────────────────────────┘
All three user-space calls ultimately invoke copy_process(). The clone_flags bitmask
decides what is copied (independent) versus shared (pointer shared between parent
and child):
| Flag | Effect when set | Absence |
|---|---|---|
CLONE_VM |
Share mm_struct (same address space) |
Copy mm_struct with CoW page tables |
CLONE_FILES |
Share files_struct |
Duplicate fd table |
CLONE_FS |
Share fs_struct (cwd, root, umask) |
Copy fs context |
CLONE_SIGHAND |
Share sighand_struct (signal handlers) |
Copy handlers |
CLONE_THREAD |
Place in same thread group (same tgid) |
New process group |
CLONE_NEWPID |
Create new PID namespace | Inherit parent's |
CLONE_NEWNET |
Create new network namespace | Inherit parent's |
fork() Internals: copy_process() in Detail
Step 1 — dup_task_struct()
Allocates a new task_struct and a new kernel stack (8 or 16 KB depending on arch
config). The thread_info embedded at the base of the stack is initialized to point at
the new task_struct. The contents of task_struct are initially a shallow copy of
the parent — subsequent steps selectively deep-copy or re-share fields.
Step 2 — copy_mm() (fork case: no CLONE_VM)
The parent's mm_struct is duplicated: a new mm_struct is allocated, and the VMA
list is cloned. However, the actual physical pages are not copied. Instead,
copy_page_range() walks the parent's page tables and marks every writable page
read-only in both parent and child page tables. This is copy-on-write (CoW).
When either process writes to a CoW page, the MMU raises a page fault. The kernel's
do_wp_page() allocates a new physical page, copies the content, updates the faulting
process's page table entry to point to the new page and mark it writable, and the write
proceeds. Unmodified pages continue to be shared indefinitely.
Result: fork() is O(number of VMAs), not O(address space size). On a process with 100
VMAs and 2 GB of pages, fork takes microseconds not seconds.
Step 3 — dup_fd() (fork case: no CLONE_FILES)
The file descriptor table is copied. Open file descriptions (the kernel-side objects
pointed to by struct file *) are not duplicated — both parent and child hold
references to the same open file descriptions. Consequently they share the file offset
(f_pos). This is the POSIX-specified behavior: writes from parent and child to the same
fd are not coordinated and can interleave.
File descriptors marked O_CLOEXEC are closed in the child across an execve() call
(not at fork time).
Step 4 — copy_sighand() and copy_signal()
Signal handler table is copied; pending signals and signal mask are reset for the child
(except that pending signals sent to the thread group remain). The child starts with
its own sigpending queue.
Step 5 — copy_thread() (arch-specific)
On x86-64 (arch/x86/kernel/process.c:copy_thread()), the child's kernel stack is set
up so that when the scheduler first runs the child, it returns from fork() with return
value 0. The parent's copy_process() returns the child's PID. This is the classic fork
return-value bifurcation: same code, two different return values.
vfork(): Sharing the Address Space
vfork() is clone(CLONE_VFORK | CLONE_VM | SIGCHLD). The child shares the parent's
mm_struct entirely — no page table duplication. The parent is put to sleep
(TASK_UNINTERRUPTIBLE) on a vfork_done completion object until the child calls
execve() or _exit(), at which point the parent is woken up.
The performance motivation: if you immediately call exec() after fork(), all the
CoW page-table setup work is wasted because exec() discards the entire address space
anyway. vfork() avoids that cost.
Safety constraints: between vfork() and exec()/_exit(), the child must not:
- write to local variables in the function that called vfork()
- call any function that might allocate heap memory or call exit() (which calls
atexit() handlers and flushes stdio buffers shared with the parent)
- return from the function that called vfork()
In practice, modern systems with sufficient RAM treat vfork's optimization as less
critical. posix_spawn() uses vfork internally on Linux glibc as an implementation
detail.
clone3(): The Modern Interface
Linux 5.3 introduced clone3() which takes a struct clone_args rather than cramming
everything into a flags bitmask and loose arguments. It adds:
- pidfd: the kernel can return a PID file descriptor (pidfds) instead of a raw integer,
avoiding PID reuse races in code that signals or waits on children.
- set_tid: allows specifying the desired PID (for checkpoint/restore).
- exit_signal: specifies which signal the parent receives when the child exits.
execve() Mechanics
execve(path, argv, envp) replaces the current process image entirely. The key kernel
path on Linux:
sys_execve()
└─ do_execveat_common()
├─ open_exec() — open the binary, check permissions
├─ bprm_init() — allocate linux_binprm, copy argv/envp
├─ search_binary_handler()
│ └─ load_elf_binary() — for ELF files
│ ├─ elf_check_arch()
│ ├─ load ELF LOAD segments → map into new address space
│ ├─ if PT_INTERP present:
│ │ load dynamic linker (ld.so) into address space
│ ├─ setup_new_exec()
│ │ — flush old mm (munmap everything), install new mm
│ ├─ setup_arg_pages()
│ │ — set up stack: argv, envp, auxv
│ └─ set entry point (ld.so entry or ELF entry if static)
└─ exec returns to user space at new entry point
The auxiliary vector (auxv): the kernel passes metadata to the process on the stack
below envp. Key entries:
- AT_PHDR: address of the ELF program header table (ld.so uses this to find DYNAMIC)
- AT_ENTRY: program entry point (so ld.so knows where to jump after relocation)
- AT_RANDOM: 16 random bytes (used by glibc as stack canary seed)
- AT_SYSINFO_EHDR: address of the vDSO page
Read auxv from a running process:
cat /proc/self/auxv | od -t x8
# or
LD_SHOW_AUXV=1 /bin/true
Stack layout at process entry (x86-64):
High address (stack top)
┌──────────────────────────────┐
│ argc │ ← %rsp at _start
├──────────────────────────────┤
│ argv[0] pointer │
│ argv[1] pointer │
│ ... │
│ NULL │
├──────────────────────────────┤
│ envp[0] pointer │
│ ... │
│ NULL │
├──────────────────────────────┤
│ auxv[0] {type, value} │
│ auxv[1] {type, value} │
│ ... │
│ {AT_NULL, 0} │
├──────────────────────────────┤
│ string data for argv/envp │
└──────────────────────────────┘
Low address
exec() and Credential Changes (setuid)
When execve() loads a binary with the setuid bit set (S_ISUID):
effective UID ← file owner UID
The kernel calls prepare_binprm() which checks S_ISUID/S_ISGID and calls
bprm_fill_uid(). New credentials are committed via commit_creds(new_cred) after
the binary is fully loaded and just before jumping to user space.
Capability rules on exec:
- If the binary has file capabilities (cap_setpcap), execve() can add capabilities
to the permitted set that the process did not previously have.
- PR_SET_NO_NEW_PRIVS (set by systemd for sandboxed services, by seccomp-bpf setups)
disables setuid and file capabilities for the process and all its descendants —
a powerful sandboxing primitive.
execve() and Filesystem Namespaces
execve operates within the current mount namespace. The path argument is resolved
using the process's fs_struct (root and cwd). Container runtimes use pivot_root() or
chroot() followed by execve() inside the container's mount namespace to ensure the
new program sees the container filesystem, not the host.
O_CLOEXEC and exec: any fd opened without O_CLOEXEC remains open across execve.
This is a common security bug — long-lived file descriptors (pipes to privileged parents,
sockets, memfd secrets) accidentally inherited by setuid children. Mitigation: use
open(path, O_RDONLY | O_CLOEXEC) everywhere, and call closefrom(3) or iterate
/proc/self/fd before execve in security-sensitive code.
Historical Context
The fork/exec separation was a deliberate design choice in the original UNIX (Bell Labs,
early 1970s). Ken Thompson later said it was almost an accident — they needed process
creation but hadn't designed exec yet, so the first version of fork() just duplicated
the process and both copies continued running the same program. exec() came later to
load new programs.
The Multics OS (1965) used a single "create process and load program" primitive. Plan 9
and many microkernel designs revisited this, arguing that fork/exec wastes work. Linux's
clone() (added in 1.x) represents a third approach: a fully parametric primitive that
subsumes fork, vfork, and thread creation.
posix_spawn() (POSIX.1-2001) was standardized as an alternative for resource-
constrained systems where fork/exec overhead is unacceptable (embedded, real-time).
Linux glibc implements it via vfork() + execve() when possible.
Production Examples
Shell fork-exec cycle:
// Simplified version of what a shell does for each command:
pid_t pid = fork();
if (pid == 0) {
// child
close(STDIN_FILENO);
dup2(pipe_fd[0], STDIN_FILENO); // plumbing
execvp(argv[0], argv); // replace image
_exit(127); // exec failed
}
// parent
waitpid(pid, &status, 0);
Checking CoW behavior (strace output):
strace -e trace=clone,mmap,munmap -f ./fork_test 2>&1 | head -40
# clone() flags reveal exactly which resources are shared/copied
pidfd for race-free process management:
// clone3 with CLONE_PIDFD
struct clone_args args = {
.flags = CLONE_PIDFD,
.pidfd = (uint64_t)&pidfd,
.exit_signal = SIGCHLD,
};
pid_t pid = syscall(SYS_clone3, &args, sizeof(args));
// now pidfd is a file descriptor — valid even if PID wraps around
waitid(P_PIDFD, pidfd, &info, WEXITED);
Debugging Notes
- Fork bomb recovery:
ulimit -u <N>(RLIMIT_NPROC) per-user limit is the only pre-emptive defense. Once PIDs are exhausted, evenkillandbashcannot fork. Use an existing shell session (without forking) to kill the culprit:kill -9 $(pgrep -u baduser)—pgrepreads/procwithout forking if built-in. - Tracing exec:
strace -e execve -f -p PIDshows everyexecvecall in a process tree. Useful for finding unexpected interpreter invocations. - File descriptor leaks across exec:
ls -la /proc/PID/fdshows inherited fds. Compare the set before and after exec with/proc/PID/fdinfo/N(showsO_CLOEXECflag asflags: 02000000or similar). - CoW faults: perf
page-faultscounter will spike after fork if the child or parent modifies many pages. Useperf stat -e page-faults,minor-faults ./progto quantify. - vfork deadlock: if the child calls a function that uses
mallocbeforeexec, it can deadlock because the malloc lock in the shared address space may already be held by another parent thread.
Security Implications
- setuid exec and privilege escalation: any writable component of
$PATHbefore a setuid binary's directory allows a PATH hijack.execveuses the real UID's filesystem permissions to open the binary, but applies SUID post-open — the window between open and exec is the subject of TOCTOU analysis. PR_SET_NO_NEW_PRIVS: once set, neither setuid bits nor file capabilities can grant privileges. Used by Chrome, systemd services, and seccomp-bpf before loading a restrictive filter (a seccomp filter cannot be bypassed by exec-ing a setuid helper ifNO_NEW_PRIVSis set).- Inherited file descriptors: if a privileged process forks and the child
execs an untrusted binary without closing all fds, the untrusted code may inherit access to/dev/kmem, sockets bound to privileged ports, or open secrets. UseO_CLOEXECby default and audit withls /proc/PID/fd. - symlink attacks on
/proc/self/exe: a process that re-execs itself using/proc/self/execan be tricked if an attacker can replace the executable after the original open but before the exec. Usefexecve(fd, argv, envp)(open the binary to an fd first, then exec via that fd) for re-exec security.
Performance Implications
- fork() cost is dominated by page-table duplication (O(address space VMAs)) and TLB invalidation, not by copying pages. A process with 10,000 VMAs (common in JVMs and Go runtimes due to mmap-heavy allocation) may spend 100+ µs in fork even with CoW.
- exec() cost involves at minimum: file open + permission check, ELF header read, 1–4 mmap() calls for LOAD segments, ld.so being loaded and initialized. For a dynamically linked binary with 50 shared libraries, the dynamic linker's relocation work adds tens of milliseconds of user-space startup time.
- Reducing fork overhead in JVMs: JVM-based languages that spawn many processes
(e.g., Clojure build tools) suffer from large heap fragmentation of CoW pages.
jemalloctends to keep pages cleaner than glibc's allocator;transparent huge pagescan worsen CoW cost (one 2 MB THP fault = one 2 MB copy). posix_spawnvs fork+exec: on Linux,posix_spawnis not significantly faster thanvfork+execbecause glibc implements it as such. On systems without virtual memory (MMU-less), the difference is dramatic.
Failure Modes
| Scenario | Error | Cause |
|---|---|---|
| Fork returns EAGAIN | EAGAIN |
RLIMIT_NPROC reached for this user |
| Fork returns ENOMEM | ENOMEM |
Cannot allocate task_struct or duplicate page tables |
| exec returns ENOEXEC | ENOEXEC |
Binary not recognized by any registered binfmt handler |
| exec returns ETXTBSY | ETXTBSY |
Binary file is currently open for writing |
| vfork child calls exit() | Parent stack corruption | Child called exit() (not _exit()), flushing shared stdio buffers |
| fd leak across exec | Unintended access | O_CLOEXEC not set; privileged fds inherited |
Modern Usage
Container runtimes (runc, crun) use clone() with CLONE_NEWPID | CLONE_NEWNET |
CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC to place the container's init process into
a fully isolated namespace set, then execve() the container entrypoint.
Language runtimes: Go's syscall.ForkExec uses vfork (or a clone variant) on
Linux and carefully avoids any goroutine scheduler interaction between fork and exec
because Go's runtime state is deeply multi-threaded. CPython uses os.fork() but warns
that forking a multi-threaded Python process is unsafe (lock state incoherence); it
recommends multiprocessing with the spawn start method on newer Python.
fexecve(3): execute a file via an open file descriptor, bypassing path resolution
entirely. Used by container runtimes to exec a binary that is inside a memfd or an
already-open file, without race conditions on the filesystem path.
Future Directions
clone3withCLONE_INTO_CGROUP: allows placing the new task directly into a specified cgroup at creation time, eliminating the racy "fork, then move to cgroup" pattern that container runtimes historically used.- io_uring and process creation: proposals to allow
clone3/execveto be submitted via io_uring for async process spawning without blocking the caller. - Checkpoint/Restore in Userspace (CRIU): CRIU restores processes using
clone3withset_tid,set_tid_size, and custom PID namespace mappings to recreate exact PID trees across migration. - Rust-based
forksafety: the Rust standard library deliberately does not exposefork()in safe code due to multi-threading hazards. Discussion ongoing about whether a safeprefork/postforkhook mechanism would make it feasible.
Exercises
-
Measuring CoW cost: write a C program that allocates 1 GB via
mmap(MAP_ANONYMOUS), fills it with data, then forks. Time the fork withclock_gettime. Then repeat with the child immediatelyexec-ing/bin/true(measure total fork+exec time). Compare the two. Now repeat with the parent having only 10 MB allocated. Explain the differences. -
clone() flags experiment: write a C program that uses
clone()directly (viasyscall(SYS_clone, ...)) to create a new "process" that sharesCLONE_FILESwith the parent. In the child, open a file and write to it. Back in the parent, verify the fd is visible in/proc/PPID/fd. Then fork() normally and verify the fd is NOT shared after fork (they are independent copies). -
execve stack inspection: write a program that, just after
main()starts, walks the stack downward fromargv[0]to find and print the auxiliary vector entries. Verify againstLD_SHOW_AUXV=1. -
setuid security audit: on a test system, find all setuid executables (
find / -perm -4000 -type f 2>/dev/null). For each, check if it is dynamically linked and whether its$RPATH/$RUNPATHis writable. Usereadelf -d <binary> | grep -E 'RPATH|RUNPATH'and verify directory permissions. -
pidfd lifecycle: write a C program using
clone3withCLONE_PIDFD. Fork 10 children, each sleeping for a random 1–5 seconds. Usepoll()on all 10 pidfds to wait for whichever finishes first, then usewaitid(P_PIDFD, ...)on that one. Compare this approach to aSIGCHLDhandler +waitpid(-1, ...)loop.
References
kernel/fork.c—copy_process(),dup_task_struct(),copy_mm(),dup_fd()fs/exec.c—do_execveat_common(),setup_new_exec(),setup_arg_pages()fs/binfmt_elf.c—load_elf_binary(), ELF loading internalsarch/x86/kernel/process.c—copy_thread()(x86-64 fork return value setup)- Kerrisk, The Linux Programming Interface — Chapters 24 (fork), 25 (process termination), 27 (exec)
- Stevens & Rago, Advanced Programming in the UNIX Environment, 3rd ed. — Chapter 8
man 2 clone,man 2 clone3,man 2 execve,man 2 vfork- LWN: "The clone3() system call" (2019), "pidfds and a safer kill()" (2019)
- glibc source:
sysdeps/unix/sysv/linux/spawni.c(posix_spawn implementation)