02 — fork(), exec(), and the Unix Process Creation Model

Technical Overview

Unix process creation is built on two orthogonal system calls: fork() duplicates the calling process, and exec() replaces the calling process's program image with a new one. Together they form the "fork-exec" idiom that every shell, service manager, and container runtime uses. Linux exposes both through a single, powerful primitive — clone() — that makes the degree of sharing between parent and child fully configurable. Understanding these mechanics is essential for anyone writing daemons, container runtimes, language runtimes, or anything that manages child processes.

Prerequisites

01-process-concept.md: task_struct, process address space, PID/TGID distinction
Virtual memory concepts: page tables, copy-on-write (CoW), VMAs
ELF binary format basics (load segments, entry point, interpreter path)
File descriptor semantics (reference counting, O_CLOEXEC)

Core Content

The fork/exec/clone Relationship

User-space APIs                       Kernel entry point
─────────────────────────────────     ──────────────────────────────────────
fork()        ────────────────────►   clone(SIGCHLD, 0)
                                             │
vfork()       ────────────────────►   clone(CLONE_VFORK|CLONE_VM|SIGCHLD, 0)
                                             │
pthread_create() ─────────────────►   clone(CLONE_VM|CLONE_FS|CLONE_FILES│
                                            CLONE_SIGHAND|CLONE_THREAD│    │
                                            CLONE_SETTLS|..., stack)        │
                                             │                              │
                                             ▼                              │
                                       kernel/fork.c:                       │
                                       sys_clone() / clone3()               │
                                             │                              │
                                             ▼                              │
                                       copy_process()  ◄────────────────────┘
                                             │
                              ┌──────────────┴──────────────────────────────┐
                              │  dup_task_struct()   — new stack + PCB       │
                              │  copy_creds()        — credentials           │
                              │  copy_mm()           — address space (CoW)   │
                              │  dup_fd()            — file descriptor table  │
                              │  copy_fs()           — fs context (cwd/root)  │
                              │  copy_sighand()      — signal handlers        │
                              │  copy_signal()       — signal state           │
                              │  copy_thread()       — arch register state    │
                              │  alloc_pid()         — PID allocation         │
                              └─────────────────────────────────────────────┘

All three user-space calls ultimately invoke copy_process(). The clone_flags bitmask decides what is copied (independent) versus shared (pointer shared between parent and child):

Flag	Effect when set	Absence
`CLONE_VM`	Share `mm_struct` (same address space)	Copy `mm_struct` with CoW page tables
`CLONE_FILES`	Share `files_struct`	Duplicate fd table
`CLONE_FS`	Share `fs_struct` (cwd, root, umask)	Copy fs context
`CLONE_SIGHAND`	Share `sighand_struct` (signal handlers)	Copy handlers
`CLONE_THREAD`	Place in same thread group (same `tgid`)	New process group
`CLONE_NEWPID`	Create new PID namespace	Inherit parent's
`CLONE_NEWNET`	Create new network namespace	Inherit parent's

`fork()` Internals: copy_process() in Detail

Step 1 — dup_task_struct()

Allocates a new task_struct and a new kernel stack (8 or 16 KB depending on arch config). The thread_info embedded at the base of the stack is initialized to point at the new task_struct. The contents of task_struct are initially a shallow copy of the parent — subsequent steps selectively deep-copy or re-share fields.

Step 2 — copy_mm() (fork case: no CLONE_VM)

The parent's mm_struct is duplicated: a new mm_struct is allocated, and the VMA list is cloned. However, the actual physical pages are not copied. Instead, copy_page_range() walks the parent's page tables and marks every writable page read-only in both parent and child page tables. This is copy-on-write (CoW).

When either process writes to a CoW page, the MMU raises a page fault. The kernel's do_wp_page() allocates a new physical page, copies the content, updates the faulting process's page table entry to point to the new page and mark it writable, and the write proceeds. Unmodified pages continue to be shared indefinitely.

Result: fork() is O(number of VMAs), not O(address space size). On a process with 100 VMAs and 2 GB of pages, fork takes microseconds not seconds.

Step 3 — dup_fd() (fork case: no CLONE_FILES)

The file descriptor table is copied. Open file descriptions (the kernel-side objects pointed to by struct file *) are not duplicated — both parent and child hold references to the same open file descriptions. Consequently they share the file offset (f_pos). This is the POSIX-specified behavior: writes from parent and child to the same fd are not coordinated and can interleave.

File descriptors marked O_CLOEXEC are closed in the child across an execve() call (not at fork time).

Step 4 — copy_sighand() and copy_signal()

Signal handler table is copied; pending signals and signal mask are reset for the child (except that pending signals sent to the thread group remain). The child starts with its own sigpending queue.

Step 5 — copy_thread() (arch-specific)

On x86-64 (arch/x86/kernel/process.c:copy_thread()), the child's kernel stack is set up so that when the scheduler first runs the child, it returns from fork() with return value 0. The parent's copy_process() returns the child's PID. This is the classic fork return-value bifurcation: same code, two different return values.

`vfork()`: Sharing the Address Space

vfork() is clone(CLONE_VFORK | CLONE_VM | SIGCHLD). The child shares the parent's mm_struct entirely — no page table duplication. The parent is put to sleep (TASK_UNINTERRUPTIBLE) on a vfork_done completion object until the child calls execve() or _exit(), at which point the parent is woken up.

The performance motivation: if you immediately call exec() after fork(), all the CoW page-table setup work is wasted because exec() discards the entire address space anyway. vfork() avoids that cost.

Safety constraints: between vfork() and exec()/_exit(), the child must not: - write to local variables in the function that called vfork() - call any function that might allocate heap memory or call exit() (which calls atexit() handlers and flushes stdio buffers shared with the parent) - return from the function that called vfork()

In practice, modern systems with sufficient RAM treat vfork's optimization as less critical. posix_spawn() uses vfork internally on Linux glibc as an implementation detail.

`clone3()`: The Modern Interface

Linux 5.3 introduced clone3() which takes a struct clone_args rather than cramming everything into a flags bitmask and loose arguments. It adds: - pidfd: the kernel can return a PID file descriptor (pidfds) instead of a raw integer, avoiding PID reuse races in code that signals or waits on children. - set_tid: allows specifying the desired PID (for checkpoint/restore). - exit_signal: specifies which signal the parent receives when the child exits.

`execve()` Mechanics

execve(path, argv, envp) replaces the current process image entirely. The key kernel path on Linux:

sys_execve()
  └─ do_execveat_common()
       ├─ open_exec()           — open the binary, check permissions
       ├─ bprm_init()           — allocate linux_binprm, copy argv/envp
       ├─ search_binary_handler()
       │    └─ load_elf_binary()   — for ELF files
       │         ├─ elf_check_arch()
       │         ├─ load ELF LOAD segments → map into new address space
       │         ├─ if PT_INTERP present:
       │         │    load dynamic linker (ld.so) into address space
       │         ├─ setup_new_exec()
       │         │    — flush old mm (munmap everything), install new mm
       │         ├─ setup_arg_pages()
       │         │    — set up stack: argv, envp, auxv
       │         └─ set entry point (ld.so entry or ELF entry if static)
       └─ exec returns to user space at new entry point

The auxiliary vector (auxv): the kernel passes metadata to the process on the stack below envp. Key entries: - AT_PHDR: address of the ELF program header table (ld.so uses this to find DYNAMIC) - AT_ENTRY: program entry point (so ld.so knows where to jump after relocation) - AT_RANDOM: 16 random bytes (used by glibc as stack canary seed) - AT_SYSINFO_EHDR: address of the vDSO page

Read auxv from a running process:

cat /proc/self/auxv | od -t x8
# or
LD_SHOW_AUXV=1 /bin/true

Stack layout at process entry (x86-64):

  High address (stack top)
  ┌──────────────────────────────┐
  │  argc                        │ ← %rsp at _start
  ├──────────────────────────────┤
  │  argv[0] pointer             │
  │  argv[1] pointer             │
  │  ...                         │
  │  NULL                        │
  ├──────────────────────────────┤
  │  envp[0] pointer             │
  │  ...                         │
  │  NULL                        │
  ├──────────────────────────────┤
  │  auxv[0] {type, value}       │
  │  auxv[1] {type, value}       │
  │  ...                         │
  │  {AT_NULL, 0}                │
  ├──────────────────────────────┤
  │  string data for argv/envp   │
  └──────────────────────────────┘
  Low address

exec() and Credential Changes (setuid)

When execve() loads a binary with the setuid bit set (S_ISUID):

effective UID ← file owner UID

The kernel calls prepare_binprm() which checks S_ISUID/S_ISGID and calls bprm_fill_uid(). New credentials are committed via commit_creds(new_cred) after the binary is fully loaded and just before jumping to user space.

Capability rules on exec: - If the binary has file capabilities (cap_setpcap), execve() can add capabilities to the permitted set that the process did not previously have. - PR_SET_NO_NEW_PRIVS (set by systemd for sandboxed services, by seccomp-bpf setups) disables setuid and file capabilities for the process and all its descendants — a powerful sandboxing primitive.

execve() and Filesystem Namespaces

execve operates within the current mount namespace. The path argument is resolved using the process's fs_struct (root and cwd). Container runtimes use pivot_root() or chroot() followed by execve() inside the container's mount namespace to ensure the new program sees the container filesystem, not the host.

O_CLOEXEC and exec: any fd opened without O_CLOEXEC remains open across execve. This is a common security bug — long-lived file descriptors (pipes to privileged parents, sockets, memfd secrets) accidentally inherited by setuid children. Mitigation: use open(path, O_RDONLY | O_CLOEXEC) everywhere, and call closefrom(3) or iterate /proc/self/fd before execve in security-sensitive code.

Historical Context

The fork/exec separation was a deliberate design choice in the original UNIX (Bell Labs, early 1970s). Ken Thompson later said it was almost an accident — they needed process creation but hadn't designed exec yet, so the first version of fork() just duplicated the process and both copies continued running the same program. exec() came later to load new programs.

The Multics OS (1965) used a single "create process and load program" primitive. Plan 9 and many microkernel designs revisited this, arguing that fork/exec wastes work. Linux's clone() (added in 1.x) represents a third approach: a fully parametric primitive that subsumes fork, vfork, and thread creation.

posix_spawn() (POSIX.1-2001) was standardized as an alternative for resource- constrained systems where fork/exec overhead is unacceptable (embedded, real-time). Linux glibc implements it via vfork() + execve() when possible.

Production Examples

Shell fork-exec cycle:

// Simplified version of what a shell does for each command:
pid_t pid = fork();
if (pid == 0) {
    // child
    close(STDIN_FILENO);
    dup2(pipe_fd[0], STDIN_FILENO); // plumbing
    execvp(argv[0], argv);          // replace image
    _exit(127);                     // exec failed
}
// parent
waitpid(pid, &status, 0);

Checking CoW behavior (strace output):

strace -e trace=clone,mmap,munmap -f ./fork_test 2>&1 | head -40
# clone() flags reveal exactly which resources are shared/copied

pidfd for race-free process management:

// clone3 with CLONE_PIDFD
struct clone_args args = {
    .flags = CLONE_PIDFD,
    .pidfd = (uint64_t)&pidfd,
    .exit_signal = SIGCHLD,
};
pid_t pid = syscall(SYS_clone3, &args, sizeof(args));
// now pidfd is a file descriptor — valid even if PID wraps around
waitid(P_PIDFD, pidfd, &info, WEXITED);

Debugging Notes

Fork bomb recovery: ulimit -u <N> (RLIMIT_NPROC) per-user limit is the only pre-emptive defense. Once PIDs are exhausted, even kill and bash cannot fork. Use an existing shell session (without forking) to kill the culprit: kill -9 $(pgrep -u baduser) — pgrep reads /proc without forking if built-in.
Tracing exec: strace -e execve -f -p PID shows every execve call in a process tree. Useful for finding unexpected interpreter invocations.
File descriptor leaks across exec: ls -la /proc/PID/fd shows inherited fds. Compare the set before and after exec with /proc/PID/fdinfo/N (shows O_CLOEXEC flag as flags: 02000000 or similar).
CoW faults: perf page-faults counter will spike after fork if the child or parent modifies many pages. Use perf stat -e page-faults,minor-faults ./prog to quantify.
vfork deadlock: if the child calls a function that uses malloc before exec, it can deadlock because the malloc lock in the shared address space may already be held by another parent thread.

Security Implications

setuid exec and privilege escalation: any writable component of $PATH before a setuid binary's directory allows a PATH hijack. execve uses the real UID's filesystem permissions to open the binary, but applies SUID post-open — the window between open and exec is the subject of TOCTOU analysis.
PR_SET_NO_NEW_PRIVS: once set, neither setuid bits nor file capabilities can grant privileges. Used by Chrome, systemd services, and seccomp-bpf before loading a restrictive filter (a seccomp filter cannot be bypassed by exec-ing a setuid helper if NO_NEW_PRIVS is set).
Inherited file descriptors: if a privileged process forks and the child execs an untrusted binary without closing all fds, the untrusted code may inherit access to /dev/kmem, sockets bound to privileged ports, or open secrets. Use O_CLOEXEC by default and audit with ls /proc/PID/fd.
symlink attacks on /proc/self/exe: a process that re-execs itself using /proc/self/exe can be tricked if an attacker can replace the executable after the original open but before the exec. Use fexecve(fd, argv, envp) (open the binary to an fd first, then exec via that fd) for re-exec security.

Performance Implications

fork() cost is dominated by page-table duplication (O(address space VMAs)) and TLB invalidation, not by copying pages. A process with 10,000 VMAs (common in JVMs and Go runtimes due to mmap-heavy allocation) may spend 100+ µs in fork even with CoW.
exec() cost involves at minimum: file open + permission check, ELF header read, 1–4 mmap() calls for LOAD segments, ld.so being loaded and initialized. For a dynamically linked binary with 50 shared libraries, the dynamic linker's relocation work adds tens of milliseconds of user-space startup time.
Reducing fork overhead in JVMs: JVM-based languages that spawn many processes (e.g., Clojure build tools) suffer from large heap fragmentation of CoW pages. jemalloc tends to keep pages cleaner than glibc's allocator; transparent huge pages can worsen CoW cost (one 2 MB THP fault = one 2 MB copy).
posix_spawn vs fork+exec: on Linux, posix_spawn is not significantly faster than vfork+exec because glibc implements it as such. On systems without virtual memory (MMU-less), the difference is dramatic.

Failure Modes

Scenario	Error	Cause
Fork returns EAGAIN	`EAGAIN`	RLIMIT_NPROC reached for this user
Fork returns ENOMEM	`ENOMEM`	Cannot allocate `task_struct` or duplicate page tables
exec returns ENOEXEC	`ENOEXEC`	Binary not recognized by any registered binfmt handler
exec returns ETXTBSY	`ETXTBSY`	Binary file is currently open for writing
vfork child calls exit()	Parent stack corruption	Child called `exit()` (not `_exit()`), flushing shared stdio buffers
fd leak across exec	Unintended access	`O_CLOEXEC` not set; privileged fds inherited

Modern Usage

Container runtimes (runc, crun) use clone() with CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC to place the container's init process into a fully isolated namespace set, then execve() the container entrypoint.

Language runtimes: Go's syscall.ForkExec uses vfork (or a clone variant) on Linux and carefully avoids any goroutine scheduler interaction between fork and exec because Go's runtime state is deeply multi-threaded. CPython uses os.fork() but warns that forking a multi-threaded Python process is unsafe (lock state incoherence); it recommends multiprocessing with the spawn start method on newer Python.

fexecve(3): execute a file via an open file descriptor, bypassing path resolution entirely. Used by container runtimes to exec a binary that is inside a memfd or an already-open file, without race conditions on the filesystem path.

Future Directions

clone3 with CLONE_INTO_CGROUP: allows placing the new task directly into a specified cgroup at creation time, eliminating the racy "fork, then move to cgroup" pattern that container runtimes historically used.
io_uring and process creation: proposals to allow clone3/execve to be submitted via io_uring for async process spawning without blocking the caller.
Checkpoint/Restore in Userspace (CRIU): CRIU restores processes using clone3 with set_tid, set_tid_size, and custom PID namespace mappings to recreate exact PID trees across migration.
Rust-based fork safety: the Rust standard library deliberately does not expose fork() in safe code due to multi-threading hazards. Discussion ongoing about whether a safe prefork/postfork hook mechanism would make it feasible.

Exercises

Measuring CoW cost: write a C program that allocates 1 GB via mmap(MAP_ANONYMOUS), fills it with data, then forks. Time the fork with clock_gettime. Then repeat with the child immediately exec-ing /bin/true (measure total fork+exec time). Compare the two. Now repeat with the parent having only 10 MB allocated. Explain the differences.
clone() flags experiment: write a C program that uses clone() directly (via syscall(SYS_clone, ...)) to create a new "process" that shares CLONE_FILES with the parent. In the child, open a file and write to it. Back in the parent, verify the fd is visible in /proc/PPID/fd. Then fork() normally and verify the fd is NOT shared after fork (they are independent copies).
execve stack inspection: write a program that, just after main() starts, walks the stack downward from argv[0] to find and print the auxiliary vector entries. Verify against LD_SHOW_AUXV=1.
setuid security audit: on a test system, find all setuid executables (find / -perm -4000 -type f 2>/dev/null). For each, check if it is dynamically linked and whether its $RPATH/$RUNPATH is writable. Use readelf -d <binary> | grep -E 'RPATH|RUNPATH' and verify directory permissions.
pidfd lifecycle: write a C program using clone3 with CLONE_PIDFD. Fork 10 children, each sleeping for a random 1–5 seconds. Use poll() on all 10 pidfds to wait for whichever finishes first, then use waitid(P_PIDFD, ...) on that one. Compare this approach to a SIGCHLD handler + waitpid(-1, ...) loop.

References

kernel/fork.c — copy_process(), dup_task_struct(), copy_mm(), dup_fd()
fs/exec.c — do_execveat_common(), setup_new_exec(), setup_arg_pages()
fs/binfmt_elf.c — load_elf_binary(), ELF loading internals
arch/x86/kernel/process.c — copy_thread() (x86-64 fork return value setup)
Kerrisk, The Linux Programming Interface — Chapters 24 (fork), 25 (process termination), 27 (exec)
Stevens & Rago, Advanced Programming in the UNIX Environment, 3rd ed. — Chapter 8
man 2 clone, man 2 clone3, man 2 execve, man 2 vfork
LWN: "The clone3() system call" (2019), "pidfds and a safer kill()" (2019)
glibc source: sysdeps/unix/sysv/linux/spawni.c (posix_spawn implementation)