03 — Process Lifecycle
Technical Overview
Every process follows a deterministic lifecycle: it is created, competes for CPU time,
blocks on resources, resumes, and eventually terminates. Understanding every phase of
this journey — including the subtleties of zombie processes, orphan adoption, voluntary
yields, daemon creation patterns, and resource limits — is the foundation of reliable
systems programming. This file traces the full arc from fork() to the final cleanup of
the last kernel data structure, and covers all the administrative mechanisms that
constrain and monitor processes throughout their life.
Prerequisites
01-process-concept.md:task_struct, process state machine,/proc/PID/02-fork-and-exec.md: how processes are created- Basic C signal handling (
SIGCHLD,wait()family) - Understanding of file descriptors and reference counting
Core Content
The Complete Lifecycle Arc
fork() / clone()
│
▼
┌───────────────┐
│ CREATED │ task_struct allocated, not yet runnable
└──────┬────────┘
│ copy_process() completes, wake_up_new_task()
▼
┌───────────────┐
│ READY │ TASK_RUNNING on run queue, waiting for CPU
└──────┬────────┘
┌────────────┤ scheduler picks task
│ ▼
│ ┌───────────────┐
│ │ RUNNING │ TASK_RUNNING, executing on CPU
│ └──────┬────────┘
│ │ I/O, lock, sleep syscall
│ ▼
│ ┌───────────────┐
│ │ BLOCKED │ TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE
│ └──────┬────────┘
│ │ event occurs, wake_up_process()
└────────────┘ (back to READY)
│
│ exit() / exit_group() / fatal signal
▼
┌───────────────┐
│ ZOMBIE │ EXIT_ZOMBIE: PCB kept, mm freed
└──────┬────────┘
│ parent calls wait() / waitpid()
▼
┌───────────────┐
│ DEAD │ EXIT_DEAD: task_struct freed, PID recycled
└───────────────┘
Voluntary Yield vs. Involuntary Preemption
Voluntary yield: the running process calls a syscall that puts itself to sleep.
sched_yield(2): moves the process to the end of the run queue for its priority level. Useful hint but the scheduler may immediately reschedule it if it's the only runnable task. Rarely the right tool — usually a sign of a busy-wait that should use a proper synchronization primitive.nanosleep(2)/clock_nanosleep(2): sleeps for at least the requested duration. The task entersTASK_INTERRUPTIBLE. A signal can wake it early; the remaining time is returned in theremargument.- Blocking syscalls (
read,write,recv,accept, mutexfutex): all put the task intoTASK_INTERRUPTIBLEorTASK_UNINTERRUPTIBLEdepending on whether an interrupt makes sense.
Involuntary preemption: the scheduler forcibly removes the CPU from a running task.
- Triggered by timer interrupt (scheduler tick or
hrtimerin tickless mode). - The scheduler sets
TIF_NEED_RESCHEDin the task's thread flags. - Actual preemption happens at the next safe preemption point:
- On return from interrupt (always safe)
- On return from syscall to user space
- In the kernel at explicit
preempt_enable()ifCONFIG_PREEMPTis enabled
With CONFIG_PREEMPT_RT, nearly all spinlocks become sleeping locks and the kernel is
fully preemptible, allowing hard real-time latency bounds.
Process Waiting: wait4() and waitpid() Variants
When a child terminates it becomes a zombie. The parent must reap the zombie by calling
a wait family syscall:
waitpid(pid, &status, options) // wait for specific PID or -1 for any child
wait4(pid, &status, options, &rusage) // same + resource usage
waitid(idtype, id, &siginfo, options) // POSIX, most flexible
Key options flags:
| Flag | Meaning |
|---|---|
WNOHANG |
Return immediately if no child has exited yet (non-blocking) |
WUNTRACED |
Also return for children stopped by a signal (SIGSTOP/SIGTSTP) |
WCONTINUED |
Also return for children resumed by SIGCONT |
WEXITED |
(waitid) wait for children that have exited |
WNOWAIT |
(waitid) leave the child in waitable state (peek, do not reap) |
Status decoding macros:
WIFEXITED(status) // true if child exited normally via exit()
WEXITSTATUS(status) // exit code (low 8 bits) if WIFEXITED
WIFSIGNALED(status) // true if killed by signal
WTERMSIG(status) // signal number if WIFSIGNALED
WIFSTOPPED(status) // true if stopped (WUNTRACED)
WSTOPSIG(status) // signal that stopped the child
WIFCONTINUED(status) // true if resumed by SIGCONT
Zombie Processes: Lifecycle of a Dead But Not Reaped Task
When a process calls exit() (or receives a fatal signal), the kernel runs
do_exit() in kernel/exit.c. The sequence:
do_exit()
│
├── exit_signals() — flush pending signals, notify parent via SIGCHLD
├── exit_mm() — drop reference to mm_struct; if last reference,
│ free all VMAs and page tables (mm freed here)
├── exit_files() — close all open file descriptors (fput each)
├── exit_fs() — release fs_struct (cwd, root)
├── exit_task_namespaces() — release namespace references
├── taskstats_exit() — accounting (cgroup, /proc/acct)
├── set task state → EXIT_ZOMBIE
├── notify_parent() — send SIGCHLD to parent
└── schedule() — never returns; task removed from CPU
At this point the task_struct still exists (the zombie) but:
- mm_struct is freed (no more address space — VmSize = 0)
- All file descriptors are closed
- The zombie consumes a PID and a task_struct (~8 KB of kernel memory)
Zombie creation sequence:
PARENT CHILD
│ │
│ fork() │
│◄──────────────────────────── │
│ │
│ (parent does other work) │ exit(0)
│ ├─────────────────────┐
│ │ ▼
│ EXIT_ZOMBIE state mm freed,
│ task_struct retained fds closed,
│ PCB in parent's SIGCHLD sent
│ children list to parent
│
│ waitpid(child_pid, ...)
├──────────────────────────────►
│ │
│ parent reads exit status │ EXIT_DEAD
│ │ task_struct freed
│ │ PID recycled
Zombies are harmless as long as they are temporary. They become a problem when:
- The parent never calls wait() (bug)
- The parent is in a tight loop creating and ignoring children
- PID table fills up (more critical on systems with low pid_max)
To prevent zombies: either call waitpid() / wait(), or explicitly set
SIGCHLD to SIG_IGN (on Linux this auto-reaps children without creating zombies),
or use the double-fork technique.
Orphan Adoption: init and PR_SET_CHILD_SUBREAPER
If a parent exits before its children, those children become orphans. The kernel reparents them:
- Walk the task's children list
- For each child, look for the nearest ancestor with
task_struct->is_child_subreaperset (set viaprctl(PR_SET_CHILD_SUBREAPER, 1)) - If no subreaper found, reparent to PID 1 (init / systemd)
systemd uses subreaper on each service scope so that grandchild processes (daemons
that double-fork) are properly accounted under the service, not under PID 1.
The orphaned child continues running normally — it is not killed. It simply has a
new task_struct->parent pointer.
Double-Fork Technique for Daemons
The traditional Unix daemon creation pattern uses two consecutive forks to fully detach from the terminal and controlling process group:
// Step 1: fork, parent exits → shell prompt returns
pid_t pid = fork();
if (pid > 0) exit(0); // parent exits
if (pid < 0) abort();
// Step 2: create new session (detach from controlling terminal)
setsid(); // new session, new process group, no ctty
// Step 3: fork again → grandparent was session leader; child can never
// accidentally acquire a controlling terminal (SIGHUP immunity)
pid = fork();
if (pid > 0) exit(0); // session leader exits
if (pid < 0) abort();
// Now we are a fully independent daemon (grandchild)
chdir("/"); // don't hold a mount point busy
umask(0);
close(STDIN_FILENO);
close(STDOUT_FILENO);
close(STDERR_FILENO);
// Redirect stdio to /dev/null or a log file
int fd = open("/dev/null", O_RDWR);
dup2(fd, STDIN_FILENO);
dup2(fd, STDOUT_FILENO);
dup2(fd, STDERR_FILENO);
if (fd > 2) close(fd);
// daemon code here
The second fork ensures the daemon is not a session leader and therefore cannot
open("/dev/tty", ...) to accidentally acquire a controlling terminal, which
could deliver SIGHUP if the terminal session ends.
Modern alternative: use systemd with Type=forking or, better, Type=notify
with sd_notify(3) — no double-fork needed; systemd manages daemonization.
/proc/PID/status: Key Fields
Name: nginx # comm field (truncated to 15 chars)
State: S (sleeping) # state character: R,S,D,Z,T,t,X
Tgid: 1234 # process ID (thread group leader's PID)
Ngid: 0 # NUMA group ID (scheduling hint)
Pid: 1234 # this task's PID (= Tgid for main thread)
PPid: 1 # parent PID
TracerPid: 0 # PID of ptracer (0 = not traced)
Uid: 1000 1000 1000 1000 # real, effective, saved, filesystem UIDs
Gid: 1000 1000 1000 1000
FDSize: 256 # size of fd table (not number of open fds)
Groups: 1000 4 24 27 # supplementary groups
VmPeak: 102400 kB # peak virtual address space size
VmSize: 98304 kB # current virtual address space size
VmLck: 0 kB # locked (mlock) pages
VmPin: 0 kB # pinned pages
VmHWM: 12288 kB # peak resident set size
VmRSS: 8192 kB # current RSS (physical pages)
VmData: 512 kB # data + stack size
VmStk: 136 kB # stack size
VmExe: 256 kB # text (code) size
VmLib: 32768 kB # shared library code size
VmPTE: 72 kB # page table entries size
VmSwap: 0 kB # swapped-out pages
Threads: 4 # number of threads in this process
SigQ: 0/62810 # pending signals / max queue length
SigPnd: 0000000000000000 # bitmask of pending signals (this thread)
ShdPnd: 0000000000000000 # bitmask of pending signals (process-wide)
SigBlk: 0000000000010000 # blocked signals bitmask
SigIgn: 0000000000001000 # ignored signals bitmask
SigCgt: 0000000180014603 # caught signals bitmask
voluntary_ctxt_switches: 1432
nonvoluntary_ctxt_switches: 87
Resource Limits: rlimit
Every process has a set of resource limits managed by struct rlimit {rlim_t rlim_cur, rlim_max}:
| Limit name | getrlimit constant |
Default | Controls |
|---|---|---|---|
RLIMIT_AS |
Address space | unlimited | Maximum virtual address space bytes |
RLIMIT_NOFILE |
Open file descriptors | 1024 (soft) | Max number of open fds |
RLIMIT_NPROC |
Max processes | ~30000 | Max tasks per real UID |
RLIMIT_CPU |
CPU time | unlimited | Seconds of CPU; SIGXCPU at soft, SIGKILL at hard |
RLIMIT_FSIZE |
File size | unlimited | Max file size; SIGXFSZ on violation |
RLIMIT_MEMLOCK |
Locked memory | 64 KB | Max bytes locked with mlock() |
RLIMIT_STACK |
Stack size | 8 MB | Stack VMA size; SIGSEGV beyond |
RLIMIT_CORE |
Core dump size | 0 | Max core file size (0 = no core) |
RLIMIT_NICE |
Nice ceiling | 0 | Lowest (best) nice value unprivileged process can set |
RLIMIT_RTPRIO |
RT priority | 0 | Max real-time scheduling priority |
The soft limit is the current enforcement value; the hard limit is the ceiling that an unprivileged process can raise the soft limit to. Only root (CAP_SYS_RESOURCE) can raise above the hard limit.
struct rlimit rl;
getrlimit(RLIMIT_NOFILE, &rl); // read current limits
rl.rlim_cur = 65536; // raise soft limit
setrlimit(RLIMIT_NOFILE, &rl); // set new soft limit
// prlimit() allows setting limits on other processes:
prlimit(pid, RLIMIT_NOFILE, &new_rl, &old_rl);
Systemd enforces limits per-service via LimitNOFILE=, LimitNPROC=, etc. in unit
files, which call prlimit() on the service's main process.
Historical Context
The zombie state dates to original UNIX: a dead process needed to retain its exit status until the parent read it, since the status could not be passed through signals alone (signal delivery is not reliable for data). The term "zombie" was coined in the UNIX community by the late 1970s.
Resource limits (rlimit) were introduced in BSD UNIX (4.1BSD, 1981) and standardized
in POSIX.1. The prlimit() syscall (Linux 2.6.36, 2010) extended the interface to allow
a process to set limits on other processes — essential for container runtimes that
configure resource limits before execing the container init.
The daemon double-fork pattern is documented in Stevens' Advanced Programming in the UNIX Environment (1992) and remains the standard approach for traditional Unix daemons, though systemd (2010) largely supersedes it in modern Linux deployments.
Production Examples
Monitor for zombie accumulation:
# Count zombies per parent
ps aux | awk '$8 == "Z" {print $3}' | sort | uniq -c | sort -rn | head
# Then find parent:
ps -o pid,comm -p <PPID>
Check resource limits for a running process:
cat /proc/$(pgrep nginx | head -1)/limits
# or
prlimit --pid $(pgrep nginx | head -1)
Increase file descriptor limit for a service at runtime:
# For a running process (as root):
prlimit --nofile=65536:65536 --pid <PID>
# Permanently in systemd service:
# [Service]
# LimitNOFILE=65536
Finding orphaned processes (reparented to init):
ps -eo pid,ppid,comm | awk '$2 == 1 {print}' | grep -v systemd
Debugging Notes
- Zombie root cause: a zombie's parent is shown in the
PPid:field of/proc/ZPID/status. Usestrace -e wait4 -p PPIDto see if the parent is callingwaitat all. Often the bug is a SIGCHLD handler that doeswaitpid(specific_pid)instead ofwaitpid(-1, ...)in a loop, so it misses children. RLIMIT_NOFILEtoo low: nginx or a Go server printing "too many open files" means the process hit its fd limit. Checkcat /proc/PID/limitsandls /proc/PID/fd | wc -lto see current usage vs. limit.RLIMIT_CPUbehavior: processes exceeding RLIMIT_CPU soft limit receive SIGXCPU (catchable); at the hard limit they receive SIGKILL. Catching SIGXCPU and continuing will eventually hit the hard limit.- Daemon startup debugging: a double-forked daemon that fails silently is hard to
debug. Use
Type=forkingin systemd and setStandardError=journalso stderr is captured even from the forked child. - Core dumps not generated:
RLIMIT_CORE=0by default. Set withulimit -c unlimitedorsysctl kernel.core_patternto configure the dump path.coredumpctl(systemd) is the modern interface.
Security Implications
- Zombie PID squatting: a process that creates many short-lived children and delays
reaping can accumulate thousands of zombies, each holding a PID. In environments with
low
pid_max, this prevents other processes from spawning — a denial-of-service vector. RLIMIT_NPROCas DoS mitigation: setting a per-user process limit prevents fork bombs from users or compromised services. Typically done in/etc/security/limits.confor systemd unit files.RLIMIT_ASfor untrusted code: settingRLIMIT_ASbeforeexecve-ing an untrusted binary limits its virtual memory consumption. Combined withRLIMIT_FSIZEandRLIMIT_NOFILE, provides coarse sandboxing.RLIMIT_COREand sensitive data: a core dump captures the full process address space including cryptographic keys, passwords, and token values. Production systems commonly disable core dumps (RLIMIT_CORE=0) or configurekernel.core_patternto pipe to a controlled handler (systemd'ssystemd-coredumpencrypts dumps).
Performance Implications
- Zombie cleanup latency: zombies don't consume CPU, memory, or file descriptors —
only a PID and a
task_struct. The main impact is PID table pressure. Reap promptly. wait()call frequency: in a server that spawns many short-lived children, callingwaitpid(-1, WNOHANG)on eachSIGCHLDin a tight loop is more efficient than blockingwaitpid(-1, 0)with a thread. Usesignalfdorpidfd_openfor event-driven waiting without signal handler reentrancy issues.RLIMIT_CPUand CPU-bound workloads: even if soft limit allows SIGXCPU to be ignored, the kernel will enforce the hard limit. Use cgroup CPU quotas for finer- grained CPU accounting without the all-or-nothing kill semantics ofRLIMIT_CPU.- Stack size and thread count: the default 8 MB
RLIMIT_STACKmeans each thread reserves 8 MB of virtual address space. A server with 10,000 threads consumes 80 GB of virtual space even if physically only a few KB are used. On 32-bit systems this was a hard constraint; on 64-bit it's a virtual address space concern but not physical memory pressure until pages are actually touched.
Failure Modes
| Failure | Symptom | Diagnosis |
|---|---|---|
| Zombie flood | ps shows many Z entries |
strace -e wait4 -p PARENT_PID |
| PID exhaustion from zombies | fork(): EAGAIN despite low process count |
cat /proc/sys/kernel/pid_max; count zombies |
| fd leak exhaustion | EMFILE: too many open files |
ls /proc/PID/fd | wc -l vs limits |
| Daemon silently exits | No log output, process missing | Run without daemonization first; check journalctl |
| Core dumps missing | Crash with no .core file |
RLIMIT_CORE=0 or fs.suid_dumpable=0; check kernel.core_pattern |
| CPU hard limit kill | Process killed without SIGTERM | RLIMIT_CPU hard limit hit; switch to cgroup CPU quota |
Modern Usage
Container init processes: container runtimes set the container init (PID 1 inside the
namespace) up as a subreaper via prctl(PR_SET_CHILD_SUBREAPER). tini and dumb-init
are minimal init processes purpose-built for containers that handle SIGCHLD and reap
zombies that exec-style entrypoints would not handle.
prlimit in orchestration: Kubernetes uses cgroup v2 limits (memory.max,
cpu.max) in preference to rlimit for process resource management, but still passes
rlimit values via the container runtime for RLIMIT_NOFILE and RLIMIT_NPROC.
Systemd and cgroup-based lifecycle: systemd tracks a service's lifecycle via its
cgroup. Even if the daemon double-forks into a new process tree, systemd can terminate
all members of the cgroup on systemctl stop. This makes the double-fork technique less
useful — the orphan adoption trick that daemons used to escape the parent is countered
by cgroup membership which persists across reparenting.
Future Directions
pidfdas the universal process handle: the kernel team's stated goal is to replace all PID-based APIs (kill,ptrace,waitpid) with pidfd-based variants to eliminate PID reuse races.pidfd_send_signal()andwaitid(P_PIDFD, ...)are already stable.- Cgroup v2 lifecycle management: increasingly, process lifecycle is managed at the
cgroup level, not the individual PID level.
cgroup.events(populated/empty) is the modern notification mechanism replacing SIGCHLD for container runtimes. - Structured concurrency in OS APIs: inspired by structured concurrency in language runtimes, proposals exist for "process groups that automatically wait for all members" as a kernel primitive, solving the orphan/zombie problem at the API level.
Exercises
-
Zombie lifecycle timer: write a C program that forks 50 children, each exiting immediately. The parent sleeps 10 seconds before calling
waitpidin a loop. During the sleep, runps aux | grep Zto observe the zombies. Measure how long the full reap loop takes withclock_gettime. Then repeat withsignal(SIGCHLD, SIG_IGN)and verify no zombies appear. -
Resource limit enforcement: write a C program that allocates memory in 100 MB chunks until
mallocreturns NULL. Before running, setRLIMIT_AS=512MBwithsetrlimit. Record at what point allocation fails. Then useRLIMIT_AS=unlimitedand observe the difference (be careful not to swap thrash your machine). -
Daemon implementation: implement a complete Unix daemon from scratch in C: double- fork, setsid, redirect stdio to /dev/null, write a PID file to
/var/run/mydaemon.pid, handle SIGTERM to clean up the PID file and exit gracefully. Verify withsystemd-run --scope ./mydaemonthat it is properly independent. -
Orphan adoption tracing: write a C program where grandparent forks a child, the child forks a grandchild, and the child immediately exits. The grandparent sleeps 5 seconds then exits. At each step, read
/proc/GRANDCHILD_PID/status | grep PPidto trace how PPid changes as the orphan is reparented. -
SIGCHLD vs waitpid patterns: implement two versions of a process pool (10 workers, each doing random work for 0–2 seconds): version A uses a
SIGCHLDhandler withwaitpid(-1, WNOHANG)in a loop; version B usessignalfd+epollfor event- driven reaping. Measure total zombie lifespan (time from exit to reap) for each approach under load.
References
kernel/exit.c—do_exit(),exit_mm(),exit_files(), zombie creationkernel/wait.c—wait_consider_task(), zombie reapinginclude/uapi/linux/resource.h—rlimitconstantskernel/sys.c—getrlimit(),setrlimit(),prlimit64()- Kerrisk, The Linux Programming Interface — Chapters 26 (child processes), 27 (exec), 28 (process monitoring)
- Stevens & Rago, Advanced Programming in the UNIX Environment — Chapter 13 (Daemons)
man 2 wait,man 2 waitpid,man 2 waitid,man 2 getrlimit,man 2 prlimitman 2 prctl—PR_SET_CHILD_SUBREAPER- LWN: "Managing processes via pidfds" (2020)
- systemd documentation:
systemd.exec(5)— Limit* directives