03 — Process Lifecycle

Technical Overview

Every process follows a deterministic lifecycle: it is created, competes for CPU time, blocks on resources, resumes, and eventually terminates. Understanding every phase of this journey — including the subtleties of zombie processes, orphan adoption, voluntary yields, daemon creation patterns, and resource limits — is the foundation of reliable systems programming. This file traces the full arc from fork() to the final cleanup of the last kernel data structure, and covers all the administrative mechanisms that constrain and monitor processes throughout their life.

Prerequisites

01-process-concept.md: task_struct, process state machine, /proc/PID/
02-fork-and-exec.md: how processes are created
Basic C signal handling (SIGCHLD, wait() family)
Understanding of file descriptors and reference counting

Core Content

The Complete Lifecycle Arc

                      fork() / clone()
                            │
                            ▼
                    ┌───────────────┐
                    │    CREATED    │  task_struct allocated, not yet runnable
                    └──────┬────────┘
                           │  copy_process() completes, wake_up_new_task()
                           ▼
                    ┌───────────────┐
                    │     READY     │  TASK_RUNNING on run queue, waiting for CPU
                    └──────┬────────┘
              ┌────────────┤  scheduler picks task
              │            ▼
              │     ┌───────────────┐
              │     │   RUNNING     │  TASK_RUNNING, executing on CPU
              │     └──────┬────────┘
              │            │  I/O, lock, sleep syscall
              │            ▼
              │     ┌───────────────┐
              │     │   BLOCKED     │  TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE
              │     └──────┬────────┘
              │            │  event occurs, wake_up_process()
              └────────────┘  (back to READY)
                            │
                            │  exit() / exit_group() / fatal signal
                            ▼
                    ┌───────────────┐
                    │    ZOMBIE     │  EXIT_ZOMBIE: PCB kept, mm freed
                    └──────┬────────┘
                           │  parent calls wait() / waitpid()
                           ▼
                    ┌───────────────┐
                    │     DEAD      │  EXIT_DEAD: task_struct freed, PID recycled
                    └───────────────┘

Voluntary Yield vs. Involuntary Preemption

Voluntary yield: the running process calls a syscall that puts itself to sleep.

sched_yield(2): moves the process to the end of the run queue for its priority level. Useful hint but the scheduler may immediately reschedule it if it's the only runnable task. Rarely the right tool — usually a sign of a busy-wait that should use a proper synchronization primitive.
nanosleep(2) / clock_nanosleep(2): sleeps for at least the requested duration. The task enters TASK_INTERRUPTIBLE. A signal can wake it early; the remaining time is returned in the rem argument.
Blocking syscalls (read, write, recv, accept, mutex futex): all put the task into TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE depending on whether an interrupt makes sense.

Involuntary preemption: the scheduler forcibly removes the CPU from a running task.

Triggered by timer interrupt (scheduler tick or hrtimer in tickless mode).
The scheduler sets TIF_NEED_RESCHED in the task's thread flags.
Actual preemption happens at the next safe preemption point:
On return from interrupt (always safe)
On return from syscall to user space
In the kernel at explicit preempt_enable() if CONFIG_PREEMPT is enabled

With CONFIG_PREEMPT_RT, nearly all spinlocks become sleeping locks and the kernel is fully preemptible, allowing hard real-time latency bounds.

Process Waiting: wait4() and waitpid() Variants

When a child terminates it becomes a zombie. The parent must reap the zombie by calling a wait family syscall:

waitpid(pid, &status, options)   // wait for specific PID or -1 for any child
wait4(pid, &status, options, &rusage)  // same + resource usage
waitid(idtype, id, &siginfo, options)  // POSIX, most flexible

Key options flags:

Flag	Meaning
`WNOHANG`	Return immediately if no child has exited yet (non-blocking)
`WUNTRACED`	Also return for children stopped by a signal (SIGSTOP/SIGTSTP)
`WCONTINUED`	Also return for children resumed by SIGCONT
`WEXITED`	(waitid) wait for children that have exited
`WNOWAIT`	(waitid) leave the child in waitable state (peek, do not reap)

Status decoding macros:

WIFEXITED(status)    // true if child exited normally via exit()
WEXITSTATUS(status)  // exit code (low 8 bits) if WIFEXITED
WIFSIGNALED(status)  // true if killed by signal
WTERMSIG(status)     // signal number if WIFSIGNALED
WIFSTOPPED(status)   // true if stopped (WUNTRACED)
WSTOPSIG(status)     // signal that stopped the child
WIFCONTINUED(status) // true if resumed by SIGCONT

Zombie Processes: Lifecycle of a Dead But Not Reaped Task

When a process calls exit() (or receives a fatal signal), the kernel runs do_exit() in kernel/exit.c. The sequence:

do_exit()
   │
   ├── exit_signals()         — flush pending signals, notify parent via SIGCHLD
   ├── exit_mm()              — drop reference to mm_struct; if last reference,
   │                            free all VMAs and page tables (mm freed here)
   ├── exit_files()           — close all open file descriptors (fput each)
   ├── exit_fs()              — release fs_struct (cwd, root)
   ├── exit_task_namespaces() — release namespace references
   ├── taskstats_exit()       — accounting (cgroup, /proc/acct)
   ├── set task state → EXIT_ZOMBIE
   ├── notify_parent()        — send SIGCHLD to parent
   └── schedule()             — never returns; task removed from CPU

At this point the task_struct still exists (the zombie) but: - mm_struct is freed (no more address space — VmSize = 0) - All file descriptors are closed - The zombie consumes a PID and a task_struct (~8 KB of kernel memory)

Zombie creation sequence:

  PARENT                         CHILD
    │                              │
    │           fork()             │
    │◄──────────────────────────── │
    │                              │
    │     (parent does other work) │  exit(0)
    │                              ├─────────────────────┐
    │                              │                     ▼
    │                    EXIT_ZOMBIE state         mm freed,
    │                    task_struct retained      fds closed,
    │                    PCB in parent's           SIGCHLD sent
    │                    children list             to parent
    │
    │  waitpid(child_pid, ...)
    ├──────────────────────────────►
    │                              │
    │  parent reads exit status    │ EXIT_DEAD
    │                              │ task_struct freed
    │                              │ PID recycled

Zombies are harmless as long as they are temporary. They become a problem when: - The parent never calls wait() (bug) - The parent is in a tight loop creating and ignoring children - PID table fills up (more critical on systems with low pid_max)

To prevent zombies: either call waitpid() / wait(), or explicitly set SIGCHLD to SIG_IGN (on Linux this auto-reaps children without creating zombies), or use the double-fork technique.

Orphan Adoption: init and PR_SET_CHILD_SUBREAPER

If a parent exits before its children, those children become orphans. The kernel reparents them:

Walk the task's children list
For each child, look for the nearest ancestor with task_struct->is_child_subreaper set (set via prctl(PR_SET_CHILD_SUBREAPER, 1))
If no subreaper found, reparent to PID 1 (init / systemd)

systemd uses subreaper on each service scope so that grandchild processes (daemons that double-fork) are properly accounted under the service, not under PID 1.

The orphaned child continues running normally — it is not killed. It simply has a new task_struct->parent pointer.

Double-Fork Technique for Daemons

The traditional Unix daemon creation pattern uses two consecutive forks to fully detach from the terminal and controlling process group:

// Step 1: fork, parent exits → shell prompt returns
pid_t pid = fork();
if (pid > 0) exit(0);        // parent exits
if (pid < 0) abort();

// Step 2: create new session (detach from controlling terminal)
setsid();                    // new session, new process group, no ctty

// Step 3: fork again → grandparent was session leader; child can never
//         accidentally acquire a controlling terminal (SIGHUP immunity)
pid = fork();
if (pid > 0) exit(0);        // session leader exits
if (pid < 0) abort();

// Now we are a fully independent daemon (grandchild)
chdir("/");                  // don't hold a mount point busy
umask(0);
close(STDIN_FILENO);
close(STDOUT_FILENO);
close(STDERR_FILENO);
// Redirect stdio to /dev/null or a log file
int fd = open("/dev/null", O_RDWR);
dup2(fd, STDIN_FILENO);
dup2(fd, STDOUT_FILENO);
dup2(fd, STDERR_FILENO);
if (fd > 2) close(fd);

// daemon code here

The second fork ensures the daemon is not a session leader and therefore cannot open("/dev/tty", ...) to accidentally acquire a controlling terminal, which could deliver SIGHUP if the terminal session ends.

Modern alternative: use systemd with Type=forking or, better, Type=notify with sd_notify(3) — no double-fork needed; systemd manages daemonization.

/proc/PID/status: Key Fields

Name:   nginx                  # comm field (truncated to 15 chars)
State:  S (sleeping)           # state character: R,S,D,Z,T,t,X
Tgid:   1234                   # process ID (thread group leader's PID)
Ngid:   0                      # NUMA group ID (scheduling hint)
Pid:    1234                   # this task's PID (= Tgid for main thread)
PPid:   1                      # parent PID
TracerPid: 0                   # PID of ptracer (0 = not traced)
Uid:    1000  1000  1000  1000 # real, effective, saved, filesystem UIDs
Gid:    1000  1000  1000  1000
FDSize: 256                    # size of fd table (not number of open fds)
Groups: 1000 4 24 27           # supplementary groups
VmPeak: 102400 kB              # peak virtual address space size
VmSize:  98304 kB              # current virtual address space size
VmLck:       0 kB              # locked (mlock) pages
VmPin:       0 kB              # pinned pages
VmHWM:   12288 kB              # peak resident set size
VmRSS:    8192 kB              # current RSS (physical pages)
VmData:    512 kB              # data + stack size
VmStk:     136 kB              # stack size
VmExe:     256 kB              # text (code) size
VmLib:   32768 kB              # shared library code size
VmPTE:      72 kB              # page table entries size
VmSwap:      0 kB              # swapped-out pages
Threads: 4                     # number of threads in this process
SigQ:   0/62810                # pending signals / max queue length
SigPnd: 0000000000000000       # bitmask of pending signals (this thread)
ShdPnd: 0000000000000000       # bitmask of pending signals (process-wide)
SigBlk: 0000000000010000       # blocked signals bitmask
SigIgn: 0000000000001000       # ignored signals bitmask
SigCgt: 0000000180014603       # caught signals bitmask
voluntary_ctxt_switches: 1432
nonvoluntary_ctxt_switches: 87

Resource Limits: rlimit

Every process has a set of resource limits managed by struct rlimit {rlim_t rlim_cur, rlim_max}:

Limit name	`getrlimit` constant	Default	Controls
`RLIMIT_AS`	Address space	unlimited	Maximum virtual address space bytes
`RLIMIT_NOFILE`	Open file descriptors	1024 (soft)	Max number of open fds
`RLIMIT_NPROC`	Max processes	~30000	Max tasks per real UID
`RLIMIT_CPU`	CPU time	unlimited	Seconds of CPU; SIGXCPU at soft, SIGKILL at hard
`RLIMIT_FSIZE`	File size	unlimited	Max file size; SIGXFSZ on violation
`RLIMIT_MEMLOCK`	Locked memory	64 KB	Max bytes locked with mlock()
`RLIMIT_STACK`	Stack size	8 MB	Stack VMA size; SIGSEGV beyond
`RLIMIT_CORE`	Core dump size	0	Max core file size (0 = no core)
`RLIMIT_NICE`	Nice ceiling	0	Lowest (best) nice value unprivileged process can set
`RLIMIT_RTPRIO`	RT priority	0	Max real-time scheduling priority

The soft limit is the current enforcement value; the hard limit is the ceiling that an unprivileged process can raise the soft limit to. Only root (CAP_SYS_RESOURCE) can raise above the hard limit.

struct rlimit rl;
getrlimit(RLIMIT_NOFILE, &rl);     // read current limits
rl.rlim_cur = 65536;               // raise soft limit
setrlimit(RLIMIT_NOFILE, &rl);     // set new soft limit

// prlimit() allows setting limits on other processes:
prlimit(pid, RLIMIT_NOFILE, &new_rl, &old_rl);

Systemd enforces limits per-service via LimitNOFILE=, LimitNPROC=, etc. in unit files, which call prlimit() on the service's main process.

Historical Context

The zombie state dates to original UNIX: a dead process needed to retain its exit status until the parent read it, since the status could not be passed through signals alone (signal delivery is not reliable for data). The term "zombie" was coined in the UNIX community by the late 1970s.

Resource limits (rlimit) were introduced in BSD UNIX (4.1BSD, 1981) and standardized in POSIX.1. The prlimit() syscall (Linux 2.6.36, 2010) extended the interface to allow a process to set limits on other processes — essential for container runtimes that configure resource limits before execing the container init.

The daemon double-fork pattern is documented in Stevens' Advanced Programming in the UNIX Environment (1992) and remains the standard approach for traditional Unix daemons, though systemd (2010) largely supersedes it in modern Linux deployments.

Production Examples

Monitor for zombie accumulation:

# Count zombies per parent
ps aux | awk '$8 == "Z" {print $3}' | sort | uniq -c | sort -rn | head
# Then find parent:
ps -o pid,comm -p <PPID>

Check resource limits for a running process:

cat /proc/$(pgrep nginx | head -1)/limits
# or
prlimit --pid $(pgrep nginx | head -1)

Increase file descriptor limit for a service at runtime:

# For a running process (as root):
prlimit --nofile=65536:65536 --pid <PID>
# Permanently in systemd service:
# [Service]
# LimitNOFILE=65536

Finding orphaned processes (reparented to init):

ps -eo pid,ppid,comm | awk '$2 == 1 {print}' | grep -v systemd

Debugging Notes

Zombie root cause: a zombie's parent is shown in the PPid: field of /proc/ZPID/status. Use strace -e wait4 -p PPID to see if the parent is calling wait at all. Often the bug is a SIGCHLD handler that does waitpid(specific_pid) instead of waitpid(-1, ...) in a loop, so it misses children.
RLIMIT_NOFILE too low: nginx or a Go server printing "too many open files" means the process hit its fd limit. Check cat /proc/PID/limits and ls /proc/PID/fd | wc -l to see current usage vs. limit.
RLIMIT_CPU behavior: processes exceeding RLIMIT_CPU soft limit receive SIGXCPU (catchable); at the hard limit they receive SIGKILL. Catching SIGXCPU and continuing will eventually hit the hard limit.
Daemon startup debugging: a double-forked daemon that fails silently is hard to debug. Use Type=forking in systemd and set StandardError=journal so stderr is captured even from the forked child.
Core dumps not generated: RLIMIT_CORE=0 by default. Set with ulimit -c unlimited or sysctl kernel.core_pattern to configure the dump path. coredumpctl (systemd) is the modern interface.

Security Implications

Zombie PID squatting: a process that creates many short-lived children and delays reaping can accumulate thousands of zombies, each holding a PID. In environments with low pid_max, this prevents other processes from spawning — a denial-of-service vector.
RLIMIT_NPROC as DoS mitigation: setting a per-user process limit prevents fork bombs from users or compromised services. Typically done in /etc/security/limits.conf or systemd unit files.
RLIMIT_AS for untrusted code: setting RLIMIT_AS before execve-ing an untrusted binary limits its virtual memory consumption. Combined with RLIMIT_FSIZE and RLIMIT_NOFILE, provides coarse sandboxing.
RLIMIT_CORE and sensitive data: a core dump captures the full process address space including cryptographic keys, passwords, and token values. Production systems commonly disable core dumps (RLIMIT_CORE=0) or configure kernel.core_pattern to pipe to a controlled handler (systemd's systemd-coredump encrypts dumps).

Performance Implications

Zombie cleanup latency: zombies don't consume CPU, memory, or file descriptors — only a PID and a task_struct. The main impact is PID table pressure. Reap promptly.
wait() call frequency: in a server that spawns many short-lived children, calling waitpid(-1, WNOHANG) on each SIGCHLD in a tight loop is more efficient than blocking waitpid(-1, 0) with a thread. Use signalfd or pidfd_open for event-driven waiting without signal handler reentrancy issues.
RLIMIT_CPU and CPU-bound workloads: even if soft limit allows SIGXCPU to be ignored, the kernel will enforce the hard limit. Use cgroup CPU quotas for finer- grained CPU accounting without the all-or-nothing kill semantics of RLIMIT_CPU.
Stack size and thread count: the default 8 MB RLIMIT_STACK means each thread reserves 8 MB of virtual address space. A server with 10,000 threads consumes 80 GB of virtual space even if physically only a few KB are used. On 32-bit systems this was a hard constraint; on 64-bit it's a virtual address space concern but not physical memory pressure until pages are actually touched.

Failure Modes

Failure	Symptom	Diagnosis
Zombie flood	`ps` shows many `Z` entries	`strace -e wait4 -p PARENT_PID`
PID exhaustion from zombies	`fork(): EAGAIN` despite low process count	`cat /proc/sys/kernel/pid_max`; count zombies
fd leak exhaustion	`EMFILE: too many open files`	`ls /proc/PID/fd \| wc -l` vs `limits`
Daemon silently exits	No log output, process missing	Run without daemonization first; check `journalctl`
Core dumps missing	Crash with no `.core` file	`RLIMIT_CORE=0` or `fs.suid_dumpable=0`; check `kernel.core_pattern`
CPU hard limit kill	Process killed without SIGTERM	`RLIMIT_CPU` hard limit hit; switch to cgroup CPU quota

Modern Usage

Container init processes: container runtimes set the container init (PID 1 inside the namespace) up as a subreaper via prctl(PR_SET_CHILD_SUBREAPER). tini and dumb-init are minimal init processes purpose-built for containers that handle SIGCHLD and reap zombies that exec-style entrypoints would not handle.

prlimit in orchestration: Kubernetes uses cgroup v2 limits (memory.max, cpu.max) in preference to rlimit for process resource management, but still passes rlimit values via the container runtime for RLIMIT_NOFILE and RLIMIT_NPROC.

Systemd and cgroup-based lifecycle: systemd tracks a service's lifecycle via its cgroup. Even if the daemon double-forks into a new process tree, systemd can terminate all members of the cgroup on systemctl stop. This makes the double-fork technique less useful — the orphan adoption trick that daemons used to escape the parent is countered by cgroup membership which persists across reparenting.

Future Directions

pidfd as the universal process handle: the kernel team's stated goal is to replace all PID-based APIs (kill, ptrace, waitpid) with pidfd-based variants to eliminate PID reuse races. pidfd_send_signal() and waitid(P_PIDFD, ...) are already stable.
Cgroup v2 lifecycle management: increasingly, process lifecycle is managed at the cgroup level, not the individual PID level. cgroup.events (populated/empty) is the modern notification mechanism replacing SIGCHLD for container runtimes.
Structured concurrency in OS APIs: inspired by structured concurrency in language runtimes, proposals exist for "process groups that automatically wait for all members" as a kernel primitive, solving the orphan/zombie problem at the API level.

Exercises

Zombie lifecycle timer: write a C program that forks 50 children, each exiting immediately. The parent sleeps 10 seconds before calling waitpid in a loop. During the sleep, run ps aux | grep Z to observe the zombies. Measure how long the full reap loop takes with clock_gettime. Then repeat with signal(SIGCHLD, SIG_IGN) and verify no zombies appear.
Resource limit enforcement: write a C program that allocates memory in 100 MB chunks until malloc returns NULL. Before running, set RLIMIT_AS=512MB with setrlimit. Record at what point allocation fails. Then use RLIMIT_AS=unlimited and observe the difference (be careful not to swap thrash your machine).
Daemon implementation: implement a complete Unix daemon from scratch in C: double- fork, setsid, redirect stdio to /dev/null, write a PID file to /var/run/mydaemon.pid, handle SIGTERM to clean up the PID file and exit gracefully. Verify with systemd-run --scope ./mydaemon that it is properly independent.
Orphan adoption tracing: write a C program where grandparent forks a child, the child forks a grandchild, and the child immediately exits. The grandparent sleeps 5 seconds then exits. At each step, read /proc/GRANDCHILD_PID/status | grep PPid to trace how PPid changes as the orphan is reparented.
SIGCHLD vs waitpid patterns: implement two versions of a process pool (10 workers, each doing random work for 0–2 seconds): version A uses a SIGCHLD handler with waitpid(-1, WNOHANG) in a loop; version B uses signalfd + epoll for event- driven reaping. Measure total zombie lifespan (time from exit to reap) for each approach under load.

References

kernel/exit.c — do_exit(), exit_mm(), exit_files(), zombie creation
kernel/wait.c — wait_consider_task(), zombie reaping
include/uapi/linux/resource.h — rlimit constants
kernel/sys.c — getrlimit(), setrlimit(), prlimit64()
Kerrisk, The Linux Programming Interface — Chapters 26 (child processes), 27 (exec), 28 (process monitoring)
Stevens & Rago, Advanced Programming in the UNIX Environment — Chapter 13 (Daemons)
man 2 wait, man 2 waitpid, man 2 waitid, man 2 getrlimit, man 2 prlimit
man 2 prctl — PR_SET_CHILD_SUBREAPER
LWN: "Managing processes via pidfds" (2020)
systemd documentation: systemd.exec(5) — Limit* directives