01 — The Process Concept
Technical Overview
A process is a program in execution. This single sentence carries enormous density: a program is a static artifact — an ELF binary on disk; a process is the living, dynamic instance of that program inside the operating system, complete with its own address space, file descriptor table, signal handlers, scheduling context, and accounting data. The distinction matters because the same program can run as dozens of concurrent processes, each fully isolated from the others.
In Linux the fundamental data structure that represents a process — or a thread — is
task_struct, defined in include/linux/sched.h. It currently spans roughly 700 fields
and is the single most important structure in the entire kernel. Linux unifies processes
and threads under the same abstraction: both are tasks. The difference is purely in
which resources are shared at creation time (covered in depth in 02-fork-and-exec.md).
Prerequisites
- Basic understanding of C and pointers
- Familiarity with the OS conceptual model (kernel/user space, system calls)
- Awareness of virtual memory concepts (pages, page tables) — see
11-memory-management/
Core Content
Program vs. Process vs. Thread vs. Task
| Concept | Kernel representation | Shared resources |
|---|---|---|
| Program | ELF file on disk | N/A — static |
| Process | task_struct |
Nothing shared with siblings |
| Thread | task_struct |
VM, file table, signal handlers, FS context |
| Task | task_struct |
Kernel's unified term — both above |
POSIX threads inside a process share the same mm_struct (the address space descriptor),
the same files_struct (open file table), and the same signal_struct. The kernel
scheduler sees every task_struct equally; whether two tasks share memory is a detail of
their creation flags, not a fundamental distinction in the scheduler.
The Process Control Block: Linux task_struct
The PCB is the OS's bookkeeping record for one task. In classical OS textbooks it is a compact structure; in Linux production kernels it has grown to accommodate decades of features. Key field groups:
task_struct (~700 fields, include/linux/sched.h)
├── Identity
│ ├── pid_t pid — unique ID of this task (thread)
│ ├── pid_t tgid — thread group ID = PID of the process leader
│ ├── uid_t uid, euid — real and effective user IDs
│ └── char comm[16] — short executable name
│
├── State
│ └── volatile long state — TASK_RUNNING, TASK_INTERRUPTIBLE, …
│
├── Memory
│ └── struct mm_struct *mm — address space (NULL in kernel threads)
│
├── Files
│ ├── struct files_struct *files — open file descriptor table
│ └── struct fs_struct *fs — root and cwd
│
├── Signals
│ ├── struct signal_struct *signal — shared signal state (whole process)
│ ├── struct sighand_struct *sighand — signal handlers (shared by threads)
│ └── sigset_t blocked — per-task signal mask
│
├── Scheduling
│ ├── int prio, static_prio, normal_prio
│ ├── unsigned int policy — SCHED_NORMAL, SCHED_FIFO, SCHED_RR, …
│ └── struct sched_entity se — CFS bookkeeping (vruntime, load weight)
│
├── Relationships
│ ├── struct task_struct *real_parent — biological parent
│ ├── struct task_struct *parent — current parent (may differ after reparenting)
│ ├── struct list_head children — list of child tasks
│ └── struct list_head sibling — position in parent's children list
│
├── Timing
│ ├── u64 utime, stime — user/kernel time in nanoseconds
│ └── u64 start_time — monotonic time of task creation
│
└── Architecture context
└── struct thread_struct thread — saved registers, stack pointer
The pid and tgid fields are subtle: for the main (and only) thread of a process they
are equal. When pthread_create() calls clone() to create a second thread, the new
task_struct gets a fresh pid (visible as gettid()) but the same tgid as the
process leader. getpid() returns tgid; gettid() returns pid. This is why ps -e
shows one line per process while /proc has an entry per thread.
Process State Machine
A task transitions among states based on I/O, scheduling, and signals. The canonical Linux states:
┌──────────────────────────────────────┐
│ │
fork() │ │
(parent) ──────────▼──────────┐ │
TASK_RUNNING │ ◄────── woken up ─────────┤
(on run queue) │ │
│ │ │
scheduler │ │ yield / time slice │
picks task ▼ │ expiry │
TASK_RUNNING ──┘ │
(executing on CPU) │
│ │
┌────────────┼─────────────────────────────────┐ │
│ │ │ │
▼ ▼ ▼ │
TASK_INTERRUPTIBLE TASK_UNINTERRUPTIBLE TASK_STOPPED
(waiting, can be (waiting, ignores signals (SIGSTOP received,
woken by signal) — disk I/O, mutex) or ptrace attach)
│ │ │
└────────────┘ │
wake_up_process() │ SIGCONT
or signal delivery────────────────────────────▶│
▼
TASK_RUNNING
(back on run queue)
On exit():
TASK_RUNNING ──► TASK_ZOMBIE ──► TASK_DEAD
(parent not (parent called wait(),
yet waited) resources freed)
| State constant | Numeric value | Meaning |
|---|---|---|
TASK_RUNNING |
0 | Runnable (on run queue) or currently executing |
TASK_INTERRUPTIBLE |
1 | Sleeping; woken by signal or event |
TASK_UNINTERRUPTIBLE |
2 | Sleeping; immune to signals (critical I/O) |
TASK_STOPPED |
4 | Stopped by SIGSTOP/SIGTSTP or ptrace |
TASK_TRACED |
8 | Being traced by ptrace |
EXIT_ZOMBIE |
16 | Exited; PCB kept until parent calls wait() |
EXIT_DEAD |
32 | wait() called; PCB being torn down |
TASK_UNINTERRUPTIBLE sleep is the source of the dreaded D state in ps. A process
stuck in D cannot be killed because SIGKILL is not delivered until the task returns to
user space from the interrupted syscall — which it cannot do until the I/O completes.
This is why a hung NFS mount can produce unkillable D-state processes.
Process Address Space
Each process has a private virtual address space managed by a mm_struct and its list of
vm_area_struct (VMAs):
Virtual address space of a 64-bit process (not to scale)
0xFFFFFFFFFFFFFFFF ┌─────────────────────────────┐
│ Kernel space (not mappable) │
0xFFFF800000000000 ├─────────────────────────────┤
│ ... │
│ [gap — non-canonical] │
│ ... │
0x00007FFFFFFFFFFF ├─────────────────────────────┤
│ Stack (grows ▼) │ ← ASLR randomized
│ ... │
├─────────────────────────────┤
│ mmap region │ ← shared libs, anonymous
│ (grows ▼ from top) │ mmap, file-backed pages
├─────────────────────────────┤
│ Heap (grows ▲) │ ← brk/sbrk/mmap
├─────────────────────────────┤
│ BSS segment │ zero-initialized globals
├─────────────────────────────┤
│ Data segment (.data) │ initialized globals
├─────────────────────────────┤
│ Text segment (.text) │ executable code (r-x)
0x0000000000400000 └─────────────────────────────┘ ← traditional ELF load addr
Segments visible via /proc/PID/maps:
- text (.text): read+execute, loaded from ELF LOAD segment with PF_X
- rodata (.rodata): read-only constants
- data (.data): read+write, initialized global/static variables
- BSS (.bss): zero-initialized, backed by the zero page until written (demand paging)
- heap: anonymous pages obtained via brk(2) or mmap(MAP_ANONYMOUS)
- mmap region: shared libraries, file mappings, anonymous large allocations
- stack: one VMA per thread, typically 8 MB soft limit, grows down
ASLR (Address Space Layout Randomization) randomizes the base of the stack, mmap region, and (with PIE binaries) the text segment on each execution, making exploitation harder.
PID Limits and Namespaces
The system-wide maximum PID is controlled by:
/proc/sys/kernel/pid_max
Defaults: - 32-bit kernels: 32,768 (0x8000) — fits in a 16-bit signed integer - 64-bit kernels: 4,194,304 (4 M) — configurable up to this ceiling
When pid_max is exhausted, fork() returns ENOSPC. On busy container hosts running
many short-lived processes, PID exhaustion is a real operational problem.
With PID namespaces (used by containers), each namespace has its own PID number space starting at 1. A process has one PID per namespace in its ancestry chain. The host kernel still assigns a unique global PID.
Process Tree Structure
Every process except PID 1 has a parent. The kernel enforces this invariant: when a
parent exits before its children, the orphaned children are reparented to the nearest
ancestor that has called prctl(PR_SET_CHILD_SUBREAPER, 1), or to PID 1 (init/systemd)
if no subreaper exists.
$ pstree -p (abbreviated)
systemd(1)─┬─systemd-journal(312)
├─sshd(1024)──sshd(4096)──bash(4097)──pstree(4201)
├─nginx(2000)─┬─nginx(2001)
│ ├─nginx(2002)
│ └─nginx(2003)
└─dockerd(3000)──containerd(3001)──containerd-shim(4500)──app(4501)
Navigating the tree programmatically:
- task_struct->parent: immediate parent
- task_struct->children: list head of direct children
- task_struct->real_parent: the biological parent (differs after ptrace)
Kernel vs. User Representation
From user space, the process is visible primarily through /proc/PID/:
/proc/4097/
├── cmdline — argv as NUL-separated string
├── environ — environment as NUL-separated string
├── exe — symlink to executable
├── fd/ — symlinks to open file descriptors
├── maps — virtual memory areas (human readable)
├── smaps — detailed per-VMA memory stats
├── status — key fields from task_struct in text form
├── stat — single-line scheduler stats (read by ps/top)
├── statm — memory sizes in pages
├── wchan — kernel function where process is sleeping
├── ns/ — symlinks to namespace inode numbers
└── task/ — subdirectory per thread (task/TID/)
The kernel maintains tasks in two parallel data structures:
1. PID hash table (pid_hash[]): O(1) lookup by PID for kill(2), wait(2)
2. Doubly-linked task list (init_task.tasks): traversed by /proc and for_each_process()
Historical Context
The process concept solidified in the late 1960s with Multics and was cleanly formalized in
the UNIX paper by Ritchie and Thompson (1974). Early UNIX kept the PCB (called the "proc
table entry" and the "u area" or "user structure") in two physical structures; the u-area
was swapped out with the process while the proc entry stayed resident. Modern kernels
collapsed these into a single task_struct kept in kernel memory.
Linux originally had a simple circular linked list of task_struct pointers. The PID
hash table was added in 2.4 to handle the growth in process counts on server systems.
Namespaces were introduced in 2.4.19 (mount ns) and expanded progressively through 2.6.x,
enabling the container revolution.
Production Examples
Checking process state:
# See state column: R=running, S=sleeping, D=uninterruptible, Z=zombie, T=stopped
ps aux
# or more detail
cat /proc/$(pgrep nginx | head -1)/status | grep -E '^(State|Pid|PPid|Threads|VmRSS)'
Finding zombie processes:
ps aux | awk '$8 == "Z"'
# Zombies are harmless unless they fill the PID table — find parent and investigate:
ps -o ppid= -p <zombie_pid>
Checking pid_max and current usage:
cat /proc/sys/kernel/pid_max
# Count current tasks:
ls /proc | grep -E '^[0-9]+$' | wc -l
Inspecting address space:
cat /proc/self/maps
pmap -x $$
Debugging Notes
strace -p PIDattaches viaptraceand moves the process toTASK_TRACED; this can change timing behaviour (observer effect).- A process stuck in
Dstate cannot be killed. Checkcat /proc/PID/wchanto see which kernel function it is blocked in. Common culprits: NFS, broken block device, hung FUSE filesystem. - Zombie accumulation indicates the parent is not calling
wait(). Usecat /proc/PARENT_PID/status | grep zombiesorps --ppid PARENT_PIDto correlate. gdb -p PIDuses ptrace and will stop the process. Usegdb -p PID -batch -ex btfor a non-interactive stack trace that still pauses briefly.
Security Implications
- PID reuse attacks: PID numbers are recycled. Code that sends signals or does
ptracebased on a stale PID may hit a new unrelated process. Always validate identity via/proc/PID/exeor a file descriptor opened to/proc/PIDbefore acting. /procinformation disclosure: world-readable/proc/PID/entries can leak memory layout (defeating ASLR) and command-line arguments (which often contain passwords). Usehidepid=2mount option on/procto restrict visibility.task_structcorruption: kernel exploits that gain write-what-where primitives frequently targettask_structfields (cred,uid,euid) to escalate privileges. KASLR and SMEP/SMAP mitigate but do not eliminate this class.- Namespace escape: improper user namespace UID mapping can allow a container process to appear as root to the host kernel under certain conditions.
Performance Implications
task_structis allocated from a dedicatedtask_structslab cache (kmem cache) to keep allocation fast and avoid fragmentation.- The kernel stack (16 KB on x86-64 by default; 8 KB historically) is allocated
alongside
task_struct. Deep call stacks or heavy use of on-stack buffers risk stack overflow — the kernel usesCONFIG_VMAP_STACKto detect this via guard pages. - Large process counts increase scheduler overhead: the CFS run queue is an RB-tree
(O(log N) insertion/selection), but the
/proctraversal is O(N). task_structitself is ~7–9 KB on current kernels. On a system with 100,000 threads, that is ~700–900 MB of resident kernel memory just for PCBs — relevant on dense container hosts.
Failure Modes
| Failure | Symptom | Root cause |
|---|---|---|
| PID exhaustion | fork(): Resource temporarily unavailable |
pid_max reached |
| Zombie flood | High zombie count in ps |
Parent not calling wait(), often a bug in a thread pool or signal handler |
D-state hang |
Process unkillable, system may become sluggish | I/O blocked on unresponsive device or network filesystem |
| Stack overflow | Kernel oops kernel stack overflow |
Recursive kernel path exceeding 16 KB stack |
task_struct slab exhaustion |
OOM in kernel slab, new forks fail | Extreme process count, typically a runaway fork bomb |
Modern Usage
Container runtimes (Docker, containerd, CRI-O) create processes using clone() with
multiple CLONE_NEW* flags simultaneously, giving each container its own PID, network,
mount, and UTS namespace while sharing the host kernel's task_struct infrastructure.
systemd uses PR_SET_CHILD_SUBREAPER on service manager processes so that grandchild
processes are reparented to the service scope rather than to PID 1, enabling proper
resource tracking and cleanup.
BPF programs can attach to task_struct events (via sched_process_fork,
sched_process_exec tracepoints) to observe process lifecycle in production at near-zero
overhead.
Future Directions
task_structbloat: ongoing effort to reduce the size or modularize fields using per-architecture config and feature flags.- Shadow stacks (Intel CET,
CONFIG_X86_SHADOW_STACK): a separate read-only stack storing return addresses, tracked alongside the normal kernel stack per task. - Memory-safe process bookkeeping: proposals to rewrite portions of
kernel/fork.cin Rust to eliminate class of use-after-free bugs in PCB lifecycle code. - Scheduler extensibility (sched_ext): merged in Linux 6.12, allows BPF programs to
implement custom scheduling policies — the scheduler now reads
task_structfields via stable BPF accessors, driving more careful API stabilization of those fields.
Exercises
-
State inspection: Write a shell one-liner that, given a PID, prints the process state character, the state's plain-English name, and the kernel function where it is sleeping (if any). Verify on a sleep(1) process and a
dd if=/dev/sdaprocess. -
Address space analysis: Run a simple C program that calls
malloc(1)in a loop. At each power-of-two allocation size, print the heap boundary from/proc/self/maps. At what allocation size does glibc switch frombrktommap? (Hint:M_MMAP_THRESHOLD) -
PID and TGID: Write a C program that creates two pthreads. In each thread print
getpid(),gettid(), and readPid:andTgid:from/proc/self/status. Explain the relationship between all four values. -
Zombie lifecycle: Write a C program that forks a child, has the child exit immediately, and then sleeps for 30 seconds before calling
wait(). Useps auxand/procto observe the zombie state during the sleep. Extend the program to show that the zombie'smm_structhas been freed (VmSize = 0 in/proc/ZPID/status). -
PID reuse: Write a program that forks, records the child PID, waits for the child to exit (but does NOT consume the zombie), then tries to send
SIGKILLto that PID. What does the kernel return? Now consume the zombie and repeat. What happens if a new process has been assigned that PID in the meantime?
References
- Bovet & Cesati, Understanding the Linux Kernel, 3rd ed. — Chapter 3 (Processes)
include/linux/sched.h—task_structdefinition (read withmake TAGSorcscope)kernel/fork.c—copy_process(),dup_task_struct()fs/proc/array.c—/proc/PID/statusgeneration- Love, Linux Kernel Development, 3rd ed. — Chapter 3
- Kerrisk, The Linux Programming Interface — Chapters 6, 26, 28
man 5 proc— comprehensive/proc/PID/field reference- Linux kernel documentation:
Documentation/filesystems/proc.rst