Skip to content

01 — The Process Concept

Technical Overview

A process is a program in execution. This single sentence carries enormous density: a program is a static artifact — an ELF binary on disk; a process is the living, dynamic instance of that program inside the operating system, complete with its own address space, file descriptor table, signal handlers, scheduling context, and accounting data. The distinction matters because the same program can run as dozens of concurrent processes, each fully isolated from the others.

In Linux the fundamental data structure that represents a process — or a thread — is task_struct, defined in include/linux/sched.h. It currently spans roughly 700 fields and is the single most important structure in the entire kernel. Linux unifies processes and threads under the same abstraction: both are tasks. The difference is purely in which resources are shared at creation time (covered in depth in 02-fork-and-exec.md).


Prerequisites

  • Basic understanding of C and pointers
  • Familiarity with the OS conceptual model (kernel/user space, system calls)
  • Awareness of virtual memory concepts (pages, page tables) — see 11-memory-management/

Core Content

Program vs. Process vs. Thread vs. Task

Concept Kernel representation Shared resources
Program ELF file on disk N/A — static
Process task_struct Nothing shared with siblings
Thread task_struct VM, file table, signal handlers, FS context
Task task_struct Kernel's unified term — both above

POSIX threads inside a process share the same mm_struct (the address space descriptor), the same files_struct (open file table), and the same signal_struct. The kernel scheduler sees every task_struct equally; whether two tasks share memory is a detail of their creation flags, not a fundamental distinction in the scheduler.


The Process Control Block: Linux task_struct

The PCB is the OS's bookkeeping record for one task. In classical OS textbooks it is a compact structure; in Linux production kernels it has grown to accommodate decades of features. Key field groups:

task_struct (~700 fields, include/linux/sched.h)
├── Identity
│   ├── pid_t  pid          — unique ID of this task (thread)
│   ├── pid_t  tgid         — thread group ID = PID of the process leader
│   ├── uid_t  uid, euid    — real and effective user IDs
│   └── char   comm[16]     — short executable name
│
├── State
│   └── volatile long state — TASK_RUNNING, TASK_INTERRUPTIBLE, …
│
├── Memory
│   └── struct mm_struct *mm        — address space (NULL in kernel threads)
│
├── Files
│   ├── struct files_struct *files  — open file descriptor table
│   └── struct fs_struct    *fs     — root and cwd
│
├── Signals
│   ├── struct signal_struct   *signal   — shared signal state (whole process)
│   ├── struct sighand_struct  *sighand  — signal handlers (shared by threads)
│   └── sigset_t               blocked   — per-task signal mask
│
├── Scheduling
│   ├── int          prio, static_prio, normal_prio
│   ├── unsigned int policy    — SCHED_NORMAL, SCHED_FIFO, SCHED_RR, …
│   └── struct sched_entity se — CFS bookkeeping (vruntime, load weight)
│
├── Relationships
│   ├── struct task_struct *real_parent  — biological parent
│   ├── struct task_struct *parent       — current parent (may differ after reparenting)
│   ├── struct list_head    children     — list of child tasks
│   └── struct list_head    sibling      — position in parent's children list
│
├── Timing
│   ├── u64 utime, stime      — user/kernel time in nanoseconds
│   └── u64 start_time        — monotonic time of task creation
│
└── Architecture context
    └── struct thread_struct thread  — saved registers, stack pointer

The pid and tgid fields are subtle: for the main (and only) thread of a process they are equal. When pthread_create() calls clone() to create a second thread, the new task_struct gets a fresh pid (visible as gettid()) but the same tgid as the process leader. getpid() returns tgid; gettid() returns pid. This is why ps -e shows one line per process while /proc has an entry per thread.


Process State Machine

A task transitions among states based on I/O, scheduling, and signals. The canonical Linux states:

                     ┌──────────────────────────────────────┐
                     │                                      │
             fork()  │                                      │
  (parent) ──────────▼──────────┐                          │
                 TASK_RUNNING   │ ◄────── woken up ─────────┤
                (on run queue)  │                           │
                     │          │                           │
        scheduler    │          │ yield / time slice        │
        picks task   ▼          │ expiry                    │
                 TASK_RUNNING ──┘                           │
               (executing on CPU)                           │
                     │                                      │
        ┌────────────┼─────────────────────────────────┐   │
        │            │                                  │   │
        ▼            ▼                                  ▼   │
  TASK_INTERRUPTIBLE  TASK_UNINTERRUPTIBLE         TASK_STOPPED
  (waiting, can be   (waiting, ignores signals     (SIGSTOP received,
   woken by signal)   — disk I/O, mutex)            or ptrace attach)
        │            │                                  │
        └────────────┘                                  │
        wake_up_process()                               │ SIGCONT
        or signal delivery────────────────────────────▶│
                                                        ▼
                                                  TASK_RUNNING
                                                  (back on run queue)

  On exit():
        TASK_RUNNING ──► TASK_ZOMBIE ──► TASK_DEAD
                         (parent not      (parent called wait(),
                          yet waited)      resources freed)
State constant Numeric value Meaning
TASK_RUNNING 0 Runnable (on run queue) or currently executing
TASK_INTERRUPTIBLE 1 Sleeping; woken by signal or event
TASK_UNINTERRUPTIBLE 2 Sleeping; immune to signals (critical I/O)
TASK_STOPPED 4 Stopped by SIGSTOP/SIGTSTP or ptrace
TASK_TRACED 8 Being traced by ptrace
EXIT_ZOMBIE 16 Exited; PCB kept until parent calls wait()
EXIT_DEAD 32 wait() called; PCB being torn down

TASK_UNINTERRUPTIBLE sleep is the source of the dreaded D state in ps. A process stuck in D cannot be killed because SIGKILL is not delivered until the task returns to user space from the interrupted syscall — which it cannot do until the I/O completes. This is why a hung NFS mount can produce unkillable D-state processes.


Process Address Space

Each process has a private virtual address space managed by a mm_struct and its list of vm_area_struct (VMAs):

  Virtual address space of a 64-bit process (not to scale)
  0xFFFFFFFFFFFFFFFF ┌─────────────────────────────┐
                     │  Kernel space (not mappable) │
  0xFFFF800000000000 ├─────────────────────────────┤
                     │         ...                 │
                     │       [gap — non-canonical]  │
                     │         ...                 │
  0x00007FFFFFFFFFFF ├─────────────────────────────┤
                     │    Stack (grows ▼)           │  ← ASLR randomized
                     │    ...                       │
                     ├─────────────────────────────┤
                     │    mmap region               │  ← shared libs, anonymous
                     │    (grows ▼ from top)        │     mmap, file-backed pages
                     ├─────────────────────────────┤
                     │    Heap (grows ▲)            │  ← brk/sbrk/mmap
                     ├─────────────────────────────┤
                     │    BSS segment               │  zero-initialized globals
                     ├─────────────────────────────┤
                     │    Data segment (.data)      │  initialized globals
                     ├─────────────────────────────┤
                     │    Text segment (.text)      │  executable code (r-x)
  0x0000000000400000 └─────────────────────────────┘ ← traditional ELF load addr

Segments visible via /proc/PID/maps: - text (.text): read+execute, loaded from ELF LOAD segment with PF_X - rodata (.rodata): read-only constants - data (.data): read+write, initialized global/static variables - BSS (.bss): zero-initialized, backed by the zero page until written (demand paging) - heap: anonymous pages obtained via brk(2) or mmap(MAP_ANONYMOUS) - mmap region: shared libraries, file mappings, anonymous large allocations - stack: one VMA per thread, typically 8 MB soft limit, grows down

ASLR (Address Space Layout Randomization) randomizes the base of the stack, mmap region, and (with PIE binaries) the text segment on each execution, making exploitation harder.


PID Limits and Namespaces

The system-wide maximum PID is controlled by:

/proc/sys/kernel/pid_max

Defaults: - 32-bit kernels: 32,768 (0x8000) — fits in a 16-bit signed integer - 64-bit kernels: 4,194,304 (4 M) — configurable up to this ceiling

When pid_max is exhausted, fork() returns ENOSPC. On busy container hosts running many short-lived processes, PID exhaustion is a real operational problem.

With PID namespaces (used by containers), each namespace has its own PID number space starting at 1. A process has one PID per namespace in its ancestry chain. The host kernel still assigns a unique global PID.


Process Tree Structure

Every process except PID 1 has a parent. The kernel enforces this invariant: when a parent exits before its children, the orphaned children are reparented to the nearest ancestor that has called prctl(PR_SET_CHILD_SUBREAPER, 1), or to PID 1 (init/systemd) if no subreaper exists.

$ pstree -p (abbreviated)
systemd(1)─┬─systemd-journal(312)
           ├─sshd(1024)──sshd(4096)──bash(4097)──pstree(4201)
           ├─nginx(2000)─┬─nginx(2001)
           │             ├─nginx(2002)
           │             └─nginx(2003)
           └─dockerd(3000)──containerd(3001)──containerd-shim(4500)──app(4501)

Navigating the tree programmatically: - task_struct->parent: immediate parent - task_struct->children: list head of direct children - task_struct->real_parent: the biological parent (differs after ptrace)


Kernel vs. User Representation

From user space, the process is visible primarily through /proc/PID/:

/proc/4097/
├── cmdline        — argv as NUL-separated string
├── environ        — environment as NUL-separated string
├── exe            — symlink to executable
├── fd/            — symlinks to open file descriptors
├── maps           — virtual memory areas (human readable)
├── smaps          — detailed per-VMA memory stats
├── status         — key fields from task_struct in text form
├── stat           — single-line scheduler stats (read by ps/top)
├── statm          — memory sizes in pages
├── wchan          — kernel function where process is sleeping
├── ns/            — symlinks to namespace inode numbers
└── task/          — subdirectory per thread (task/TID/)

The kernel maintains tasks in two parallel data structures: 1. PID hash table (pid_hash[]): O(1) lookup by PID for kill(2), wait(2) 2. Doubly-linked task list (init_task.tasks): traversed by /proc and for_each_process()


Historical Context

The process concept solidified in the late 1960s with Multics and was cleanly formalized in the UNIX paper by Ritchie and Thompson (1974). Early UNIX kept the PCB (called the "proc table entry" and the "u area" or "user structure") in two physical structures; the u-area was swapped out with the process while the proc entry stayed resident. Modern kernels collapsed these into a single task_struct kept in kernel memory.

Linux originally had a simple circular linked list of task_struct pointers. The PID hash table was added in 2.4 to handle the growth in process counts on server systems. Namespaces were introduced in 2.4.19 (mount ns) and expanded progressively through 2.6.x, enabling the container revolution.


Production Examples

Checking process state:

# See state column: R=running, S=sleeping, D=uninterruptible, Z=zombie, T=stopped
ps aux
# or more detail
cat /proc/$(pgrep nginx | head -1)/status | grep -E '^(State|Pid|PPid|Threads|VmRSS)'

Finding zombie processes:

ps aux | awk '$8 == "Z"'
# Zombies are harmless unless they fill the PID table — find parent and investigate:
ps -o ppid= -p <zombie_pid>

Checking pid_max and current usage:

cat /proc/sys/kernel/pid_max
# Count current tasks:
ls /proc | grep -E '^[0-9]+$' | wc -l

Inspecting address space:

cat /proc/self/maps
pmap -x $$

Debugging Notes

  • strace -p PID attaches via ptrace and moves the process to TASK_TRACED; this can change timing behaviour (observer effect).
  • A process stuck in D state cannot be killed. Check cat /proc/PID/wchan to see which kernel function it is blocked in. Common culprits: NFS, broken block device, hung FUSE filesystem.
  • Zombie accumulation indicates the parent is not calling wait(). Use cat /proc/PARENT_PID/status | grep zombies or ps --ppid PARENT_PID to correlate.
  • gdb -p PID uses ptrace and will stop the process. Use gdb -p PID -batch -ex bt for a non-interactive stack trace that still pauses briefly.

Security Implications

  • PID reuse attacks: PID numbers are recycled. Code that sends signals or does ptrace based on a stale PID may hit a new unrelated process. Always validate identity via /proc/PID/exe or a file descriptor opened to /proc/PID before acting.
  • /proc information disclosure: world-readable /proc/PID/ entries can leak memory layout (defeating ASLR) and command-line arguments (which often contain passwords). Use hidepid=2 mount option on /proc to restrict visibility.
  • task_struct corruption: kernel exploits that gain write-what-where primitives frequently target task_struct fields (cred, uid, euid) to escalate privileges. KASLR and SMEP/SMAP mitigate but do not eliminate this class.
  • Namespace escape: improper user namespace UID mapping can allow a container process to appear as root to the host kernel under certain conditions.

Performance Implications

  • task_struct is allocated from a dedicated task_struct slab cache (kmem cache) to keep allocation fast and avoid fragmentation.
  • The kernel stack (16 KB on x86-64 by default; 8 KB historically) is allocated alongside task_struct. Deep call stacks or heavy use of on-stack buffers risk stack overflow — the kernel uses CONFIG_VMAP_STACK to detect this via guard pages.
  • Large process counts increase scheduler overhead: the CFS run queue is an RB-tree (O(log N) insertion/selection), but the /proc traversal is O(N).
  • task_struct itself is ~7–9 KB on current kernels. On a system with 100,000 threads, that is ~700–900 MB of resident kernel memory just for PCBs — relevant on dense container hosts.

Failure Modes

Failure Symptom Root cause
PID exhaustion fork(): Resource temporarily unavailable pid_max reached
Zombie flood High zombie count in ps Parent not calling wait(), often a bug in a thread pool or signal handler
D-state hang Process unkillable, system may become sluggish I/O blocked on unresponsive device or network filesystem
Stack overflow Kernel oops kernel stack overflow Recursive kernel path exceeding 16 KB stack
task_struct slab exhaustion OOM in kernel slab, new forks fail Extreme process count, typically a runaway fork bomb

Modern Usage

Container runtimes (Docker, containerd, CRI-O) create processes using clone() with multiple CLONE_NEW* flags simultaneously, giving each container its own PID, network, mount, and UTS namespace while sharing the host kernel's task_struct infrastructure.

systemd uses PR_SET_CHILD_SUBREAPER on service manager processes so that grandchild processes are reparented to the service scope rather than to PID 1, enabling proper resource tracking and cleanup.

BPF programs can attach to task_struct events (via sched_process_fork, sched_process_exec tracepoints) to observe process lifecycle in production at near-zero overhead.


Future Directions

  • task_struct bloat: ongoing effort to reduce the size or modularize fields using per-architecture config and feature flags.
  • Shadow stacks (Intel CET, CONFIG_X86_SHADOW_STACK): a separate read-only stack storing return addresses, tracked alongside the normal kernel stack per task.
  • Memory-safe process bookkeeping: proposals to rewrite portions of kernel/fork.c in Rust to eliminate class of use-after-free bugs in PCB lifecycle code.
  • Scheduler extensibility (sched_ext): merged in Linux 6.12, allows BPF programs to implement custom scheduling policies — the scheduler now reads task_struct fields via stable BPF accessors, driving more careful API stabilization of those fields.

Exercises

  1. State inspection: Write a shell one-liner that, given a PID, prints the process state character, the state's plain-English name, and the kernel function where it is sleeping (if any). Verify on a sleep(1) process and a dd if=/dev/sda process.

  2. Address space analysis: Run a simple C program that calls malloc(1) in a loop. At each power-of-two allocation size, print the heap boundary from /proc/self/maps. At what allocation size does glibc switch from brk to mmap? (Hint: M_MMAP_THRESHOLD)

  3. PID and TGID: Write a C program that creates two pthreads. In each thread print getpid(), gettid(), and read Pid: and Tgid: from /proc/self/status. Explain the relationship between all four values.

  4. Zombie lifecycle: Write a C program that forks a child, has the child exit immediately, and then sleeps for 30 seconds before calling wait(). Use ps aux and /proc to observe the zombie state during the sleep. Extend the program to show that the zombie's mm_struct has been freed (VmSize = 0 in /proc/ZPID/status).

  5. PID reuse: Write a program that forks, records the child PID, waits for the child to exit (but does NOT consume the zombie), then tries to send SIGKILL to that PID. What does the kernel return? Now consume the zombie and repeat. What happens if a new process has been assigned that PID in the meantime?


References

  • Bovet & Cesati, Understanding the Linux Kernel, 3rd ed. — Chapter 3 (Processes)
  • include/linux/sched.htask_struct definition (read with make TAGS or cscope)
  • kernel/fork.ccopy_process(), dup_task_struct()
  • fs/proc/array.c/proc/PID/status generation
  • Love, Linux Kernel Development, 3rd ed. — Chapter 3
  • Kerrisk, The Linux Programming Interface — Chapters 6, 26, 28
  • man 5 proc — comprehensive /proc/PID/ field reference
  • Linux kernel documentation: Documentation/filesystems/proc.rst