Section 08: Threading Models — Overview
Section Purpose and Scope
Concurrency is among the most consequential and subtle topics in systems programming. The threading model chosen by a language, runtime, or operating system determines the performance ceiling, the synchronization complexity, and the failure modes of every concurrent program built on top of it.
This section provides a rigorous taxonomy of threading models from the lowest-level kernel threads through user-space threading models (M:N, green threads, fibers), through coroutines, to the highest-level async/await pattern. For each, we analyze the scheduling mechanism, the stack management strategy, the synchronization requirements, and the mapping to kernel abstractions. We cover POSIX threads, Windows threads, Go goroutines, and Rust's async runtime as concrete production implementations.
Understanding this section is prerequisite for reasoning about performance at scale, diagnosing deadlocks and livelocks, designing lock-free data structures, and understanding why languages like Go and Rust make the threading choices they do.
Prerequisites
- Section 03 (Kernel Fundamentals):
task_struct, kernel memory - Section 07 (Process Management):
clone()flags, address space sharing - Section 09 (Scheduling): will be cross-referenced; can be read in parallel
Learning Objectives
After completing this section you will be able to:
- Define kernel threads, user threads, and the M:N model with precision and explain the tradeoffs of each
- Explain why green threads, fibers, and coroutines exist and what problem they solve
- Trace the lifecycle of a POSIX thread from
pthread_create()to thread exit - Explain thread-local storage (TLS) implementation at the ABI and kernel level
- Describe thread pool design and the work-stealing scheduling algorithm
- Explain the Go goroutine scheduler (GMP model) and how it achieves M:N threading
- Explain Rust's async/await model and how it compiles to state machines
- Reason about the tradeoffs among kernel threads, goroutines, and async for specific workloads
- Identify and diagnose common threading bugs: deadlock, livelock, priority inversion, ABA problem
Architecture Overview
THREADING MODEL TAXONOMY
1:1 (Kernel Threads — Linux pthreads, Windows threads):
User Thread 1 User Thread 2 User Thread 3
│ │ │
▼ ▼ ▼
KThread 1 KThread 2 KThread 3 ← each is a task_struct
(scheduled by kernel)
+ Direct CPU scheduling by kernel
+ Blocking syscall doesn't block others
- Thread creation is a syscall (expensive)
- Kernel stack overhead (~8KB per thread)
- Context switch crosses kernel boundary
N:1 (User Threads — Green Threads, early Java):
Green Thread 1 Green Thread 2 Green Thread 3
│ │ │
▼ ▼ ▼
─────────────────────────────────
Single Kernel Thread
+ Cheap creation (no syscall)
+ Fast context switch (no kernel transition)
- Blocking syscall blocks ALL green threads
- No true parallelism on multi-core
M:N (Hybrid — Go goroutines, GHC, Erlang):
G1 G2 G3 G4 G5 G6 ... Gn ← goroutines (M = potentially millions)
│ │ │ │ │ │ │
├───┴───┤ ├───┴───┤ │
│ OS │ │ OS │ ... │
│ Thread │ │Thread │ │
│ (P1) │ │ (P2) │ ... │
└────────┘ └───────┘ │ ← N = GOMAXPROCS kernel threads
+ Cheap creation (no syscall per goroutine)
+ Parallelism via N kernel threads
+ Blocking goroutine → scheduler parks it, runs another
+ Work stealing balances across OS threads
ASYNC/AWAIT (Rust Tokio, Python asyncio, JS):
Task 1 Task 2 Task 3 ← async tasks (state machines)
│ │ │
└─────────┴─────────┘
│
Event Loop (single or multi-threaded)
│
Epoll / kqueue / io_uring
+ No stack per task (state machine)
+ Millions of concurrent tasks
- Blocking code breaks the model
- Colored function problem (async is infectious)
GO GOROUTINE SCHEDULER (GMP MODEL):
┌─────────────────────────────────────────────────────┐
│ G = Goroutine (user-level coroutine) │
│ M = Machine (OS kernel thread) │
│ P = Processor (logical CPU, owns a run queue) │
│ │
│ P0: [G1, G3, G7, ...] ←── work-steal ──► P1: [G2, G4]
│ │ │ │
│ M0 M1 M2│
│ (running G1) (running G2) │
│ │
│ Global run queue: [G8, G9, G10, ...] │
│ Network poller: blocked goroutines (epoll) │
└─────────────────────────────────────────────────────┘
Key Concepts
- Kernel Thread: A thread whose existence is known to the kernel. Each has its own kernel stack, scheduling entity, and CPU affinity. On Linux, created via
clone(CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND). Scheduled by the kernel scheduler. - User Thread (Green Thread): A thread whose existence is managed entirely in user space. The kernel sees only one (or few) OS thread(s). A user-space scheduler multiplexes green threads. Cannot exploit multi-core parallelism unless M:N model is used. Fast creation/switch; vulnerable to blocking syscalls.
- M:N Threading: A hybrid model where M user-level threads are multiplexed over N kernel threads. Combines cheap creation and scheduling with real parallelism. Complex to implement correctly (signal handling, thread-local storage, blocking syscall interception are all tricky).
- Fiber: A cooperatively-scheduled user-space thread with a manually managed stack. No preemption — the fiber must explicitly yield. Very low overhead; used in game engines, coroutine libraries, and some OS schedulers (Windows UMS).
- Coroutine: A generalization of subroutines that can be suspended and resumed. "Stackful" coroutines have their own stack (like fibers); "stackless" coroutines (Rust
async fn, Python coroutines) compile to state machines that store only the minimal state needed between suspension points. - async/await: A language-level syntax for writing concurrent code without explicit thread management. The compiler transforms
async fninto a state machine (aFuturein Rust, a coroutine object in Python). An executor/event loop drives state machine transitions. - Thread Lifecycle: Created → Runnable → Running → Blocked (I/O, mutex, sleep) → Runnable → ... → Terminated. Joined by another thread (
pthread_join) or detached (pthread_detach). - Thread-Local Storage (TLS): Per-thread data that appears as global variables but is unique to each thread. Implemented via the
%fssegment register (x86-64 Linux) orTPIDR_EL0(AArch64). The kernel sets these during context switch. - Thread Pool: A fixed set of worker threads that pull tasks from a shared queue, avoiding the overhead of per-task thread creation. Standard pattern for web servers, connection handlers, and CPU-bound work.
- Work Stealing: A scheduling strategy where each worker thread has its own deque of tasks. Idle threads steal tasks from the tail of other threads' deques. Provides automatic load balancing with minimal contention. Used in Go's scheduler, Rayon (Rust), and Intel TBB.
- POSIX Threads (pthreads): The standard C API for threading on Unix systems. Key calls:
pthread_create,pthread_join,pthread_mutex_lock/unlock,pthread_cond_wait/signal,pthread_rwlock_*,pthread_key_create(TLS). - Goroutine: Go's fundamental unit of concurrency. A stackful coroutine managed by the Go runtime's M:N scheduler. Initial stack is 2–8KB (growable); creation cost ~300ns vs ~20µs for an OS thread. The runtime parks goroutines blocked on I/O via a network poller (epoll/kqueue/IOCP).
- Rust async: Stackless coroutines. An
async fnreturns aFuture. Futures are zero-cost when not polled. Tokio (the dominant async runtime) implements M:N scheduling over an epoll-based event loop.asyncis "colored" — it propagates through the call chain.
Thread Safety and Synchronization Overview
SYNCHRONIZATION PRIMITIVES TAXONOMY:
Mutual Exclusion:
Mutex (blocking) ← simple, correct, can deadlock
Spinlock ← busy-wait, use only in kernel/IRQ context
RWLock ← multiple readers OR one writer
Condition Synchronization:
Condition Variable ← wait for a predicate (always with a mutex)
Semaphore ← generalized counter (Dijkstra, 1965)
Barrier ← all threads reach point before any continue
Lock-Free (no blocking):
Atomic operations ← CAS, fetch-add, load/store with memory ordering
RCU (Read-Copy-Update) ← readers never block; writers copy-then-switch
Common Bugs:
Deadlock: T1 holds A, waits B. T2 holds B, waits A. → Cycle
Livelock: Both retry simultaneously, neither progresses
Priority Inv: Low-priority holds lock needed by high-priority thread
ABA Problem: CAS succeeds incorrectly because value changed A→B→A
Data Race: Unsynchronized concurrent access to shared mutable state
Major Historical Milestones
| Year | Milestone |
|---|---|
| 1965 | Dijkstra: semaphores — first formal synchronization primitive |
| 1966 | Conway: coroutines concept formalized |
| 1979 | Hoare: Communicating Sequential Processes (CSP) — foundational concurrency theory |
| 1980 | UNIX processes for concurrency — fork() as isolation |
| 1987 | C threads proposal (later becomes POSIX threads) |
| 1988 | Mach: first kernel with kernel threads separate from processes |
| 1993 | POSIX.1c (pthreads) standardized |
| 1996 | Java 1.0: green threads (N:1) — blocked syscall blocks all threads |
| 1999 | Erlang released open-source: processes = lightweight M:N actors |
| 2001 | Linux NGPT (Next Generation POSIX Threads): M:N attempt |
| 2003 | Linux NPTL (Native POSIX Thread Library): 1:1 threads, efficient |
| 2003 | RCU merged into Linux 2.6 — scalable lock-free reader pattern |
| 2006 | Go concurrency design begins (Pike, Thompson, Griesemer) |
| 2009 | Go 1.0 preview: goroutines and channels |
| 2012 | Go 1.0 released: GMP scheduler with work stealing |
| 2014 | Rust 1.0 alpha: fearless concurrency, ownership prevents data races |
| 2016 | Rust 1.0 stable: Send/Sync traits formalize thread safety |
| 2018 | Rust async/await stabilization begins (Future trait in 1.26) |
| 2019 | Rust async/await syntax stabilized (1.39) |
| 2020 | Go 1.14: asynchronous preemption of goroutines (signal-based) |
| 2021 | Linux io_uring + coroutine patterns in C become practical |
| 2022 | Linux 5.18: futex2 (more flexible futex API) |
| 2023 | Go 1.21: sync.OnceFunc, improved structured concurrency primitives |
Modern Relevance and Production Use Cases
High-concurrency servers: Nginx, Envoy, and Redis use event-loop + async I/O models (N:1 or M:N). Understanding why they don't use one-thread-per-connection is fundamental to reasoning about their scalability.
Go in production: Go powers Docker, Kubernetes, Terraform, CockroachDB. The goroutine scheduler and GOMAXPROCS tuning directly affect throughput and tail latency. Understanding GMP is essential for profiling with pprof.
Rust async in production: Tokio underlies Cloudflare's infrastructure, Discord's messaging system, and many high-performance services. Pinning, Send bounds, and the executor model are not optional knowledge for Rust backend engineers.
Language runtime design: Every language runtime (Java, Python, Ruby, Haskell) has a threading model. Understanding the taxonomy allows you to evaluate GIL (Global Interpreter Lock in CPython), Java virtual threads (Project Loom, 1.21), and Kotlin coroutines.
Debugging concurrency bugs: helgrind (Valgrind), ThreadSanitizer (-fsanitize=thread), Go's race detector (-race) all detect data races. Deadlock detection requires understanding the lock acquisition graph.
File Map
08-threading-models/
├── 00-overview.md ← This file
├── 01-kernel-threads.md ← task_struct, clone(), kernel stacks, kthreads
├── 02-user-threads-green-threads.md ← N:1 model, context switching in user space
├── 03-mn-threading.md ← M:N model design, signal handling challenges
├── 04-fibers-and-coroutines.md ← Stackful vs stackless, yield mechanics
├── 05-async-await.md ← Future/Promise model, state machine compilation
├── 06-thread-lifecycle.md ← Create, join, detach, cancellation
├── 07-thread-local-storage.md ← TLS ABI, %fs register, compiler support
├── 08-thread-pools.md ← Design patterns, sizing heuristics
├── 09-work-stealing.md ← Algorithm, deque implementation, analysis
├── 10-posix-threads.md ← pthreads API reference and idioms
├── 11-windows-threads.md ← CreateThread, IOCP, fiber API
├── 12-go-goroutines.md ← GMP scheduler internals, runtime/trace
├── 13-rust-async.md ← Future trait, Pin, Tokio executor, Waker
├── 14-synchronization-primitives.md ← Mutex, condvar, semaphore, RCU, atomics
├── 15-concurrency-bugs.md ← Deadlock, livelock, ABA, data races + tools
Cross-References
- Section 07 (Process Management):
clone()as the foundation for thread creation - Section 09 (Scheduling): How threads are scheduled; CFS treats threads and processes identically
- Section 10 (Synchronization): Deep treatment of the synchronization primitives referenced here
- Section 29 (Runtime Systems): Language runtimes that implement threading models
- Section 44 (Rust and Memory Safety): Rust's ownership model and fearless concurrency
Recommended Depth of Study
Essential: Files 01–06, 10, 14–15. Core threading model knowledge is required for any concurrent systems work.
Deep dive recommended: Files 12–13 for Go/Rust engineers. Files 07–09 for runtime and systems library developers. File 11 for Windows platform work.
Hands-on: Write a thread pool in C from scratch using pthreads. Profile a goroutine-heavy Go program with go tool trace. Build a simple Tokio async server and examine the generated state machine with cargo expand.
Estimated study time: 20–25 hours including practical exercises.