Skip to content

Section 08: Threading Models — Overview

Section Purpose and Scope

Concurrency is among the most consequential and subtle topics in systems programming. The threading model chosen by a language, runtime, or operating system determines the performance ceiling, the synchronization complexity, and the failure modes of every concurrent program built on top of it.

This section provides a rigorous taxonomy of threading models from the lowest-level kernel threads through user-space threading models (M:N, green threads, fibers), through coroutines, to the highest-level async/await pattern. For each, we analyze the scheduling mechanism, the stack management strategy, the synchronization requirements, and the mapping to kernel abstractions. We cover POSIX threads, Windows threads, Go goroutines, and Rust's async runtime as concrete production implementations.

Understanding this section is prerequisite for reasoning about performance at scale, diagnosing deadlocks and livelocks, designing lock-free data structures, and understanding why languages like Go and Rust make the threading choices they do.


Prerequisites

  • Section 03 (Kernel Fundamentals): task_struct, kernel memory
  • Section 07 (Process Management): clone() flags, address space sharing
  • Section 09 (Scheduling): will be cross-referenced; can be read in parallel

Learning Objectives

After completing this section you will be able to:

  1. Define kernel threads, user threads, and the M:N model with precision and explain the tradeoffs of each
  2. Explain why green threads, fibers, and coroutines exist and what problem they solve
  3. Trace the lifecycle of a POSIX thread from pthread_create() to thread exit
  4. Explain thread-local storage (TLS) implementation at the ABI and kernel level
  5. Describe thread pool design and the work-stealing scheduling algorithm
  6. Explain the Go goroutine scheduler (GMP model) and how it achieves M:N threading
  7. Explain Rust's async/await model and how it compiles to state machines
  8. Reason about the tradeoffs among kernel threads, goroutines, and async for specific workloads
  9. Identify and diagnose common threading bugs: deadlock, livelock, priority inversion, ABA problem

Architecture Overview

THREADING MODEL TAXONOMY

1:1 (Kernel Threads — Linux pthreads, Windows threads):
  User Thread 1  User Thread 2  User Thread 3
       │               │               │
       ▼               ▼               ▼
  KThread 1      KThread 2      KThread 3      ← each is a task_struct
  (scheduled by kernel)
  + Direct CPU scheduling by kernel
  + Blocking syscall doesn't block others
  - Thread creation is a syscall (expensive)
  - Kernel stack overhead (~8KB per thread)
  - Context switch crosses kernel boundary

N:1 (User Threads — Green Threads, early Java):
  Green Thread 1  Green Thread 2  Green Thread 3
       │               │               │
       ▼               ▼               ▼
       ─────────────────────────────────
              Single Kernel Thread
  + Cheap creation (no syscall)
  + Fast context switch (no kernel transition)
  - Blocking syscall blocks ALL green threads
  - No true parallelism on multi-core

M:N (Hybrid — Go goroutines, GHC, Erlang):
  G1  G2  G3  G4  G5  G6  ...  Gn      ← goroutines (M = potentially millions)
  │   │   │   │   │   │        │
  ├───┴───┤   ├───┴───┤        │
  │  OS    │   │  OS   │   ...  │
  │ Thread │   │Thread │        │
  │  (P1)  │   │  (P2) │   ...  │
  └────────┘   └───────┘        │     ← N = GOMAXPROCS kernel threads
  + Cheap creation (no syscall per goroutine)
  + Parallelism via N kernel threads
  + Blocking goroutine → scheduler parks it, runs another
  + Work stealing balances across OS threads

ASYNC/AWAIT (Rust Tokio, Python asyncio, JS):
  Task 1    Task 2    Task 3              ← async tasks (state machines)
  │         │         │
  └─────────┴─────────┘
         │
      Event Loop (single or multi-threaded)
         │
      Epoll / kqueue / io_uring
  + No stack per task (state machine)
  + Millions of concurrent tasks
  - Blocking code breaks the model
  - Colored function problem (async is infectious)


GO GOROUTINE SCHEDULER (GMP MODEL):

  ┌─────────────────────────────────────────────────────┐
  │  G = Goroutine (user-level coroutine)               │
  │  M = Machine (OS kernel thread)                     │
  │  P = Processor (logical CPU, owns a run queue)      │
  │                                                     │
  │  P0: [G1, G3, G7, ...] ←── work-steal ──► P1: [G2, G4]
  │   │                                         │       │
  │   M0                                        M1     M2│
  │   (running G1)                          (running G2) │
  │                                                     │
  │  Global run queue: [G8, G9, G10, ...]               │
  │  Network poller: blocked goroutines (epoll)          │
  └─────────────────────────────────────────────────────┘

Key Concepts

  • Kernel Thread: A thread whose existence is known to the kernel. Each has its own kernel stack, scheduling entity, and CPU affinity. On Linux, created via clone(CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND). Scheduled by the kernel scheduler.
  • User Thread (Green Thread): A thread whose existence is managed entirely in user space. The kernel sees only one (or few) OS thread(s). A user-space scheduler multiplexes green threads. Cannot exploit multi-core parallelism unless M:N model is used. Fast creation/switch; vulnerable to blocking syscalls.
  • M:N Threading: A hybrid model where M user-level threads are multiplexed over N kernel threads. Combines cheap creation and scheduling with real parallelism. Complex to implement correctly (signal handling, thread-local storage, blocking syscall interception are all tricky).
  • Fiber: A cooperatively-scheduled user-space thread with a manually managed stack. No preemption — the fiber must explicitly yield. Very low overhead; used in game engines, coroutine libraries, and some OS schedulers (Windows UMS).
  • Coroutine: A generalization of subroutines that can be suspended and resumed. "Stackful" coroutines have their own stack (like fibers); "stackless" coroutines (Rust async fn, Python coroutines) compile to state machines that store only the minimal state needed between suspension points.
  • async/await: A language-level syntax for writing concurrent code without explicit thread management. The compiler transforms async fn into a state machine (a Future in Rust, a coroutine object in Python). An executor/event loop drives state machine transitions.
  • Thread Lifecycle: Created → Runnable → Running → Blocked (I/O, mutex, sleep) → Runnable → ... → Terminated. Joined by another thread (pthread_join) or detached (pthread_detach).
  • Thread-Local Storage (TLS): Per-thread data that appears as global variables but is unique to each thread. Implemented via the %fs segment register (x86-64 Linux) or TPIDR_EL0 (AArch64). The kernel sets these during context switch.
  • Thread Pool: A fixed set of worker threads that pull tasks from a shared queue, avoiding the overhead of per-task thread creation. Standard pattern for web servers, connection handlers, and CPU-bound work.
  • Work Stealing: A scheduling strategy where each worker thread has its own deque of tasks. Idle threads steal tasks from the tail of other threads' deques. Provides automatic load balancing with minimal contention. Used in Go's scheduler, Rayon (Rust), and Intel TBB.
  • POSIX Threads (pthreads): The standard C API for threading on Unix systems. Key calls: pthread_create, pthread_join, pthread_mutex_lock/unlock, pthread_cond_wait/signal, pthread_rwlock_*, pthread_key_create (TLS).
  • Goroutine: Go's fundamental unit of concurrency. A stackful coroutine managed by the Go runtime's M:N scheduler. Initial stack is 2–8KB (growable); creation cost ~300ns vs ~20µs for an OS thread. The runtime parks goroutines blocked on I/O via a network poller (epoll/kqueue/IOCP).
  • Rust async: Stackless coroutines. An async fn returns a Future. Futures are zero-cost when not polled. Tokio (the dominant async runtime) implements M:N scheduling over an epoll-based event loop. async is "colored" — it propagates through the call chain.

Thread Safety and Synchronization Overview

SYNCHRONIZATION PRIMITIVES TAXONOMY:

Mutual Exclusion:
  Mutex (blocking)     ← simple, correct, can deadlock
  Spinlock             ← busy-wait, use only in kernel/IRQ context
  RWLock               ← multiple readers OR one writer

Condition Synchronization:
  Condition Variable   ← wait for a predicate (always with a mutex)
  Semaphore            ← generalized counter (Dijkstra, 1965)
  Barrier              ← all threads reach point before any continue

Lock-Free (no blocking):
  Atomic operations    ← CAS, fetch-add, load/store with memory ordering
  RCU (Read-Copy-Update) ← readers never block; writers copy-then-switch

Common Bugs:
  Deadlock:     T1 holds A, waits B. T2 holds B, waits A. → Cycle
  Livelock:     Both retry simultaneously, neither progresses
  Priority Inv: Low-priority holds lock needed by high-priority thread
  ABA Problem:  CAS succeeds incorrectly because value changed A→B→A
  Data Race:    Unsynchronized concurrent access to shared mutable state

Major Historical Milestones

Year Milestone
1965 Dijkstra: semaphores — first formal synchronization primitive
1966 Conway: coroutines concept formalized
1979 Hoare: Communicating Sequential Processes (CSP) — foundational concurrency theory
1980 UNIX processes for concurrency — fork() as isolation
1987 C threads proposal (later becomes POSIX threads)
1988 Mach: first kernel with kernel threads separate from processes
1993 POSIX.1c (pthreads) standardized
1996 Java 1.0: green threads (N:1) — blocked syscall blocks all threads
1999 Erlang released open-source: processes = lightweight M:N actors
2001 Linux NGPT (Next Generation POSIX Threads): M:N attempt
2003 Linux NPTL (Native POSIX Thread Library): 1:1 threads, efficient
2003 RCU merged into Linux 2.6 — scalable lock-free reader pattern
2006 Go concurrency design begins (Pike, Thompson, Griesemer)
2009 Go 1.0 preview: goroutines and channels
2012 Go 1.0 released: GMP scheduler with work stealing
2014 Rust 1.0 alpha: fearless concurrency, ownership prevents data races
2016 Rust 1.0 stable: Send/Sync traits formalize thread safety
2018 Rust async/await stabilization begins (Future trait in 1.26)
2019 Rust async/await syntax stabilized (1.39)
2020 Go 1.14: asynchronous preemption of goroutines (signal-based)
2021 Linux io_uring + coroutine patterns in C become practical
2022 Linux 5.18: futex2 (more flexible futex API)
2023 Go 1.21: sync.OnceFunc, improved structured concurrency primitives

Modern Relevance and Production Use Cases

High-concurrency servers: Nginx, Envoy, and Redis use event-loop + async I/O models (N:1 or M:N). Understanding why they don't use one-thread-per-connection is fundamental to reasoning about their scalability.

Go in production: Go powers Docker, Kubernetes, Terraform, CockroachDB. The goroutine scheduler and GOMAXPROCS tuning directly affect throughput and tail latency. Understanding GMP is essential for profiling with pprof.

Rust async in production: Tokio underlies Cloudflare's infrastructure, Discord's messaging system, and many high-performance services. Pinning, Send bounds, and the executor model are not optional knowledge for Rust backend engineers.

Language runtime design: Every language runtime (Java, Python, Ruby, Haskell) has a threading model. Understanding the taxonomy allows you to evaluate GIL (Global Interpreter Lock in CPython), Java virtual threads (Project Loom, 1.21), and Kotlin coroutines.

Debugging concurrency bugs: helgrind (Valgrind), ThreadSanitizer (-fsanitize=thread), Go's race detector (-race) all detect data races. Deadlock detection requires understanding the lock acquisition graph.


File Map

08-threading-models/
├── 00-overview.md                     ← This file
├── 01-kernel-threads.md               ← task_struct, clone(), kernel stacks, kthreads
├── 02-user-threads-green-threads.md   ← N:1 model, context switching in user space
├── 03-mn-threading.md                 ← M:N model design, signal handling challenges
├── 04-fibers-and-coroutines.md        ← Stackful vs stackless, yield mechanics
├── 05-async-await.md                  ← Future/Promise model, state machine compilation
├── 06-thread-lifecycle.md             ← Create, join, detach, cancellation
├── 07-thread-local-storage.md         ← TLS ABI, %fs register, compiler support
├── 08-thread-pools.md                 ← Design patterns, sizing heuristics
├── 09-work-stealing.md                ← Algorithm, deque implementation, analysis
├── 10-posix-threads.md                ← pthreads API reference and idioms
├── 11-windows-threads.md              ← CreateThread, IOCP, fiber API
├── 12-go-goroutines.md                ← GMP scheduler internals, runtime/trace
├── 13-rust-async.md                   ← Future trait, Pin, Tokio executor, Waker
├── 14-synchronization-primitives.md   ← Mutex, condvar, semaphore, RCU, atomics
├── 15-concurrency-bugs.md             ← Deadlock, livelock, ABA, data races + tools

Cross-References

  • Section 07 (Process Management): clone() as the foundation for thread creation
  • Section 09 (Scheduling): How threads are scheduled; CFS treats threads and processes identically
  • Section 10 (Synchronization): Deep treatment of the synchronization primitives referenced here
  • Section 29 (Runtime Systems): Language runtimes that implement threading models
  • Section 44 (Rust and Memory Safety): Rust's ownership model and fearless concurrency

Essential: Files 01–06, 10, 14–15. Core threading model knowledge is required for any concurrent systems work.

Deep dive recommended: Files 12–13 for Go/Rust engineers. Files 07–09 for runtime and systems library developers. File 11 for Windows platform work.

Hands-on: Write a thread pool in C from scratch using pthreads. Profile a goroutine-heavy Go program with go tool trace. Build a simple Tokio async server and examine the generated state machine with cargo expand.

Estimated study time: 20–25 hours including practical exercises.