08 — Thread Pools

Overview

A thread pool is one of the most fundamental concurrency patterns in systems programming. The core idea is simple: creating and destroying OS threads is expensive (typically 10–100 µs per thread), so instead of creating a new thread for each unit of work, you create a fixed-size pool of threads once at startup and reuse them across many work items. Work is submitted to a shared queue; idle threads dequeue and execute work items; when work completes the thread returns to the pool, ready for the next item.

Thread pools appear everywhere: Java's ExecutorService and ForkJoinPool, Rust's rayon and tokio::task::spawn_blocking, Go's runtime scheduler (which is itself a thread pool backing goroutines), Linux's kernel workqueue subsystem, database connection pools, and web server worker pools. Understanding thread pools deeply means understanding queue design, sizing heuristics, backpressure, deadlock modes, and the difference between pools optimized for CPU-bound vs. I/O-bound workloads.

Prerequisites

OS thread concepts and lifecycle (see 08-threading-models/01-kernel-threads.md)
POSIX threads API (see 08-threading-models/05-posix-threads.md)
Synchronization primitives: mutexes, condition variables (see 10-synchronization/)
Basic understanding of queuing theory helpful for sizing

Historical Context

Thread pool patterns emerged in the late 1990s alongside the proliferation of multi-threaded server programming. Pre-pooling, servers typically followed one of two models: a single-threaded event loop (like early versions of Nginx before worker processes) or a thread-per-connection model (like Apache's prefork). Thread-per-connection is simple but collapses under high connection counts due to thread creation cost and memory pressure from stacks (typically 8 MB per thread on Linux by default).

Java's java.util.concurrent package (Java 5, 2004), designed by Doug Lea, standardized thread pool APIs for the JVM ecosystem. Lea's ThreadPoolExecutor and later ForkJoinPool remain the reference implementations that subsequent language ecosystems have modeled. The patterns and concepts Lea introduced — work stealing, fork/join, blocking queue backpressure — appear in virtually every subsequent thread pool implementation.

Thread Pool Lifecycle

                    submit(task)
User Code ---------> [Work Queue]
                         |
                         | (dequeue)
             +-----------+-----------+
             |           |           |
         Thread-1     Thread-2    Thread-3   ... Thread-N
         [running]    [idle]      [running]      [idle]
             |           |           |
             |    task arrives       |
             |           v           |
             |       [running]       |
             |                       |
          task done               task done
             |                       |
         [idle: wait               [idle: wait
          on queue]                 on queue]
             |
        (keepAliveTime expires if
         thread count > corePoolSize)
             |
          thread exits

Lifecycle states for a ThreadPoolExecutor-style pool:

RUNNING: Accepting new tasks, processing queued tasks
SHUTDOWN: Not accepting new tasks, processing remaining queued tasks (after shutdown())
STOP: Not accepting new tasks, not processing queued tasks, interrupting in-progress tasks (after shutdownNow())
TERMINATED: All threads have stopped

Sizing Heuristics

Thread pool sizing is one of the most commonly misunderstood aspects of concurrent programming. There is no universal formula — sizing depends on workload characteristics.

CPU-Bound Workloads

For purely CPU-bound tasks (no I/O, no waiting):

optimal_threads = N_cores

More threads than cores causes context switching overhead with no benefit. Hyperthreaded cores provide partial benefit — often N_cores * 1.5 is better than N_cores for CPU-bound mixed workloads because HT lets the CPU hide some pipeline stalls.

I/O-Bound Workloads

For workloads that spend time waiting (database queries, HTTP requests, file I/O):

optimal_threads = N_cores * (1 + W/C)

Where W is average wait time and C is average compute time per task. If a task spends 90% of its time waiting for a database response and 10% computing, W/C = 9, so:

optimal_threads = N_cores * (1 + 9) = 10 * N_cores

This is the "Little's Law" derived formula. In practice, empirical measurement beats formula: benchmark with varying thread counts and choose the count that maximizes throughput or minimizes latency at your target load.

Practical Approach

1. Start with N_cores threads
2. Profile: are threads CPU-bound or waiting?
3. If waiting > 50% of time: increase threads
4. Measure throughput and tail latency at each count
5. Stop when adding threads no longer helps (or hurts)
6. Leave headroom for other processes and kernel threads

Work Queue Implementations

The work queue is the central data structure of a thread pool. Its design determines throughput, latency, and backpressure behavior.

Bounded MPMC Queue (Multi-Producer, Multi-Consumer)

Producers ----+                 +---- Consumers
              |                 |
              v                 v
+--[  task  ][  task  ][  task  ]--+  <- bounded (max N items)
|   head                    tail   |
+----------------------------------+
  block/reject when full (backpressure)

A bounded queue provides backpressure: when the queue fills up, producers block (or receive a rejection). This is critical for system stability — without backpressure, producers can enqueue work faster than consumers process it, consuming unbounded memory.

Java's ArrayBlockingQueue is a circular buffer backed by an array. offer(task, timeout, unit) blocks up to a timeout if the queue is full, enabling graceful degradation.

Unbounded Queue

Accepts all submissions. Low latency under normal load, but under overload the queue grows without bound, consuming memory until OOM. Only appropriate when submission rate is guaranteed to not exceed processing rate. Java's LinkedBlockingQueue without a capacity argument is the classic mistake — it is effectively unbounded and can cause heap exhaustion.

SynchronousQueue

A queue with no internal storage — every put must pair with a concurrent take. In ThreadPoolExecutor, this means a submitted task is handed directly to a waiting thread; if no thread is available, a new thread is created (up to maxPoolSize), or the task is rejected. Used in Executors.newCachedThreadPool() for elastic thread creation.

Work-Stealing Deque

Each worker thread has its own deque. Producers push to the back of their thread-local deque. Workers pop from the back of their own deque (LIFO for hot cache lines). When idle, a worker steals from the front of another worker's deque (FIFO for fairness). This design reduces contention — workers mostly operate on their own deque without locks, and stealing is rare.

Worker-1 deque:    [front] task-A task-B [back] <-- Worker-1 pops from back
Worker-2 deque:    [front] task-C task-D [back]
Worker-3 (idle): steal from front of Worker-1 deque --> executes task-A

Java ExecutorService and ThreadPoolExecutor

ThreadPoolExecutor is the most configurable thread pool in the Java ecosystem:

ThreadPoolExecutor executor = new ThreadPoolExecutor(
    corePoolSize,    // Threads kept alive even when idle
    maxPoolSize,     // Maximum threads ever created
    keepAliveTime,   // How long excess threads wait before dying
    TimeUnit.SECONDS,
    workQueue,       // BlockingQueue<Runnable>
    threadFactory,   // Optional: custom thread names, daemon status
    rejectionHandler // What happens when queue full and at maxPoolSize
);

Key interactions: - If threads < corePoolSize: always create a new thread for submitted tasks - If threads >= corePoolSize: try to add to workQueue - If workQueue is full and threads < maxPoolSize: create a new thread - If workQueue is full and threads == maxPoolSize: invoke rejectionHandler

Rejection policies: - AbortPolicy (default): throw RejectedExecutionException - CallerRunsPolicy: run the task on the submitting thread (natural backpressure) - DiscardPolicy: silently drop the task - DiscardOldestPolicy: drop the oldest queued task and retry submission

         corePoolSize=4    maxPoolSize=8
              |                 |
threads: 0 ---> 4              4 ---> 8 (if queue full)
              |                 |
         always make         only make
         new threads          new threads
         until core           until max

Java ForkJoinPool

ForkJoinPool (Java 7, 2011) implements work-stealing for divide-and-conquer parallelism:

ForkJoinPool pool = ForkJoinPool.commonPool();
// or:
ForkJoinPool pool = new ForkJoinPool(parallelism);

class SumTask extends RecursiveTask<Long> {
    long[] data;
    int lo, hi;

    protected Long compute() {
        if (hi - lo < THRESHOLD) {
            return sumSequential(data, lo, hi);
        }
        int mid = (lo + hi) / 2;
        SumTask left = new SumTask(data, lo, mid);
        SumTask right = new SumTask(data, mid, hi);
        left.fork();            // Push to deque
        return right.compute()  // Compute right inline
             + left.join();     // Wait for left (steal if needed)
    }
}

ForkJoinPool uses work-stealing. When a thread calls join() and the joined task is incomplete, the thread doesn't block — it steals and executes other tasks from its deque until the joined task completes. This avoids deadlock on task dependencies.

Java's Stream.parallel() uses ForkJoinPool.commonPool() — a JVM-wide shared pool, which is a source of subtle bugs when blocking I/O or long-running tasks are submitted to it.

Rust: rayon

Rayon provides data parallelism via a work-stealing thread pool, par_iter(), and join():

use rayon::prelude::*;

// Parallel map+sum over a large slice
let sum: i64 = data.par_iter()
    .map(|&x| expensive_compute(x))
    .sum();

// Fork/join style
rayon::join(
    || left_half.sort(),
    || right_half.sort(),
);

Rayon's thread pool is sized to num_cpus::get() by default and uses work-stealing. Rayon is designed strictly for CPU-bound work — blocking inside a rayon task stalls a worker thread. For mixed CPU+I/O parallelism, use Tokio with spawn_blocking for the CPU work.

Go's Implicit Thread Pool

Go does not expose a thread pool API because the runtime itself is a thread pool: GOMAXPROCS OS threads back all goroutines. The scheduler multiplexes goroutines onto OS threads using M:N threading.

Goroutines: G1 G2 G3 G4 G5 G6 G7 G8 ...  (millions possible)
                |           |
         +------+     +-----+
         |             |
     OS Thread M1   OS Thread M2    ... (GOMAXPROCS threads)

When a goroutine blocks on a syscall, the runtime parks it and replaces it with another goroutine on the same OS thread (via netpoller for network I/O, or by spawning an additional thread for blocking syscalls). From the user's perspective, goroutines never block the thread pool.

Linux Kernel Workqueue (cmwq)

The Linux kernel has its own thread pool: the Concurrency-Managed Workqueue (cmwq), introduced in Linux 2.6.36 to replace the older simpler workqueue system.

System Workqueues:
  system_wq          -- general purpose, concurrent
  system_highpri_wq  -- high priority tasks
  system_long_wq     -- expected to run for a long time
  system_unbound_wq  -- not bound to a specific CPU

Per-NUMA-node worker pools:
  [CPU0, CPU1] -> worker pool (normal priority)
  [CPU0, CPU1] -> worker pool (high priority)
  [CPU2, CPU3] -> worker pool (normal priority)
  ...

cmwq maintains per-NUMA-node worker pools with a minimum of one running worker per pool. If a worker blocks (waiting for a lock or sleeping), cmwq wakes or creates a new worker to maintain concurrency. If all workers are actively running, no new workers are created (bounded concurrency). Ordered workqueues guarantee FIFO ordering — one work item at a time.

// Submitting work to kernel workqueue
static void my_work_handler(struct work_struct *work)
{
    struct my_data *data = container_of(work, struct my_data, work);
    // do work
}

INIT_WORK(&my_data->work, my_work_handler);
schedule_work(&my_data->work);  // to system_wq
// or:
queue_work(my_private_wq, &my_data->work);

Thread Pool Exhaustion Deadlock

A critical failure mode unique to thread pools: pool exhaustion deadlock occurs when all threads in the pool are blocked waiting for results from tasks that themselves need a thread from the same pool to execute.

Thread Pool (size=4):
  Thread-1: blocked, waiting for Task-B result
  Thread-2: blocked, waiting for Task-C result
  Thread-3: blocked, waiting for Task-D result
  Thread-4: blocked, waiting for Task-E result

Task Queue:
  [Task-B] [Task-C] [Task-D] [Task-E] -- all waiting for a thread!

Result: DEADLOCK -- no thread available to execute the queued tasks

This pattern appears frequently in reactive systems using CompletableFuture (Java), Future (Scala), or callbacks with blocking .get() calls. The fix:

Use non-blocking async composition (.thenApply, .thenCompose) instead of .get() inside tasks
Use ForkJoinPool (task-stealing prevents deadlock on join)
Use separate pools for "outer" and "inner" tasks
Increase pool size (treats the symptom, not the cause)

Production Examples

Nginx Worker Processes

Nginx uses a fixed number of worker processes (not threads), each running a single-threaded event loop. The "thread pool" is at the process level. worker_processes auto sets count to number of CPUs. For blocking file I/O (sendfile), Nginx optionally uses a thread pool (aio threads) to avoid blocking the event loop worker.

Apache Tomcat

Tomcat's default connector uses a thread pool for HTTP request handling (maxThreads=200 default). Each thread handles one HTTP connection at a time. Under high concurrency this means 200 concurrent requests maximum — a common production bottleneck. The NIO/NIO2 connectors use non-blocking I/O with a smaller thread pool.

PostgreSQL

PostgreSQL uses a process-per-connection model (not a thread pool). max_connections limits concurrent connections. For connection pooling, PgBouncer or pgpool-II proxy connections, multiplexing many client connections onto fewer server processes.

Debugging Notes

Java Thread Pool Monitoring

ThreadPoolExecutor tpe = (ThreadPoolExecutor) executor;
System.out.println("Pool size: " + tpe.getPoolSize());
System.out.println("Active threads: " + tpe.getActiveCount());
System.out.println("Queue size: " + tpe.getQueue().size());
System.out.println("Completed tasks: " + tpe.getCompletedTaskCount());

Expose these metrics via JMX or Micrometer for production monitoring. Alert when getQueue().size() exceeds a threshold (growing queue indicates overload).

Linux: Observing Thread Pools

# Count threads in a process
cat /proc/<pid>/status | grep Threads
ls /proc/<pid>/task/ | wc -l

# Java thread dump (shows pool thread names and states)
jstack <pid>
# or in JDK 11+:
jcmd <pid> Thread.print

Detecting Exhaustion

# Java: detect threads blocked waiting
jstack <pid> | grep -A 5 "WAITING\|BLOCKED"

# If many threads show "waiting on condition" with the same lock,
# you likely have pool exhaustion or lock contention

Security Implications

Information leakage via thread-local state: Thread-local variables (e.g., ThreadLocal<SecurityContext> in Spring Security) must be explicitly cleared between tasks. If a thread returns to the pool without clearing ThreadLocal, the next task on that thread inherits the previous task's context.
CPU exhaustion attacks: An attacker submitting computationally expensive tasks can saturate a thread pool. Use timeouts on task execution and per-user rate limiting.
Thread pool isolation: Security-sensitive operations (encryption key derivation, audit log writes) should run on dedicated thread pools with bounded sizes and monitored queues, isolated from general application work pools.

Performance Implications

Queue contention: At very high submission rates, the queue's lock becomes a bottleneck. Solutions: multiple queue shards, lock-free queues (LMAX Disruptor), or work-stealing deques.
Thread migration and cache coherence: A task that runs on different threads each time it executes pays cache miss penalties. Work-stealing maintains LIFO order for the owning thread (temporal locality) and FIFO for stealers.
Stack memory: Default stack size is 8MB per thread on Linux. 1000 threads = 8GB virtual address space just for stacks (much of it uncommitted, but address space exhaustion is possible on 32-bit systems).
Spin-wait vs. park: Under light load, sleeping a thread on Condition.await() costs wake-up latency (~10 µs). Under very high-frequency tiny tasks, brief spinning before sleeping can reduce latency significantly. Java's ForkJoinPool implements adaptive spinning.

Failure Modes

Silent task rejection: DiscardPolicy or DiscardOldestPolicy silently drops tasks. The application behaves incorrectly without any error. Always use AbortPolicy in production and handle RejectedExecutionException explicitly.
Thread leak from uncaught exceptions: If a task throws an unchecked exception, the thread pool creates a replacement thread — but if the exception recurs on every task, you get a constant cycle of thread death and creation. Wrap Runnable.run() in try-catch and log exceptions from pool threads.
Slow tasks starving fast tasks: A thread pool running a mix of fast (1ms) and slow (10s) tasks will see slow tasks occupy threads, reducing throughput for fast tasks. Separate pools by expected task duration.

Future Directions

Virtual threads (Java 21, Project Loom): Java 21 ships virtual threads — lightweight threads that mount onto OS carrier threads. This largely replaces the thread pool sizing problem for I/O-bound Java code: one virtual thread per task is now practical. ThreadPoolExecutor is no longer needed for I/O-bound concurrency in Java 21+.
Structured concurrency (Java 21, StructuredTaskScope): Provides fork/join structured scope for virtual threads, analogous to Rust's JoinSet and Go's errgroup.
Lock-free work-stealing queues: Research into fully lock-free concurrent deque implementations (Chase-Lev, IdempotentWorkStealing) for higher throughput at extreme core counts.
NUMA-aware thread pools: Pinning threads to NUMA nodes and routing work to the node owning the relevant memory region reduces remote memory access latency. Linux's cmwq already does this; JVM-level implementations are emerging.

Exercises

Implement a thread pool in C using pthreads: a fixed array of N threads, a bounded queue with mutex + two condition variables (not-empty, not-full), pool_submit(fn, arg), and pool_destroy(). Test with 100,000 tasks.
Demonstrate pool exhaustion deadlock in Java: create a ThreadPoolExecutor with 4 threads and a LinkedBlockingQueue. Submit 4 tasks that each Future.get() on a new task submitted to the same pool. Observe the deadlock. Fix it using CallerRunsPolicy or separate pools.
Benchmark Java ThreadPoolExecutor with ArrayBlockingQueue(1000) vs LinkedBlockingQueue (unbounded) under a load that exceeds processing capacity for 5 seconds. Observe memory and throughput differences.
Use rayon::ThreadPoolBuilder to create a custom-sized pool. Measure parallel sort performance for a 100M-element integer array at pool sizes 1, 2, 4, 8, 16. Plot throughput vs. thread count and identify the inflection point.
Profile thread wake-up latency: have a producer wake a consumer via Condition.signal() and measure the time from signal to the consumer running (use System.nanoTime()). Repeat 1M times and plot the distribution. Identify p50, p99, p999 latencies.
Read Linux kernel/workqueue.c and trace the code path from schedule_work() to the point where a worker thread calls the work function. Identify where NUMA affinity is enforced.

References

Lea, D. (1999). Concurrent Programming in Java: Design Principles and Patterns, 2nd ed. Addison-Wesley.
Goetz, B. et al. (2006). Java Concurrency in Practice. Addison-Wesley. (Chapter 8: "Applying Thread Pools")
Herlihy, M. & Shavit, N. (2012). The Art of Multiprocessor Programming. Morgan Kaufmann. (Chapter 16: Work-Stealing)
rayon documentation: https://docs.rs/rayon/
Tokio spawn_blocking: https://docs.rs/tokio/latest/tokio/task/fn.spawn_blocking.html
Linux kernel workqueue documentation: https://www.kernel.org/doc/html/latest/core-api/workqueue.html
Chase, D. & Lev, Y. (2005). "Dynamic Circular Work-Stealing Deque." SPAA 2005.
Java 21 Virtual Threads JEP 444: https://openjdk.org/jeps/444
LMAX Disruptor (lock-free queue): https://lmax-exchange.github.io/disruptor/