POSIX Threads

Technical Overview

POSIX threads (pthreads) define the standard API for multithreaded programming on UNIX-like systems. The standard (IEEE 1003.1c-1995) specifies thread creation and management, synchronization primitives (mutexes, condition variables, read-write locks, barriers), thread-specific data, and scheduling interfaces.

pthreads is the lingua franca of native concurrent programming on Linux, macOS, FreeBSD, and every other POSIX system. Understanding pthreads is prerequisite to understanding how higher-level threading abstractions (Java threads, Go runtime, Python threading) are implemented.

Prerequisites

Process model (fork/exec/wait)
Virtual memory and address spaces
Kernel threads (01-kernel-threads.md)
Basic mutex/semaphore concepts
C language proficiency

Core Concepts

Thread Creation and Lifecycle

#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>

// Thread function signature: void* (*)(void*)
void* worker_function(void* arg) {
    int thread_num = *(int*)arg;
    printf("Thread %d started\n", thread_num);

    // Do work...

    printf("Thread %d exiting\n", thread_num);
    return (void*)(intptr_t)thread_num;  // return value
}

int main() {
    pthread_t threads[4];
    int args[4];

    for (int i = 0; i < 4; i++) {
        args[i] = i;
        int rc = pthread_create(
            &threads[i],      // thread handle (output)
            NULL,             // attributes (NULL = default)
            worker_function,  // start function
            &args[i]          // argument
        );
        if (rc != 0) {
            fprintf(stderr, "pthread_create failed: %s\n", strerror(rc));
            exit(1);
        }
    }

    // Join: wait for threads to complete, collect return values
    for (int i = 0; i < 4; i++) {
        void* retval;
        pthread_join(threads[i], &retval);
        printf("Thread %d returned: %ld\n", i, (long)retval);
    }

    return 0;
}

pthread Lifecycle Diagram
==========================

pthread_create() ──────> RUNNABLE ──> [scheduled] ──> RUNNING
                              ^                            |
                              |        preempted          |
                              +────────────────────────────+
                                                           |
                                            pthread_mutex_lock()
                                            mutex is locked
                                                           v
                                                      BLOCKED
                                            (waiting for mutex)
                                                           |
                                            mutex released by holder
                                                           v
                                                       RUNNABLE
                                                           |
                                                    pthread_exit()
                                                    or return
                                                           v
                                                       ZOMBIE
                                              (until pthread_join()
                                               or detached and done)

Detached vs. Joinable Threads

// Joinable thread (default): parent must call pthread_join()
pthread_t tid;
pthread_create(&tid, NULL, func, NULL);
pthread_join(tid, NULL);  // waits for thread, frees resources

// Detached thread: resources freed automatically when thread exits
pthread_t tid;
pthread_attr_t attr;
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED);
pthread_create(&tid, &attr, func, NULL);
pthread_attr_destroy(&attr);
// No join needed — but also no return value available

// Detach an already-running thread:
pthread_create(&tid, NULL, func, NULL);
pthread_detach(tid);  // now detached

// Self-detach from within thread:
pthread_detach(pthread_self());

Detached threads are appropriate for fire-and-forget tasks. Failing to join joinable threads leaks their TCB and stack until the process exits — a common resource leak.

Mutexes

Mutexes are the fundamental mutual exclusion primitive:

#include <pthread.h>

// Static initialization:
pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;

// Dynamic initialization (required for non-static or custom attributes):
pthread_mutex_t lock;
pthread_mutexattr_t attr;
pthread_mutexattr_init(&attr);

// Mutex types:
pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_NORMAL);      // default, fast
pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_ERRORCHECK);  // detect deadlock (relock)
pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_RECURSIVE);   // reentrant
pthread_mutexattr_settype(&attr, PTHREAD_MUTEX_DEFAULT);     // implementation-defined

// Priority inheritance (POSIX.1-2001):
pthread_mutexattr_setprotocol(&attr, PTHREAD_PRIO_INHERIT);  // PI mutex

pthread_mutex_init(&lock, &attr);
pthread_mutexattr_destroy(&attr);

// Usage:
pthread_mutex_lock(&lock);      // blocks if locked
// ... critical section ...
pthread_mutex_unlock(&lock);

// Trylock (non-blocking):
if (pthread_mutex_trylock(&lock) == 0) {
    // ... critical section ...
    pthread_mutex_unlock(&lock);
} else {
    // EBUSY: lock held by another thread
}

// Timed lock (POSIX.1-2001):
struct timespec timeout = { .tv_sec = time(NULL) + 1 };  // 1 second from now
int rc = pthread_mutex_timedlock(&lock, &timeout);
if (rc == ETIMEDOUT) { /* deadline exceeded */ }

// Cleanup:
pthread_mutex_destroy(&lock);

Condition Variables

Condition variables enable threads to wait for a condition to become true, avoiding busy-waiting:

pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
pthread_cond_t  cond  = PTHREAD_COND_INITIALIZER;
int             ready = 0;

// Producer thread:
void* producer(void* arg) {
    // ... do some work ...

    pthread_mutex_lock(&mutex);
    ready = 1;                    // set the condition
    pthread_cond_signal(&cond);   // wake ONE waiting thread
    // or: pthread_cond_broadcast() to wake ALL waiting threads
    pthread_mutex_unlock(&mutex);
    return NULL;
}

// Consumer thread:
void* consumer(void* arg) {
    pthread_mutex_lock(&mutex);

    // ALWAYS loop: spurious wakeups are allowed by POSIX!
    while (!ready) {
        // Atomically: unlock mutex + sleep on condition variable
        // When woken: re-acquire mutex before returning
        pthread_cond_wait(&cond, &mutex);
    }

    // Condition is true AND mutex is held
    printf("Condition is ready!\n");
    pthread_mutex_unlock(&mutex);
    return NULL;
}

The while (!ready) loop is mandatory — POSIX allows spurious wakeups: pthread_cond_wait() may return even if no signal was sent. Code that uses if (!ready) instead of while (!ready) has a race condition.

// Timed wait:
struct timespec deadline;
clock_gettime(CLOCK_REALTIME, &deadline);
deadline.tv_sec += 5;  // 5 seconds from now

pthread_mutex_lock(&mutex);
while (!ready) {
    int rc = pthread_cond_timedwait(&cond, &mutex, &deadline);
    if (rc == ETIMEDOUT) {
        printf("Timed out waiting for condition\n");
        break;
    }
}
pthread_mutex_unlock(&mutex);

Read-Write Locks

When reads are far more common than writes, a read-write lock allows concurrent readers but exclusive writers:

pthread_rwlock_t rwlock = PTHREAD_RWLOCK_INITIALIZER;

// Multiple readers can hold simultaneously:
pthread_rwlock_rdlock(&rwlock);    // blocks only if writer holds
// ... read shared data ...
pthread_rwlock_unlock(&rwlock);

// Only one writer, no readers while writing:
pthread_rwlock_wrlock(&rwlock);    // blocks if any reader or writer holds
// ... modify shared data ...
pthread_rwlock_unlock(&rwlock);

// Trylock variants:
pthread_rwlock_tryrdlock(&rwlock);  // returns EBUSY if can't acquire immediately
pthread_rwlock_trywrlock(&rwlock);

Writer starvation: if readers continuously acquire the lock, writers never get exclusive access. The pthread_rwlock implementation is allowed to give priority to writers (but not required to). Check your specific platform's behavior if writer starvation matters.

Barriers

Barriers synchronize a fixed number of threads — all must arrive before any continue:

pthread_barrier_t barrier;
pthread_barrier_init(&barrier, NULL, N_THREADS);  // wait for N threads

void* worker(void* arg) {
    // Phase 1 work:
    do_phase1_work();

    // Wait for all threads to complete phase 1:
    int rc = pthread_barrier_wait(&barrier);
    if (rc == PTHREAD_BARRIER_SERIAL_THREAD) {
        // Exactly one thread gets SERIAL_THREAD as return value
        // Convention: this thread does post-barrier cleanup/merge
        merge_phase1_results();
    }

    // Phase 2 work (all threads guaranteed past phase 1):
    do_phase2_work();

    pthread_barrier_wait(&barrier);  // wait again before phase 3

    return NULL;
}

Thread-Specific Data (TSD)

Thread-specific data provides per-thread storage accessible through a global key:

// Create a key (once, before threads start):
pthread_key_t tsd_key;
pthread_once_t key_once = PTHREAD_ONCE_INIT;

void create_key(void) {
    // destructor called when thread exits (for cleanup):
    pthread_key_create(&tsd_key, free);  // free() as destructor
}

// In thread code:
void* thread_func(void* arg) {
    pthread_once(&key_once, create_key);  // init key exactly once

    // Allocate per-thread data:
    int* thread_data = malloc(sizeof(int));
    *thread_data = pthread_self();  // store thread ID as example

    // Associate with key for THIS thread:
    pthread_setspecific(tsd_key, thread_data);

    // Retrieve (later, possibly in a different function):
    int* data = pthread_getspecific(tsd_key);
    printf("My thread data: %d\n", *data);

    return NULL;
    // destructor (free()) called automatically with the data pointer
}

pthread_once ensures the key creation runs exactly once, even if multiple threads call pthread_once() simultaneously — it's a portable "run-once" primitive.

Modern C code often uses _Thread_local / __thread instead of TSD for simplicity, but TSD is required when the destructor callback is needed for cleanup.

Thread Attributes

pthread_attr_t attr;
pthread_attr_init(&attr);

// Stack size (default: 8MB on Linux):
pthread_attr_setstacksize(&attr, 512 * 1024);  // 512KB stack

// Stack address (advanced: pre-allocate stack memory):
void* stack_memory = mmap(NULL, stack_size, PROT_READ|PROT_WRITE,
                           MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0);
pthread_attr_setstack(&attr, stack_memory, stack_size);

// Detach state:
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED);

// Scheduling policy and priority:
pthread_attr_setschedpolicy(&attr, SCHED_FIFO);    // real-time FIFO
// or: SCHED_RR (round-robin), SCHED_OTHER (default CFS)

struct sched_param sp = { .sched_priority = 50 };  // 1-99 for RT
pthread_attr_setschedparam(&attr, &sp);

// INHERIT vs. EXPLICIT scheduling:
// PTHREAD_INHERIT_SCHED: inherit from creating thread (default)
// PTHREAD_EXPLICIT_SCHED: use values set in attr
pthread_attr_setinheritsched(&attr, PTHREAD_EXPLICIT_SCHED);

// Guard size (extra protection pages beyond standard guard):
pthread_attr_setguardsize(&attr, 8192);  // 8KB guard region

pthread_create(&tid, &attr, func, NULL);
pthread_attr_destroy(&attr);

Thread Cancellation

POSIX thread cancellation allows one thread to request another thread's termination:

// Set cancellation state:
pthread_setcancelstate(PTHREAD_CANCEL_ENABLE, NULL);   // can be cancelled
pthread_setcancelstate(PTHREAD_CANCEL_DISABLE, NULL);  // ignore cancel requests

// Cancellation type:
pthread_setcanceltype(PTHREAD_CANCEL_DEFERRED, NULL);  // only at cancel points
pthread_setcanceltype(PTHREAD_CANCEL_ASYNCHRONOUS, NULL);  // immediately (DANGEROUS)

// Explicit cancellation point:
pthread_testcancel();  // check for pending cancellation, cancel if pending

// Cleanup handlers: run when thread is cancelled or exits normally
pthread_cleanup_push(cleanup_function, cleanup_arg);
// ... code that might be cancelled ...
pthread_cleanup_pop(1);  // 1 = execute cleanup; 0 = pop without executing

// Sending a cancel request:
pthread_cancel(target_tid);
// target_tid will be cancelled at the next cancellation point

Cancellation cancel points are POSIX functions that check for pending cancellation. Key ones: pthread_cond_wait(), read(), write(), select(), nanosleep(), open(), close(). Most blocking syscalls are cancellation points.

Asynchronous cancellation (PTHREAD_CANCEL_ASYNCHRONOUS) is almost never safe — it can cancel a thread in the middle of malloc(), leaving heap structures corrupted.

Signal Handling in Multithreaded Programs

Signal delivery in multithreaded programs is subtle:

#include <signal.h>
#include <pthread.h>

// Which thread gets a signal?
// Process-directed signals (kill(pid, sig), hardware exceptions):
//   - Delivered to ANY thread in the process that doesn't have it blocked
//   - Typically: the thread that caused the exception, or an unblocked thread
//
// Thread-directed signals (pthread_kill(tid, sig)):
//   - Delivered to the specific thread

// Best practice: block signals in all threads except a dedicated handler thread
void setup_signal_handling() {
    sigset_t mask;
    sigfillset(&mask);  // block all signals
    pthread_sigmask(SIG_BLOCK, &mask, NULL);
    // Must be called before pthread_create() to ensure new threads inherit mask
}

void* signal_handler_thread(void* arg) {
    sigset_t mask;
    sigemptyset(&mask);
    sigaddset(&mask, SIGTERM);
    sigaddset(&mask, SIGINT);
    sigaddset(&mask, SIGUSR1);

    while (1) {
        int sig;
        sigwait(&mask, &sig);  // blocking wait for signal

        switch (sig) {
        case SIGTERM:
        case SIGINT:
            printf("Graceful shutdown requested\n");
            // Signal other threads to stop
            atomic_store(&shutdown_flag, 1);
            return NULL;
        case SIGUSR1:
            dump_stats();
            break;
        }
    }
}

pthread_atfork: Fork Safety

When a multithreaded process calls fork(), only the calling thread is duplicated in the child. If other threads held mutexes at fork time, those mutexes remain locked in the child — with no thread to unlock them (deadlock).

pthread_atfork() registers handlers to fix this:

// Register before-fork, after-fork-in-parent, after-fork-in-child handlers
pthread_atfork(
    prepare_handler,    // called in parent BEFORE fork (take all locks)
    parent_handler,     // called in parent AFTER fork (release locks)
    child_handler       // called in child AFTER fork (reinitialize mutexes)
);

// Typical implementation for a library with internal mutex:
static pthread_mutex_t lib_mutex = PTHREAD_MUTEX_INITIALIZER;

static void prepare_fork(void) {
    pthread_mutex_lock(&lib_mutex);   // ensure clean state in child
}
static void parent_after_fork(void) {
    pthread_mutex_unlock(&lib_mutex); // parent continues normally
}
static void child_after_fork(void) {
    pthread_mutex_init(&lib_mutex, NULL);  // reinitialize in child
    // (the old mutex may have been locked by a thread that no longer exists)
}

// Libraries that use fork() must register these handlers, or risk deadlock

NPTL Implementation Details

NPTL pthread_mutex_lock() Implementation
==========================================

pthread_mutex_lock(mutex):
  1. Atomic CAS on mutex->lock:
     if (CAS(&mutex->lock, 0, 1) == 0):  // uncontended
         mutex->owner = gettid()
         return 0  // fast path, no syscall

  2. If lock held, mark as contended:
     CAS(&mutex->lock, 1, 2)  // value 2 = contended

  3. Syscall: futex(FUTEX_WAIT, &mutex->lock, 2)
     // kernel: if mutex->lock is still 2, sleep this thread
     // otherwise (changed to 0/1): return immediately

pthread_mutex_unlock(mutex):
  1. Atomic store: mutex->lock = 0

  2. If there were waiters (old value was 2):
     futex(FUTEX_WAKE, &mutex->lock, 1)  // wake one waiter

Total uncontended path: ~10-30 ns (CAS only)
Total contended path: ~1-5 µs (includes context switch)

Historical Context

The Pthreads Standard History

The IEEE 1003.4a standard for pthreads was finalized in 1995 after years of debate. The standard resolved: - Whether to use a separate pthread_t handle or integrate into POSIX process model - Exactly which functions are cancellation points - Mandatory vs. optional features

Different UNIX vendors implemented pthreads independently before standardization, leading to incompatible APIs. The POSIX standard unified these.

Linux's LinuxThreads Problem

The original Linux pthreads implementation (LinuxThreads, Xavier Leroy, 1996) had fundamental POSIX violations because Linux's clone() didn't support CLONE_THREAD at the time: - Each thread had a distinct PID (visible in ps) - getpid() returned different values per thread - Signal handling didn't match POSIX semantics

NPTL (2002-2003) fixed all these by leveraging the kernel CLONE_THREAD flag added specifically to support proper pthreads semantics.

Production Examples

PostgreSQL Worker Processes and Threads

PostgreSQL historically used processes (not threads) for parallelism — each connection gets a forked process. PostgreSQL 9.6 (2016) added parallel query execution using background worker processes. PostgreSQL doesn't use pthreads internally for query execution, partly due to historical fork-based architecture and partly due to per-process crash isolation.

However, WAL sender, background writer, checkpointer, and autovacuum are implemented as background worker processes — each a forked child, not a pthread.

This is a deliberate architectural choice: process crash isolation over pthread-level performance. A crash in one backend doesn't corrupt shared state for others.

Apache MPM Worker

Apache's Worker MPM (Multi-Processing Module) uses pthreads:

// Apache httpd worker MPM (simplified concept):
// Creates N processes, each with M threads

ap_start_pod_restart(pod);  // fork N child processes

// In each child process:
for (int i = 0; i < threads_per_child; i++) {
    pthread_create(&threads[i], &thread_attr, worker_thread, server_conf);
}

// worker_thread:
void* worker_thread(void* data) {
    while (!shutdown_requested) {
        accept_connection();          // blocks on accept()
        process_request();            // handle HTTP
        release_connection();
    }
    return NULL;
}

gRPC C++ Server

gRPC's C++ server uses a thread pool backed by pthreads. Each incoming RPC is dispatched to a thread pool worker. The pool size is configurable; the default is based on hardware concurrency:

// gRPC server thread pool (conceptual):
grpc::ServerBuilder builder;
builder.AddListeningPort(address, grpc::InsecureServerCredentials());
builder.RegisterService(&service);

// Set the number of server completion queue threads:
// Internally backed by pthread pool
builder.SetSyncServerOption(
    grpc::ServerBuilder::SyncServerOption::NUM_CQS, 
    std::thread::hardware_concurrency()
);

Debugging Notes

# Detect deadlocks with pthread errorcheck mutex:
# PTHREAD_MUTEX_ERRORCHECK returns EDEADLK on relock from same thread
# Compile with thread sanitizer:
gcc -fsanitize=thread -g program.c -lpthread -o program
./program
# ThreadSanitizer: data race detected...

# Helgrind (Valgrind tool) for data race detection:
valgrind --tool=helgrind ./program
# ==1234== Possible data race during write of size 4 at ...

# gdb multithreaded debugging:
gdb ./program
(gdb) run
(gdb) info threads                    # list all threads
(gdb) thread 3                        # switch to thread 3
(gdb) backtrace                       # stack trace of current thread
(gdb) thread apply all bt             # backtraces of all threads

# Find deadlocked threads:
# Look for threads all waiting on mutexes (state = 'W' or blocked in futex)
(gdb) thread apply all bt
# Thread 2 blocked in pthread_mutex_lock at ...
# Thread 3 blocked in pthread_mutex_lock at ...
# (if they're waiting for each other's mutexes = deadlock)

// Runtime mutex lock ordering check (poor man's deadlock detection):
// Define canonical lock ordering and assert:
void locked_assert(pthread_mutex_t* to_lock, pthread_mutex_t* already_held) {
    // Lock ordering: always acquire mutexes in pointer address order
    assert(to_lock > already_held);  // would violate order — potential deadlock
    pthread_mutex_lock(to_lock);
}

Security Implications

Race Conditions as Security Vulnerabilities

TOCTOU (Time-of-Check, Time-of-Use) races with pthreads:

// VULNERABLE: TOCTOU race condition
void process_file(const char* path, uid_t user_uid) {
    // CHECK: is user authorized?
    struct stat st;
    lstat(path, &st);
    if (st.st_uid != user_uid) return;  // not owner

    // Time gap here — another thread or attacker can replace the file

    // USE: open and read the file
    int fd = open(path, O_RDONLY);  // may now be a symlink to /etc/shadow!
    // ...
}

In a multithreaded program, a race between check and use can create privilege escalation. The fix is to open-then-fstat (check after opening), or use openat(dirfd, ...) to pin the directory.

Mutex-Guarded Data Still Needs Care

A mutex protects data from concurrent access, but not from logical races:

pthread_mutex_lock(&cache_mutex);
if (cache[key] == NULL) {
    // If two threads pass this check in sequence:
    pthread_mutex_unlock(&cache_mutex);  // WRONG: unlock before modification
    cache[key] = compute_value(key);     // race: both threads allocate
    pthread_mutex_lock(&cache_mutex);
}
pthread_mutex_unlock(&cache_mutex);

// Correct double-checked pattern:
pthread_mutex_lock(&cache_mutex);
if (cache[key] == NULL) {
    cache[key] = compute_value(key);  // still holding lock
}
pthread_mutex_unlock(&cache_mutex);

Performance Implications

Mutex Scalability Limits

A highly contended mutex becomes a bottleneck: - 10 threads on 10 cores all competing for one mutex - At any time, 9 threads are blocked (in kernel futex wait) - Throughput: limited to single-threaded rate on the critical section

Solutions: - Reduce lock granularity (per-bucket locks instead of global hash table lock) - Reader-writer locks (if reads >> writes) - Lock-free data structures (CAS-based) - Per-thread data + periodic merge (avoid shared state entirely)

Cache line contention between threads even without logical sharing:

// BAD: per-thread counters on same cache line (64 bytes wide):
struct {
    long count;   // Thread 1 writes here
    long count2;  // Thread 2 writes here — SAME CACHE LINE on many architectures
} thread_data[2];

// GOOD: pad to cache line size:
struct {
    long count;
    char pad[56];  // pad to 64 bytes (cache line size)
} thread_data[2] __attribute__((aligned(64)));

False sharing can reduce multi-threaded performance by 10-100x for tight loops. Use perf stat -e cache-misses,cache-references to measure.

Failure Modes and Real Incidents

The MySQL InnoDB Mutex Regression

MySQL 5.1's InnoDB storage engine had a known scalability cliff: above ~4 CPU cores, adding more cores decreased throughput. Root cause: a coarse-grained mutex protecting the InnoDB kernel was contended by all threads. The "kernel" operations (buffer pool management, transaction coordination) all serialized on this single lock.

Fix (MySQL 5.5+): InnoDB mutex refactoring to reduce granularity, separate mutexes per buffer pool instance.

This is the canonical "one mutex doesn't scale" production incident. Percona documented this extensively with benchmark data showing the non-linear scaling behavior.

glibc pthread_cond_broadcast Thundering Herd

pthread_cond_broadcast() wakes ALL waiting threads. If all those threads then compete for the same mutex (which they must re-acquire after waking), only one wins and the rest immediately block again. This is the "thundering herd" problem:

100 threads blocked on pthread_cond_wait()
One event occurs; pthread_cond_broadcast() called
100 threads wake, all try to acquire mutex
99 immediately block on mutex
Total wakeups: 100. Useful wakeups: 1. Wasted context switches: 99.

Use pthread_cond_signal() when only one thread should process each event. Use broadcast when all threads need to re-check their condition.

Modern Usage

All native C/C++ concurrent code: pthreads is the de facto standard for native threading. Every C++ std::thread implementation wraps pthreads on POSIX systems.

Language runtimes: CPython's GIL uses pthreads mutexes. JVM's HotSpot thread implementation on Linux wraps pthreads. Rust's std::thread wraps pthreads.

System daemons: Virtually all Linux system daemons (MySQL, PostgreSQL workers, nginx workers, OpenSSH, etc.) use pthreads for internal concurrency.

Future Directions

C++ std::atomic and lock-free algorithms: Modern C++ prefers std::atomic for shared counters/flags over mutexes, and lock-free data structures for high-contention paths. pthreads mutex for high-contention is increasingly seen as the last resort.

Linux futex2: Improved futex API in Linux 5.16+ with better priority inheritance, NUMA awareness, and reduced syscall overhead for the common case.

WTF::Lock (WebKit): Proprietary adaptive spinning mutex that spins briefly before falling to the kernel futex, reducing syscall overhead for short critical sections. Similar technique in Bionic (Android's libc).

Exercises

Deadlock Construction and Detection: Write a program with 4 threads and 4 mutexes that reliably deadlocks (circular wait: T1→L1→L2, T2→L2→L3, T3→L3→L4, T4→L4→L1). Detect with ThreadSanitizer. Fix with canonical lock ordering. Verify with Helgrind.
Condition Variable Spurious Wakeup: Write a test harness that demonstrates a spurious wakeup can occur (or use ptrace to inject one). Compare behavior of if (!ready) vs. while (!ready) in pthread_cond_wait(). Document the race condition that if introduces.
pthread_atfork Safety: Write a multithreaded server using pthreads. Add a child-process logger that fork()s to write to a file. Without pthread_atfork, demonstrate deadlock in the forked child. Implement pthread_atfork handlers to fix it.
Priority Inversion Measurement: Implement the classic three-task priority inversion scenario (low/medium/high priority pthreads with SCHED_FIFO). Measure the execution timeline showing high-priority thread starvation. Enable PTHREAD_PRIO_INHERIT on the mutex. Measure again and show the difference.
Scalability Benchmark: Write a hash map with coarse-grained locking (single mutex for entire map). Benchmark it with 1, 2, 4, 8, 16 concurrent writer threads. Plot throughput. Then refactor to fine-grained locking (one mutex per hash bucket, 256 buckets). Benchmark again. Measure the inflection point where locking overhead exceeds contention reduction benefit.

References

Butenhof, D.R. Programming with POSIX Threads. Addison-Wesley, 1997. [The definitive reference]
Kerrisk, M. The Linux Programming Interface. No Starch Press, 2010. Chapters 29-33.
IEEE 1003.1c-1995. POSIX Threads specification.
Drepper, U. "POSIX Threads and the Linux Kernel." Internal Red Hat presentation, 2002.
Lozi, J.P., et al. "The Linux Scheduler: A Decade of Wasted Cores." EuroSys '16. 2016. [Scheduler bugs in multithreaded systems]
Linux pthreads man pages: pthread_create(3), pthread_mutex_lock(3), pthread_cond_wait(3), pthread_rwlock_rdlock(3), etc.
NPTL source code: https://sourceware.org/git/glibc.git nptl/ directory
Valgrind/Helgrind documentation: https://valgrind.org/docs/manual/hg-manual.html