Section 10: Synchronization
Purpose and Scope
Synchronization is the discipline of coordinating concurrent execution units — threads, processes, or processors — to produce correct, predictable outcomes when shared state is involved. This section covers every layer of the synchronization hierarchy: from hardware primitives like compare-and-swap and memory fences, through kernel primitives like futex and spinlocks, to high-level constructs like condition variables, read-write locks, and barriers. It extends into the frontier of lock-free and wait-free algorithms, transactional memory, and the full taxonomy of concurrency hazards (deadlock, livelock, starvation, and race conditions).
The scope is intentionally broad: correctness properties, performance trade-offs, implementation strategies, and real-world failure modes are all treated as first-class concerns.
Prerequisites
- Section 02 (CPU Architecture): pipeline stages, cache coherence, memory hierarchy
- Section 03 (Operating System Fundamentals): process/thread model, kernel/user mode
- Section 05 (Scheduling): thread states, context switching, CPU affinity
- Familiarity with C or C++ at the systems level
- Basic understanding of assembly for x86-64 (LOCK prefix, MFENCE, CMPXCHG)
Learning Objectives
Upon completing this section you will be able to:
- Explain why memory reordering at the CPU and compiler level makes naive shared-memory programming incorrect.
- Implement a correct mutex using only atomic compare-and-swap and a futex syscall.
- Articulate the differences between spin-based, blocking, and adaptive locking and select appropriately for a given workload.
- Design lock-free data structures (stacks, queues) and reason formally about their progress guarantees.
- Identify deadlock, livelock, and starvation conditions from code and from runtime traces.
- Explain the C11/C++11 memory model and apply
memory_orderannotations correctly. - Describe hardware transactional memory (HTM) semantics and its fallback requirements.
- Reason about ABA problems, priority inversion, and false sharing.
Architecture Overview
User Space Kernel Space Hardware
┌─────────────────────┐ ┌──────────────────────┐ ┌──────────────┐
│ C++11 std::mutex │ │ futex (fast path) │ │ CMPXCHG │
│ pthread_mutex │──────>│ wait queue │──────>│ LOCK XADD │
│ std::atomic<T> │ │ priority inheritance │ │ MFENCE │
│ lock-free structs │ │ PI mutexes │ │ TSX/RTM │
└──────────┬──────────┘ └──────────────────────┘ └──────┬───────┘
│ │
│ fast path (uncontended): no syscall │
│ slow path (contended): FUTEX_WAIT / FUTEX_WAKE │
│ │
┌──────────▼──────────────────────────────────────────────────────────▼──────┐
│ Memory Ordering Layer │
│ Sequential Consistency ─ Acquire/Release ─ Relaxed ─ Compiler Barrier │
│ C11 _Atomic / C++11 std::atomic ─ Linux READ_ONCE / WRITE_ONCE │
└────────────────────────────────────────────────────────────────────────────┘
│
┌──────────▼──────────────────────────┐
│ Cache Coherence (MESI/MOESI) │
│ L1$─L2$─L3$ per core │
│ Invalidation traffic on contention │
└─────────────────────────────────────┘
Key Concepts
- Critical Section: A region of code that must execute atomically with respect to other threads accessing the same data.
- Mutual Exclusion (Mutex): Guarantee that at most one thread executes the critical section at any time.
- Spinlock: A mutex whose waiting threads burn CPU cycles polling the lock word rather than sleeping.
- Semaphore: A generalized integer counter allowing N concurrent holders; binary semaphore equals a mutex.
- Condition Variable: A mechanism for threads to sleep until a predicate becomes true, always paired with a mutex.
- Read-Write Lock (rwlock): Allows multiple concurrent readers or one exclusive writer.
- Barrier: A rendezvous point where all participating threads must arrive before any may proceed.
- Atomic Operation: An indivisible read-modify-write; hardware ensures no interleaved access.
- Memory Ordering: Constraints on the order in which loads and stores become visible to other processors.
- Futex: Linux's "fast userspace mutex" — an atomic word in user memory backed by a kernel wait queue, enabling lock acquisition without a syscall in the common uncontended case.
- Lock-Free: A progress property guaranteeing that at least one thread makes progress in a finite number of steps, even if others are delayed.
- Wait-Free: A stronger property guaranteeing every thread makes progress in a bounded number of its own steps.
- ABA Problem: A compare-and-swap false positive when a value changes A→B→A between read and CAS.
- Transactional Memory (TM): Optimistic concurrency where a set of memory accesses executes speculatively and commits atomically or aborts on conflict.
- Priority Inversion: A high-priority thread blocked behind a low-priority thread holding a lock; solved by priority inheritance.
- False Sharing: Two threads writing different data that reside on the same cache line, causing unnecessary coherence traffic.
Major Historical Milestones
| Year | Milestone |
|---|---|
| 1965 | Dijkstra introduces semaphores and the mutual exclusion problem |
| 1968 | Dijkstra states the dining philosophers problem |
| 1972 | Hoare introduces monitors and condition variables |
| 1974 | Lamport publishes the bakery algorithm (software mutual exclusion without atomics) |
| 1987 | Herlihy & Wing define linearizability as a correctness criterion |
| 1990 | Herlihy proves the universality of compare-and-swap |
| 1991 | Intel 486 ships with CMPXCHG instruction |
| 1993 | Herlihy introduces wait-free synchronization and the "consensus hierarchy" |
| 1996 | Linux 2.0 introduces kernel spinlocks; SMP support matures |
| 2002 | Futex introduced into Linux (Rusty Russell, Hubertus Franke) |
| 2003 | Michael & Scott non-blocking queue algorithm widely adopted |
| 2004 | Read-Copy-Update (RCU) mainlined into Linux kernel |
| 2009 | Intel TSX (Transactional Synchronization Extensions) announced |
| 2011 | C++11 and C11 standardize a formal memory model with std::atomic |
| 2013 | Intel Haswell ships with RTM/HLE (TSX hardware transactional memory) |
| 2017 | TSX errata lead to disabling on many SKUs via microcode |
| 2020 | Linux adds futex2 proposal (multi-wait, variable size) |
Modern Relevance and Production Use Cases
Database engines (PostgreSQL, MySQL InnoDB, RocksDB) are built on careful lock hierarchies — buffer pool latches, page-level rwlocks, and lock-free freelist structures are tuned obsessively because lock contention is a primary OLTP bottleneck.
High-frequency trading systems use wait-free queues and lock-free ring buffers to pass market data between threads with deterministic latency. A single cache-line bounce can cost hundreds of nanoseconds.
Linux kernel employs the full spectrum: RCU for read-mostly structures (routing tables, inode caches), spinlocks for short critical sections in interrupt context, mutexes for sleeping locks, and sequence locks (seqlocks) for infrequently-written, frequently-read counters.
Go runtime scheduler, Java's java.util.concurrent, and Rust's std::sync all implement adaptive mutexes that spin briefly then park, based on empirical observations that most locks are released quickly.
GPU programming (CUDA, ROCm) requires barrier synchronization across thousands of threads executing in SIMT fashion; __syncthreads() is a barrier within a thread block.
File Map
| File | Description |
|---|---|
01-hardware-primitives.md |
CAS, LL/SC, LOCK prefix, memory fences, x86 vs ARM atomics |
02-memory-ordering.md |
Memory models, acquire/release/relaxed/SC, C++11 memory_order, compiler barriers |
03-spinlocks.md |
Test-and-set, ticket spinlocks, MCS lock, queued spinlock in Linux |
04-mutexes.md |
Blocking mutexes, adaptive mutexes, pthread_mutex internals |
05-futex.md |
Futex design, FUTEX_WAIT/WAKE, priority inheritance, robust futexes |
06-semaphores.md |
POSIX semaphores, counting semaphores, use cases and pitfalls |
07-condition-variables.md |
Spurious wakeups, wait/signal/broadcast, monitor pattern |
08-rwlocks.md |
Reader/writer locks, read starvation, seqlocks, RCU |
09-barriers.md |
Thread barriers, memory barriers, Linux barrier() and smp_mb() |
10-atomic-operations.md |
std::atomic, fetch_add, compare_exchange_weak vs strong |
11-lock-free-data-structures.md |
Lock-free stack, Michael-Scott queue, hazard pointers, epochs |
12-wait-free-algorithms.md |
Universal construction, wait-free queues, practical considerations |
13-transactional-memory.md |
HTM (RTM/HLE), STM, conflict detection, abort handling |
14-deadlock.md |
Conditions, detection (cycle detection), prevention, avoidance (Banker's) |
15-race-conditions.md |
Data races, TOCTOU, detecting with TSan/Helgrind |
16-starvation-livelock.md |
Progress guarantees, fairness, livelock examples and fixes |
17-rcu.md |
Read-Copy-Update: grace periods, quiescent states, usage in Linux |
18-priority-inversion.md |
Priority inversion, priority inheritance, priority ceiling protocol |
19-false-sharing.md |
Cache line alignment, padding, __cacheline_aligned, profiling |
Cross-References
- Section 02 (CPU Architecture): cache coherence protocols (MESI), out-of-order execution, store buffers
- Section 03 (OS Fundamentals): kernel vs user mode, syscall overhead
- Section 05 (Scheduling): priority, real-time scheduling, SCHED_FIFO and priority inversion
- Section 08 (IPC): pipes, message queues, shared memory synchronization
- Section 11 (Memory Management): memory barriers and NUMA-aware allocation
- Section 17 (Distributed Systems): distributed locking, consensus, and their relation to local synchronization primitives
- Section 18 (Database Internals): MVCC, WAL latching, buffer pool synchronization