Section 10: Synchronization

Purpose and Scope

Synchronization is the discipline of coordinating concurrent execution units — threads, processes, or processors — to produce correct, predictable outcomes when shared state is involved. This section covers every layer of the synchronization hierarchy: from hardware primitives like compare-and-swap and memory fences, through kernel primitives like futex and spinlocks, to high-level constructs like condition variables, read-write locks, and barriers. It extends into the frontier of lock-free and wait-free algorithms, transactional memory, and the full taxonomy of concurrency hazards (deadlock, livelock, starvation, and race conditions).

The scope is intentionally broad: correctness properties, performance trade-offs, implementation strategies, and real-world failure modes are all treated as first-class concerns.

Prerequisites

Section 02 (CPU Architecture): pipeline stages, cache coherence, memory hierarchy
Section 03 (Operating System Fundamentals): process/thread model, kernel/user mode
Section 05 (Scheduling): thread states, context switching, CPU affinity
Familiarity with C or C++ at the systems level
Basic understanding of assembly for x86-64 (LOCK prefix, MFENCE, CMPXCHG)

Learning Objectives

Upon completing this section you will be able to:

Explain why memory reordering at the CPU and compiler level makes naive shared-memory programming incorrect.
Implement a correct mutex using only atomic compare-and-swap and a futex syscall.
Articulate the differences between spin-based, blocking, and adaptive locking and select appropriately for a given workload.
Design lock-free data structures (stacks, queues) and reason formally about their progress guarantees.
Identify deadlock, livelock, and starvation conditions from code and from runtime traces.
Explain the C11/C++11 memory model and apply memory_order annotations correctly.
Describe hardware transactional memory (HTM) semantics and its fallback requirements.
Reason about ABA problems, priority inversion, and false sharing.

Architecture Overview

  User Space                    Kernel Space                   Hardware
 ┌─────────────────────┐       ┌──────────────────────┐       ┌──────────────┐
 │  C++11 std::mutex   │       │  futex (fast path)   │       │  CMPXCHG     │
 │  pthread_mutex      │──────>│  wait queue          │──────>│  LOCK XADD   │
 │  std::atomic<T>     │       │  priority inheritance │       │  MFENCE      │
 │  lock-free structs  │       │  PI mutexes           │       │  TSX/RTM     │
 └──────────┬──────────┘       └──────────────────────┘       └──────┬───────┘
            │                                                         │
            │  fast path (uncontended): no syscall                    │
            │  slow path (contended): FUTEX_WAIT / FUTEX_WAKE         │
            │                                                         │
 ┌──────────▼──────────────────────────────────────────────────────────▼──────┐
 │                      Memory Ordering Layer                                  │
 │  Sequential Consistency ─ Acquire/Release ─ Relaxed ─ Compiler Barrier     │
 │  C11 _Atomic / C++11 std::atomic ─ Linux READ_ONCE / WRITE_ONCE            │
 └────────────────────────────────────────────────────────────────────────────┘
            │
 ┌──────────▼──────────────────────────┐
 │       Cache Coherence (MESI/MOESI)  │
 │  L1$─L2$─L3$ per core              │
 │  Invalidation traffic on contention │
 └─────────────────────────────────────┘

Key Concepts

Critical Section: A region of code that must execute atomically with respect to other threads accessing the same data.
Mutual Exclusion (Mutex): Guarantee that at most one thread executes the critical section at any time.
Spinlock: A mutex whose waiting threads burn CPU cycles polling the lock word rather than sleeping.
Semaphore: A generalized integer counter allowing N concurrent holders; binary semaphore equals a mutex.
Condition Variable: A mechanism for threads to sleep until a predicate becomes true, always paired with a mutex.
Read-Write Lock (rwlock): Allows multiple concurrent readers or one exclusive writer.
Barrier: A rendezvous point where all participating threads must arrive before any may proceed.
Atomic Operation: An indivisible read-modify-write; hardware ensures no interleaved access.
Memory Ordering: Constraints on the order in which loads and stores become visible to other processors.
Futex: Linux's "fast userspace mutex" — an atomic word in user memory backed by a kernel wait queue, enabling lock acquisition without a syscall in the common uncontended case.
Lock-Free: A progress property guaranteeing that at least one thread makes progress in a finite number of steps, even if others are delayed.
Wait-Free: A stronger property guaranteeing every thread makes progress in a bounded number of its own steps.
ABA Problem: A compare-and-swap false positive when a value changes A→B→A between read and CAS.
Transactional Memory (TM): Optimistic concurrency where a set of memory accesses executes speculatively and commits atomically or aborts on conflict.
Priority Inversion: A high-priority thread blocked behind a low-priority thread holding a lock; solved by priority inheritance.
False Sharing: Two threads writing different data that reside on the same cache line, causing unnecessary coherence traffic.

Major Historical Milestones

Year	Milestone
1965	Dijkstra introduces semaphores and the mutual exclusion problem
1968	Dijkstra states the dining philosophers problem
1972	Hoare introduces monitors and condition variables
1974	Lamport publishes the bakery algorithm (software mutual exclusion without atomics)
1987	Herlihy & Wing define linearizability as a correctness criterion
1990	Herlihy proves the universality of compare-and-swap
1991	Intel 486 ships with CMPXCHG instruction
1993	Herlihy introduces wait-free synchronization and the "consensus hierarchy"
1996	Linux 2.0 introduces kernel spinlocks; SMP support matures
2002	Futex introduced into Linux (Rusty Russell, Hubertus Franke)
2003	Michael & Scott non-blocking queue algorithm widely adopted
2004	Read-Copy-Update (RCU) mainlined into Linux kernel
2009	Intel TSX (Transactional Synchronization Extensions) announced
2011	C++11 and C11 standardize a formal memory model with `std::atomic`
2013	Intel Haswell ships with RTM/HLE (TSX hardware transactional memory)
2017	TSX errata lead to disabling on many SKUs via microcode
2020	Linux adds `futex2` proposal (multi-wait, variable size)

Modern Relevance and Production Use Cases

Database engines (PostgreSQL, MySQL InnoDB, RocksDB) are built on careful lock hierarchies — buffer pool latches, page-level rwlocks, and lock-free freelist structures are tuned obsessively because lock contention is a primary OLTP bottleneck.

High-frequency trading systems use wait-free queues and lock-free ring buffers to pass market data between threads with deterministic latency. A single cache-line bounce can cost hundreds of nanoseconds.

Linux kernel employs the full spectrum: RCU for read-mostly structures (routing tables, inode caches), spinlocks for short critical sections in interrupt context, mutexes for sleeping locks, and sequence locks (seqlocks) for infrequently-written, frequently-read counters.

Go runtime scheduler, Java's java.util.concurrent, and Rust's std::sync all implement adaptive mutexes that spin briefly then park, based on empirical observations that most locks are released quickly.

GPU programming (CUDA, ROCm) requires barrier synchronization across thousands of threads executing in SIMT fashion; __syncthreads() is a barrier within a thread block.

File Map

File	Description
`01-hardware-primitives.md`	CAS, LL/SC, LOCK prefix, memory fences, x86 vs ARM atomics
`02-memory-ordering.md`	Memory models, acquire/release/relaxed/SC, C++11 `memory_order`, compiler barriers
`03-spinlocks.md`	Test-and-set, ticket spinlocks, MCS lock, queued spinlock in Linux
`04-mutexes.md`	Blocking mutexes, adaptive mutexes, pthread_mutex internals
`05-futex.md`	Futex design, FUTEX_WAIT/WAKE, priority inheritance, robust futexes
`06-semaphores.md`	POSIX semaphores, counting semaphores, use cases and pitfalls
`07-condition-variables.md`	Spurious wakeups, wait/signal/broadcast, monitor pattern
`08-rwlocks.md`	Reader/writer locks, read starvation, seqlocks, RCU
`09-barriers.md`	Thread barriers, memory barriers, Linux `barrier()` and `smp_mb()`
`10-atomic-operations.md`	`std::atomic`, fetch_add, compare_exchange_weak vs strong
`11-lock-free-data-structures.md`	Lock-free stack, Michael-Scott queue, hazard pointers, epochs
`12-wait-free-algorithms.md`	Universal construction, wait-free queues, practical considerations
`13-transactional-memory.md`	HTM (RTM/HLE), STM, conflict detection, abort handling
`14-deadlock.md`	Conditions, detection (cycle detection), prevention, avoidance (Banker's)
`15-race-conditions.md`	Data races, TOCTOU, detecting with TSan/Helgrind
`16-starvation-livelock.md`	Progress guarantees, fairness, livelock examples and fixes
`17-rcu.md`	Read-Copy-Update: grace periods, quiescent states, usage in Linux
`18-priority-inversion.md`	Priority inversion, priority inheritance, priority ceiling protocol
`19-false-sharing.md`	Cache line alignment, padding, `__cacheline_aligned`, profiling

Cross-References

Section 02 (CPU Architecture): cache coherence protocols (MESI), out-of-order execution, store buffers
Section 03 (OS Fundamentals): kernel vs user mode, syscall overhead
Section 05 (Scheduling): priority, real-time scheduling, SCHED_FIFO and priority inversion
Section 08 (IPC): pipes, message queues, shared memory synchronization
Section 11 (Memory Management): memory barriers and NUMA-aware allocation
Section 17 (Distributed Systems): distributed locking, consensus, and their relation to local synchronization primitives
Section 18 (Database Internals): MVCC, WAL latching, buffer pool synchronization