Mutual Exclusion

Overview

Mutual exclusion is the foundational problem in concurrent programming: ensuring that when multiple threads or processes need to access a shared resource, only one may do so at any given time. The term was coined by Edsger W. Dijkstra in his landmark 1965 paper "Solution of a Problem in Concurrent Programming Control," which described what is now known as the critical section problem. Every lock in every operating system kernel, every database transaction isolation mechanism, every concurrent data structure ultimately rests on the solution to this problem. Understanding mutual exclusion from first principles — including why naive software approaches fail and why hardware primitives are necessary — is essential for anyone working on systems software.

Prerequisites

Basic understanding of threads and processes
Familiarity with memory models and CPU registers
Understanding of assembly language at a conceptual level
Knowledge of what a critical section is

Core Technical Content

The Critical Section Problem

A critical section is any segment of code that must not be executed concurrently by more than one thread. The mutual exclusion problem requires designing an entry protocol (lock acquisition) and exit protocol (lock release) such that:

Mutual Exclusion: At most one process executes in the critical section at any time.
Progress: If no process is in the critical section and some wish to enter, only those not executing in the remainder section participate in the decision, and this decision cannot be postponed indefinitely.
Bounded Waiting: There exists a bound on the number of times other processes can enter before a waiting process is admitted.

Dekker's Algorithm (1965)

The first correct software-only solution for two processes, attributed to Dutch mathematician Th. J. Dekker, discovered by Dijkstra:

// Shared variables
int turn = 0;
bool flag[2] = {false, false};

// Process 0
flag[0] = true;
while (flag[1]) {
    if (turn != 0) {
        flag[0] = false;
        while (turn != 0) {} // busy wait
        flag[0] = true;
    }
}
// critical section
turn = 1;
flag[0] = false;

Peterson's Algorithm (1981)

Gary Peterson published a simpler two-process solution in 1981 that is easier to understand and prove correct:

// Shared variables
bool flag[2] = {false, false};
int turn;

// Process i (j = 1-i)
void lock(int i) {
    int j = 1 - i;
    flag[i] = true;       // "I want to enter"
    turn = j;             // "But you go first"
    while (flag[j] && turn == j) {
        // busy wait
    }
}

void unlock(int i) {
    flag[i] = false;
}

Proof sketch: If both processes are spinning, turn can only be one value — the last writer wins. The other process will be allowed in. When process i exits, it sets flag[i] = false, allowing process j in if it is waiting.

Why Software-Only Solutions Fail on Modern CPUs

Peterson's and Dekker's algorithms are provably correct under sequential consistency — a memory model where all memory operations appear to execute in some global sequential order consistent with each program's order. Modern processors, however, implement relaxed memory models:

x86 (TSO — Total Store Order): Stores can be delayed relative to loads from other processors. A store by CPU 0 may not be visible to CPU 1 until after CPU 1 has completed its own loads.
ARM/POWER (Relaxed): Both loads and stores can be reordered with respect to each other.

On an x86 TSO machine, Peterson's algorithm can fail:

CPU 0:                    CPU 1:
flag[0] = true            flag[1] = true
turn = 1                  turn = 0
// Reads flag[1]          // Reads flag[0]
// Store buffer not       // Store buffer not
// flushed yet!           // flushed yet!
// Sees flag[1] = false   // Sees flag[0] = false
// Enters CS              // Enters CS  <-- VIOLATION

Both processors read stale values from their store buffers and both enter the critical section simultaneously. Without memory barriers (MFENCE on x86), software-only mutual exclusion is broken on real hardware.

Hardware Primitives

Test-and-Set (TAS)

The simplest hardware primitive, available since the IBM System/360 (1964):

// Atomic: reads old value, writes 1, returns old value
bool test_and_set(bool *lock) {
    bool old = *lock;
    *lock = true;
    return old;
}

// Spinlock using TAS
void lock(bool *l) {
    while (test_and_set(l)) {} // spin until we get 0 (unlocked)
}
void unlock(bool *l) {
    *l = false;
}

x86 instruction: XCHG reg, mem (exchange is implicitly atomic and acts as a full memory barrier on x86).

Compare-and-Swap (CAS)

More powerful than TAS; the foundation of most modern lock-free algorithms:

// Atomic: if *ptr == expected, write new_val, return true
// Otherwise, leave *ptr unchanged, return false
bool compare_and_swap(int *ptr, int expected, int new_val) {
    if (*ptr == expected) {
        *ptr = new_val;
        return true;
    }
    return false;
}

x86: CMPXCHG (with LOCK prefix for atomicity across CPUs). ARM64: CAS instruction (ARMv8.1-A), or the older LDXR/STXR loop. C11: atomic_compare_exchange_strong() / atomic_compare_exchange_weak() Linux kernel: cmpxchg() macro in arch/x86/include/asm/cmpxchg.h

// Linux kernel CAS usage
old = cmpxchg(&lock->val, 0, 1);
if (old == 0) {
    // acquired
}

Load-Linked / Store-Conditional (LL/SC)

Used on ARM, MIPS, PowerPC, RISC-V. Avoids the ABA problem that CAS suffers from:

// ARM64 spinlock acquire (simplified)
1: ldaxr   w0, [x1]      // Load-Acquire Exclusive
   cbnz    w0, 1b        // If not 0, spin
   stxr    w2, w3, [x1]  // Store Exclusive: write 1
   cbnz    w2, 1b        // If store failed (contention), retry

The store-conditional fails if any other CPU has written to the location since the load-linked. This is stronger than CAS: it detects any intervening write, not just value changes.

The Fetch-and-Add Primitive

// Atomic: return old value, increment by delta
int fetch_and_add(int *ptr, int delta) {
    int old = *ptr;
    *ptr += delta;
    return old;
}

x86: XADD with LOCK prefix. Used in ticket spinlocks to ensure FIFO ordering (see 02-spinlocks.md).

From Hardware to OS Locks

Hardware TAS/CAS/LL-SC
        |
        v
  Spinlock (busy-wait, short duration)
        |
        v
  Mutex (sleep on contention, backed by futex)
        |
        v
  Higher-level sync (semaphore, rwlock, condvar)

In the Linux kernel, arch/x86/include/asm/spinlock.h implements the architecture-specific spinlock primitive using LOCK CMPXCHG or XADD. On ARM64, arch/arm64/include/asm/spinlock.h uses LDAXR/STXR pairs.

Historical Context

Edsger W. Dijkstra first formally stated and solved the critical section problem in 1965 in "Solution of a Problem in Concurrent Programming Control" (Communications of the ACM, 8(9):569). This was motivated by the need to coordinate multiple asynchronous processes sharing a single-user computer. Dijkstra attributed the two-process solution to Dekker but generalized it to N processes.

Peterson's 1981 solution ("Myths about the Mutual Exclusion Problem," IPL 12(3)) simplified Dekker's algorithm dramatically and became the canonical textbook example.

Hardware test-and-set appeared in the IBM System/360 architecture (1964) as the TS instruction, predating the formal software solutions. This is a recurring theme in systems: hardware designers and systems programmers discovered empirical solutions before theoreticians provided formal frameworks.

Leslie Lamport introduced the bakery algorithm (1974), a software solution for N processes, notable for working even if memory reads/writes are not atomic (some early shared-memory machines had non-atomic word access).

Production Examples

Linux kernel: Every spinlock_t, mutex, rwsem, and semaphore in the kernel builds on CAS/LL-SC hardware primitives. There are approximately 10,000+ lock sites in the kernel source.
glibc NPTL: pthread_mutex_lock() uses futex, which uses CAS in the fast path.
JVM: Java's synchronized keyword and java.util.concurrent.locks use CAS via sun.misc.Unsafe.compareAndSwapInt().
PostgreSQL: Buffer manager uses spinlocks (s_lock()) built on TAS for short critical sections protecting buffer pin counts.
Redis: Single-threaded but uses lock-free techniques in its cluster replication path.
Intel TBB: Task Building Blocks uses CAS extensively in its work-stealing scheduler.

Debugging Notes

Detecting missing mutual exclusion: Use ThreadSanitizer (TSan), which instruments every memory access and reports data races. Enable with CC="clang -fsanitize=thread".
Linux kernel: Use CONFIG_LOCKDEP which tracks lock ordering and detects potential deadlocks, and CONFIG_PROVE_LOCKING for stronger checking.
Helgrind (Valgrind tool): Detects lock ordering violations and data races.
GDB: Inspect lock state with info threads, thread apply all bt for deadlock detection.
Perf: Use perf lock to trace kernel lock contention events.

Common symptoms of broken mutual exclusion: - Sporadic incorrect results (non-deterministic bugs) - Crashes only under load or on many-core machines - Bugs that disappear under debugger (timing changes)

Security Implications

TOCTOU (Time-of-Check Time-of-Use): A classic race condition where a security check is performed before a privileged operation, but the state changes between the check and the use. Example: checking file permissions then opening the file — an attacker can swap the file with a symlink. Mitigated by using atomic operations that combine check and act.
CVE-2016-5195 (Dirty COW): A famous Linux kernel privilege escalation caused by a race condition in the copy-on-write page handling. A write race in mm/memory.c allowed unprivileged users to write to read-only memory-mapped files, including /etc/passwd. This is arguably the most famous mutual exclusion failure in Linux history.
Double-fetch vulnerabilities: Kernel reads a value from user space twice; an attacker modifies it between reads. Mitigated by reading once into kernel memory.
Lock-bypass attacks: If an attacker can cause a lock to be skipped (e.g., by triggering an exception between lock check and lock acquire), they can exploit the unprotected window.

Performance Implications

Uncontended CAS: ~4-10 cycles on x86 (same as a regular load/store pair plus a fence).
Contended CAS: Cache line must be transferred exclusively; on a 2-socket NUMA system this can be 300-500 cycles.
Amdahl's Law: If 5% of code is serial (under a global lock), maximum speedup on 128 cores is ~17x, not 128x.
False sharing: Two variables on the same cache line contend even though they protect different data. Use __cacheline_aligned (____cacheline_aligned_in_smp in Linux kernel) to pad structures.
Cache line bouncing: A heavily contended lock causes the cache line holding it to bounce between CPUs, saturating the interconnect. The MCS lock (see 02-spinlocks.md) solves this by having each CPU spin on a local variable.

Common Pitfalls and Failure Modes

Assuming x86 = sequential consistency: Peterson's algorithm silently breaks. Always use proper synchronization primitives with appropriate memory ordering.
Compiler reordering: Even on sequentially consistent hardware, compilers can reorder memory accesses. Use volatile in C (insufficient) or proper atomic types with std::atomic / _Atomic.
Lock elision without testing: Intel's TSX (Hardware Lock Elision) can silently fall back to the lock path. Code that assumed elision would happen may have unexpected performance cliffs.
Recursive locking: Many spinlock implementations are not reentrant. Calling a function that acquires the same lock from within a critical section causes a deadlock.
Interrupt context: Acquiring a sleeping mutex from interrupt context is illegal in Linux. Must use spin_lock_irqsave() to disable interrupts and prevent deadlock with interrupt handlers.

Real-World Failure Case Studies

Dirty COW (CVE-2016-5195, October 2016): The madvise(MADV_DONTNEED) call could race with a write to a copy-on-write mapping. The kernel's write_protect_page() path had a window where a page could be dirtied without going through the COW mechanism, allowing a write to a read-only file-backed mapping. Affected all Linux kernels from 2.6.22 (2007) to the patch in 4.8.3. Exploited actively in the wild within days of disclosure.

MySQL InnoDB double-write race (pre-5.5): A race condition in the buffer pool could allow two threads to simultaneously flush the same page, causing corruption under high write load. Required careful lock ordering fixes in the buffer manager.

OpenSSL PRNG fork safety (2008): The OpenSSL PRNG was not fork-safe — after fork(), parent and child shared identical PRNG state because the PID-based reseeding had a race condition. Duplicate TLS session keys were generated. Fixed by adding a fork handler with pthread_atfork().

Modern Usage and Cloud-Scale Considerations

At cloud scale, contention on a single mutex becomes a serious bottleneck: - Lock sharding: Instead of one global lock, use an array of locks and hash the key to select one. Linux's dcache_lock was sharded into per-dentry seqlocks. - Lock-free algorithms: Avoid locks entirely for hot paths (see 09-lock-free-data-structures.md). - RCU: For read-heavy workloads, RCU provides zero-overhead reads (see 10-rcu.md). - Partitioned data structures: Design data so that different CPUs own different partitions, eliminating shared mutable state. - Kernel bypass (DPDK, io_uring): Minimizing kernel entries reduces opportunities for lock contention.

On ARM-based cloud instances (AWS Graviton, Ampere Altra), the weaker memory model means that code relying on implicit x86 ordering assumptions will fail. Porting code from x86 to ARM often reveals latent mutual exclusion bugs.

Future Directions

Rust ownership model: The borrow checker enforces at compile time that mutable references are exclusive, making many mutual exclusion bugs impossible. Mutex<T> in Rust wraps data, not code, making it impossible to access protected data without holding the lock.
Formal verification: Tools like TLA+, SPIN, and the Linux kernel's memory model (tools/memory-model/) allow algorithmic verification of synchronization protocols.
Hardware transactional memory: Intel TSX (partially deprecated due to bugs) promised to make mutual exclusion implicit. See 11-transactional-memory.md.
Capability-based OS design: Approaches like seL4 use capability-passing to eliminate shared mutable state, making mutual exclusion largely unnecessary at the kernel level.

Summary Table

Property	TAS	CAS	LL/SC	Peterson	Dekker
Hardware required	Yes	Yes	Yes	No	No
Works on modern CPUs	Yes	Yes	Yes	No*	No*
N-process support	Yes	Yes	Yes	No (2 only)	No (2 only)
ABA-free	No	No	Yes	N/A	N/A
Wait-free	No	No	No	Yes (bounded)	Yes (bounded)
Needs memory fence	Implicit	Explicit	Implicit	Yes	Yes

*Without explicit memory fences.

Exercises

Implement and break Peterson's algorithm: Write a C program using pthreads that runs Peterson's algorithm without memory barriers on an x86 machine. Add a counter incremented in the critical section and verify that the final count is sometimes wrong. Then add __sync_synchronize() barriers and verify it becomes correct.
CAS-based counter: Implement a thread-safe counter using only atomic_compare_exchange_weak. Handle the retry loop correctly. Benchmark it against a mutex-protected counter at varying thread counts (2, 4, 8, 16, 32) and plot contention vs throughput.
TOCTOU exploit: Write a program that has a TOCTOU race on file access. Use inotifywait or a racing thread to exploit it. Then fix it using open() + fstat() instead of stat() + open().
Memory ordering experiment: On an x86 machine, write an assembly or C program (using relaxed atomics) that demonstrates store-load reordering is possible. On ARM, demonstrate load-load reordering.
Study Dirty COW: Read the CVE-2016-5195 patch (mm/memory.c). Identify exactly which lines were added and what race they close. Draw a timing diagram showing the pre-patch race.

References

Dijkstra, E.W. (1965). "Solution of a Problem in Concurrent Programming Control." CACM 8(9):569.
Peterson, G.L. (1981). "Myths about the Mutual Exclusion Problem." Information Processing Letters 12(3):115-116.
Lamport, L. (1974). "A New Solution of Dijkstra's Concurrent Programming Problem." CACM 17(8):453-455.
Herlihy, M. & Shavit, N. (2008). The Art of Multiprocessor Programming. Morgan Kaufmann.
Preshing, J. (2012). "Memory Ordering at Compile Time." https://preshing.com/20120625/memory-ordering-at-compile-time/
Linux kernel source: arch/x86/include/asm/cmpxchg.h, arch/arm64/include/asm/spinlock.h
CVE-2016-5195: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2016-5195
Intel 64 and IA-32 Architectures Software Developer's Manual, Vol. 3A, §8.2 (Memory Ordering)