Mutexes and Semaphores

Overview

Mutexes and semaphores are the two fundamental sleeping synchronization primitives in operating systems. While both prevent concurrent access to shared resources, they differ critically in semantics, ownership model, and appropriate use cases. Edsger W. Dijkstra introduced the semaphore in 1965 as part of the THE multiprogramming system, defining the P (proberen, "to test") and V (verhogen, "to increment") operations that remain in use today. The mutex — a binary semaphore with strict ownership semantics and additional properties like priority inheritance — came later as programmers recognized that semaphores were too general and too easy to misuse. Understanding both primitives, their kernel implementations, and their performance characteristics is essential for writing correct and efficient systems software.

Prerequisites

Understanding of spinlocks and busy-waiting
Knowledge of scheduler data structures (wait queues)
Familiarity with futex system call
Understanding of priority inversion problem

Core Technical Content

Semaphore: Definition and Operations

A semaphore is an integer variable on which only two atomic operations are defined:

P (down/wait/acquire): Decrement the counter. If the result is negative, block (add self to wait queue and sleep).
V (up/signal/release): Increment the counter. If there were blocked threads (counter was negative), wake one.

struct semaphore {
    int count;      // negative = |count| threads waiting
    wait_queue_t waiters;
};

void down(semaphore_t *sem) {
    if (atomic_dec_return(&sem->count) < 0) {
        // Add to wait queue, sleep
        sleep_on(&sem->waiters);
    }
}

void up(semaphore_t *sem) {
    if (atomic_inc_return(&sem->count) <= 0) {
        // Wake one waiter
        wakeup_one(&sem->waiters);
    }
}

Binary semaphore: Initialized to 1. P acquires, V releases. Functions like a mutex but lacks ownership.

Counting semaphore: Initialized to N. Allows up to N concurrent access. Used for resource pools (e.g., connection pools, buffer slots).

Linux Kernel Semaphore

Defined in include/linux/semaphore.h, implemented in kernel/locking/semaphore.c:

struct semaphore {
    raw_spinlock_t  lock;
    unsigned int    count;
    struct list_head wait_list;
};

#define DEFINE_SEMAPHORE(name) \
    struct semaphore name = __SEMAPHORE_INITIALIZER(name, 1)

Key operations: - down(sem): Uninterruptible wait (cannot be killed by signal while waiting) - down_interruptible(sem): Returns -EINTR if signal received - down_trylock(sem): Non-blocking attempt - up(sem): Release

The kernel semaphore uses a raw_spinlock_t internally to protect the count and wait_list, then calls schedule() to sleep. This means the semaphore implementation has two layers: a spinlock for short critical section around state update, and sleeping for the actual wait.

Mutex: Definition

A mutex (mutual exclusion lock) is a binary semaphore with the following additional constraints:

Ownership: Only the thread that acquired the mutex may release it.
Non-recursive: A thread attempting to reacquire a mutex it already holds deadlocks (by default).
Priority inheritance: The mutex holder inherits the priority of the highest-priority waiter (optional but standard in RT systems).

These constraints make mutexes safer and easier to reason about than semaphores, but less flexible.

Linux Kernel Mutex

Defined in include/linux/mutex.h, implemented in kernel/locking/mutex.c:

struct mutex {
    atomic_long_t   owner;    // task_struct pointer of owner, or 0
    raw_spinlock_t  wait_lock; // protects wait_list
    struct list_head wait_list;
#ifdef CONFIG_MUTEX_SPIN_ON_OWNER
    struct optimistic_spin_queue osq; // MCS-based spin queue
#endif
};

The owner field encodes both the owner pointer and flags in the low bits (since task_struct is aligned to at least 16 bytes on 64-bit systems, the low 4 bits are free):

owner bits:
  [63:4] = task_struct pointer
  [3]    = WAITERS flag (waiters in wait_list)
  [2]    = PICKUP flag (lock being handed off)
  [1]    = HANDOFF flag
  [0]    = (unused/reserved)

Fast Path

void mutex_lock(struct mutex *lock) {
    // Fast path: CAS NULL -> current task_struct
    if (!__mutex_trylock_fast(lock))
        __mutex_lock_slowpath(lock);
}

static inline bool __mutex_trylock_fast(struct mutex *lock) {
    unsigned long curr = (unsigned long)current;
    // atomic CAS: if owner == 0, set to curr
    return atomic_long_try_cmpxchg_acquire(&lock->owner, &(unsigned long){0}, curr);
}

If the lock is uncontended (owner == 0), this is a single CAS with acquire semantics — no syscall, no spinlock, O(1).

Slow Path: Optimistic Spinning

Before sleeping, the kernel tries optimistic spinning: if the mutex owner is currently running on another CPU, spin for a short time on an MCS node (osq) hoping it will release quickly. This avoids the expensive context switch for short critical sections.

static noinline void __mutex_lock_slowpath(struct mutex *lock) {
    // Try optimistic spinning (owner is running, maybe quick)
    if (mutex_can_spin_on_owner(lock)) {
        if (mutex_optimistic_spin(lock, ...))
            return; // got lock
    }
    // Add to wait_list, sleep
    __mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, ...);
}

This "adaptive mutex" design (spin if owner is running, sleep otherwise) was popularized by Solaris and adopted by Linux. It greatly reduces mutex latency for short critical sections on multi-core systems.

Unlock

void mutex_unlock(struct mutex *lock) {
    // Fast path: CAS owner -> 0 if no waiters
    if (__mutex_unlock_fast(lock))
        return;
    __mutex_unlock_slowpath(lock, _RET_IP_);
}

If no waiters, unlock is a single atomic store. If there are waiters, wake the next waiter and hand off ownership.

POSIX pthread_mutex Internals

pthread_mutex_lock() in glibc's NPTL (nptl/pthread_mutex_lock.c) uses the futex system call:

// futex word layout (glibc NPTL):
// 0 = unlocked
// TID = locked by thread TID, no waiters
// TID | FUTEX_WAITERS = locked, threads waiting

int pthread_mutex_lock(pthread_mutex_t *mutex) {
    int tid = gettid();
    // Fast path: CAS 0 -> tid
    if (cmpxchg(&mutex->__data.__lock, 0, tid) == 0)
        return 0;
    // Slow path: mark WAITERS bit, call futex(WAIT)
    return __pthread_mutex_lock_full(mutex);
}

The futex(FUTEX_WAIT) system call puts the thread to sleep in the kernel if the futex word still has the expected value. futex(FUTEX_WAKE) on unlock wakes one waiter. See 07-futex.md for full details.

Priority Inheritance Mutex

Priority inversion occurs when a high-priority task is blocked waiting for a lock held by a low-priority task, and a medium-priority task preempts the lock holder. The high-priority task is effectively at the mercy of the medium-priority task.

High priority:  [BLOCKED waiting for lock held by Low]
Medium:         [RUNNING, preempted Low]
Low:            [PREEMPTED while holding lock]

Result: High is blocked behind Medium, violating priority ordering.

Priority Inheritance (PI): When High blocks on Low's lock, Low's effective priority is raised to High's until it releases the lock. Linux implements PI mutexes (kernel/locking/rtmutex.c, include/linux/rtmutex.h):

struct rt_mutex {
    raw_spinlock_t  wait_lock;
    struct rb_root_cached waiters; // priority-sorted red-black tree
    struct task_struct *owner;
};

Waiters are in a red-black tree sorted by priority, so the highest-priority waiter can be found in O(log N). The PI chain propagation (boost Low's priority, and transitively boost if Low is waiting on another lock held by VeryLow) is implemented in __rt_mutex_slowlock().

PI futexes (FUTEX_LOCK_PI, FUTEX_UNLOCK_PI) bring this to userspace — glibc uses them for PTHREAD_MUTEX_ROBUST with PTHREAD_PRIO_INHERIT.

Robust Mutex

A robust mutex automatically releases when the owning process dies (crashes). The kernel maintains a list of robust mutexes per thread. On process death, the kernel marks them as "owner dead" so waiters can detect the failure:

// In glibc
pthread_mutex_t m;
pthread_mutexattr_t attr;
pthread_mutexattr_setrobust(&attr, PTHREAD_MUTEX_ROBUST);
pthread_mutex_init(&m, &attr);

// After lock holder dies, a waiter gets:
int ret = pthread_mutex_lock(&m);
// ret == EOWNERDEAD
pthread_mutex_consistent(&m);  // mark as recovered

The kernel side is in kernel/futex.c: exit_robust_list() is called during do_exit() and walks the thread's robust list, unlocking each futex with a special "owner died" marker.

Linux Kernel Synchronization Hierarchy

Scenario                     Primitive
---------                    ---------
Short CS, any context        raw_spinlock_t
Short CS, process context    spinlock_t
Long CS, process context     mutex (kernel/locking/mutex.c)
Long CS, interruptible       mutex_lock_interruptible()
Read-mostly data             rwsem (kernel/locking/rwsem.c)
Many readers, rare writers   RCU (kernel/rcu/)
Event signaling              completion (kernel/locking/completion.c)
Resource counting            semaphore (kernel/locking/semaphore.c)

A key rule from Documentation/locking/mutex-design.rst: "The kernel mutex has stricter semantics than a semaphore. Use a mutex where possible."

Specifically, kernel mutex cannot be used: - From interrupt context (use spinlock) - If mutex_unlock would be called from a different task than mutex_lock (use semaphore) - If recursive locking is needed (avoid this design; use recursive_mutex if unavoidable)

rwsem (Read-Write Semaphore)

For workloads with many readers and few writers, rwsem (kernel/locking/rwsem.c) allows concurrent reads:

struct rw_semaphore {
    atomic_long_t count;       // high bits=readers, low bits=flags
    atomic_long_t owner;
    struct optimistic_spin_queue osq;
    raw_spinlock_t wait_lock;
    struct list_head wait_list;
};

The count encoding: - RWSEM_UNLOCKED_VALUE (0): unlocked - Positive: number of active readers - RWSEM_FLAG_WAITERS: writers waiting - RWSEM_WRITER_LOCKED: write lock held

Historical Context

Dijkstra introduced the semaphore (P and V operations) in his 1965 report on the THE system. The name comes from railroad semaphores — signals that indicate whether a track section is occupied.

The concept of a mutex with ownership was formalized in the context of monitors (Hoare, 1974; Hansen, 1973), though the explicit mutex abstraction emerged from real-time operating systems in the late 1970s and early 1980s, particularly in work on priority inversion (Mars Pathfinder's famous 1997 priority inversion bug predates Linux's PI mutex).

The POSIX thread standard (1995) formalized pthread_mutex_t with its various types and attributes.

Production Examples

Linux kernel (kernel/locking/mutex.c): The struct mutex is used for nearly all process-context synchronization in drivers, filesystems, and subsystems.
glibc NPTL: pthread_mutex_lock() wraps futex, used by all POSIX programs.
Java synchronized/ReentrantLock: Uses OS mutexes via JNI at the slow path.
Mars Pathfinder (1997): Priority inversion on a VxWorks mutex caused the spacecraft to reset. The fix was enabling priority inheritance on one mutex. This incident drove adoption of PI mutexes in embedded RTOS.

Debugging Notes

CONFIG_DEBUG_MUTEXES: Adds checks for lock ordering violations, double unlock, unlock-from-wrong-task.
CONFIG_LOCKDEP: Tracks mutex acquisition order graph.
cat /proc/locks: Shows all held POSIX locks (file locks, not mutex).
strace -e futex: Shows futex system calls, reveals mutex contention (slow path).
perf trace -e futex: Trace futex calls with timing.
High FUTEX_WAIT rate in strace indicates heavy mutex contention.
Mutex deadlock: gdb thread apply all bt to identify all threads and their wait states.

Security Implications

Double-free via mutex misuse: If two threads both believe they hold a lock due to a logic error, double-free of shared data becomes possible.
Priority inversion as DoS: In RT systems, a low-priority task can permanently block high-priority tasks if PI is not used. Historically exploited in embedded systems.
Robust mutex bypass: If the pthread_mutex_consistent() call is skipped after an EOWNERDEAD error, the mutex is left in an inconsistent state, potentially allowing future incorrect behavior.
CVE-2014-3153 (futex escalation): A bug in the futex requeue code (futex_requeue() in kernel/futex.c) allowed privilege escalation by manipulating futex kernel wait queue. Affected Linux ≤3.14.5.

Performance Implications

Uncontended mutex lock/unlock: ~10-30 ns (single CAS + memory fence), no syscall.
Contended, owner running: ~50-500 ns (optimistic spin), no syscall.
Contended, owner sleeping: ~1-10 µs (futex WAIT syscall, context switch pair).
Mutex vs spinlock: For critical sections > 1-2 µs, mutex is better (sleeping is cheaper than spinning). For < 100 ns, spinlock is better.
Priority inheritance overhead: O(depth) lock chain boost propagation. Can be expensive for deep PI chains.

Common Pitfalls

Using semaphore for mutex: Semaphore allows unlock from a different thread, enabling subtle bugs where error paths skip the release.
Holding mutex across copy_to/from_user: The copy may fault and sleep, which is fine, but be aware it holds the lock during the fault — ensure no lock ordering violation.
mutex_lock in atomic context: Will BUG() in debug kernels since mutex may sleep.
Forgetting mutex_lock_interruptible in syscalls: Using mutex_lock() (uninterruptible) in syscall code means the process cannot be killed while waiting. Use mutex_lock_interruptible() instead.

Real-World Failure Cases

Mars Pathfinder Priority Inversion (1997): The spacecraft's information bus management task was higher priority than a meteorological data task. Both used a shared mutex (without priority inheritance). A medium-priority communications task preempted the met-data task while it held the mutex, blocking the high-priority bus manager. The watchdog timer reset the spacecraft. Fixed in flight by enabling priority inheritance on the mutex via remote upload.

CVE-2014-3153: The futex FUTEX_REQUEUE operation could requeue futexes from a non-PI queue to a PI queue. A bug in the requeue path allowed manipulation of the kernel PI waiter structures, leading to arbitrary kernel memory write and privilege escalation. Fixed in kernel/futex.c by adding proper validation.

Modern Usage and Cloud-Scale

Lock-free alternatives: For highly contended mutexes, consider per-CPU data (eliminates sharing), RCU (for read-mostly), or lock-free data structures.
Mutex profiling: Linux CONFIG_LOCK_STAT and BPF-based mutex latency histograms (via bpftrace) are used in production to find hot mutexes.
Rust std::sync::Mutex: Wraps the OS mutex and, uniquely, wraps the data, making it impossible to access the data without holding the lock. Panics (rather than deadlocking) on re-entrant lock in same thread.

Future Directions

Lock elision via TSX: Proposed to speculatively execute critical sections without actually acquiring the mutex. Largely abandoned due to TSX bugs (see 11-transactional-memory.md).
Futex2: A redesigned futex API proposed for Linux to support 8/16/32/64-bit futex words and NUMA-aware wakeup.
User-space RCU for mutexes: For read-dominated workloads, replacing mutexes with RCU+CAS update eliminates all lock contention for readers.

Summary Table

Property	Semaphore	Mutex	PI Mutex	Robust Mutex
Ownership	None	Strict (acquirer must release)	Strict	Strict
Count	N (counting)	Binary	Binary	Binary
Priority inheritance	No	Optional	Yes	Yes
Process death handling	No	No	No	Yes (EOWNERDEAD)
Recursive	Yes (count)	No (deadlock)	No	No
ISR-safe	No (sleeps)	No (sleeps)	No	No
Linux type	`struct semaphore`	`struct mutex`	`struct rt_mutex`	futex+ROBUST

Exercises

Measure mutex fast/slow path: Write a program that locks/unlocks a pthread_mutex in a tight loop (uncontended). Use strace -c to confirm zero futex calls. Then add a second thread and observe futex calls appear under contention.
Priority inversion demo: On a real-time Linux system (SCHED_FIFO), create three threads at priorities 10, 20, 30. Have the low-priority thread (10) hold a mutex while the high-priority (30) waits. Show medium (20) preempts and causes latency. Then enable PTHREAD_PRIO_INHERIT and show it is fixed.
Robust mutex recovery: Write a program where a child process acquires a robust mutex and _exit()s. The parent should detect EOWNERDEAD and call pthread_mutex_consistent().
Kernel mutex tracing with ftrace: Using trace-cmd, record mutex lock/unlock events for a specific kernel subsystem (e.g., the VFS inode mutex). Identify the longest hold time.
Implement a counting semaphore using a mutex + condition variable: This is the standard implementation pattern. Verify correctness with a bounded buffer producer-consumer test.

References

Dijkstra, E.W. (1968). "Cooperating Sequential Processes." In Programming Languages (Genuys, ed.). Academic Press.
Hoare, C.A.R. (1974). "Monitors: An Operating System Structuring Concept." CACM 17(10):549-557.
Linux kernel source: kernel/locking/mutex.c, include/linux/mutex.h
Linux kernel source: kernel/locking/rtmutex.c (priority inheritance)
Linux kernel source: kernel/locking/semaphore.c
Linux kernel documentation: Documentation/locking/mutex-design.rst
Drepper, U. (2011). "Futexes are Tricky." Red Hat. https://www.akkadia.org/drepper/futex.pdf
Yodaiken, V. (2004). "Against Priority Inheritance." FSMLabs.