Mutexes and Semaphores
Overview
Mutexes and semaphores are the two fundamental sleeping synchronization primitives in operating systems. While both prevent concurrent access to shared resources, they differ critically in semantics, ownership model, and appropriate use cases. Edsger W. Dijkstra introduced the semaphore in 1965 as part of the THE multiprogramming system, defining the P (proberen, "to test") and V (verhogen, "to increment") operations that remain in use today. The mutex — a binary semaphore with strict ownership semantics and additional properties like priority inheritance — came later as programmers recognized that semaphores were too general and too easy to misuse. Understanding both primitives, their kernel implementations, and their performance characteristics is essential for writing correct and efficient systems software.
Prerequisites
- Understanding of spinlocks and busy-waiting
- Knowledge of scheduler data structures (wait queues)
- Familiarity with futex system call
- Understanding of priority inversion problem
Core Technical Content
Semaphore: Definition and Operations
A semaphore is an integer variable on which only two atomic operations are defined:
- P (down/wait/acquire): Decrement the counter. If the result is negative, block (add self to wait queue and sleep).
- V (up/signal/release): Increment the counter. If there were blocked threads (counter was negative), wake one.
struct semaphore {
int count; // negative = |count| threads waiting
wait_queue_t waiters;
};
void down(semaphore_t *sem) {
if (atomic_dec_return(&sem->count) < 0) {
// Add to wait queue, sleep
sleep_on(&sem->waiters);
}
}
void up(semaphore_t *sem) {
if (atomic_inc_return(&sem->count) <= 0) {
// Wake one waiter
wakeup_one(&sem->waiters);
}
}
Binary semaphore: Initialized to 1. P acquires, V releases. Functions like a mutex but lacks ownership.
Counting semaphore: Initialized to N. Allows up to N concurrent access. Used for resource pools (e.g., connection pools, buffer slots).
Linux Kernel Semaphore
Defined in include/linux/semaphore.h, implemented in kernel/locking/semaphore.c:
struct semaphore {
raw_spinlock_t lock;
unsigned int count;
struct list_head wait_list;
};
#define DEFINE_SEMAPHORE(name) \
struct semaphore name = __SEMAPHORE_INITIALIZER(name, 1)
Key operations:
- down(sem): Uninterruptible wait (cannot be killed by signal while waiting)
- down_interruptible(sem): Returns -EINTR if signal received
- down_trylock(sem): Non-blocking attempt
- up(sem): Release
The kernel semaphore uses a raw_spinlock_t internally to protect the count and wait_list, then calls schedule() to sleep. This means the semaphore implementation has two layers: a spinlock for short critical section around state update, and sleeping for the actual wait.
Mutex: Definition
A mutex (mutual exclusion lock) is a binary semaphore with the following additional constraints:
- Ownership: Only the thread that acquired the mutex may release it.
- Non-recursive: A thread attempting to reacquire a mutex it already holds deadlocks (by default).
- Priority inheritance: The mutex holder inherits the priority of the highest-priority waiter (optional but standard in RT systems).
These constraints make mutexes safer and easier to reason about than semaphores, but less flexible.
Linux Kernel Mutex
Defined in include/linux/mutex.h, implemented in kernel/locking/mutex.c:
struct mutex {
atomic_long_t owner; // task_struct pointer of owner, or 0
raw_spinlock_t wait_lock; // protects wait_list
struct list_head wait_list;
#ifdef CONFIG_MUTEX_SPIN_ON_OWNER
struct optimistic_spin_queue osq; // MCS-based spin queue
#endif
};
The owner field encodes both the owner pointer and flags in the low bits (since task_struct is aligned to at least 16 bytes on 64-bit systems, the low 4 bits are free):
owner bits:
[63:4] = task_struct pointer
[3] = WAITERS flag (waiters in wait_list)
[2] = PICKUP flag (lock being handed off)
[1] = HANDOFF flag
[0] = (unused/reserved)
Fast Path
void mutex_lock(struct mutex *lock) {
// Fast path: CAS NULL -> current task_struct
if (!__mutex_trylock_fast(lock))
__mutex_lock_slowpath(lock);
}
static inline bool __mutex_trylock_fast(struct mutex *lock) {
unsigned long curr = (unsigned long)current;
// atomic CAS: if owner == 0, set to curr
return atomic_long_try_cmpxchg_acquire(&lock->owner, &(unsigned long){0}, curr);
}
If the lock is uncontended (owner == 0), this is a single CAS with acquire semantics — no syscall, no spinlock, O(1).
Slow Path: Optimistic Spinning
Before sleeping, the kernel tries optimistic spinning: if the mutex owner is currently running on another CPU, spin for a short time on an MCS node (osq) hoping it will release quickly. This avoids the expensive context switch for short critical sections.
static noinline void __mutex_lock_slowpath(struct mutex *lock) {
// Try optimistic spinning (owner is running, maybe quick)
if (mutex_can_spin_on_owner(lock)) {
if (mutex_optimistic_spin(lock, ...))
return; // got lock
}
// Add to wait_list, sleep
__mutex_lock_common(lock, TASK_UNINTERRUPTIBLE, ...);
}
This "adaptive mutex" design (spin if owner is running, sleep otherwise) was popularized by Solaris and adopted by Linux. It greatly reduces mutex latency for short critical sections on multi-core systems.
Unlock
void mutex_unlock(struct mutex *lock) {
// Fast path: CAS owner -> 0 if no waiters
if (__mutex_unlock_fast(lock))
return;
__mutex_unlock_slowpath(lock, _RET_IP_);
}
If no waiters, unlock is a single atomic store. If there are waiters, wake the next waiter and hand off ownership.
POSIX pthread_mutex Internals
pthread_mutex_lock() in glibc's NPTL (nptl/pthread_mutex_lock.c) uses the futex system call:
// futex word layout (glibc NPTL):
// 0 = unlocked
// TID = locked by thread TID, no waiters
// TID | FUTEX_WAITERS = locked, threads waiting
int pthread_mutex_lock(pthread_mutex_t *mutex) {
int tid = gettid();
// Fast path: CAS 0 -> tid
if (cmpxchg(&mutex->__data.__lock, 0, tid) == 0)
return 0;
// Slow path: mark WAITERS bit, call futex(WAIT)
return __pthread_mutex_lock_full(mutex);
}
The futex(FUTEX_WAIT) system call puts the thread to sleep in the kernel if the futex word still has the expected value. futex(FUTEX_WAKE) on unlock wakes one waiter. See 07-futex.md for full details.
Priority Inheritance Mutex
Priority inversion occurs when a high-priority task is blocked waiting for a lock held by a low-priority task, and a medium-priority task preempts the lock holder. The high-priority task is effectively at the mercy of the medium-priority task.
High priority: [BLOCKED waiting for lock held by Low]
Medium: [RUNNING, preempted Low]
Low: [PREEMPTED while holding lock]
Result: High is blocked behind Medium, violating priority ordering.
Priority Inheritance (PI): When High blocks on Low's lock, Low's effective priority is raised to High's until it releases the lock. Linux implements PI mutexes (kernel/locking/rtmutex.c, include/linux/rtmutex.h):
struct rt_mutex {
raw_spinlock_t wait_lock;
struct rb_root_cached waiters; // priority-sorted red-black tree
struct task_struct *owner;
};
Waiters are in a red-black tree sorted by priority, so the highest-priority waiter can be found in O(log N). The PI chain propagation (boost Low's priority, and transitively boost if Low is waiting on another lock held by VeryLow) is implemented in __rt_mutex_slowlock().
PI futexes (FUTEX_LOCK_PI, FUTEX_UNLOCK_PI) bring this to userspace — glibc uses them for PTHREAD_MUTEX_ROBUST with PTHREAD_PRIO_INHERIT.
Robust Mutex
A robust mutex automatically releases when the owning process dies (crashes). The kernel maintains a list of robust mutexes per thread. On process death, the kernel marks them as "owner dead" so waiters can detect the failure:
// In glibc
pthread_mutex_t m;
pthread_mutexattr_t attr;
pthread_mutexattr_setrobust(&attr, PTHREAD_MUTEX_ROBUST);
pthread_mutex_init(&m, &attr);
// After lock holder dies, a waiter gets:
int ret = pthread_mutex_lock(&m);
// ret == EOWNERDEAD
pthread_mutex_consistent(&m); // mark as recovered
The kernel side is in kernel/futex.c: exit_robust_list() is called during do_exit() and walks the thread's robust list, unlocking each futex with a special "owner died" marker.
Linux Kernel Synchronization Hierarchy
Scenario Primitive
--------- ---------
Short CS, any context raw_spinlock_t
Short CS, process context spinlock_t
Long CS, process context mutex (kernel/locking/mutex.c)
Long CS, interruptible mutex_lock_interruptible()
Read-mostly data rwsem (kernel/locking/rwsem.c)
Many readers, rare writers RCU (kernel/rcu/)
Event signaling completion (kernel/locking/completion.c)
Resource counting semaphore (kernel/locking/semaphore.c)
A key rule from Documentation/locking/mutex-design.rst: "The kernel mutex has stricter semantics than a semaphore. Use a mutex where possible."
Specifically, kernel mutex cannot be used:
- From interrupt context (use spinlock)
- If mutex_unlock would be called from a different task than mutex_lock (use semaphore)
- If recursive locking is needed (avoid this design; use recursive_mutex if unavoidable)
rwsem (Read-Write Semaphore)
For workloads with many readers and few writers, rwsem (kernel/locking/rwsem.c) allows concurrent reads:
struct rw_semaphore {
atomic_long_t count; // high bits=readers, low bits=flags
atomic_long_t owner;
struct optimistic_spin_queue osq;
raw_spinlock_t wait_lock;
struct list_head wait_list;
};
The count encoding:
- RWSEM_UNLOCKED_VALUE (0): unlocked
- Positive: number of active readers
- RWSEM_FLAG_WAITERS: writers waiting
- RWSEM_WRITER_LOCKED: write lock held
Historical Context
Dijkstra introduced the semaphore (P and V operations) in his 1965 report on the THE system. The name comes from railroad semaphores — signals that indicate whether a track section is occupied.
The concept of a mutex with ownership was formalized in the context of monitors (Hoare, 1974; Hansen, 1973), though the explicit mutex abstraction emerged from real-time operating systems in the late 1970s and early 1980s, particularly in work on priority inversion (Mars Pathfinder's famous 1997 priority inversion bug predates Linux's PI mutex).
The POSIX thread standard (1995) formalized pthread_mutex_t with its various types and attributes.
Production Examples
- Linux kernel (
kernel/locking/mutex.c): Thestruct mutexis used for nearly all process-context synchronization in drivers, filesystems, and subsystems. - glibc NPTL:
pthread_mutex_lock()wraps futex, used by all POSIX programs. - Java synchronized/ReentrantLock: Uses OS mutexes via JNI at the slow path.
- Mars Pathfinder (1997): Priority inversion on a VxWorks mutex caused the spacecraft to reset. The fix was enabling priority inheritance on one mutex. This incident drove adoption of PI mutexes in embedded RTOS.
Debugging Notes
CONFIG_DEBUG_MUTEXES: Adds checks for lock ordering violations, double unlock, unlock-from-wrong-task.CONFIG_LOCKDEP: Tracks mutex acquisition order graph.cat /proc/locks: Shows all held POSIX locks (file locks, not mutex).strace -e futex: Shows futex system calls, reveals mutex contention (slow path).perf trace -e futex: Trace futex calls with timing.- High
FUTEX_WAITrate instraceindicates heavy mutex contention. - Mutex deadlock:
gdbthread apply all btto identify all threads and their wait states.
Security Implications
- Double-free via mutex misuse: If two threads both believe they hold a lock due to a logic error, double-free of shared data becomes possible.
- Priority inversion as DoS: In RT systems, a low-priority task can permanently block high-priority tasks if PI is not used. Historically exploited in embedded systems.
- Robust mutex bypass: If the
pthread_mutex_consistent()call is skipped after an EOWNERDEAD error, the mutex is left in an inconsistent state, potentially allowing future incorrect behavior. - CVE-2014-3153 (futex escalation): A bug in the futex requeue code (
futex_requeue()inkernel/futex.c) allowed privilege escalation by manipulating futex kernel wait queue. Affected Linux ≤3.14.5.
Performance Implications
- Uncontended mutex lock/unlock: ~10-30 ns (single CAS + memory fence), no syscall.
- Contended, owner running: ~50-500 ns (optimistic spin), no syscall.
- Contended, owner sleeping: ~1-10 µs (futex WAIT syscall, context switch pair).
- Mutex vs spinlock: For critical sections > 1-2 µs, mutex is better (sleeping is cheaper than spinning). For < 100 ns, spinlock is better.
- Priority inheritance overhead: O(depth) lock chain boost propagation. Can be expensive for deep PI chains.
Common Pitfalls
- Using semaphore for mutex: Semaphore allows unlock from a different thread, enabling subtle bugs where error paths skip the release.
- Holding mutex across copy_to/from_user: The copy may fault and sleep, which is fine, but be aware it holds the lock during the fault — ensure no lock ordering violation.
- mutex_lock in atomic context: Will BUG() in debug kernels since mutex may sleep.
- Forgetting
mutex_lock_interruptiblein syscalls: Usingmutex_lock()(uninterruptible) in syscall code means the process cannot be killed while waiting. Usemutex_lock_interruptible()instead.
Real-World Failure Cases
Mars Pathfinder Priority Inversion (1997): The spacecraft's information bus management task was higher priority than a meteorological data task. Both used a shared mutex (without priority inheritance). A medium-priority communications task preempted the met-data task while it held the mutex, blocking the high-priority bus manager. The watchdog timer reset the spacecraft. Fixed in flight by enabling priority inheritance on the mutex via remote upload.
CVE-2014-3153: The futex FUTEX_REQUEUE operation could requeue futexes from a non-PI queue to a PI queue. A bug in the requeue path allowed manipulation of the kernel PI waiter structures, leading to arbitrary kernel memory write and privilege escalation. Fixed in kernel/futex.c by adding proper validation.
Modern Usage and Cloud-Scale
- Lock-free alternatives: For highly contended mutexes, consider per-CPU data (eliminates sharing), RCU (for read-mostly), or lock-free data structures.
- Mutex profiling: Linux
CONFIG_LOCK_STATand BPF-based mutex latency histograms (viabpftrace) are used in production to find hot mutexes. - Rust std::sync::Mutex: Wraps the OS mutex and, uniquely, wraps the data, making it impossible to access the data without holding the lock. Panics (rather than deadlocking) on re-entrant lock in same thread.
Future Directions
- Lock elision via TSX: Proposed to speculatively execute critical sections without actually acquiring the mutex. Largely abandoned due to TSX bugs (see
11-transactional-memory.md). - Futex2: A redesigned futex API proposed for Linux to support 8/16/32/64-bit futex words and NUMA-aware wakeup.
- User-space RCU for mutexes: For read-dominated workloads, replacing mutexes with RCU+CAS update eliminates all lock contention for readers.
Summary Table
| Property | Semaphore | Mutex | PI Mutex | Robust Mutex |
|---|---|---|---|---|
| Ownership | None | Strict (acquirer must release) | Strict | Strict |
| Count | N (counting) | Binary | Binary | Binary |
| Priority inheritance | No | Optional | Yes | Yes |
| Process death handling | No | No | No | Yes (EOWNERDEAD) |
| Recursive | Yes (count) | No (deadlock) | No | No |
| ISR-safe | No (sleeps) | No (sleeps) | No | No |
| Linux type | struct semaphore |
struct mutex |
struct rt_mutex |
futex+ROBUST |
Exercises
-
Measure mutex fast/slow path: Write a program that locks/unlocks a
pthread_mutexin a tight loop (uncontended). Usestrace -cto confirm zerofutexcalls. Then add a second thread and observefutexcalls appear under contention. -
Priority inversion demo: On a real-time Linux system (
SCHED_FIFO), create three threads at priorities 10, 20, 30. Have the low-priority thread (10) hold a mutex while the high-priority (30) waits. Show medium (20) preempts and causes latency. Then enablePTHREAD_PRIO_INHERITand show it is fixed. -
Robust mutex recovery: Write a program where a child process acquires a robust mutex and
_exit()s. The parent should detectEOWNERDEADand callpthread_mutex_consistent(). -
Kernel mutex tracing with ftrace: Using
trace-cmd, record mutex lock/unlock events for a specific kernel subsystem (e.g., the VFS inode mutex). Identify the longest hold time. -
Implement a counting semaphore using a mutex + condition variable: This is the standard implementation pattern. Verify correctness with a bounded buffer producer-consumer test.
References
- Dijkstra, E.W. (1968). "Cooperating Sequential Processes." In Programming Languages (Genuys, ed.). Academic Press.
- Hoare, C.A.R. (1974). "Monitors: An Operating System Structuring Concept." CACM 17(10):549-557.
- Linux kernel source:
kernel/locking/mutex.c,include/linux/mutex.h - Linux kernel source:
kernel/locking/rtmutex.c(priority inheritance) - Linux kernel source:
kernel/locking/semaphore.c - Linux kernel documentation:
Documentation/locking/mutex-design.rst - Drepper, U. (2011). "Futexes are Tricky." Red Hat. https://www.akkadia.org/drepper/futex.pdf
- Yodaiken, V. (2004). "Against Priority Inheritance." FSMLabs.