Skip to content

Futex: Fast Userspace Mutex

Overview

The futex (Fast Userspace muTEX) is the fundamental synchronization mechanism underlying virtually all user-space synchronization primitives on Linux — mutexes, semaphores, condition variables, and barriers. Its key insight is that uncontended lock operations should require no system call at all: the lock state is maintained in a user-space integer variable, and the kernel is only involved in the slow path when a thread must actually sleep. Invented by Hubertus Franke, Rusty Russell, and Matthew Kirkwood at IBM and published in 2002, futex solved the long-standing problem of efficient user-space synchronization on Linux by providing a minimal kernel interface: just enough to put threads to sleep and wake them up, with the user-space library managing all lock logic.

Prerequisites

  • Understanding of system calls and the user-kernel boundary
  • Knowledge of atomic operations (CAS, fetch-and-add)
  • Familiarity with kernel wait queues
  • Understanding of why mutexes need kernel support for sleeping

Core Technical Content

The Core Insight

Before futex, user-space synchronization either used inefficient syscalls (semget/semop from System V IPC) for every lock/unlock, or used POSIX shared-memory plus a busy-wait spin. Syscall overhead is ~100-1000 ns — too slow for highly contended applications. Futex solved this with a split design:

User Space:                    Kernel Space:
+--------------------+         +--------------------+
| futex_word (int)   | <-----> | futex hash table   |
| CAS/atomic ops     |         | wait queues        |
| lock/unlock logic  |         | futex_wait()       |
+--------------------+         | futex_wake()       |
         |                     +--------------------+
         | Only if contended
         v
    syscall(futex, ...)

The futex_word is a plain integer in user-space memory (mapped shared or private). The kernel never touches it except in futex_wait() where it atomically checks the value.

The Futex System Call

#include <linux/futex.h>
#include <syscall.h>

long futex(uint32_t *uaddr, int futex_op, uint32_t val,
           const struct timespec *timeout,
           uint32_t *uaddr2, uint32_t val3);

Primary operations:

FUTEX_WAIT: If *uaddr == val, put the calling thread to sleep on the queue for uaddr. The check-and-sleep is atomic from the kernel's perspective (via the hash bucket spinlock), preventing the "missed wake" race.

// Thread wants to sleep:
// 1. Set WAITERS bit in futex_word
// 2. Re-check that lock is still held
// 3. Call futex_wait
syscall(SYS_futex, &futex_word, FUTEX_WAIT, expected_val, NULL, NULL, 0);
// Returns 0 on wake, -ETIMEDOUT on timeout, -EINTR on signal

FUTEX_WAKE: Wake at most val threads sleeping on uaddr.

syscall(SYS_futex, &futex_word, FUTEX_WAKE, 1 /*wake one*/, NULL, NULL, 0);

FUTEX_REQUEUE: Atomically wake val threads on uaddr and move remaining waiters to uaddr2. Prevents thundering herd on condition variable broadcast (broadcast wakes one, requeues rest to mutex).

FUTEX_CMP_REQUEUE: Like REQUEUE, but atomically checks *uaddr == val3 before requeuing. Used in condition variable broadcast.

FUTEX_WAIT_BITSET / FUTEX_WAKE_BITSET: Wait/wake only threads matching a specific bitset. Used in condition variable timed waits (32 different clocks, matching clocks via bitset).

FUTEX_FD (deprecated): Was used to integrate futex wakeup with select()/poll().

The Futex Fast Path (No Syscall)

For a mutex:

// Lock (simplified glibc NPTL pthread_mutex_lock):
int tid = gettid();
if (cmpxchg(&futex_word, 0, tid) == 0) {
    // SUCCESS: fast path, no syscall
    return 0;
}
// Slow path: contended
mutex_lock_slowpath(&futex_word, tid);

For unlock:

// Unlock:
if (atomic_exchange(&futex_word, 0) == tid) {
    // No waiters bit set, no syscall needed
    return 0;
}
// Waiters present: call futex_wake
syscall(SYS_futex, &futex_word, FUTEX_WAKE, 1, NULL, NULL, 0);

The WAITERS bit encoding allows unlock to know whether to call the kernel: if no thread is sleeping, no syscall is needed.

Futex Word Encoding (glibc mutex)

glibc's pthread_mutex_t uses the __data.__lock field as the futex word:

Bit layout of futex_word:
31           1    0
+------------+----+
| owner TID  | W  |
+------------+----+
             W=1: WAITERS present (others sleeping)

Value 0:   unlocked
Value TID: locked by TID, no waiters
Value TID|FUTEX_WAITERS: locked by TID, waiters sleeping

Mutex lock slowpath:

void mutex_lock_slowpath(int *futex_word, int tid) {
    int c;
    // Mark that we are waiting
    while ((c = cmpxchg(futex_word, 0, tid | FUTEX_WAITERS)) != 0) {
        if (c != tid | FUTEX_WAITERS)
            atomic_or(futex_word, FUTEX_WAITERS);
        // Sleep if futex_word still has WAITERS set
        syscall(SYS_futex, futex_word, FUTEX_WAIT,
                *futex_word, NULL, NULL, 0);
    }
}

Futex Hash Table (Kernel Internals)

The kernel maintains a global hash table of futex wait queues in kernel/futex/core.c:

// Simplified
struct futex_hash_bucket {
    atomic_t waiters;
    spinlock_t lock;
    struct plist_head chain;  // priority-sorted waiters
};

static struct futex_hash_bucket futex_queues[1 << FUTEX_HASHBITS];
// FUTEX_HASHBITS = 8 (256 buckets) on 32-bit, up to 13 (8192) on 64-bit

futex_wait() kernel path: 1. Compute hash = hash_long((unsigned long)uaddr, FUTEX_HASHBITS). 2. Lock futex_queues[hash].lock (spinlock). 3. Read *uaddr (with a get_user() to handle page faults). 4. If *uaddr != expected_val, return EWOULDBLOCK immediately. 5. Enqueue struct futex_q in the chain. 6. Release the bucket spinlock. 7. schedule() — sleep.

This approach ensures atomicity: the check of *uaddr and the enqueueing happen under the bucket spinlock, so a concurrent futex_wake() cannot miss the sleeper.

For private (process-local) futexes (FUTEX_PRIVATE_FLAG), the hash is based on the physical page address, allowing more efficient hashing and avoiding cross-process wakeups.

Robust Futex

A robust futex automatically becomes "dead owner" when the owning thread exits without releasing it. The kernel implements this via a per-thread robust list:

// Thread sets up robust list via prctl(PR_SET_ROBUST_LIST):
struct robust_list_head {
    struct robust_list list;     // linked list of locked robust futexes
    long futex_offset;           // offset of futex word in enclosing struct
    struct robust_list *list_op_pending; // in-progress operation
};

On do_exit() (kernel/exit.c), exit_robust_list() walks this list and for each entry: 1. Sets the FUTEX_OWNER_DIED bit in the futex word. 2. Calls futex_wake() to wake one waiter. 3. The woken waiter receives EOWNERDEAD from pthread_mutex_lock().

Robust futex is used by glibc's PTHREAD_MUTEX_ROBUST and database connection pools that need to handle server process crashes.

Priority Inheritance Futex (PI Futex)

PI futex (FUTEX_LOCK_PI, FUTEX_UNLOCK_PI, FUTEX_TRYLOCK_PI) implements priority inheritance in user space with kernel support:

// PI futex word encoding:
// Bit 30: FUTEX_WAITERS
// Bit 31: FUTEX_OWNER_DIED
// Bits 0-29: TID of owner

// Acquire PI futex:
if (cmpxchg(&futex_word, 0, tid) != 0) {
    // Contended: let kernel handle priority inheritance
    syscall(SYS_futex, &futex_word, FUTEX_LOCK_PI, 0, NULL, NULL, 0);
}

In the kernel, futex_lock_pi() (kernel/futex/pi.c): 1. Identifies the owner thread (from TID in futex word). 2. Boosts owner's priority to waiter's priority. 3. Handles PI chain propagation (if owner is also waiting on another PI futex). 4. Puts waiter to sleep in a priority-sorted wait queue.

glibc uses PI futex for PTHREAD_MUTEX_ROBUST | PTHREAD_PRIO_INHERIT mutex type, essential for real-time applications and embedded systems.

Futex-Based Implementations

pthread_mutex:

lock:   CAS(futex, 0, tid)  --> fast path (no syscall)
        CAS fails?          --> futex_wait slowpath
unlock: XCHG(futex, 0)     --> if WAITERS bit: futex_wake

pthread_semaphore:

down:   fetch_add(futex, -1)  --> if result < 0: futex_wait
up:     fetch_add(futex, 1)   --> if old was < 0: futex_wake

pthread_cond_wait (glibc implementation):

1. Increment cond->__data.__wakeup_seq (under mutex)
2. Unlock mutex
3. futex_wait(cond->futex, expected_seq)
4. Lock mutex on wakeup

Broadcast uses FUTEX_CMP_REQUEUE to atomically wake one waiter and requeue the rest to the mutex futex, preventing all waiters from piling onto the mutex simultaneously.

pthread_barrier:

count--; if count > 0: futex_wait on barrier futex
else (last thread): futex_wake(INT_MAX) all waiters

futex2: The Proposed New API

Linux 5.16+ development saw proposals for futex2() (kernel/futex/futex2.c), addressing futex limitations: - Wait on multiple futexes simultaneously (like select()). - Support 8/16/64-bit futex words (not just 32-bit). - Per-futex timeout clocks. - NUMA-aware wakeup (wake threads on specific NUMA nodes).

As of Linux 6.x, some futex2 features have been merged incrementally.

Historical Context

Futex was invented by Hubertus Franke, Rusty Russell, and Matthew Kirkwood, presented at the Ottawa Linux Symposium in 2002: "Fuss, Futexes and Furwocks: Fast Userlevel Locking in Linux." The name "FUSS" (Fast Userspace Synchronization) was the original working name.

Prior to futex, Linux's user-space synchronization relied on LinuxThreads (based on clone()) and slow semaphore syscalls. The introduction of futex enabled NPTL (Native POSIX Thread Library) by Ulrich Drepper and Ingo Molnar in 2003, replacing LinuxThreads with a POSIX-compliant, high-performance threading library.

The robust futex mechanism was added in Linux 2.6.17 (2006) by Ingo Molnar and Thomas Gleixner.

PI futex was added in Linux 2.6.18 (2006) by Thomas Gleixner, Ingo Molnar, and Arjan van de Ven.

Production Examples

  • glibc NPTL: All pthread_mutex_*, pthread_cond_*, pthread_sem_*, pthread_barrier_* use futex. This covers virtually all multi-threaded C/C++ programs.
  • Java HotSpot JVM: The JVM's monitor implementation uses futex on Linux for synchronized blocks and Object.wait().
  • Go runtime: Uses futex directly (runtime/os_linux.go) for goroutine parking/unparking.
  • Rust std: std::sync::Mutex uses parking_lot crate or OS futex.
  • Wine: The Windows synchronization objects (mutex, event, semaphore) are emulated using futex on Linux.
  • PostgreSQL: Uses futex-based semaphores for inter-process synchronization in shared memory.

Debugging Notes

  • strace -e futex: Shows all futex syscalls with arguments. High FUTEX_WAIT frequency = contention.
  • perf trace -e futex: Similar to strace but lower overhead.
  • /proc/PID/status: voluntary_ctxt_switches and nonvoluntary_ctxt_switches show blocking behavior.
  • BPF/bpftrace: Trace futex wait durations with custom scripts:
# Trace futex wait time > 1ms:
bpftrace -e '
tracepoint:syscalls:sys_enter_futex /args->futex_op == 128/ { @start[tid] = nsecs; }
tracepoint:syscalls:sys_exit_futex /retval == 0 && @start[tid]/ {
    $lat = (nsecs - @start[tid]) / 1000;
    if ($lat > 1000) { printf("PID %d waited %d us\n", pid, $lat); }
    delete(@start[tid]);
}'
  • futex: hash collision kernel message: If the hash table is too small for the workload, collisions cause spurious wakeups and performance degradation.
  • Deadlock in futex code: If a thread calls FUTEX_WAIT on a futex_word that will never change to a different value, it sleeps forever. Always double-check the WAKEUP condition.

Security Implications

  • CVE-2014-3153: futex_requeue() could requeue from a non-PI queue to a PI queue, leading to type confusion in the kernel PI waiter structures. An attacker could use this to write arbitrary values to kernel memory and escalate privileges. The bug was in kernel/futex.c, futex_requeue(). Patch: validate queue type before requeue.

  • CVE-2021-3153 / CVE-2021-26708: Multiple vulnerabilities in kernel/futex.c in the 2020-2021 timeframe related to futex WAITV handling and edge cases in PI futex chains.

  • User-space futex address manipulation: An attacker who can control the uaddr argument to futex() might be able to put threads to sleep waiting on kernel-space addresses (rejected by the kernel, which validates user-space ranges), or manipulate the futex word of another process's mutex (only possible via shared memory, which requires explicit setup).

  • Robust futex list poisoning: If an attacker can modify the robust list head in the target thread's task_struct, they can cause arbitrary wake/dead-owner notifications on process exit.

Performance Implications

  • Uncontended mutex lock: ~3-10 ns (single CAS on x86, stays entirely in userspace).
  • Contended mutex lock (owner running): Optimistic spin for ~1-10 µs, then futex_wait.
  • Contended mutex lock (owner sleeping): futex_wait + schedule() ≈ 2-10 µs (two context switches).
  • Futex hash table: 256-8192 buckets. Many threads contending on different mutexes that hash to the same bucket can cause false contention (each bucket has its own spinlock). Use FUTEX_PRIVATE_FLAG for process-private futexes (different hash function).
  • FUTEX_WAKE cost: O(N) where N = threads woken. For broadcast of 1000 threads, this is significant. FUTEX_CMP_REQUEUE amortizes this.

Common Pitfalls

  1. Not using FUTEX_PRIVATE_FLAG: For mutexes that are not shared between processes, forgetting FUTEX_PRIVATE_FLAG uses a global hash table that is shared with all processes, causing unnecessary contention.
  2. Spurious wakeup not handled: futex_wait() can return 0 even if no futex_wake() was called (e.g., signal delivery). Always re-check the condition after wakeup.
  3. Race in robust futex list registration: If a thread crashes between registering a futex and actually locking it, the dead-owner notification fires prematurely. Robust futex's list_op_pending field handles this case.
  4. Futex on stack: Futex requires the futex word to remain at a stable virtual address. A futex word on a thread's stack is problematic if the stack grows or is unmapped (e.g., thread exit without unlock).
  5. 32-bit futex word assumption: Standard futex is 32-bit. Building a 64-bit counter on top requires two 32-bit futex words or the futex2 API.

Real-World Failure Cases

CVE-2014-3153 (Towelroot exploit): Demonstrated by George "geohot" Hotz in May 2014. The futex requeue bug allowed local privilege escalation on Android and Linux. The exploit queued a futex on a specific address, then used the buggy requeue to corrupt a kernel PI waiter structure, leading to an arbitrary kernel write. Patched in Linux 3.14.5.

glibc robust mutex data corruption (2012): A bug in glibc 2.15's implementation of robust mutexes caused the list_op_pending field to be incorrectly cleared, meaning a crash between lock registration and actual lock acquisition would not trigger the EOWNERDEAD notification. Fixed in glibc 2.16.

Java HotSpot futex starvation (OpenJDK 2009): A bug in the JVM's Linux futex usage for monitor wait/notify caused starvation under specific thread scheduling patterns. Threads blocked in Object.wait() could remain blocked even after Object.notifyAll() because the futex WAKE count was set to 1 instead of INT_MAX.

Modern Usage and Cloud-Scale

  • Containers and futex namespacing: Futex keys are based on physical page addresses, so two containers sharing the same page (via copy-on-write) could interfere with each other's futexes. Linux 5.7 added FUTEX_32 flag to improve NUMA locality.
  • NUMA-aware futex: Waking threads on a remote NUMA node is expensive. The futex2 FUTEX_NUMA_FLAG allows specifying which NUMA node to wake threads on.
  • io_uring and futex: Linux 6.7 added IORING_OP_FUTEX_WAIT and IORING_OP_FUTEX_WAKE to allow futex operations in io_uring submission queues, enabling async futex patterns.

Future Directions

  • futex2: Multi-wait, 64-bit words, NUMA-aware wakeup.
  • io_uring futex integration: Async synchronization without blocking a thread.
  • Kernel bypass synchronization: DPDK and RDMA applications use shared memory atomics without any kernel involvement for maximum throughput — the logical extreme of futex's "kernel only when needed" design.

Summary Table

Futex Operation Kernel Involvement Use Case
FUTEX_WAIT Syscall + sleep Mutex slow path, semaphore down
FUTEX_WAKE Syscall Mutex unlock with waiters
FUTEX_REQUEUE Syscall Cond broadcast (requeue to mutex)
FUTEX_CMP_REQUEUE Syscall Cond broadcast (atomic requeue)
FUTEX_LOCK_PI Syscall PI mutex slow path
FUTEX_UNLOCK_PI Syscall PI mutex unlock with waiters
Fast path (CAS) None (user-space) Uncontended lock/unlock

Exercises

  1. Implement a futex mutex from scratch: Write a futex_mutex_t using syscall(SYS_futex, ...) directly (not through glibc). Implement lock, trylock, and unlock. Test with 4 threads and verify no data races.

  2. Strace a mutex benchmark: Run a contended pthread_mutex benchmark under strace -e futex -c. Identify the ratio of futex syscalls to total lock/unlock operations. Observe how the ratio changes as contention increases.

  3. Robust mutex recovery: Write a server that holds a robust mutex in shared memory. Kill it with SIGKILL mid-operation. Have a client recover the mutex using EOWNERDEAD handling. Verify no deadlock.

  4. Futex hash collision measurement: Using perf stat -e lock-inst, measure performance of 256 threads each locking a different pthread_mutex_t. Check if collisions in the futex hash table cause performance anomalies by comparing aligned vs unaligned mutex arrays.

  5. PI futex chain: Create a three-thread priority inversion scenario (low=SCHED_FIFO prio 10, medium=20, high=30). Lock a PI mutex (PTHREAD_MUTEX_DEFAULT with PTHREAD_PRIO_INHERIT). Verify via top -H that the low-priority thread's priority is boosted when the high-priority thread blocks on it.

References

  • Franke, H., Russell, R., & Kirkwood, M. (2002). "Fuss, Futexes and Furwocks: Fast Userlevel Locking in Linux." Ottawa Linux Symposium 2002.
  • Drepper, U. (2011). "Futexes Are Tricky." Red Hat. https://www.akkadia.org/drepper/futex.pdf
  • Linux kernel source: kernel/futex/core.c, kernel/futex/pi.c, include/uapi/linux/futex.h
  • Linux kernel source: kernel/exit.c (exit_robust_list())
  • glibc source: nptl/pthread_mutex_lock.c, nptl/pthread_cond_wait.c
  • CVE-2014-3153: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2014-3153
  • Molnar, I. (2006). "Robust Futexes." LWN.net. https://lwn.net/Articles/172149/