Skip to content

Condition Variables

Overview

A condition variable is a synchronization primitive that allows threads to efficiently wait for a specific condition to become true, while atomically releasing a mutex during the wait. Without condition variables, a thread would have to busy-poll or sleep for fixed intervals to check whether a condition has been met — wasting CPU or suffering latency. Condition variables solve this by pairing with a mutex: the waiting thread atomically releases the mutex and suspends, and the signaling thread uses the mutex to safely modify shared state before waking the waiter. The concept was introduced by C.A.R. Hoare in his 1974 paper on monitors, and remains the canonical building block for producer-consumer systems, thread pools, event loops, and any system where threads coordinate around state changes.

Prerequisites

  • Understanding of mutexes and why they are needed
  • Familiarity with producer-consumer problem
  • Basic POSIX threading (pthread_mutex_t)
  • Understanding of spurious wakeups and why while not if is correct

Core Technical Content

The Core Abstraction

A condition variable has three operations: - wait(cond, mutex): Atomically release mutex and suspend. On wakeup, re-acquire mutex before returning. - signal(cond): Wake one waiting thread. - broadcast(cond): Wake all waiting threads.

The atomic release-and-suspend is crucial. Without it, there is a window between releasing the mutex and suspending where the signaling thread could fire, missing the waiter.

Mesa vs Hoare Semantics

Hoare semantics (original 1974 paper): When signal() is called, the signaling thread is immediately preempted and the woken thread runs immediately with the condition guaranteed. The woken thread does not need to re-check the condition because no other thread could have run between the signal and the wakeup.

Mesa semantics (Lampson & Redell, 1980, Mesa language): signal() merely moves a thread from the wait queue to the run queue. The woken thread competes for the mutex and may not run immediately. Another thread may run first and change the condition. Therefore, the woken thread must re-check the condition in a while loop:

// Hoare semantics (WRONG for POSIX):
pthread_cond_wait(&cond, &mutex);
if (condition_true) { ... }  // condition guaranteed

// Mesa semantics (CORRECT for POSIX):
while (!condition_true) {
    pthread_cond_wait(&cond, &mutex);
}
// condition is now guaranteed true

All POSIX condition variables use Mesa semantics. This is also why spurious wakeups are permitted.

Spurious Wakeups

POSIX explicitly allows pthread_cond_wait() to return even when no signal() or broadcast() was called. This is not a bug — it is a deliberate concession to implementation efficiency on multiprocessors and cluster systems.

Reasons spurious wakeups exist: 1. Signal delivery during wait: On Linux, pthread_cond_wait() uses a futex internally. If a signal is delivered to the thread, it may be woken up to handle the signal and then returned to waiting — but futex(WAIT) returns EINTR which surfaces as a spurious wakeup. 2. Implementation simplification: On some architectures and cluster systems, it is expensive to guarantee exactly-once wakeup semantics. Allowing spurious wakeups simplifies the implementation. 3. Memory model freedom: On relaxed-memory architectures, ensuring that a woken thread sees consistent state requires fences; spurious wakeups allow deferring this to the while-loop re-check.

The mandatory pattern:

pthread_mutex_lock(&mutex);
while (!condition) {           // ALWAYS while, NEVER if
    pthread_cond_wait(&cond, &mutex);
}
// condition is now true, mutex is held

Bounded Buffer (Producer-Consumer)

The canonical condition variable example:

#define N 16
int buffer[N];
int head = 0, tail = 0, count = 0;
pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
pthread_cond_t  not_empty = PTHREAD_COND_INITIALIZER;
pthread_cond_t  not_full  = PTHREAD_COND_INITIALIZER;

void produce(int item) {
    pthread_mutex_lock(&mutex);
    while (count == N) {                          // buffer full
        pthread_cond_wait(&not_full, &mutex);     // wait for space
    }
    buffer[tail] = item;
    tail = (tail + 1) % N;
    count++;
    pthread_cond_signal(&not_empty);              // wake a consumer
    pthread_mutex_unlock(&mutex);
}

int consume(void) {
    pthread_mutex_lock(&mutex);
    while (count == 0) {                          // buffer empty
        pthread_cond_wait(&not_empty, &mutex);    // wait for item
    }
    int item = buffer[head];
    head = (head + 1) % N;
    count--;
    pthread_cond_signal(&not_full);               // wake a producer
    pthread_mutex_unlock(&mutex);
    return item;
}

Note the use of two separate condition variables (not_empty and not_full) rather than one. Using a single condition variable and broadcast() for both producer and consumer wakeups is simpler but causes unnecessary wakeups.

Signal vs Broadcast

  • pthread_cond_signal(): Wake exactly one waiting thread. Use when waking any single waiter is sufficient (e.g., "there is one new item in the queue").
  • pthread_cond_broadcast(): Wake all waiting threads. Use when the condition change potentially affects all waiters (e.g., "shutdown initiated", "configuration changed").

When to use broadcast: 1. Multiple threads are waiting on different sub-conditions under the same condvar (common pattern: one condvar for "state changed", each thread checks its own condition). 2. Exactly-which-waiter-to-wake is semantically important and signal() might wake the wrong one. 3. Shutdown/teardown: all threads must be notified.

Thundering Herd Problem

broadcast() wakes all waiting threads, but only one can hold the mutex at a time. They all compete, all but one go back to sleep. On a system with 1000 threads waiting on a condition:

broadcast() called
    |
    v
1000 threads wake up
    |
    v
All 1000 try to acquire mutex
    |
    v
1 succeeds, 999 immediately block again
    |
    v
999 more context switches (wasted)

This is the thundering herd. Mitigations: - Use signal() instead of broadcast() when one waiter is sufficient. - Use FUTEX_WAKE with limited count (Linux futex2 API). - Design the wait condition so that each woken thread can make progress (e.g., thread pool with work queue: each woken thread pulls one item and does not need to block again). - Linux kernel: wake_up_nr() wakes exactly N tasks, avoiding full broadcast.

Monitor Pattern

A monitor is a higher-level pattern encapsulating shared state, a mutex, and condition variables:

+----------------------------------+
|           MONITOR                |
|  +--------+    +--------+        |
|  | mutex  |    | state  |        |
|  +--------+    +--------+        |
|  +--------+    +--------+        |
|  | cond1  |    | cond2  |        |
|  +--------+    +--------+        |
|  method1() { lock; check; wait } |
|  method2() { lock; modify; sig } |
+----------------------------------+

Java's synchronized blocks implement a monitor: every object has an implicit mutex and a single condition variable (accessed via wait(), notify(), notifyAll()). The limitation of one condvar per object led to java.util.concurrent.locks.Condition which allows multiple condition variables per lock.

Linux Kernel: wait_event / wake_up Family

The kernel does not use POSIX condvars but has an equivalent: wait queues (struct wait_queue_head) with wait_event* / wake_up* macros:

// Declaration
DECLARE_WAIT_QUEUE_HEAD(wq);

// Wait (process context only):
wait_event(wq, condition);              // uninterruptible
wait_event_interruptible(wq, cond);     // returns -ERESTARTSYS on signal
wait_event_timeout(wq, cond, timeout);  // timeout in jiffies

// Wake:
wake_up(&wq);            // wake one exclusive + all non-exclusive
wake_up_all(&wq);        // wake all
wake_up_interruptible(&wq); // only wake TASK_INTERRUPTIBLE threads

Internally, wait_event(wq, cond) expands to:

// Simplified expansion of wait_event(wq, condition)
DEFINE_WAIT(wait);
for (;;) {
    prepare_to_wait(&wq, &wait, TASK_UNINTERRUPTIBLE);
    if (condition)
        break;
    schedule();  // sleep
}
finish_wait(&wq, &wait);

This is the kernel's Mesa-semantics condvar: the condition is re-checked in a loop (for (;;) with if (condition) break).

The kernel also has completion (include/linux/completion.h), a simplified one-shot condvar useful for "wait for initialization to finish":

DECLARE_COMPLETION(done);
// Thread 1:
wait_for_completion(&done);
// Thread 2 (when done):
complete(&done);

POSIX Implementation Details

pthread_cond_wait() in glibc (nptl/pthread_cond_wait.c) is implemented using futex:

  1. Atomically add self to the cond's internal waiter count.
  2. Unlock the mutex (via pthread_mutex_unlock()).
  3. Call futex(FUTEX_WAIT_BITSET, ...) on a per-condvar futex word.
  4. On wakeup (real or spurious), reacquire the mutex.

The "atomic release and wait" property is achieved because futex(WAIT) checks the futex word against an expected value before sleeping. If pthread_cond_signal() fired between step 2 and step 3, the futex word would already have changed and futex(WAIT) returns immediately.

A key glibc implementation challenge: cancellation safety. If a thread is cancelled while inside pthread_cond_wait(), it must reacquire the mutex before running cancellation handlers (POSIX requirement). The glibc implementation uses a cleanup handler that re-acquires the mutex.

Historical Context

C.A.R. Hoare introduced condition variables in "Monitors: An Operating System Structuring Concept" (CACM, 1974), with Hoare semantics (signal-and-urgent-wait). Per Brinch Hansen independently developed a similar concept.

Butler Lampson and David Redell published "Experience with Processes and Monitors in Mesa" (CACM, 1980) describing the Mesa language's simpler semantics, which became the dominant approach because Hoare semantics are difficult to implement efficiently on real hardware.

POSIX standardized pthread_cond_* in 1995, adopting Mesa semantics and explicitly permitting spurious wakeups.

Production Examples

  • glibc NPTL: pthread_cond_wait() is the basis of most blocking synchronization in userspace.
  • Linux kernel: wait_event_* / wake_up_* are used in virtually every subsystem — disk I/O completion, network socket data availability, pipe read/write, process exit notification, etc.
  • Java: Object.wait()/notify() and java.util.concurrent.locks.Condition power BlockingQueue, CountDownLatch, Semaphore, etc.
  • Go: sync.Cond implements condition variables; however, Go idioms prefer channels for inter-goroutine communication.
  • PostgreSQL: Uses ConditionVariable (PG-internal, src/backend/storage/lmgr/condvar.c) for buffer manager, lock manager, and parallel query coordination.

Debugging Notes

  • Missed signal: If signal() is called before wait(), the signal is lost (condvars are not counted). Always use a while-loop with a condition variable and a state variable. The state variable persists; the condvar wakeup does not.
  • Deadlock with condvar: If the mutex is not released inside wait(), broadcast will hang. If wait() is called without holding the mutex, behavior is undefined (and pthread_cond_wait() will return EPERM in glibc on error-checking mutexes).
  • helgrind --tool=helgrind: Detects condvar usage without mutex, and lock order violations.
  • strace -e futex: Reveals all condvar waits as futex(FUTEX_WAIT_BITSET).
  • Thundering herd detection: perf sched can show mass context switches following a broadcast().
  • In the kernel, ftrace with sched:sched_wakeup event shows all wake_up() calls and which tasks were woken.

Security Implications

  • Signal-before-wait race (missing signal): A security-critical event (e.g., "new auth token available") signaled before the receiver calls wait() is lost. The receiver may wait indefinitely or miss an authorization window. Always combine condvar with a state variable protected by the same mutex.
  • Broadcast DoS: An attacker with ability to call pthread_cond_broadcast() on a shared condvar (e.g., via shared memory) can cause thundering herd DoS in a multi-threaded server.
  • Cancellation and resource leaks: If a thread waiting on a condvar is cancelled without a proper cleanup handler, it may leave the mutex locked or resources unreleased.

Performance Implications

  • Uncontended signal+wake: ~2-5 µs (mutex lock, signal, mutex unlock, futex wakeup, waiter resumes).
  • Broadcast to N threads: O(N) futex wakeups, O(N) context switches, O(N) mutex acquisitions serialized.
  • Avoid condvar for high-frequency events: A condvar wakeup is expensive (syscall + context switch). For very high event rates, use lock-free queues or eventfd with epoll.
  • Lock granularity: Using one mutex for both "not_empty" and "not_full" conditions (single condvar pattern) forces producers and consumers to serialize, even though they could operate on different parts of the buffer.

Common Pitfalls

  1. if instead of while: The single most common condvar bug. Using if instead of while to check the condition means a spurious wakeup causes incorrect behavior.
  2. Signal outside mutex: pthread_cond_signal() can technically be called outside the mutex, but this creates a race: the signal may fire between the condition check and the wait() call in the waiter.
  3. Broadcast for single-consumer: Using broadcast() when only one consumer needs to wake is wasteful (thundering herd).
  4. Holding mutex too long: Keeping the mutex locked while doing expensive work (I/O, memory allocation) delays signalers unnecessarily.
  5. Condvar without associated state variable: Using wait() without a condition to check — relying solely on signal() to indicate something happened. This loses signals if the producer fires before the consumer waits.

Real-World Failure Cases

glibc condvar glibc-2.3.x bug (2004): An early NPTL implementation had a race condition in pthread_cond_wait() where a broadcast could be missed if it arrived between the mutex unlock and the futex wait. Fixed by using a futex-sequence counter.

MySQL connection pool thundering herd (pre-5.7): The MySQL thread cache used a single pthread_cond_broadcast() to wake all idle threads when a new connection arrived. Under high connection rates, this caused severe contention. MySQL 5.7 restructured this to use signal() to wake exactly one thread.

Linux kernel completion misuse (various drivers): Several drivers incorrectly used complete() as a general-purpose event and then waited with wait_for_completion() without ensuring the complete happened-before the wait, causing indefinite hangs. Added wait_for_completion_timeout() as a safer alternative.

Modern Usage and Cloud-Scale

  • Go channels: Go idiom prefers channels over condvars for inter-goroutine communication. Channels implement a typed bounded buffer with condvar-like blocking semantics but with select-based multiplexing.
  • Rust Condvar: std::sync::Condvar in Rust is similar to POSIX but the borrow checker prevents using the condvar without the associated MutexGuard, making the "signal without mutex" error impossible at compile time.
  • io_uring and eventfd: For high-performance event notification (network, disk), eventfd + epoll is preferred over condvars because it supports multiplexing many event sources without per-source threads.
  • Kernel io_uring: Uses kernel wait queues internally, but exposes completion via ring buffer polling, avoiding condvar overhead for high-frequency I/O events.

Future Directions

  • Futex2: A proposed Linux kernel API (futex2()) supports waiting on multiple futexes simultaneously (like select() for locks), enabling more efficient condvar implementations.
  • C++20 std::atomic::wait(): Provides atomic wait/notify as a first-class language feature, often implementable without a mutex via futex, avoiding lock overhead for simple state changes.
  • Structured concurrency: Languages like Swift (with actors) and Kotlin (with coroutines) replace condvars with structured async/await patterns that are easier to reason about.

Summary Table

Property POSIX condvar Java Object.wait Linux wait_event Go channel
Semantics Mesa Mesa Mesa N/A (typed)
Spurious wakeups Yes (allowed) Yes (allowed) No (macro loop) No
Signal/broadcast Both notify/notifyAll wake_up/wake_up_all send/close
Associated lock Explicit mutex Object monitor Implicit (macro) Internal
Timeout support Yes Yes Yes select+timer
Cancellation safe Yes (cleanup) Via interrupt Via -ERESTARTSYS Via context

Exercises

  1. Bounded buffer with two condvars: Implement the bounded buffer above. Verify correctness with 4 producers and 4 consumers running for 10 seconds. Use an atomic counter to confirm every item produced is consumed exactly once.

  2. Spurious wakeup injection: Modify pthread_cond_wait() via LD_PRELOAD (wrap the function) to randomly return spurious wakeups 10% of the time. Test your bounded buffer implementation to confirm it handles them correctly (i.e., the while-loop makes it safe).

  3. Thundering herd measurement: Implement a workqueue with 100 worker threads waiting on a single condvar. Measure the time between pthread_cond_broadcast() and all 100 threads completing one unit of work. Then redesign with individual signal() per item and compare.

  4. Kernel wait_event tracing: Write a kernel module that uses wait_event_interruptible(). Use ftrace to trace sched_wakeup events and confirm the waiter is woken correctly. Inject a signal to confirm wait_event_interruptible returns -ERESTARTSYS.

  5. Condvar vs eventfd benchmark: Implement a high-frequency event notification system (100k events/sec) using (a) condvar+mutex, (b) eventfd+epoll. Measure throughput and latency of each approach.

References

  • Hoare, C.A.R. (1974). "Monitors: An Operating System Structuring Concept." CACM 17(10):549-557.
  • Lampson, B.W. & Redell, D.D. (1980). "Experience with Processes and Monitors in Mesa." CACM 23(2):105-117.
  • POSIX.1-2017 (pthread_cond_wait(3), pthread_cond_signal(3))
  • Linux kernel source: include/linux/wait.h, kernel/sched/wait.c
  • glibc NPTL source: nptl/pthread_cond_wait.c, nptl/pthread_cond_signal.c
  • Drepper, U. (2003). "NPTL Design." https://www.akkadia.org/drepper/nptl-design.pdf
  • Birrell, A.D. (1989). "An Introduction to Programming with Threads." DEC SRC Research Report 35.