Deadlock

Overview

A deadlock is a situation where two or more threads are permanently blocked, each waiting for a resource held by another thread in the cycle. Unlike a live system under high load, a deadlocked system makes no progress at all — it is permanently stuck. Deadlocks are among the most serious and least reproducible bugs in concurrent systems: they may occur only under specific timing conditions, infrequently enough to escape testing, but frequently enough to cause production outages. The Linux kernel's lockdep validator is the most sophisticated deadlock-prevention tool in production systems today, capable of detecting potential deadlocks at first lock acquisition — before the deadlock actually occurs — by proving that no sequence of events can lead to a circular lock dependency.

Prerequisites

Understanding of mutexes and spinlocks
Familiarity with process/thread scheduling
Knowledge of resource management
Basic graph theory (directed graphs, cycles)

Core Technical Content

The Four Coffman Conditions

Deadlock can occur if and only if all four Coffman conditions (1971) hold simultaneously:

Mutual Exclusion: Resources cannot be shared — only one thread can use a resource at a time.
Hold-and-Wait: A thread holds at least one resource while waiting to acquire additional resources held by other threads.
No Preemption: Resources cannot be forcibly taken from a thread; they must be released voluntarily.
Circular Wait: A circular chain of threads exists, where each thread holds a resource needed by the next thread in the chain.

ABBA Deadlock (circular wait of depth 2):

Thread A:          Thread B:
lock(mutex_1)      lock(mutex_2)
  lock(mutex_2)      lock(mutex_1)  <-- DEADLOCK
  ...                ...

Eliminating ANY one of the four conditions prevents deadlock. This is the basis for all deadlock prevention strategies.

Resource Allocation Graph

A resource allocation graph (RAG) visually represents the state of resource allocation and requests:

Nodes:
  - Square: Resource (e.g., mutex)
  - Circle: Thread/Process

Edges:
  - Thread -> Resource: "Thread is waiting for this resource"
  - Resource -> Thread: "Resource is held by this thread"

No deadlock:
  T1 --> R1 --> T2

  T1 wants R1, R1 is held by T2.
  T2 is not waiting for anything. Will finish and release R1.

Deadlock (cycle):
  T1 --> R1 --> T2 --> R2 --> T1

  T1 holds R2, waits for R1.
  T2 holds R1, waits for R2.
  Circular dependency --> DEADLOCK.

For multi-instance resources (resources with multiple units), a cycle in the RAG is necessary but not sufficient for deadlock. Deadlock requires that the cycle cannot be broken by satisfying any thread's request.

Banker's Algorithm (Deadlock Avoidance)

Dijkstra's Banker's Algorithm (1965) checks whether granting a resource request would leave the system in a "safe state" — one where all threads can eventually complete:

State: (allocated, maximum, available)
  Thread 1: allocated=[1,0,0], max=[7,5,3]
  Thread 2: allocated=[3,2,2], max=[3,2,2]
  Thread 3: allocated=[2,1,1], max=[9,0,2]
  Available: [3,3,2]

Safety check: can we find an order in which all threads complete?
  Thread 2 needs (max - allocated) = [0,0,0] -> can satisfy!
  After T2 completes, available = [3+3, 3+2, 2+2] = [6,5,4]
  Thread 3 needs [7,-1,1] -- wait, needs < available for all -> yes
  After T3, available = [8,6,5]
  Thread 1 can complete.
  Order T2, T3, T1 is SAFE.

If a request would make the state unsafe, deny it.

Banker's Algorithm is theoretically sound but practically impractical for OS kernels because: - Must know the maximum resource needs of every process in advance. - Resource types (locks) are created dynamically and vary per execution path. - The overhead of checking safety on every lock acquisition is prohibitive.

Deadlock Detection

When prevention/avoidance is too costly, detect deadlocks after they occur:

Periodic cycle detection: Run a cycle-detection algorithm on the resource allocation graph periodically. O(V+E) per check.

Wait-for graph: A simplified RAG where only threads are nodes, with an edge T1→T2 meaning "T1 is waiting for a resource held by T2." A cycle in the wait-for graph indicates deadlock.

Full RAG:    T1 --wants--> R1 --held_by--> T2 --wants--> R2 --held_by--> T1

Wait-for:    T1 --> T2 --> T1   (deadlock!)

OS-level deadlock detection: Most OS kernels do not implement general deadlock detection (too expensive). Instead, tools like lockdep detect potential deadlocks statically.

Timeout-based detection: If a thread has been waiting for a lock for longer than a threshold, assume deadlock. Used in databases (InnoDB deadlock timeout). Problem: false positives under high load.

Deadlock Prevention Strategies

Eliminate Mutual Exclusion: Use lock-free data structures where possible. Not always feasible.

Eliminate Hold-and-Wait: - Require all resources to be acquired atomically (either get all or none). Impractical for dynamically determined resource sets. - Two-phase locking (2PL): Acquire all locks before using them (lock phase), then release (unlock phase). Prevents deadlock but reduces concurrency. - trylock + backoff: Try to acquire additional locks with a timeout. If any trylock fails, release all held locks and retry from scratch:

void transfer(mutex_t *m1, mutex_t *m2) {
    while (1) {
        lock(m1);
        if (trylock(m2)) break;  // got both
        unlock(m1);              // release and retry
        cpu_relax();             // brief backoff
    }
    // transfer
    unlock(m2);
    unlock(m1);
}

Eliminate No-Preemption: Allow locks to be forcibly released. Uncommon in kernel mutexes (too dangerous for consistency), but used in some RTOS designs and database systems (transaction rollback = preemption).

Eliminate Circular Wait: Impose a total ordering on all resources and require that threads acquire resources only in increasing order. This is the most practical kernel strategy.

Lock Ordering Convention

The most important deadlock prevention technique in practice: assign every lock a global "lock class" or "lock number" and require that all code acquires locks only in order of increasing number.

// Wrong (can deadlock):
lock(A); lock(B);  // in some paths
lock(B); lock(A);  // in other paths

// Correct (total order):
#define LOCK_ORDER_A  1
#define LOCK_ORDER_B  2
// Always lock A before B (lower order first)
lock(A); lock(B);   // everywhere

In the Linux kernel, the lock ordering is documented in Documentation/locking/lockdep-design.rst and enforced by lockdep. Common ordered lock groups: - RCU → rw_semaphore → mutex → spinlock (broader to narrower scope) - inode_lock (outer) → block_device_lock (inner) - Subsystem lock (outer) → per-object lock (inner)

Linux lockdep

CONFIG_LOCKDEP enables the Linux lock dependency validator in kernel/locking/lockdep.c. It is the most sophisticated production deadlock-detection tool in existence.

Lock classes: Each unique lock (identified by its source code location, not its runtime address) is a "lock class." All spinlock_t variables declared in the same call site share a lock class.

Lock dependency graph: lockdep maintains a directed graph where an edge A→B means "lock B was acquired while holding lock A." Any cycle in this graph is a potential deadlock.

Key insight: lockdep detects potential deadlocks at first acquisition, not just when a deadlock actually occurs. If the system takes a path where A→B is acquired, and another path acquires B→A, lockdep warns immediately — even if these paths have never actually run simultaneously.

[ lockdep output example ]
WARNING: possible circular locking dependency detected
...
Chain exists of:
  &inode->i_rwsem --> &mm->mmap_lock --> &inode->i_rwsem

but this lock ordering was taken:
  task/1234: inode_lock (inode->i_rwsem) -> mmap_lock
  task/5678: mmap_lock -> inode_lock

This could trigger a deadlock when both tasks run simultaneously.

lockdep lock classes with explicit naming:

// In kernel/locking/lockdep.c
static struct lock_class_key my_lock_key;
lockdep_set_class(&my_lock, &my_lock_key);

Interrupt safety tracking: lockdep tracks whether a lock is ever acquired in interrupt context (IRQ disabled) or not. If a lock is sometimes held in interrupt context and sometimes with interrupts enabled, and another lock follows the same pattern inversed, lockdep warns — the interrupt handler could trigger the deadlock.

SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected:
  holding:   &spinlock_A  (held in softirq context)
  wanting:   &mutex_B
  but this chain exists: &mutex_B -> &spinlock_A
  (mutex_B held in process ctx, spinlock_A acquired in softirq)
  --> softirq fires while holding mutex_B, tries to take spinlock_A
  --> deadlock if spinlock_A is held elsewhere trying to take mutex_B

lockdep overhead: ~20-40% performance reduction in kernel. Only enabled in debug builds. CONFIG_LOCK_STAT additionally tracks lock hold/wait times.

ABBA Deadlock Pattern

The most common kernel deadlock pattern — two threads acquiring the same two locks in opposite orders:

Time:  CPU 0              CPU 1
  t0:  lock(mutex_a)
  t1:                     lock(mutex_b)
  t2:  lock(mutex_b) <-- BLOCKS (mutex_b held by CPU 1)
  t3:                     lock(mutex_a) <-- BLOCKS (mutex_a held by CPU 0)
  ...  DEADLOCK!

The fix: establish a global order. Always lock mutex_a before mutex_b, or use trylock + release-all + retry.

Real Kernel Deadlock Incidents

Linux NFS + inode lock deadlock (2008, fixed in 2.6.27): The NFS client would call nfs_lookup() while holding inode->i_mutex, which then called into the writeback path, which tried to acquire inode->i_mutex again — a self-deadlock. Fixed by using lockdep_set_class() to distinguish the outer and inner inode locks and restructuring the call path.

Linux btrfs triple-lock deadlock (2017): btrfs had a complex three-way lock cycle between fs_info->reloc_mutex, the chunk mutex, and the commit mutex in specific tree relocation paths. Discovered by lockdep during stress testing. The fix involved adding a third lock class to differentiate acquisition contexts.

PostgreSQL lock manager deadlock (pre-9.2): PostgreSQL's lock manager used a single "LockManager" mutex protecting all lock operations. Under high concurrency, a deadlock between the lock manager mutex and the buffer pool mutex was possible in error recovery paths. Fixed in 9.2 by fine-grained locking.

Historical Context

E.J. Dijkstra discussed deadlock in the context of the THE operating system (1965). He described it as "deadly embrace" and showed that spinlocks and semaphores could cause it.

Coffman, Elphick, and Shoshani formalized the four necessary conditions for deadlock in "System Deadlocks" (CACM, 1971) — the conditions now bear Coffman's name.

Dijkstra's Banker's Algorithm was described in "EWD622: On-the-fly Garbage Collection" (1975) and formalized by Habermann.

Ingo Molnar implemented the Linux lockdep validator in 2006, submitted with Linux 2.6.18. It has been called "perhaps the most impressive debugging tool in the kernel" and has found hundreds of real and potential deadlocks over its lifetime.

Production Examples

Linux kernel lockdep: Every major kernel subsystem has been audited and deadlock-proofed via lockdep. The networking, VFS, and memory management subsystems have particularly complex lock hierarchies enforced by lockdep.
PostgreSQL deadlock detection: InnoDB and PostgreSQL both implement deadlock detection in their lock managers using wait-for graph cycle detection, timing out and rolling back one transaction in the cycle.
Oracle RAC: Distributed deadlock detection across a cluster of Oracle instances, periodically exchanging global wait-for graphs.
Java java.util.concurrent: ReentrantLock with tryLock(timeout) provides deadlock avoidance in Java applications. The JVM's JStack tool (kill -3 or jstack) dumps thread states including lock holders and waiters for deadlock diagnosis.

Debugging Notes

CONFIG_LOCKDEP: Enable in debug kernels. Catches potential deadlocks immediately.
CONFIG_DEBUG_LOCK_ALLOC: Verifies proper lock initialization and destruction.
cat /proc/PID/task/TID/wchan: Shows the kernel function a thread is blocked in (e.g., mutex_lock, schedule).
sysrq-t (echo t > /proc/sysrq-trigger): Dumps all thread backtraces to the kernel log. Essential for diagnosing deadlocks in production.
gdb with threads: info threads, thread apply all bt shows all threads and their stacks, identifying which thread holds which lock.
strace -e futex -p PID: Shows a process stuck in futex(WAIT) — indicates blocking on a mutex.
Java jstack PID: Prints thread dumps including "Found one Java-level deadlock" messages.
lsof: Shows file locks held by processes; for file-level deadlocks.
D-state processes: ps aux | grep ' D ' shows processes in uninterruptible sleep, often waiting for a lock or I/O. Sustained D-state is a red flag.

Security Implications

Deadlock as DoS: A carefully crafted request sequence can trigger a deadlock in a server, causing it to hang indefinitely. If the server handles requests single-threaded (with one global lock), one such request crashes the entire service.
Lock ordering bypass: If an attacker can control the order in which the kernel acquires locks (e.g., by manipulating system call arguments), they may be able to trigger a deadlock path not detected by lockdep.
Deadlock-based privilege escalation: Some privilege escalation exploits use deadlocks strategically — causing a high-privilege thread to block waiting for a resource controlled by the attacker.
CVE-2021-3347 (futex PI deadlock): A bug in the futex PI (priority inheritance) code could cause a deadlock in the kernel's PI chain-walking code, leading to a kernel hang (DoS). The futex PI mutex implementation had incorrect handling of chains that created circular references.
Recovery from deadlock: Forced lock preemption (breaking the "no preemption" Coffman condition) can recover from deadlock, but this may leave data in an inconsistent state — a security risk if the inconsistent state is observable.

Performance Implications

Lock ordering overhead: Acquiring locks in a fixed order may require lock+release+reacquire sequences when natural code flow demands a different order. This adds latency but prevents deadlock.
trylock + backoff overhead: In high-contention scenarios, frequent failed trylocks and retries cause livelock-like behavior. Limit retries and add randomized backoff.
lockdep overhead: ~20-40% slowdown with CONFIG_LOCKDEP. Not acceptable in production; only for debug builds.
Deadlock recovery cost: In databases, rolling back a transaction to break a deadlock is expensive (undo all changes). Database deadlock rates (from SHOW ENGINE INNODB STATUS) should be minimized through lock ordering.

Common Pitfalls

Lock inversion in callback-based code: A function acquires lock A, then calls a user-supplied callback, which acquires lock B. Another code path acquires B then A. Lockdep may not catch this if the callback is only set at runtime.
Nested locks with same class: Two different instances of the same struct both have a mutex. If one mutex can be acquired while holding another of the same class (e.g., nested inode->i_mutex), lockdep needs lockdep_set_class() to distinguish them.
Thread A waits for Thread B to initialize, Thread B waits for Thread A: Not a traditional resource lock deadlock, but a deadlock on a condition/signal. Common in initialization code with circular dependencies.
Forgetting that spin_lock_irqsave changes interrupt state: Code that acquires a spinlock with spin_lock() (no IRQ disable) and then tries to acquire a lock that is sometimes held with IRQ disabled can deadlock with an interrupt handler.
Lock forgotten across error path: A lock acquired in a function is not released on an error return. The next thread to acquire the lock deadlocks. Use RAII or explicit goto unlock patterns in kernel code.

Real-World Failure Cases

Git 2.x Windows deadlock (2018): Git's credential_fill() function on Windows acquired the credential lock, then called out to an external credential helper, which in some configurations tried to acquire the same lock. Self-deadlock in credential management code.

Android Binder deadlock (2015): The Android Binder IPC driver had a lock ordering issue where the Binder transaction lock could be acquired while holding a process's inner lock, and vice versa in another path. This caused rare but reproducible deadlocks in apps using both Binder and internal locks. Fixed in drivers/android/binder.c with lock ordering documentation.

MySQL InnoDB LOCK_trx_sys → LOCK_trx deadlock (pre-5.7): A transaction acquiring LOCK_trx_sys (global transaction system lock) while holding a per-transaction LOCK_trx, while another code path reversed the order. Rare but caused production hangs. Fixed by stricter lock ordering enforcement and code restructuring in 5.7.

ZooKeeper deadlock (2014, ZOOKEEPER-1999): A deadlock between ZooKeeper's ZKDatabase lock and FileTxnSnapLog lock in the takeSnapshot path. Under concurrent leader election and snapshot, these were acquired in opposite orders. Fixed by establishing a consistent lock order.

Modern Usage and Cloud-Scale

Microservices distributed deadlock: At cloud scale, deadlocks manifest at the distributed level: Service A calls Service B (and waits), Service B calls Service A (and waits). Detected via distributed tracing (Jaeger, Zipkin) by finding circular call chains, or via timeout-based detection.
Database connection pool deadlocks: Application servers with fixed-size connection pools can deadlock if transactions hold connections while waiting for other connections to complete sub-transactions. Solved with async connection acquisition + timeout.
Kubernetes controller deadlocks: Complex Kubernetes controllers with multiple informer caches and event handlers can create deadlock scenarios when callbacks acquire shared locks. Mitigated by using work queues to decouple event handling from event generation.
Rust's type system and deadlocks: Rust's Mutex<T> cannot prevent deadlocks (they are still possible), but the borrow checker prevents many classes of unsafe concurrent access. Libraries like parking_lot provide try_lock with timeout for deadlock recovery.

Future Directions

Static deadlock analysis: Tools like Facebook's RacerD, Intel's Inspector, and research tools attempt compile-time deadlock detection without the runtime overhead of lockdep. Increasingly practical as LLVM analysis passes improve.
Rust deadlock detection: no-deadlock and tracing-mutex crates attempt lockdep-like runtime detection in Rust.
Formal verification of lock ordering: TLA+ and Alloy specifications can formally verify that a locking protocol's ordering invariants prevent deadlock. Used in distributed system protocol verification.
Hierarchical lock classes: Future lockdep improvements may support hierarchical lock classes that automatically infer safe acquisition orders from data structure hierarchies.

Summary Table

Strategy	Prevents Deadlock?	Coffman Condition Broken	Cost	Practical?
Lock ordering	Yes	Circular wait	Low (design)	Yes (most common)
Trylock + retry	Yes	Hold-and-wait	Livelock risk	Yes (with backoff)
Banker's algorithm	Yes	Hold-and-wait	High (per-request)	No (OS-level)
Lock-free algorithms	Yes	Mutual exclusion	High (complexity)	Sometimes
Deadlock detection + recovery	No (reacts)	N/A	Medium	Yes (databases)
Timeout + abort	No (reacts)	N/A	Low	Yes (databases)
lockdep (Linux)	Finds potential DLs	N/A (instrumentation)	20-40% overhead	Debug builds only

Exercises

Write a deadlock: Write a C program with two threads that reliably deadlocks (ABBA pattern). Use gdb to examine the deadlock: info threads, thread apply all bt. Identify which thread holds which mutex (examine pthread_mutex_t internals).
Fix with lock ordering: Fix the ABBA deadlock by imposing a lock ordering. Verify the fix eliminates the deadlock by running 1 million iterations.
Implement deadlock detection: Write a simple wait-for graph tracker for 5 threads and 5 mutexes. After each lock() call, update the wait-for graph. After each unlock(), remove edges. Detect cycles using DFS. Test with a controlled deadlock scenario.
Linux lockdep exercise: Write a kernel module with an intentional lock ordering inversion (acquire A then B in one path, B then A in another). Boot with CONFIG_LOCKDEP. Observe the WARNING: possible circular locking dependency detected output. Fix the inversion and verify the warning disappears.
Database deadlock simulation: Using two psql sessions to PostgreSQL, create two rows. Session 1: UPDATE row1; UPDATE row2;. Session 2 (concurrently): UPDATE row2; UPDATE row1;. Observe the deadlock and PostgreSQL's automatic rollback of one transaction. Note the ERROR: deadlock detected message.

References

Coffman, E.G., Elphick, M.J., & Shoshani, A. (1971). "System Deadlocks." ACM Computing Surveys 3(2):67-78.
Dijkstra, E.W. (1965). "Cooperating Sequential Processes." Technical Report EWD-123, Eindhoven.
Havender, J.W. (1968). "Avoiding Deadlock in Multitasking Systems." IBM Systems Journal 7(2):74-84.
Molnar, I. (2006). "Lock dependency validator (lockdep)." Linux kernel patch, 2.6.18.
Linux kernel source: kernel/locking/lockdep.c, Documentation/locking/lockdep-design.rst
Herlihy, M. & Shavit, N. (2008). The Art of Multiprocessor Programming, Chapter 2 (Mutual Exclusion) and Chapter 8 (Monitors and Blocking Synchronization).
Silberschatz, A., Galvin, P.B., & Gagne, G. (2018). Operating System Concepts, 10th ed., Chapter 8 (Deadlocks).
MySQL documentation: SHOW ENGINE INNODB STATUS (deadlock section).