Lock-Free Data Structures

Overview

Lock-free data structures guarantee that at least one thread makes progress in a finite number of steps, regardless of what other threads are doing. They eliminate the deadlock and priority inversion problems inherent to lock-based designs, and can deliver significant throughput gains in highly concurrent workloads. However, lock-free algorithms are substantially harder to design, prove correct, and debug than their lock-based counterparts. The ABA problem, memory reclamation, and subtle memory ordering requirements make lock-free code a minefield. In practice, lock-free structures are not always faster — understanding when they help and when they hurt is as important as knowing how to implement them.

Prerequisites

Deep understanding of CAS and atomic operations (see 08-atomics-and-cas.md)
Understanding of the ABA problem and solutions
Familiarity with memory ordering (acquire/release)
Ability to reason about concurrent interleavings

Core Technical Content

Progress Hierarchy

From weakest to strongest:

Obstruction-free: A thread running in isolation (no other thread executing) completes in a finite number of steps. Contention may prevent progress, but isolation guarantees it.

Lock-free: At least one thread among all contending threads makes progress in a finite number of steps. Other threads may starve temporarily. CAS-based retry loops are typically lock-free.

Wait-free: Every thread completes its operation in a bounded number of steps, regardless of contention. No thread can starve indefinitely. Much harder to implement efficiently.

wait-free
    |  (strictly stronger)
lock-free
    |  (strictly stronger)
obstruction-free
    |  (strictly stronger)
lock-based (no progress guarantee under contention)

Treiber Stack (Lock-Free Stack)

Published by R. Kent Treiber (IBM Research, 1986), this is the simplest non-trivial lock-free data structure:

typedef struct node {
    int value;
    struct node *next;
} node_t;

typedef struct {
    atomic_uintptr_t top;   // tagged_ptr if using version counter
} stack_t;

void push(stack_t *s, node_t *n) {
    node_t *old_top;
    do {
        old_top = atomic_load_explicit(&s->top, memory_order_relaxed);
        n->next = old_top;
    } while (!atomic_compare_exchange_weak_explicit(
        &s->top, (uintptr_t*)&old_top, (uintptr_t)n,
        memory_order_release,   // success: publish n and n->next
        memory_order_relaxed    // failure: just reload top
    ));
}

node_t *pop(stack_t *s) {
    node_t *old_top;
    do {
        old_top = atomic_load_explicit(&s->top, memory_order_acquire);
        if (old_top == NULL) return NULL;
    } while (!atomic_compare_exchange_weak_explicit(
        &s->top, (uintptr_t*)&old_top, (uintptr_t)old_top->next,
        memory_order_acquire,
        memory_order_relaxed
    ));
    return old_top;
}

ABA hazard: Pop reads top=A, gets preempted. Another thread pops A, pops B, then pushes A back. Our CAS succeeds with top=A, but old_top->next now points to freed memory (or the wrong node). Fix with tagged pointers or hazard pointers.

Michael-Scott Queue (Non-Blocking Queue)

Maged Michael and Michael Scott's 1996 paper describes the definitive lock-free FIFO queue, used as the basis for Java's ConcurrentLinkedQueue:

typedef struct node {
    int value;
    atomic_uintptr_t next;
} ms_node_t;

typedef struct {
    atomic_uintptr_t head;  // dummy node (sentinel)
    atomic_uintptr_t tail;
} ms_queue_t;

void ms_enqueue(ms_queue_t *q, ms_node_t *new_node) {
    new_node->next = 0;
    ms_node_t *tail, *next;
    while (1) {
        tail = (ms_node_t*)atomic_load_explicit(&q->tail, memory_order_acquire);
        next = (ms_node_t*)atomic_load_explicit(&tail->next, memory_order_acquire);
        if (tail == atomic_load(&q->tail)) {  // tail consistent?
            if (next == NULL) {
                // Try to link new node
                if (atomic_compare_exchange_weak_explicit(
                    &tail->next, (uintptr_t*)&next, (uintptr_t)new_node,
                    memory_order_release, memory_order_relaxed))
                    break;  // success
            } else {
                // Tail is lagging; advance it
                atomic_compare_exchange_weak(&q->tail, (uintptr_t*)&tail,
                                              (uintptr_t)next);
            }
        }
    }
    // Advance tail (best-effort; another thread can do it too)
    atomic_compare_exchange_weak(&q->tail, (uintptr_t*)&tail, (uintptr_t)new_node);
}

The MS queue uses a "helping" scheme: a lagging tail pointer doesn't block enqueue; any thread seeing a stale tail advances it. This is a key lock-free technique: threads help each other complete pending operations.

Queue state diagram:

Steady state:     head --> [dummy] --> [a] --> [b] --> NULL
                                                        ^
                                                       tail

Mid-enqueue:      head --> [dummy] --> [a] --> [b] --> [c] --> NULL
                                                        ^
                                                       tail (lagging)
                  Any thread can advance tail to [c].

Harris Lock-Free Linked List

Timothy Harris (2001) showed how to build a lock-free sorted linked list supporting insert, delete, and search. The key insight: use a "logical deletion" marker — the low bit of the next pointer — to indicate a node is logically deleted before physically removing it:

// Node with mark bit in pointer
typedef uintptr_t marked_ptr_t;

#define IS_MARKED(p)    ((p) & 1)
#define MARK(p)         ((p) | 1)
#define UNMARK(p)       ((p) & ~1UL)
#define GET_PTR(p)      ((node_t*)UNMARK(p))

// Delete: first mark next pointer (logical delete)
do {
    next = atomic_load(&node->next);
} while (!CAS(&node->next, UNMARK(next), MARK(next)));

// Then CAS predecessor's next to skip this node (physical delete)
CAS(&pred->next, (marked_ptr_t)node, GET_PTR(next));

A concurrent search that encounters a marked node physically removes it and retries — "helping" the delete to complete. Insert must ensure neither predecessor nor the new node's successor is marked.

Memory Reclamation for Lock-Free Structures

The hardest problem in lock-free programming: when is it safe to free a node that has been removed from a data structure? A concurrent reader may still hold a pointer to the node.

Safe Memory Reclamation (SMR) options:

Hazard Pointers: Each thread has K hazard pointer slots. Before dereferencing a shared pointer, publish it in a hazard slot. Reclaimer scans all hazard pointers before freeing. O(P*K) scan per reclamation, where P = thread count.

// Thread-local hazard pointers
__thread void *hazard[2];

void *safe_read(void **ptr) {
    void *p;
    do {
        p = atomic_load(ptr);
        hazard[0] = p;
    } while (p != atomic_load(ptr)); // verify not changed
    return p;
}

Epoch-Based Reclamation (EBR): Three epochs in rotation. Retired objects go into the current epoch's limbo list. The epoch advances when all threads have acknowledged the current epoch. Objects in epoch E-2 are safe to free once epoch is E.

// Per-thread epoch and limbo lists
struct epoch_record {
    atomic_int epoch;     // thread's current epoch
    void *limbo[3][N];    // objects retired in each epoch
};

// Read side: announce epoch
rcu_read_lock():  thread_epoch = global_epoch;
rcu_read_unlock(): thread_epoch = -1;  // not in read-side CS

// Write side: retire object
retire(obj): add to limbo[global_epoch]; try_advance_epoch();

// Epoch advance: if all threads have epoch >= global_epoch,
// increment global_epoch and free limbo[old_epoch - 2]

RCU: Linux kernel RCU is an optimized EBR where "epochs" are implicit in preemption points (quiescent states). See 10-rcu.md.

Quiescent State Based Reclamation (QSBR): Variant of EBR where threads explicitly call quiescent() to indicate safe points, allowing reclamation without per-operation epoch announcements.

Reference Counting with deferred decrement: std::shared_ptr / boost::intrusive_ptr. Overhead per access. Can be made lock-free with split reference counts (Folly AtomicSharedPtr).

Practical Limits of Lock-Free

Lock-free algorithms are NOT always faster. Situations where lock-free is slower or equivalent:

Low contention: A well-implemented mutex with futex fast path is as fast as a lock-free CAS in the uncontended case (~3-10 ns each). The benefit of lock-free only materializes under contention.
Long operations: If the critical section is long (microseconds), the per-operation CAS retry cost is small relative to the work done. A mutex is simpler and equally fast.
CAS retry storms: Under very high contention, N threads racing on a CAS have O(N) total retries per success — effectively O(N²) total work. A mutex serializes this to O(N) total operations.
Memory reclamation overhead: Hazard pointer scans add overhead to every reclamation. EBR can delay memory reclamation significantly. A locked structure with free() is simpler.
Cache line bouncing amplified: With CAS retries, a contested cache line bounces more than with a mutex (which serializes access). Paradoxically, a contended mutex can cause less bus traffic than a highly contended CAS.

Rule: Use lock-free when (a) contention is high but individual operations are short, (b) lock-freedom is a correctness requirement (interrupt context, avoiding scheduler involvement), or (c) a specific wait-free guarantee is needed.

Java `ConcurrentLinkedQueue`

Java's java.util.concurrent.ConcurrentLinkedQueue implements the Michael-Scott queue with hazard-pointer-like logic using Java's GC for reclamation (no explicit hazard pointers needed — GC prevents premature freeing):

public class ConcurrentLinkedQueue<E> {
    private volatile Node<E> head;
    private volatile Node<E> tail;
    // Uses Unsafe.compareAndSwapObject for CAS on head/tail
}

Java's GC eliminates the ABA problem for references: addresses are never reused (GC moves objects with different addresses). The Java CLQ can use simple CAS without tagged pointers.

Linux Kernel Lock-Free Structures

llist (Lock-Less List): include/linux/llist.h. A single-linked list with lock-free push (CAS on head) and batch-pop (XCHG entire list). Used in work queues, IRQ thread wakeup.

// Lock-free push:
static inline bool llist_add_batch(struct llist_node *new_first,
                                   struct llist_node *new_last,
                                   struct llist_head *head) {
    struct llist_node *first = READ_ONCE(head->first);
    do {
        new_last->next = first;
    } while (!try_cmpxchg(&head->first, &first, new_first));
    return !first;
}

hlist_bl (Hash List with Bit Lock): A hash list that uses the low bit of the first pointer as a spinlock, used in the dentry cache (fs/dcache.c).

Per-CPU queues (include/linux/percpu_counter.h): Not lock-free in the classical sense, but eliminate shared-state contention by keeping per-CPU counters and batching updates to a global counter.

Historical Context

Treiber published his lock-free stack as an IBM Research Report in 1986 — possibly the first published lock-free data structure. Michael and Scott's 1996 PODC paper on the non-blocking queue was the catalyst for serious academic work on lock-free algorithms. Maurice Herlihy's 1991 paper on wait-free synchronization provided the theoretical foundation for lock-free research.

Maged Michael's 2004 paper on hazard pointers solved the long-standing problem of lock-free memory reclamation without GC. Epoch-based reclamation was popularized by Fraser (2004) and is the basis for Linux's RCU.

Production Examples

Java ConcurrentLinkedQueue: Used in Java EE application servers, thread pool work queues, event processing systems.
Folly (Facebook): folly::MPMCQueue (multi-producer multi-consumer bounded queue), folly::ConcurrentSkipList, folly::AtomicHashMap.
Linux kernel llist: Used by kthread mechanism, SMP call functions, timer wheel.
LMAX Disruptor: High-performance financial trading ring buffer using a single producer / multiple consumer lock-free design with sequence numbers.
Chromium/V8: Lock-free work-stealing deques in V8's task scheduler.
RocksDB: Lock-free skip list used as the in-memory write buffer (MemTable).

Debugging Notes

Relacy (Dmitry Vyukov): Systematically explores all thread interleavings of a lock-free algorithm. Most reliable tool for finding correctness bugs.
CDSChecker: Model checker for C11 atomic programs.
ThreadSanitizer: Detects data races but may not detect all logical correctness issues in lock-free code (a race-free lock-free algorithm can still be logically wrong).
KCSAN: Linux kernel equivalent of TSan.
Adding instrumentation: For debugging, add a counter that tracks how many times each CAS loop retries. More than 5-10 retries per operation indicates high contention.
Symptoms of broken lock-free code: sporadic crashes, memory corruption, infinite loops (livelock), incorrect results. Non-reproducible on single-core or under debugger.

Security Implications

Use-after-free from reclamation races: If hazard pointers are not correctly set before dereferencing, a freed node can be accessed. In kernel context, leads to privilege escalation.
Lock-free structures in kernel (CVE-2017-10661): A use-after-free in the kernel's lock-free timer implementation (timer_list). An attacker could manipulate timer expiry to trigger the UAF.
TOCTOU in optimistic algorithms: Lock-free algorithms that read data and then CAS are vulnerable to ABA-like TOCTOU if the check and the action are not atomic with respect to each other.

Performance Implications

Uncontended CAS: ~5-10 cycles (same as mutex fast path).
Contended CAS (8 threads): ~50-200 cycles per operation due to retries.
Contended mutex (8 threads): ~100-5000 cycles per operation (futex syscall if sleeping).
Crossover: Lock-free often wins at 2-8 threads. At 32+ threads, it depends heavily on the algorithm and contention pattern.
Memory reclamation overhead: Hazard pointer scan: O(P*K) per reclamation. EBR: amortized, but memory may be held longer (affecting RSS).

Common Pitfalls

Forgetting memory ordering: Using memory_order_relaxed for all CAS operations. The CAS itself is atomic, but the data it publishes is not visible to other threads without release/acquire ordering.
ABA without tagged pointers or hazard pointers: Particularly dangerous in lock-free stacks and lists where nodes can be reallocated.
Incorrect helping: In helping schemes (MS queue), a thread that helps complete another thread's operation must do so correctly — incorrect helping corrupts the data structure.
Livelock in poorly designed CAS loops: Two threads each repeatedly failing their CAS due to the other. Always add exponential backoff or cpu_relax() for very tight CAS retry loops.
Reclamation without quiescence: Freeing a node before all readers have finished using it. Always use a safe memory reclamation scheme.

Real-World Failure Cases

Facebook Folly AtomicLinkedList bug (2015): A concurrency bug in Folly's lock-free list allowed a node to be appended to a list that was being traversed with a hazard pointer, leading to incorrect behavior when the hazard pointer was cleared before the traversal was complete. Fixed by adjusting the memory ordering on hazard pointer publication.

Java ConcurrentHashMap striped lock regression (Java 7 → 8): Java 7's ConcurrentHashMap used striped locking (16 segments). Java 8 replaced this with a CAS-based approach. Under extremely high contention with many threads, the Java 8 version showed higher variance in latency due to CAS retry storms — a regression from lock-striping for some workloads. A cautionary example of lock-free not always being better.

Linux kthread llist race (pre-3.14): A race condition in the lock-free push to kthread_create_list allowed a kthread to see a partially-initialized work item. Fixed by adding a smp_wmb() before the llist push.

Modern Usage and Cloud-Scale

Disruptor pattern: The LMAX Disruptor uses a pre-allocated ring buffer with sequence numbers, eliminating dynamic allocation and ABA. Achieves ~6M events/second on a single core.
io_uring: Linux's io_uring uses lock-free ring buffers (submission and completion queues) for kernel-bypass I/O, eliminating syscall overhead for individual operations.
DPDK rte_ring: Lock-free MPMC ring buffer using CAS on head/tail indices. Achieves near-line-rate packet processing.
FoundationDB: Uses lock-free skip lists for its in-memory storage layer.
Cloud-scale RCU: Linux RCU is the ultimate lock-free read-side, used throughout the networking and VFS stacks handling millions of requests/second.

Future Directions

Transactional memory: Hardware transactional memory (HTM) attempts to make any code block atomic without CAS, potentially simplifying lock-free code. See 11-transactional-memory.md.
Verified lock-free algorithms: Formal verification tools (Iris, Coq) are being used to mechanically prove lock-free algorithm correctness. The Iris framework has verified the Michael-Scott queue.
Persistent lock-free structures: Designing lock-free data structures for non-volatile memory (NVM/Optane) where crash consistency is also required.

Summary Table

Structure	Progress	ABA-safe	Reclamation	Complexity	Use Case
Treiber stack	Lock-free	No (tagged ptr needed)	Hazard/EBR	Low	Work queues
MS queue	Lock-free	No (tagged ptr needed)	Hazard/EBR/GC	Medium	MPMC queues
Harris list	Lock-free	No	Hazard/EBR	High	Sorted sets
Skip list	Lock-free	No	Hazard/EBR	High	Ordered maps
Disruptor ring	Wait-free	N/A (no pointer ABA)	None (static)	Medium	Single-prod queues
Linux llist	Lock-free	No (push-only)	N/A (GC managed)	Low	IRQ/kthread lists

Exercises

Implement and test Treiber stack: Implement the Treiber stack in C with tagged pointers. Write a test that intentionally triggers the ABA scenario (three threads with sleeps at precise points). Verify that without tagged pointers the test fails, and with tagged pointers it passes.
Michael-Scott queue benchmark: Implement the MS queue. Benchmark with 1, 2, 4, 8 producers and 1, 2, 4, 8 consumers. Compare against a mutex-protected std::queue. Plot throughput vs thread count.
Hazard pointer reclamation: Implement a simple hazard pointer system. Use it to safely reclaim Treiber stack nodes. Verify with Valgrind that no node is freed while still referenced.
CAS retry counting: Add a retry counter to a lock-free CAS loop. Run under increasing thread counts (2, 4, 8, 16, 32). Plot average retries per operation. Identify the contention cliff.
Relacy model checking: Use Relacy (http://1024cores.net/home/relacy-race-detector) to model-check your Treiber stack implementation. Explore all possible interleavings and verify ABA freedom and linearizability.

References

Treiber, R.K. (1986). "Systems Programming: Coping with Parallelism." IBM Research Report RJ 5118.
Michael, M.M. & Scott, M.L. (1996). "Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms." PODC '96.
Harris, T. (2001). "A Pragmatic Implementation of Non-Blocking Linked-Lists." DISC 2001.
Herlihy, M. (1991). "Wait-Free Synchronization." TOCS 9(1):124-149.
Michael, M.M. (2004). "Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects." IEEE TPDS.
Fraser, K. (2004). "Practical Lock-Freedom." PhD thesis, Cambridge.
Herlihy, M. & Shavit, N. (2008). The Art of Multiprocessor Programming, Chapters 9-13.
Linux kernel source: include/linux/llist.h, lib/llist.c
Java source: java/util/concurrent/ConcurrentLinkedQueue.java