Spinlocks

Overview

A spinlock is a synchronization primitive where a thread attempting to acquire a held lock repeatedly polls (spins) in a tight loop until the lock becomes available, rather than yielding the CPU and sleeping. Spinlocks are the most primitive locking mechanism used in operating systems and are the backbone of all higher-level synchronization. They are appropriate when the expected hold time is short — shorter than the overhead of a context switch (~1-10 microseconds) — and when sleeping is prohibited, such as in interrupt handlers or during early kernel initialization. Understanding the evolution from naive spinlocks to ticket locks, MCS locks, and the Linux qspinlock reveals how subtle the interaction between cache coherency, NUMA topology, and fairness requirements is.

Prerequisites

Understanding of cache coherency protocols (MESI/MOESI)
Knowledge of hardware atomic operations (CAS, fetch-and-add)
Familiarity with NUMA architecture
Basic understanding of interrupts and interrupt context

Core Technical Content

Naive Test-and-Set Spinlock

The simplest spinlock uses the test-and-set (TAS) instruction:

typedef struct { int locked; } spinlock_t;

void spin_lock(spinlock_t *lock) {
    while (atomic_exchange(&lock->locked, 1) == 1)
        ; // spin
}

void spin_unlock(spinlock_t *lock) {
    atomic_store_explicit(&lock->locked, 0, memory_order_release);
}

Problem: Under contention, every spinning CPU continuously issues XCHG (which is an implicit LOCK), causing the cache line to bounce between all spinning CPUs in Modified state. If N threads are spinning, the holder's unlock operation triggers N cache coherency messages simultaneously — a "thundering herd" on the cache bus. This is O(N²) bus traffic.

Ticket Spinlock

Invented to solve the unfairness of TAS (last-writer-wins is not FIFO) and to reduce bus traffic marginally:

typedef struct {
    uint16_t owner;   // the ticket currently being served
    uint16_t next;    // the next ticket to hand out
} ticket_spinlock_t;

void ticket_spin_lock(ticket_spinlock_t *lock) {
    uint16_t my_ticket = atomic_fetch_add(&lock->next, 1);
    while (lock->owner != my_ticket)
        cpu_relax(); // PAUSE on x86, YIELD on ARM
}

void ticket_spin_unlock(ticket_spinlock_t *lock) {
    lock->owner++;  // atomic on x86 due to single aligned write
}

Still O(N) cache line bouncing: every lock->owner write wakes all N spinning CPUs. Linux used ticket spinlocks from kernel 2.6.25 until they were replaced by qspinlock in 3.15.

MCS Lock (Mellor-Crummey Scott, 1991)

The MCS lock solves cache line bouncing entirely: each waiter spins on its own locally-allocated node, not on the shared lock word. Unlock only affects the immediate successor's node.

typedef struct mcs_node {
    struct mcs_node *next;
    int locked;  // 1 = spinning, 0 = go
} mcs_node_t;

typedef mcs_node_t *mcs_lock_t;  // points to tail of queue

void mcs_lock(mcs_lock_t *lock, mcs_node_t *my_node) {
    my_node->next = NULL;
    my_node->locked = 1;
    mcs_node_t *prev = atomic_exchange(lock, my_node);
    if (prev != NULL) {
        prev->next = my_node;
        while (my_node->locked) cpu_relax(); // spin on LOCAL node
    }
}

void mcs_unlock(mcs_lock_t *lock, mcs_node_t *my_node) {
    if (my_node->next == NULL) {
        // Try to clear tail pointer
        if (cmpxchg(lock, my_node, NULL) == my_node)
            return; // no waiters
        // Someone is enqueuing; wait for them
        while (my_node->next == NULL) cpu_relax();
    }
    my_node->next->locked = 0; // wake successor
}

MCS Queue Structure:

  Lock tail pointer
       |
       v
  +--------+    +--------+    +--------+
  | nodeA  |--->| nodeB  |--->| nodeC  |
  |locked=0|    |locked=1|    |locked=1|
  +--------+    +--------+    +--------+
  (owner,        (spinning     (spinning
   in CS)         on local)     on local)

Each CPU spins on its own locked field. When A releases, it writes to B's locked field (which is in B's L1 cache), causing exactly one cache line invalidation. Total cost: O(1) cache operations per lock/unlock, regardless of contention.

Downside: MCS lock requires per-thread node storage, making it awkward for embedded usage. The lock itself needs 2 words (pointer + node). This was acceptable in high-performance computing but less convenient in the kernel where spinlocks are embedded in thousands of data structures.

CLH Lock (Craig-Landin-Hagersten, 1993)

A variant of MCS where threads spin on the predecessor's node (implicit queue via pred pointer), rather than their own. Nodes are logically in a linked list but predecessor's node is the spinning point:

pred_node (spin here)  <-- my_node
     [locked=1] -----------> [locked=?]

CLH has slightly simpler unlock (just set own node to locked=0), but requires reading predecessor's node, which may be on a different NUMA node. MCS is generally preferred on NUMA.

qspinlock (Linux kernel, v3.15+)

The Linux qspinlock (include/asm-generic/qspinlock.h, kernel/locking/qspinlock.c) is a compact MCS-based lock that fits in a single 32-bit word — the same size as the old ticket spinlock — while providing MCS semantics for contended cases.

Layout of the 32-bit lock word:

Bit:  31          9 8      7 6      0
      +------------+--------+--------+
      | tail index | tail   | locked |
      | + cpu num  | (2b)   | (8b)   |
      +------------+--------+--------+

Bits 0-7: locked byte (0=unlocked, 1=locked by owner)
Bits 8-9: "pending" — first waiter sets this instead of queuing
Bits 10-31: tail — encodes the CPU number of the tail of the MCS queue

The key insight: for the common case of low contention (0-2 waiters), no MCS node is needed. The "pending" bit serves as a single-waiter fast path. Only when there are 3 or more contenders is the full MCS queue used.

Per-CPU MCS nodes are stored in qnodes array in kernel/locking/qspinlock.c:

static DEFINE_PER_CPU_ALIGNED(struct qnode, qnodes[MAX_NODES]);

This is why qspinlock achieves both compactness (32 bits) and scalability (MCS queue for high contention).

Spinlocks in Interrupt Context

A fundamental rule in the Linux kernel: never acquire a sleeping lock from interrupt context. A spinlock, by definition, never sleeps, but there is still a danger: if a spinlock is held by a thread and an interrupt fires on the same CPU, and the interrupt handler tries to acquire the same lock, deadlock results (the CPU is spinning waiting for itself).

Solution: spin_lock_irqsave(lock, flags) disables interrupts on the local CPU before acquiring the lock. spin_unlock_irqrestore(lock, flags) re-enables them after.

unsigned long flags;
spin_lock_irqsave(&my_lock, flags);
// critical section — interrupts disabled on this CPU
spin_unlock_irqrestore(&my_lock, flags);

For spinlocks shared with softirqs (bottom halves), use spin_lock_bh() / spin_unlock_bh().

PREEMPT_RT Spinlock Changes

The PREEMPT_RT (real-time) patchset, which became part of mainline Linux in v6.1, converts most spinlocks to sleeping spinlocks (rtmutex-based). Under PREEMPT_RT:

spinlock_t becomes a sleeping lock backed by a priority-inheriting RT mutex.
raw_spinlock_t remains a true spinlock (non-sleeping).
Interrupt handlers run in threaded context, so they can sleep on RT mutexes.

This distinction is why Linux introduced raw_spinlock_t for the small number of truly non-sleeping spinlocks (scheduler internals, interrupt controller code).

Spinlock Debugging with lockdep

CONFIG_LOCKDEP enables the lock validator in kernel/locking/lockdep.c. It tracks a directed graph of lock acquisition ordering and checks for:

Circular dependency (potential deadlock): If you acquire A then B in one path, and B then A in another, lockdep warns immediately — before a deadlock actually occurs.
Lock class: Lockdep uses "lock classes" (defined by the call site where a lock is declared) to generalize ordering rules.
Interrupt safety: Tracks whether a lock is ever acquired with interrupts disabled vs enabled, detecting potential interrupt-context deadlocks.

WARNING: possible circular locking dependency detected
...
Chain exists of: &(&mapping->i_mmap_rwsem)->(name)
--> fs/inode.c: inode_lock() --> kernel/locking/lockdep.c

Performance: Cache Line Bouncing and NUMA

Cache Line Bouncing

A highly contended spinlock causes the cache line containing the lock word to transfer between CPUs in Modified state. On a dual-socket Intel system, inter-socket cache line transfer takes ~100-150 ns compared to ~5 ns for L1 cache hit. With 32 threads contending:

                   Socket 0                   Socket 1
             [CPU0][CPU1][CPU2][CPU3]    [CPU4][CPU5][CPU6][CPU7]
                     |                           |
              QPI/UPI link (~100 ns)

The lock word ping-pongs across the QPI link, burning interconnect bandwidth. This is why MCS/qspinlock's property of local spinning is so valuable: the cache line with the shared locked bit only moves when the lock is transferred, not during spinning.

NUMA Effects

On NUMA systems, a spinlock located in socket 0's memory costs 5ns for socket 0 CPUs but 150ns for socket 1 CPUs to acquire. Linux's numa_balancing and per-NUMA-node data structures mitigate this, but fundamental lock migration to "follow" the workload is non-trivial.

The Linux kernel's kernel/locking/qspinlock_paravirt.h implements a "PV qspinlock" for hypervisors: a vCPU that is spinning but has been scheduled out (not running) should be put to sleep rather than wasting the physical CPU. PV spinlocks use a "kick" mechanism via VCPU_KICK hypercall.

Historical Context

John Mellor-Crummey and Michael Scott published the MCS lock in "Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors" (TOCS, 1991), one of the most cited systems papers. It solved the cache line bouncing problem that had plagued early shared-memory multiprocessors (BBN Butterfly, Sequent Symmetry).

Travis Craig (1993) and independently Magnusson, Landin, and Hagersten (1994) developed the CLH lock, which is used in Java's AbstractQueuedSynchronizer (AQS) — the foundation of ReentrantLock, Semaphore, and CountDownLatch.

Linux adopted ticket spinlocks in 2008 (2.6.25) to fix the unfairness of TAS locks. They were replaced by qspinlock in 2015 (3.15) for better scalability.

Production Examples

Linux kernel (kernel/locking/qspinlock.c): All spinlock_t uses qspinlock.
Java AQS (java/util/concurrent/locks/AbstractQueuedSynchronizer.java): Uses CLH-based queue.
DPDK (lib/librte_eal/common/rte_spinlock.h): Uses CAS-based spinlock for packet processing fast paths.
PostgreSQL (src/backend/storage/lmgr/s_lock.c): Uses TAS spinlock for buffer manager short critical sections.
RocksDB (util/mutexlock.h): Uses std::mutex (which is futex-backed) but falls back to spinlock for very short critical sections.

Debugging Notes

perf lock: Traces kernel lock/unlock events, shows contention statistics per lock.
/proc/lock_stat (requires CONFIG_LOCK_STAT): Per-lock-class contention counts, wait times, hold times.
lockdep_rcu_dereference: Ensures RCU-protected pointers are only read under rcu_read_lock().
spin_is_locked(): Non-atomic check for debugging (not for production logic).
Symptoms of spinlock contention: high %sys in top, CPU not making progress, high IPC but low throughput in perf stat.

Security Implications

Spinlock starvation DoS: A kernel path holding a spinlock for a long time (e.g., waiting for a hardware register) can prevent all other CPUs from making progress. Avoid long critical sections under spinlocks.
Spectre variant 1 in spinlock loops: Speculative execution during a spin loop can cause cache side-channel leaks. cpu_relax() includes a PAUSE instruction which also acts as a speculation barrier on newer Intel CPUs.
Interrupt handler spinlock injection: If an attacker can trigger a specific interrupt at a precise moment (via hardware or rowhammer), they may be able to interfere with lock state. Largely theoretical but considered in hardware security research.

Common Pitfalls and Failure Modes

Holding a spinlock across a sleeping operation: Calling kmalloc(GFP_KERNEL) (which may sleep) while holding a spinlock is illegal. Use GFP_ATOMIC or rework the code to release the lock first.
Nested spinlocks without documenting ordering: Lock A → Lock B is safe, but any other path taking B → A will deadlock. Use lockdep and document lock ordering in header files.
Per-CPU variables without disabling preemption: On a preemptible kernel, accessing per-CPU variables requires either disabling preemption (get_cpu()) or being in a non-preemptible context (interrupt handler, spinlock held).
Trylock in a tight loop without backoff: spin_trylock() in a tight loop without cpu_relax() skips the PAUSE/YIELD hint, preventing the CPU from hinting to the hardware that it is in a spin loop, degrading hyperthreading performance.
qspinlock with too many MCS nodes: MAX_NODES=4 in the Linux kernel. If preemption occurs while holding an MCS position, another CPU may take a 5th slot and spin on the wrong path. This is why qspinlock is not safe with full preemption (only raw_spinlock_t is used in truly preemption-sensitive code under PREEMPT_RT).

Real-World Failure Case Studies

Linux Ticket Spinlock Starvation (pre-qspinlock): On virtualized environments (Xen, KVM), a vCPU holding a ticket spinlock could be preempted by the hypervisor scheduler. Other vCPUs would spin waiting for their ticket to be called, but since the holder was not running, the wait could be unbounded. This "lock-waiter preemption" problem was the primary motivation for PV spinlocks and eventually qspinlock's paravirt support.

Google's futex_requeue deadlock (2013): A bug in the futex requeue path could cause spinlock-like busy loops in the kernel under specific contention patterns, reported via Google's kernel team. The interaction between futex hash bucket spinlocks and the requeue logic created a livelock.

Amazon EC2 spinlock regression (2017): A Linux 4.12 kernel change to qspinlock behavior on heavily oversubscribed instances caused measurable latency spikes in the networking stack. The PV spinlock kick threshold needed tuning for specific instance types, fixed in 4.14.

Modern Usage and Cloud-Scale Considerations

At cloud scale, spinlocks in the kernel are under extreme pressure: - CPU oversubscription: vCPUs spinning on spinlocks waste physical CPUs. PV spinlocks mitigate this. - Lock-free networking: DPDK and XDP bypass kernel spinlocks entirely for packet processing. - io_uring: Reduces syscall overhead and per-request locking by batching operations. - Per-CPU data: Replacing shared spinlocked data structures with per-CPU structures (percpu counters, percpu freelists) eliminates contention.

ARM-based cloud instances (Graviton, Neoverse) use WFE (Wait For Event) instruction in spinlock loops rather than x86's PAUSE, waking immediately when the cache line changes rather than after a fixed delay.

Future Directions

Cohort locks: Lock passing within a NUMA node before releasing to another node, dramatically reducing inter-socket traffic. Implemented in research (Dice et al., 2012) but not yet in mainline Linux.
Delegation-style locks: ffwd (Fatourou & Kallimanis), where a designated server thread executes all critical sections on behalf of waiting threads. Eliminates cache line bouncing at the cost of serialized execution.
Hardware queue locks: Future ISA extensions that implement MCS-like queueing in hardware cache coherency protocol.

Summary Table

Lock Type	Space	Fairness	Cache Bouncing	Spinning Target	Used In
TAS Spinlock	1 word	No (LIFO)	O(N²)	Shared	Old kernels
Ticket Spinlock	2 bytes	FIFO	O(N)	Shared	Linux 2.6.25-3.14
MCS Lock	1 word + N nodes	FIFO	O(1)	Local	Research, Java AQS
CLH Lock	1 word + N nodes	FIFO	O(1)	Predecessor	Java AQS
qspinlock	1 word (32-bit)	FIFO	O(1)	Local (MCS)	Linux ≥3.15

Exercises

Implement ticket spinlock: Write a ticket spinlock in C using __sync_fetch_and_add. Add a contention stat counter. Benchmark at 2, 4, 8, 16 threads with a 1ns critical section and plot wait time distribution.
MCS lock implementation: Implement the MCS lock in C. Verify FIFO ordering by recording timestamps when each thread acquires the lock. Confirm they match the order of ticket acquisition.
Measure cache line bouncing: Use perf stat -e cache-misses to measure cache misses with TAS spinlock vs qspinlock equivalent under high contention. Observe the difference.
Lockdep exercise: In a kernel module, intentionally create a lock ordering inversion (acquire A then B in one function, B then A in another). Boot with CONFIG_LOCKDEP and observe the warning. Fix it.
PAUSE instruction effect: Write a spinloop with and without _mm_pause() (x86) / __yield() (ARM). Use perf stat -e cpu-cycles to measure cycles spent spinning. Observe how PAUSE reduces power and allows hyperthreading partner to run.

References

Mellor-Crummey, J.M. & Scott, M.L. (1991). "Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors." TOCS 9(1):21-65.
Craig, T.S. (1993). "Building FIFO and priority-queuing spin locks from atomic swap." UW Tech Report 93-02-02.
Linux kernel source: kernel/locking/qspinlock.c, include/asm-generic/qspinlock.h, kernel/locking/lockdep.c
Linux kernel source: kernel/locking/qspinlock_paravirt.h (PV spinlocks)
Dice, D., Marathe, V.J., & Shavit, N. (2012). "Lock Cohorting." PPoPP '12.
Intel Optimization Reference Manual, §2.4.4 (PAUSE instruction)