Kernel Locking Overview
Technical Overview
Concurrent access to shared data structures is the central challenge of kernel programming. The Linux kernel runs on multicore machines where dozens or hundreds of CPUs may simultaneously execute kernel code, access shared data structures, and interact with interrupt handlers — which can fire at any point on any CPU. Getting locking wrong means data corruption, kernel panics, deadlocks, and security vulnerabilities.
Linux provides a rich taxonomy of locking primitives, each designed for a specific context: some can be used in interrupt handlers (where sleeping is prohibited), some scale well across hundreds of CPUs, some prioritize read-heavy workloads, and some provide merely ordering guarantees without exclusion. Choosing the wrong primitive wastes performance; using a primitive in the wrong context causes a kernel bug.
The lockdep validator, built into the kernel, provides runtime proof that lock usage is correct — it maintains a dependency graph of lock acquisition orders and detects potential deadlocks before they occur.
Prerequisites
02-kernel-initialization.md: when in the boot sequence locking is available03-kernel-memory-model.md: what shared data structures exist04-kernel-threads.md: kernel thread and interrupt context00-foundations/06-interrupts.md: interrupt context rules
Core Content
Lock Taxonomy in Linux
Linux Kernel Lock Hierarchy (from least to most blocking):
RCU — Lock-free reads, copy-on-write updates
↕ lowest overhead
Spinlock — Busy-wait, no sleep, usable in IRQ context
seqlock — Writer-preferring, readers retry on write
rwlock — Read-write spinlock
↕
Completion — One-shot barrier (wait-for-event)
↕
Semaphore — Legacy, general purpose, may sleep
mutex — Single-owner, may sleep
rwsem — Read-write semaphore, may sleep
↕ highest overhead/flexibility
RT mutex — Priority-inheriting mutex (PREEMPT_RT)
Spinlock: Atomic, IRQ-Safe Variants
Source: include/linux/spinlock.h, arch/x86/include/asm/spinlock.h
A spinlock is a lock where the waiting CPU continuously executes a loop (spins) checking the lock state rather than sleeping. This makes it safe in interrupt context (no sleep possible) but wasteful if the critical section is long (the waiting CPU does no useful work).
spinlock_t my_lock;
spin_lock_init(&my_lock); // dynamic init
DEFINE_SPINLOCK(my_lock); // static init (preferred)
// Basic spinlock (disables preemption):
spin_lock(&my_lock);
// critical section — preemption disabled
spin_unlock(&my_lock);
// With IRQ disable (use if lock shared between process context AND IRQ handlers):
unsigned long flags;
spin_lock_irqsave(&my_lock, flags); // save IRQ state, disable IRQs, lock
// critical section — preemption AND interrupts disabled
spin_unlock_irqrestore(&my_lock, flags); // unlock, restore IRQ state
// With just IRQ disable (not save/restore — only if you know IRQs were enabled):
spin_lock_irq(&my_lock);
spin_unlock_irq(&my_lock);
// Non-blocking try:
if (spin_trylock(&my_lock)) {
// got the lock
spin_unlock(&my_lock);
} else {
// lock was busy
}
When to use irqsave vs. plain spin_lock:
If a lock is accessed both in process context AND in an interrupt handler on the same CPU:
- Without irqsave: process context takes the lock, interrupt fires, IRQ handler tries to take the same lock → deadlock (spinning forever on a lock held by the interrupted code on the same CPU)
- With irqsave: disabling interrupts while holding the lock prevents the IRQ from firing, preventing the deadlock
If the lock is only accessed in IRQ handlers (never in process context), plain spin_lock suffices (IRQ handlers on the same CPU cannot preempt each other by default).
Implementation on x86-64: Modern Linux uses queued spinlocks (qspinlock, include/asm-generic/qspinlock.h). A qspinlock maintains a queue of waiting CPUs, granting the lock in FIFO order, preventing starvation. The lock word contains: locked bit (bit 0), pending bit (bit 8), and a wait queue tail index (bits 16+).
Mutex: Sleepable, Not in IRQ Context
Source: include/linux/mutex.h, kernel/locking/mutex.c
A mutex (MUTual EXclusion) allows at most one holder. When a CPU tries to acquire a contended mutex, instead of spinning, it goes to sleep (calls schedule()), allowing other work to proceed. When the mutex is released, the waiter is woken up.
struct mutex my_mutex;
mutex_init(&my_mutex); // dynamic init
DEFINE_MUTEX(my_mutex); // static init
mutex_lock(&my_mutex); // acquire (sleeps if contended)
// critical section — may sleep here
mutex_unlock(&my_mutex);
// Non-blocking:
if (mutex_trylock(&my_mutex)) {
mutex_unlock(&my_mutex);
}
// Interruptible (returns -EINTR if signal arrives while waiting):
int ret = mutex_lock_interruptible(&my_mutex);
if (ret == -EINTR) {
// signal received, lock not acquired
return -EINTR;
}
mutex_unlock(&my_mutex);
Mutex vs. spinlock choice rules:
- Interrupt context → must use spinlock (no sleep allowed)
- Critical section holds for > a few hundred nanoseconds → use mutex (avoids wasting CPU spinning)
- Critical section does I/O, calls schedule(), or calls any function that might sleep → must use mutex
- Performance-critical, very short critical section on SMP → spinlock (avoids context switch overhead)
Mutex implementation: Linux's mutex uses an adaptive spinning optimization: instead of immediately sleeping, a waiter briefly spins if the lock holder is currently running on another CPU (it will release "soon"). This is the osq_lock (optimistic spinning queue) mechanism. If the holder is not running (blocked in scheduler), the waiter immediately goes to sleep.
rwsem: Read-Mostly Concurrent Access
Source: include/linux/rwsem.h, kernel/locking/rwsem.c
A reader/writer semaphore allows multiple concurrent readers OR one exclusive writer:
struct rw_semaphore my_rwsem;
init_rwsem(&my_rwsem);
DECLARE_RWSEM(my_rwsem);
// Reader path:
down_read(&my_rwsem); // acquire read lock (sleeps if writer holds it)
// read shared data
up_read(&my_rwsem);
// Writer path:
down_write(&my_rwsem); // acquire exclusive write lock (sleeps if any reader)
// modify shared data
up_write(&my_rwsem);
// Downgrade write lock to read lock (without releasing):
downgrade_write(&my_rwsem);
// Try variants:
if (down_read_trylock(&my_rwsem)) { ... up_read(&my_rwsem); }
if (down_write_trylock(&my_rwsem)) { ... up_write(&my_rwsem); }
Real usage: mmap_lock (formerly mmap_sem), the per-process struct mm_struct read-write semaphore, is the most frequently used rwsem in the kernel. VMA lookups (page fault handling, mmap, munmap) acquire it as a read lock; VMA modifications (mmap(2), munmap(2), mprotect(2)) acquire it as a write lock.
Semaphore: Legacy
Source: include/linux/semaphore.h
POSIX-style counting semaphores exist in the kernel but are largely legacy. For binary semaphores (mutex-like), use mutex. For read-mostly patterns, use rwsem. For one-time completion events, use completion. The kernel semaphore (struct semaphore) is documented but discouraged for new code.
seqlock: Writer-Preferring
Source: include/linux/seqlock.h
A seqlock allows high-frequency readers at the cost of potential retries. Writers always proceed (no waiting for readers); readers check a sequence counter before and after their read — if the counter changed (a write occurred), they retry:
seqlock_t my_seqlock;
seqlock_init(&my_seqlock);
// Writer:
write_seqlock(&my_seqlock); // increment sequence counter (now odd = write in progress)
// modify data
write_sequnlock(&my_seqlock); // increment again (now even = write done)
// Reader (retries on concurrent write):
unsigned seq;
do {
seq = read_seqbegin(&my_seqlock); // read sequence counter (waits if odd)
// read data into local variables
} while (read_seqretry(&my_seqlock, seq)); // retry if counter changed
Used for: kernel timekeeping (timekeeper_seq), jiffies_64 updates, per-CPU clock data. The kernel timer tick updates the timekeeping ~1000 times/second; every clock_gettime() call reads it. seqlock enables lock-free reads with minimal overhead.
Seqlock warning: Readers must not use pointers from the seqlock-protected region (the data might be partially updated). Seqlock is best for simple numeric data (timestamps, counters) where reading a partially-updated value is detectable (the retry mechanism catches it) and safe (no pointer dereference in the read path).
Completion: One-Shot Synchronization
Source: include/linux/completion.h
A completion is used when one part of the kernel needs to wait for something specific to happen:
struct completion my_comp;
init_completion(&my_comp); // or DECLARE_COMPLETION(my_comp)
// Waiter:
wait_for_completion(&my_comp); // blocks until complete() is called
// Or with timeout:
unsigned long remain = wait_for_completion_timeout(&my_comp,
msecs_to_jiffies(5000)); // 5 seconds timeout
if (!remain) {
// timed out
}
// Signaler (in a different context):
complete(&my_comp); // wake one waiter
complete_all(&my_comp); // wake all waiters
Used for: waiting for a kernel thread to finish initialization, waiting for DMA to complete, waiting for a device to respond. kthread_stop() uses a completion internally to wait for the thread to exit.
RCU: Read-Copy-Update (Most Scalable Reads)
Source: include/linux/rcupdate.h, kernel/rcu/
RCU is the most important synchronization mechanism for read-mostly, dynamically-allocated data structures in the kernel. It allows readers to proceed without any locking at all — no atomic operations, no memory barriers (on x86-64 for non-preemptable kernels), no cache line bouncing.
Core concept: Readers read data as-is. Writers create a new version, atomically publish it (one atomic pointer store), and then wait for all existing readers to finish before freeing the old version.
// Protected data:
struct config {
int value;
char name[32];
};
struct config __rcu *global_config;
// Reader (zero overhead on non-preemptable kernels):
rcu_read_lock(); // marks entry to RCU read-side critical section
struct config *cfg = rcu_dereference(global_config); // READ_ONCE + memory barrier
if (cfg)
use(cfg->value, cfg->name); // safe: cfg cannot be freed while we hold lock
rcu_read_unlock(); // marks exit
// Writer:
struct config *old, *new;
new = kmalloc(sizeof(*new), GFP_KERNEL);
new->value = 42;
strcpy(new->name, "updated");
spin_lock(&update_lock); // only one writer at a time
old = rcu_dereference_protected(global_config, lockdep_is_held(&update_lock));
rcu_assign_pointer(global_config, new); // atomic publish: smp_store_release
spin_unlock(&update_lock);
// Wait for all readers that saw 'old' to finish:
synchronize_rcu(); // blocks until RCU grace period completes
kfree(old); // safe to free now
RCU grace period: An RCU grace period is a time interval during which every CPU has passed through at least one "quiescent state" (a point where no CPU holds an RCU read-side critical section that was active at the grace period start). On a non-preemptable kernel, any context switch or voluntary schedule constitutes a quiescent state. synchronize_rcu() waits for a full grace period — typically a few milliseconds.
call_rcu() (asynchronous grace period):
struct my_rcu_struct {
struct rcu_head rcu; // embedded RCU callback head
// ... data ...
};
static void my_rcu_free(struct rcu_head *head)
{
struct my_rcu_struct *s = container_of(head, struct my_rcu_struct, rcu);
kfree(s);
}
// Non-blocking: schedule free for after next grace period
call_rcu(&old->rcu, my_rcu_free);
// Returns immediately; free happens later via kworker thread
RCU-protected linked lists: The most common RCU pattern — many readers iterate a list, infrequent writers modify it:
// Reader:
rcu_read_lock();
list_for_each_entry_rcu(pos, &my_list, list) {
use(pos->data);
}
rcu_read_unlock();
// Writer (add):
spin_lock(&list_lock);
list_add_rcu(&new->list, &my_list); // publishes atomically
spin_unlock(&list_lock);
// Writer (remove):
spin_lock(&list_lock);
list_del_rcu(&entry->list); // removes from list (readers may still see it)
spin_unlock(&list_lock);
synchronize_rcu();
kfree(entry);
Real usage: The network protocol list (net/core/net-procfs.c), the module list (kernel/module/core.c), process credential updates (task->cred, updated atomically via RCU), and the PID hash table all use RCU. These are among the hottest data structures in the kernel, read millions of times per second.
Lock Usage Rules: Which Lock in Which Context
Context Allowed locks
────────────────────────────────────────────────────────────────
Hard IRQ handler spinlock (irqsave/irq variants), RCU read
(NO mutex, NO rwsem, NO semaphore, NO completion)
Softirq handler spinlock (bh variants), RCU read
(NO mutex, NO rwsem)
Process context All lock types (spinlock, mutex, rwsem, RCU, completion)
- Spinlock held: no sleep functions (might_sleep checks)
- Mutex held: may sleep (I/O, schedule allowed)
NMI handler ONLY: per-CPU data (no lock needed per-CPU)
NMI-safe spinlock variants (extremely limited)
Violating these rules:
- Sleeping with spinlock held → might_sleep() WARN → BUG if CONFIG_DEBUG_SPINLOCK
- Mutex in IRQ context → deadlock risk, caught by lockdep
- Using spinlock where mutex is needed → performance waste (spinning unnecessarily)
lockdep: The Kernel Lock Validator
Source: kernel/locking/lockdep.c, enabled by CONFIG_PROVE_LOCKING=y (implies CONFIG_LOCKDEP=y)
lockdep is a runtime lock validator that detects potential deadlocks before they occur in production. It works by:
-
Lock class keys: Every lock instance is associated with a "lock class" based on its declaration site (address of the static key). All
spinlock_tvariables declared in the same scope share a class. This groups locks logically. -
Dependency graph: Every time a lock acquisition occurs while another lock is held, lockdep records an edge: "lock A → lock B" means "there exists a code path that holds A then acquires B."
-
Cycle detection: After recording each new edge, lockdep checks whether a cycle exists in the dependency graph. If
A→BandB→Aexist, there's a potential deadlock: CPU1 holds A, tries to acquire B; CPU2 holds B, tries to acquire A. -
Lock state tracking: lockdep tracks which "irq state context" a lock is used in — SOFTIRQ-safe, IRQ-safe, etc. A lock acquired from IRQ context must also be acquired from process context with IRQs disabled (via
irqsave).
Example lockdep report:
======================================================
WARNING: possible circular locking dependency detected
6.1.0 #1 Not tainted
------------------------------------------------------
[process] is trying to acquire lock:
(lock_B){+.+.}-{3:3} at: function_B+0x... [module.ko]
but task is already holding lock:
(lock_A){+.+.}-{3:3} at: function_A+0x... [module.ko]
the existing dependency chain:
-> #1 (lock_B){+.+.}:
lock_acquire+0x...
_raw_spin_lock+0x...
function_B+0x...
-> #0 (lock_A){+.+.}:
lock_acquire+0x...
_raw_spin_lock+0x...
function_A+0x...
stack backtrace:
[backtrace of current code path]
Reading lockdep reports:
- The arrow chain shows the sequence of locks held
- {+.+.} shows lock usage flags (+ = used in IRQ context, etc.)
- {3:3} shows the lock class nesting level
CONFIG_PROVE_LOCKING options:
- CONFIG_PROVE_LOCKING: enables lockdep
- CONFIG_DEBUG_LOCKDEP: extra lockdep debugging
- CONFIG_LOCK_STAT: lock contention statistics (counts times a lock was contended, how long waits were)
- CONFIG_DEBUG_SPINLOCK: validates spinlock usage
- CONFIG_DEBUG_MUTEXES: validates mutex usage
All of these have runtime overhead — production kernels rarely ship with CONFIG_PROVE_LOCKING enabled, but kernel developers always use it during development.
Lock class annotations:
// Override lock class for a lock (when auto-detection is wrong):
lockdep_set_class(&my_lock, &my_lock_class_key);
// Assert a lock is held:
lockdep_assert_held(&my_lock); // panics if not held
// Mark a lock as conditionally acquired (for lockdep analysis):
mutex_lock_nested(&parent->lock, SINGLE_DEPTH_NESTING);
// Tells lockdep that parent/child locks can be nested in this order
The nested variants are used for recursive locking patterns (e.g., locking a parent directory then a child directory) where the same lock class is acquired twice but always in the same order (parent before child).
Historical Context
Linux's locking evolved through several major phases:
Big Kernel Lock (BKL): Linux 2.0–2.4 used the "Big Kernel Lock" — a single recursive spinlock that protected almost the entire kernel. Systems with two CPUs would see the second CPU spin endlessly waiting for the BKL to be released. This was deliberately chosen for correctness in the early SMP era; performance came later.
Fine-grained locking: Linux 2.4 introduced per-subsystem locks. The mmap_sem (now mmap_lock) replaced BKL for VMA operations. The BKL was completely removed in Linux 3.9 (2013) after years of gradual conversion.
RCU introduction: Paul McKenney introduced RCU to Linux in 2001 (Linux 2.5.43). It was initially used for the network routing table, then expanded to hundreds of subsystems. McKenney's RCU implementation evolved through multiple variants: Classic RCU → Tree RCU → Tiny RCU (for embedded) → SRCU (Sleepable RCU for use in sleepable contexts).
lockdep: Introduced by Ingo Molnar and Peter Zijlstra in Linux 2.6.17 (2006). Before lockdep, deadlock bugs could lurk for years, only manifesting in specific multithreaded workloads on multicore machines. lockdep made deadlock bugs detectable in development.
Queued spinlocks (qspinlock): Introduced in Linux 4.2 (2015), replacing the MCS (Mellor-Crummey and Scott) spinlocks used on NUMA machines. qspinlock is more compact (4 bytes vs. 8 bytes) and cache-efficient.
PREEMPT_RT: The Real-Time Linux patchset (now partially merged upstream as of Linux 6.x) converts all spinlocks to mutexes (except truly atomic contexts), making interrupt handlers preemptible and reducing worst-case latency from milliseconds to microseconds.
Production Examples
Networking stack lock contention: The Linux TCP stack's socket lock (sock->sk_lock) is a spinlock-protected mutex hybrid (it uses a spinlock for SoftIRQ/process synchronization and a wait queue for sleeping). On a 10 GbE server processing 1M connections, socket lock contention between the NIC softirq context and application process context is a primary scalability bottleneck. This drove the development of SO_REUSEPORT (multiple sockets on the same port, each handled by a dedicated CPU) and XDP (processing packets before socket locking).
mmap_lock in the page fault handler: Every page fault requires reading the VMA tree with mmap_lock held for reading. A mmap(2) or munmap(2) call acquires mmap_lock for writing, blocking all concurrent page faults for that process. On a JVM with many threads and frequent GC (which calls mmap/mprotect), mmap_lock contention causes latency spikes. This motivated the mmap locking scalability work in Linux 5.8 and the range-locking proposals.
jbd2 journaling lock: ext4's journal subsystem uses a complex hierarchy of locks to protect journal transactions. The j_state_lock (rwlock), j_list_lock (spinlock), and per-handle locks must be acquired in a specific order. lockdep was essential in detecting lock ordering violations in jbd2 during the Linux 5.x development cycle.
Debugging Notes
# Check if lockdep is enabled (significant overhead if yes):
grep CONFIG_PROVE_LOCKING /boot/config-$(uname -r)
# View lock statistics (if CONFIG_LOCK_STAT=y):
cat /proc/lock_stat | head -50
# Shows: lock name, contended count, wait time, hold time
# Trigger a lockdep report (test):
# Any call to a wrong lock API in the wrong context will trigger it
dmesg | grep "possible circular locking dependency"
dmesg | grep "possible recursive locking"
dmesg | grep "WARNING: bad lock"
# Find spinlock holders on a hung system:
echo t > /proc/sysrq-trigger # dump all task stacks
dmesg | grep -B5 "spin_lock"
# RCU stall detection:
dmesg | grep "RCU stall"
# A CPU not completing a quiescent state for >10 seconds triggers this
# Usually indicates a CPU stuck in a loop with preemption disabled
# Check lock owner (CONFIG_DEBUG_MUTEXES or CONFIG_DEBUG_SPINLOCK):
# mutex->owner field contains the task_struct* of the holder
# eBPF: trace lock acquisition latency
bpftrace -e '
kprobe:mutex_lock_slow { @start[arg0] = nsecs; }
kretprobe:mutex_lock_slow { @lat = hist(nsecs - @start[arg0]); delete(@start[arg0]); }'
Identifying deadlocks in production:
# The kernel prints lock dependency chains on deadlock/lockdep warning
# Look for "sleeping function called from invalid context":
dmesg | grep "sleeping function"
# Check /proc/lockdep_stats for validation activity:
cat /proc/lockdep_stats
# watchdog/N (one per CPU) kills the system after 10s hard lockup.
# Before it fires, check:
echo l > /proc/sysrq-trigger # show all locks held by tasks
Security Implications
Lock-based vulnerabilities: Race conditions in lock code are a frequent source of kernel security vulnerabilities: - Double-free races: if two paths can trigger a free without proper mutual exclusion, the object is freed twice, corrupting the allocator - Use-after-free via TOCTOU: a pointer is read, the lock is released, the object is freed by another thread, then the first thread uses the stale pointer - Lock inversion leading to deadlock: if code holds lock A and tries to acquire B, while another path holds B and tries to acquire A — lockdep catches this before it causes a real freeze
RCU and security: RCU's callback queue (rcu_work) processes call_rcu() callbacks. If a freed object's memory is reallocated before the RCU callback fires (which is possible if kfree_rcu() is used incorrectly or the grace period is unexpectedly long), a type-confusion bug can occur. The SLAB_TYPESAFE_BY_RCU slab flag prevents this by ensuring that freed memory is only returned to its original slab cache, preventing type-confusion attacks across the grace period.
lockdep as security tool: lockdep prevents entire classes of race conditions from reaching production by catching them in development. Many kernel CVEs could have been prevented if the offending code had been developed with lockdep enabled.
Performance Implications
Spinlock vs. mutex cost: - Uncontended spinlock: ~15 cycles (atomic test-and-set) - Uncontended mutex: ~30 cycles (slightly more complex state tracking) - Contended spinlock (spinning 10μs): ~10,000 cycles wasted - Contended mutex (one context switch): ~10,000 cycles but frees the CPU for other work
For critical sections shorter than ~100ns, spinlock wins if contention is rare. For longer critical sections or when contention causes significant spinning, mutex wins (the CPU does useful work while waiting).
RCU read overhead:
- Non-preemptable kernel: rcu_read_lock()/rcu_read_unlock() = zero overhead (just a preempt_disable()/preempt_enable(), which itself is optimized away in non-preemptable builds)
- Preemptable kernel (PREEMPT_RT): rcu_read_lock saves the task's state for grace period tracking — ~5 cycles overhead
- synchronize_rcu(): blocks for ~1–5ms (a full grace period). Use call_rcu() for non-blocking updates.
Lock contention profiling:
# perf: show lock contention hotspots
perf lock record ./my_program
perf lock report
# CONFIG_LOCK_STAT output:
# class name: contentions wait_total wait_max
# &rq->lock: 12345 1234567ns 5678ns
# mmap_lock: 1234 234567ns 3456ns
Failure Modes and Real Incidents
Dirty COW (CVE-2016-5195) — locking race: The Dirty COW vulnerability was a race condition between two concurrent page fault handlers operating on the same vm_area_struct. The VMA's mmap_lock was held as a read lock (allowing concurrent faults), but the write to the physical page was not properly serialized. The fix added additional locking around the copy-on-write decision point in do_wp_page().
ext4 deadlock in transaction commit (2019): A lockdep warning identified a potential deadlock in ext4's jbd2_journal_commit_transaction(): the journal transaction lock was acquired while holding the slab's kmem_cache lock (for statistics accounting). Another path held the journal transaction lock while doing slab allocation. Not a real observed deadlock but caught by lockdep before it could manifest in production.
Packet processing RCU stall under DDoS (2020): Under a SYN flood attack, Linux's network stack was processing millions of packets per second. Each new connection lookup required an RCU read-side critical section. When the system became overloaded and CPU scheduling was disrupted (by interrupt storms), some CPUs failed to complete RCU quiescent states for >10 seconds, triggering RCU stall warnings. The fix involved tuning RCU grace period detection thresholds and implementing rate limiting in the SYN flood path.
kfree_rcu() memory accumulation: In production deployments using Kubernetes with many namespaces being created/destroyed, kfree_rcu() was accumulating deferred free callbacks faster than grace periods could process them. Each namespace teardown queued dozens of kfree_rcu() callbacks. Under load, this caused memory to grow unboundedly until the next quiescent period. Fixed by reducing the number of kfree_rcu() calls in namespace teardown code.
Modern Usage
SRCU (Sleepable RCU): include/linux/srcu.h. An RCU variant where the read-side critical section can sleep. Used for subsystems where readers must do I/O (e.g., filesystem operations). More expensive than classic RCU but still much cheaper than a mutex for read-heavy patterns. Used in: KVM (for vCPU and memory slot operations), notifier_call_chain (kernel notifier chains).
Per-CPU RCU (per_cpu_ptr inside RCU): Combining per-CPU variables with RCU: data that is per-CPU in normal operation but needs consistent access during certain operations (e.g., CPU hotplug). The access path: normal operation = per-CPU (no lock); special operation = RCU quiescent state forces all CPUs to serialize.
seqlock + RCU for timekeeping: The kernel's timekeeping subsystem (kernel/time/timekeeping.c) uses both a seqcount (timekeeper_seq) for reading the timekeeper struct and RCU for the clocksource pointer. Reads are effectively lock-free (seqcount retry if write happens, which is rare). This path is in the vDSO hot path for clock_gettime().
Future Directions
- Lock elision: Using TSX (Intel Transactional Synchronization Extensions) hardware to speculatively execute critical sections without taking locks, falling back to the actual lock only on conflict. TSX had reliability issues (disabled by microcode updates) but the concept remains viable for future hardware.
- Concurrent data structure primitives (CDS): Research into wait-free and lock-free alternatives to specific kernel data structures (skip lists as lock-free alternatives to rb_trees for the scheduler).
- Rust memory safety eliminating some locks: Rust's ownership model can replace some locking patterns with compile-time guarantees. A
Mutex<T>in Rust enforces that the data T is only accessed while the lock is held — a correctness guarantee that Cspinlock_tdoesn't enforce. As Rust expands in the kernel, some lock/data pairing bugs will become compiler errors. - PREEMPT_RT full upstream merge: The PREEMPT_RT patchset converts spinlocks to sleeping locks in most contexts. Linux 6.12 merged the core PREEMPT_RT infrastructure. As it stabilizes, production systems (medical devices, industrial control) can use mainline Linux for real-time applications.
Exercises
-
Write a kernel module that demonstrates a potential deadlock: two spinlocks
AandB, and two code paths that acquire them in opposite orders. EnableCONFIG_PROVE_LOCKING=y(in a VM). Run the module and observe the lockdep report. Then fix the ordering and confirm the report disappears. -
Implement a simple RCU-protected linked list in a kernel module. The list stores
struct { int id; struct list_head node; struct rcu_head rcu; }. Implement: add element (writer, useslist_add_rcu), remove element (writer, useslist_del_rcu+call_rcu), read/iterate (reader, useslist_for_each_entry_rcu). Test with concurrent readers and writers. -
Read
kernel/locking/mutex.c, specifically__mutex_lock_slowpath(). What is the "optimistic spin" optimization? When does it kick in? What condition causes a mutex waiter to give up spinning and go to sleep? How does this affect the latency distribution of mutex acquisitions? -
Use
perf lock recordandperf lock reporton a database server (PostgreSQL or MySQL) under load. Identify the top 3 most-contended kernel locks. Research in the kernel source what each lock protects. Explain why those specific operations are hot. -
Read
include/linux/rcupdate.h. Findrcu_dereference()andrcu_assign_pointer(). Expand the macros and identify the memory barriers used. Why doesrcu_dereference()need a load barrier? Why doesrcu_assign_pointer()need a store barrier? What would happen without these barriers on a weakly-ordered architecture (ARM)?
References
- Linux kernel source:
include/linux/spinlock.h,include/linux/mutex.h,include/linux/rwsem.h,include/linux/rcupdate.h,include/linux/seqlock.h,include/linux/completion.h,kernel/locking/lockdep.c - Linux kernel documentation:
Documentation/locking/,Documentation/RCU/ - Paul McKenney, "Is Parallel Programming Hard, And, If So, What Can You Do About It?" (free online, comprehensive RCU coverage)
- Paul McKenney, "What is RCU?": https://www.kernel.org/doc/html/latest/RCU/whatisRCU.html
- Robert Love, Linux Kernel Development, 3rd ed., Chapter 9 (Kernel Synchronization Methods), Chapter 10 (Kernel Synchronization Introduction)
- Ingo Molnar, "Lockdep: The Runtime Locking Correctness Validator", LWN.net 2006
- Peter Zijlstra, "How to use the lockdep validator", LWN.net
- Brendan Gregg, "Linux Lock Analysis", BPF Performance Tools, Chapter 14
- Ulrich Drepper, "Futexes Are Tricky" (for comparison: user-space synchronization)
- PREEMPT_RT documentation: https://wiki.linuxfoundation.org/realtime/documentation/