Readers-Writer Locks

Overview

A readers-writer (RW) lock allows concurrent read access from multiple threads while providing exclusive access for writers. The core insight is that reads are safe to parallelize as long as no write is occurring — many real-world workloads are dominated by reads (routing tables, configuration data, caches, in-memory indexes), making RW locks an attractive optimization over plain mutexes. However, RW locks are more complex than mutexes and their performance benefits evaporate or reverse under certain workloads and hardware conditions. The Linux kernel's evolution from rwlock_t to rwsem, percpu_rwsem, and ultimately RCU tells the story of progressively solving the fundamental tension between read scalability and write latency.

Prerequisites

Understanding of mutexes and spinlocks
Knowledge of NUMA architecture and cache coherency
Familiarity with the reader-writer problem (Courtois et al., 1971)
Understanding of why per-CPU data structures scale better than shared ones

Core Technical Content

Reader-Writer Problem

Formal constraints: 1. Multiple readers may be in the critical section simultaneously. 2. Only one writer may be in the critical section at a time. 3. No reader may be in the critical section when a writer is. 4. No writer may be in the critical section when a reader is.

The freedom to choose which threads to favor when both are waiting defines three policies:

Reader preference: New readers are admitted even if a writer is waiting. Readers are never blocked by waiting writers. Problem: writer starvation — if readers arrive continuously, writers may never get access.

Writer preference: New readers are blocked if a writer is waiting. No new reader is admitted once a writer is queued. Problem: reader starvation under high write rates (rare in practice).

Fair (FIFO): Service in arrival order. Readers that arrive before a writer go first; readers that arrive after a waiting writer must wait. Provides starvation freedom for both.

Linux Kernel `rwlock_t` (Spinlock-Based)

include/linux/rwlock.h, arch/x86/include/asm/spinlock.h:

typedef struct {
    arch_rwlock_t raw_lock;
} rwlock_t;

// x86 implementation (simplified):
// raw_lock is a 32-bit integer:
// bit 31: write lock (1 = held)
// bits 0-30: reader count

// Read lock
read_lock(&rwlock);
// ...
read_unlock(&rwlock);

// Write lock
write_lock(&rwlock);
// ...
write_unlock(&rwlock);

As with spinlock_t, there are _irq, _irqsave, _bh variants for interrupt safety.

Limitation: rwlock_t is a spinlock-based RW lock — readers and writers busy-wait. This means: - Readers compete with each other on write to the reader count field (cache line bouncing on every read_lock/read_unlock). - On a 64-core machine, a read-lock-heavy workload causes 64 CPUs to hammer the same cache line, which is worse than a simple mutex.

This is the rwlock scalability fallacy: naively assuming more readers = faster is wrong because the shared reader counter becomes the bottleneck.

Linux Kernel `rwsem` (Sleeping RW Lock)

include/linux/rwsem.h, kernel/locking/rwsem.c. For process-context code that may sleep:

struct rw_semaphore {
    atomic_long_t count;       // reader count + flags
    atomic_long_t owner;       // writer owner pointer + flags
    struct optimistic_spin_queue osq;  // MCS spin queue
    raw_spinlock_t wait_lock;
    struct list_head wait_list;
};

The count field encodes: - RWSEM_UNLOCKED_VALUE (0): unlocked - Positive with RWSEM_READER_BIAS increments: shared by readers - RWSEM_WRITER_LOCKED: exclusive write lock

Operations: down_read(), up_read(), down_write(), up_write(), downgrade_write().

downgrade_write() is a key optimization: atomically convert a write lock to a read lock (without releasing and re-acquiring), useful when the write phase is done but the read phase still needs the data.

percpu_rwsem

include/linux/percpu-rwsem.h. For read-dominant workloads, percpu_rwsem eliminates reader-side cache bouncing:

struct percpu_rw_semaphore {
    struct rcu_sync    rss;
    unsigned int __percpu *read_count;  // per-CPU reader counters
    struct rcuwait     writer;
    wait_queue_head_t  waiters;
    atomic_t           block;
};

Read path:

void percpu_down_read(struct percpu_rw_semaphore *sem) {
    // Increment local CPU's counter (no cache line bouncing)
    this_cpu_inc(*sem->read_count);
    // Check if a writer is waiting (via RCU)
    if (unlikely(!rcu_sync_is_idle(&sem->rss))) {
        // Slow path: decrement and use traditional wait
        this_cpu_dec(*sem->read_count);
        percpu_down_read_slowpath(sem);
    }
}

Write path must sum all per-CPU counters and wait for them to drop to zero — expensive for writers, but reads are now truly local and generate no contention.

Used in Linux for critical paths like freeze_super(), CPU hotplug cpus_read_lock().

Seqlock

include/linux/seqlock.h. A seqlock favors writers at the cost of reader retry:

typedef struct {
    struct seqcount seqcount;  // even = no write in progress, odd = write
    spinlock_t lock;
} seqlock_t;

// Writer (exclusive):
write_seqlock(&sl);
// ... modify shared data ...
write_sequnlock(&sl);

// Reader (optimistic, may retry):
unsigned seq;
do {
    seq = read_seqbegin(&sl);  // read sequence number
    // ... read shared data ...
} while (read_seqretry(&sl, seq));  // retry if write occurred

If seqcount is odd when the reader reads it, a write is in progress — spin. If the count changed between read_seqbegin and read_seqretry, a write occurred during the read — retry. Readers do not modify shared state at all (no counter increment), so there is zero reader-side write traffic. Writers still use a spinlock for mutual exclusion.

Seqlock is ideal for data that is read far more often than written and can be read speculatively (retried cheaply), such as: - jiffies (nanosecond timestamp): kernel/time/timer.c - x86 timekeeper: kernel/time/timekeeping.c - Network routing table sequence number

Limitation: Cannot be used if the reader dereferences pointers in the protected data (pointer may be freed during retry — use RCU instead).

RCU as Scalable Reader-Writer Pattern

RCU (Read-Copy-Update, kernel/rcu/) is the ultimate read-scalable primitive: readers pay zero synchronization cost (just rcu_read_lock()/rcu_read_unlock(), which disable preemption but do no atomic operations). See 10-rcu.md for full coverage.

The progression in Linux:

rwlock_t        --> Simple but not scalable (cache bounce on read_lock)
rwsem           --> Sleeping, scalable for moderate workloads
percpu_rwsem    --> Per-CPU counters, near-zero reader overhead
seqlock         --> Zero reader writes, retry on conflict
RCU             --> Zero reader overhead, copy-on-update for writers

When RW Locks Help vs Hurt

RW lock helps when: - Read operations are significantly longer than the overhead of lock acquisition. - The workload is genuinely read-dominant (>95% reads). - The read operation holds the lock for microseconds or longer.

RW lock hurts or is neutral when: - Critical section is very short (< 100ns): Lock acquisition overhead dominates. - Many readers run simultaneously: Reader count cache line becomes a bottleneck. - Writers are frequent: Readers get no parallelism benefit if frequently blocked. - NUMA: Reader count updates cause inter-socket traffic even when "just reading".

Rule of thumb: Benchmark before replacing a mutex with an RW lock. Under high reader concurrency on multi-socket NUMA, a plain mutex often outperforms a naive rwlock.

Historical Context

The reader-writer problem was formally stated by Courtois, Heymans, and Parnas in "Concurrent Control with 'Readers' and 'Writers'" (CACM, 1971), which presented both reader-preference and writer-preference solutions.

Seqlock was invented by Stephen Hemminger and Andrea Arcangeli for the Linux kernel circa 2001 to accelerate the jiffies timestamp without requiring readers to write to a lock word.

Percpu rwsem was added to Linux in 2012 (v3.4) by Tejun Heo for the CPU hotplug notifier chain, which needed near-zero reader overhead since it was called on every scheduling tick.

Production Examples

Linux VFS: inode->i_rwsem (formerly i_mutex) is a struct rw_semaphore protecting inode metadata. Multiple read() calls can proceed in parallel; write() is exclusive.
Linux network routing: FIB (Forwarding Information Base) uses RCU + seqlock for the routing table, allowing fast parallel route lookups.
Linux memory management: mm->mmap_lock is a rwsem allowing concurrent page faults (read) while mmap()/munmap() are exclusive (write).
PostgreSQL: Buffer manager uses LWLock (lightweight lock), a reader-preference rwlock, for buffer pin operations. Heavy read workloads benefit from parallel access.
Redis Cluster: Uses a seqlock-like epoch mechanism for cluster topology changes.

Debugging Notes

/proc/lock_stat (CONFIG_LOCK_STAT): Shows per-rwlock contention stats.
perf lock: Traces kernel lock events including rwsem operations.
Starvation detection: If writers are starved, add timing to down_write() and log wait times. If consistently > N ms, investigate reader hold times.
lockdep for rwsem: Lockdep tracks rwsem ordering; it will warn if a rwsem is acquired for write in one context and read in another in a circular dependency.
CONFIG_RWSEM_SPIN_ON_OWNER: Enables optimistic spinning for rwsem, reducing context switch overhead. Check kernel config.

Security Implications

Reader starvation exploit: An attacker who can flood a system with read operations may prevent writers from updating security-critical data (e.g., access control lists, revoked certificates).
Double-checked locking with rwlock: A common pattern is to read-lock, check a condition, upgrade to write-lock, re-check, modify. Upgrading is not atomic — there is a window between releasing read lock and acquiring write lock where state can change. Must re-check condition after write lock.
CVE-2017-2636 (Linux rwlock race in n_hdlc): A double-free vulnerability caused by a race condition in the n_hdlc TTY line discipline. The n_hdlc_release() function could be called concurrently due to insufficient locking, leading to a use-after-free. Exploitable for privilege escalation.

Performance Implications

Reader count cache line: On x86, read_lock() does an atomic increment of a shared 32-bit counter. On a 16-core machine with 16 simultaneous readers, this counter cache line bounces 16 times per read. For a 1-second read operation, this is trivial. For a 1-microsecond read operation, the atomic increment dominates.
NUMA amplification: On a 2-socket NUMA system, reading the shared rwlock counter from socket 1 requires a ~150ns cross-socket access every time. percpu_rwsem eliminates this.
Writer starvation measurement: With reader-preference rwlock and continuous readers, measure writer wait time as reader arrival rate increases. This characterizes starvation risk.

Common Pitfalls

Read lock → write lock upgrade: rwlock_t does not support atomic upgrade from read to write (would deadlock: holder of read lock waits for all readers to leave, including itself). Release read lock first, then acquire write lock, and re-validate the condition.
Holding read lock while calling sleeping function: rwlock_t is a spinlock — no sleeping. rwsem allows sleeping, but be careful about lock ordering.
Using rwlock for short critical sections: The overhead of rwlock acquisition (atomic ops) may exceed the benefit of parallelism.
Seqlock with pointer dereference: Reading a pointer value under seqlock and then dereferencing it after the seqlock is released is unsafe — the pointed-to data may have been freed. Use RCU for pointer-containing data.

Real-World Failure Cases

Linux mmap_sem (now mmap_lock) scalability regression: In earlier kernels, the per-process mmap_sem rwsem was a significant bottleneck on multi-threaded workloads doing page faults (readers) while any thread called mmap() (writer). The mmap_lock fairness upgrade in 5.8 and eventual per-VMA locking (5.18+) addressed this.

PostgreSQL LWLock starvation (pre-9.0): Under extreme read loads, LWLock writers (e.g., checkpoint process) could be starved for seconds, causing checkpoint lag and potential data integrity issues. Fixed by implementing a writer-preference wakeup in the LWLock code.

Linux CVE-2017-2636: The n_hdlc TTY driver's n_hdlc_release() function had inadequate locking allowing a race that freed the same buffer twice. The fix added proper locking to prevent concurrent execution of the release path.

Modern Usage and Cloud-Scale

Per-object locking: Rather than a global rwlock, modern kernels trend toward per-inode, per-VMA, per-buffer locks to reduce contention surface.
RCU everywhere: Linux increasingly replaces rwlocks with RCU where the update pattern allows (append-only, pointer swap). This is the most scalable read path possible.
Distributed rwlocks: In distributed systems, implementing read-write semantics across nodes requires distributed consensus for writes while allowing local reads from replicas. This is the foundation of single-leader replication in databases.

Future Directions

Per-VMA locking (Linux 5.18+): Instead of a single mmap_lock rwsem per process, each VMA has its own lock. Allows parallel page faults on different VMAs, critical for large multi-threaded processes on many-core systems.
Range locks: Lock a sub-range of a data structure rather than the whole thing (used in Linux's byte-range file locks, fcntl(F_SETLK)).

Summary Table

Lock Type	Reader Cost	Writer Cost	Starvation	Sleeping	Use Case
`rwlock_t`	Atomic inc/dec	Exclusive spin	Reader pref.	No	Short CS, mixed
`rwsem`	Atomic inc/dec	Exclusive sleep	Fair (OS queue)	Yes	Long CS, mixed
`percpu_rwsem`	Per-CPU inc	Sum all CPUs	Fair	Yes	Read-dominant
`seqlock`	None (retry)	Exclusive spin	Writer pref.	No	Read-dominant, no ptrs
RCU	None	Copy+pointer swap	N/A	No (readers)	Read-dominant

Exercises

Measure rwlock vs mutex: Write a benchmark with 8 readers and 1 writer, critical section = 1 µs. Compare throughput with pthread_rwlock_t vs pthread_mutex_t. Repeat with 100 ns critical section and observe the crossover.
Implement a seqlock: Write a user-space seqlock protecting a struct { int x, y; } (2D point). Writer updates atomically. Reader reads and retries. Inject artificial write races by sleeping in the middle of a write. Verify readers always see consistent data.
Percpu vs shared counter: Implement a read counter using (a) a shared atomic, (b) per-CPU counters (using __thread as a proxy). Benchmark with 16 threads incrementing 1M times. Measure cache misses with perf stat.
Starvation experiment: Implement a reader-preference rwlock. Spawn 8 continuous reader threads. Measure how long a writer must wait. Then implement writer-preference and measure reader wait time.
Kernel module rwsem trace: Write a kernel module that uses down_read/up_read/down_write/up_write on a shared data structure. Use ftrace or bpftrace to trace lock acquisition times and identify any contention.

References

Courtois, P.J., Heymans, F., & Parnas, D.L. (1971). "Concurrent Control with 'Readers' and 'Writers'." CACM 14(10):667-668.
Linux kernel source: include/linux/rwlock.h, kernel/locking/rwsem.c, include/linux/percpu-rwsem.h
Linux kernel source: include/linux/seqlock.h
Lameter, C. (2014). "Percpu rwsem." LWN.net. https://lwn.net/Articles/587810/
CVE-2017-2636: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2017-2636
Linux kernel documentation: Documentation/locking/