Skip to content

RCU: Read-Copy-Update

Overview

Read-Copy-Update (RCU) is a synchronization mechanism that allows reads to proceed with zero overhead — no atomic operations, no memory barriers, no cache line bouncing — while writes proceed by creating updated copies of data and atomically publishing the new version. RCU is the most widely used synchronization primitive in the Linux kernel, protecting the routing table, the dentry cache, the module list, and thousands of other read-mostly data structures. Its power comes from a fundamental insight: readers do not need to coordinate with each other or with writers; they simply need to see a consistent snapshot of the data, and the system just needs to know when it is safe to free old versions. Understanding RCU requires understanding grace periods, quiescent states, and the memory ordering implications of pointer-based updates.

Prerequisites

  • Understanding of memory barriers and acquire/release semantics
  • Familiarity with spinlocks and read-write locks
  • Knowledge of the Linux scheduler and preemption
  • Understanding of linked list manipulation at the pointer level

Core Technical Content

The Three RCU Primitives

RCU is built on three operations:

  1. rcu_read_lock() / rcu_read_unlock(): Mark the start and end of an RCU read-side critical section. Readers inside this section may safely dereference RCU-protected pointers. On a non-preemptible kernel, this is simply preempt_disable() / preempt_enable() — zero atomic overhead.

  2. synchronize_rcu(): Called by the writer after publishing a new version. Blocks until all pre-existing RCU read-side critical sections have completed. After this returns, no reader can still hold a reference to the old data.

  3. rcu_assign_pointer(p, val) / rcu_dereference(p): Publish a new pointer value with release semantics (rcu_assign_pointer) and read a pointer value with acquire semantics (rcu_dereference). These ensure proper memory ordering between the data update and the pointer update.

The Grace Period Concept

A grace period is the interval between when old data is logically removed and when it is safe to physically free. Any reader that started before the grace period may still hold the old pointer; any reader that started after the grace period sees only the new pointer.

A quiescent state for a CPU is any point where that CPU is not in an RCU read-side critical section — for example, a context switch, a return to user space, an idle period, or an rcu_read_unlock().

A grace period is complete when every CPU (or every thread, in SRCU) has passed through at least one quiescent state.

Grace Period Timeline

Time: ----+-----------+----------+----------+----------+-->
          |           |          |          |          |
          t0          t1         t2         t3         t4
          |           |          |          |          |
Writer:   [delete A,  |          |          |          |
           install B] |          |          |          |
          |           |          |          |          |
CPU 0:    [rcu_read_lock  |  rcu_read_unlock]          |
           reading A  |                    |           |
          |           |                    |           |
CPU 1:    |    [rcu_read_lock .. rcu_read_unlock]      |
          |           |                    |           |
CPU 2:    |           | ctx_switch |       |           |
                      QS          QS
                                          |
                                    Grace period END
                                    (all CPUs have QS)
                                    Safe to free A at t3/t4

At t0, writer removes A from the list and installs B. CPU 0 was already in rcu_read_lock() — it still sees A and that is safe. CPU 2 has a context switch at t2 (quiescent state). CPU 0's rcu_read_unlock() at t2 is another quiescent state. CPU 1's rcu_read_unlock() at ~t3 is the last. At t3, all CPUs have passed through a quiescent state, so the grace period is over. It is now safe to free A.

RCU Read-Copy-Update Pattern

The canonical RCU update of a pointer in a linked list:

// Protected list node
struct foo {
    int data;
    struct rcu_head rcu;  // for deferred free
    struct list_head list;
};

// READ side:
rcu_read_lock();
list_for_each_entry_rcu(p, &foo_list, list) {
    if (p->data == target) {
        // Use p safely; it won't be freed until after rcu_read_unlock
        process(p);
    }
}
rcu_read_unlock();

// UPDATE side (remove an entry):
spin_lock(&foo_lock);  // serialize writers with each other
list_del_rcu(&old->list);  // remove from list (RCU-safe)
spin_unlock(&foo_lock);

synchronize_rcu();  // wait for grace period
kfree(old);         // now safe to free

// Or async:
call_rcu(&old->rcu, foo_reclaim);  // kfree in RCU callback
// foo_reclaim is called after grace period, potentially in softirq context

Memory Ordering in RCU

rcu_assign_pointer(p, val) is:

#define rcu_assign_pointer(p, v) \
    smp_store_release(&(p), v)

rcu_dereference(p) is:

#define rcu_dereference(p) \
    smp_load_acquire(&(p))

This release/acquire pair ensures that all modifications to the data structure (the new node's fields) are visible to any reader that dereferences the new pointer. Without these barriers, a reader on ARM could see the new pointer but stale data in the pointed-to object.

On x86 (TSO), plain load/store suffice due to the stronger hardware memory model, and rcu_dereference compiles to just a load with a compiler barrier.

RCU Implementation Variants

Classic RCU (original): Checks that each CPU has passed through a context switch. Simple but heavyweight — requires waiting for all CPUs. Used before Tree RCU.

Tree RCU (kernel/rcu/tree.c): Hierarchical implementation using a tree structure to aggregate quiescent states from all CPUs, scaling to thousands of CPUs. Each CPU node in the tree reports its quiescent state upward; a grace period completes when the root receives reports from all leaf nodes.

                      Root RCU node
                     /              \
            RCU node                RCU node
           /        \              /        \
        CPU 0..3  CPU 4..7    CPU 8..11  CPU 12..15

SRCU (Sleepable RCU) (kernel/rcu/srcutree.c): Allows the read-side critical section to sleep (block on I/O, sleep on a mutex). Classical RCU's read-side must be non-sleeping because it relies on context switches as quiescent states. SRCU uses per-CPU counters to track readers independently of scheduling:

struct srcu_struct my_srcu;
init_srcu_struct(&my_srcu);

// Read side (can sleep inside):
int idx = srcu_read_lock(&my_srcu);
// ... may sleep ...
srcu_read_unlock(&my_srcu, idx);

// Write side:
synchronize_srcu(&my_srcu);  // waits for all SRCU readers

// Or:
call_srcu(&my_srcu, &rcu_head, callback);

SRCU is used in filesystems (notify_change, fsnotify) and anywhere readers may block.

TASKS RCU (kernel/rcu/tasks.h): Waits for all tasks (threads) to pass through a scheduling point, rather than waiting for CPUs. Used for code patching (live kernel patching, BPF) where the unit of protection is a task, not a CPU.

Tiny RCU (kernel/rcu/tiny.c): Simplified RCU for uniprocessor (UP) builds and small embedded kernels. Without SMP, grace periods are trivially instantaneous.

call_rcu: Deferred Reclamation

synchronize_rcu() blocks the caller, which is expensive if done frequently. call_rcu() is the non-blocking alternative: register a callback that fires after the next grace period:

struct my_obj {
    int data;
    struct rcu_head rcu;
};

void my_obj_free(struct rcu_head *head) {
    struct my_obj *obj = container_of(head, struct my_obj, rcu);
    kfree(obj);
}

// Writer (after removing from RCU-protected structure):
call_rcu(&obj->rcu, my_obj_free);
// Returns immediately; my_obj_free is called after grace period
// in a softirq context (RCU_SOFTIRQ)

RCU Memory Ordering Guarantees

RCU provides a happens-before guarantee: - Everything a writer does before rcu_assign_pointer() happens-before everything a reader does after rcu_dereference() of the same pointer. - Everything a reader does inside rcu_read_lock()/rcu_read_unlock() happens-before synchronize_rcu() returns.

This is weaker than a mutual exclusion lock but sufficient for the read-modify-copy-publish update pattern.

RCU Reclamation and Quiescent States

The rcu_expedited variants (synchronize_rcu_expedited()) force CPUs to report quiescent states immediately by sending IPIs, completing the grace period in ~10-100 µs instead of up to tens of milliseconds. Useful for module unloading, kprobes attachment.

The "grace period kthread" (rcu_gp_kthread) runs on each NUMA node and drives grace period state machines in kernel/rcu/tree.c.

RCU in userspace: liburcu

liburcu (Userspace RCU, https://liburcu.org) provides RCU for user-space programs:

#include <urcu.h>

// Read side:
rcu_read_lock();
struct node *p = rcu_dereference(shared_ptr);
// use p
rcu_read_unlock();

// Write side:
rcu_assign_pointer(shared_ptr, new_ptr);
synchronize_rcu();
free(old_ptr);

Multiple flavors: - liburcu-bp (bulletproof): Works without thread registration, uses signals. - liburcu-mb (memory barriers): Uses smp_mb() on read side. - liburcu-qsbr (quiescent state based): Fastest, requires explicit rcu_quiescent_state() calls.

Used in: Userspace Envoy proxy, LTTng, some database implementations.

Historical Context

Kung and Robinson proposed the general idea of read-copy-update in "On Optimistic Methods for Concurrency Control" (1981, TODS), though without the name RCU. The term "RCU" and its Linux kernel implementation were developed by Dipankar Sarma and Paul McKenney at IBM starting around 1998-2002. The first Linux kernel submission was for 2.5.43 (2002). McKenney has been the primary RCU maintainer and researcher ever since, publishing dozens of papers on RCU theory and implementation.

Tree RCU replaced Classic RCU in Linux 2.6.29 (2009) to scale to systems with hundreds of CPUs. Preemptible RCU (SRCU precursor) was added for -rt kernels.

Production Examples

  • Linux networking: The route table (fib_table), ARP table, and neighbor table are all protected by RCU. Route lookups — happening millions of times per second — are lock-free reads.
  • Linux VFS: The dentry cache (fs/dcache.c) uses RCU for path name lookups. Every open(), stat(), exec() traverses the dentry tree under RCU.
  • Linux module system: kernel/module.c uses RCU to protect the module list. __module_address() is called under RCU to find the kernel module for a given address (e.g., in stack traces).
  • netfilter: Connection tracking table (net/netfilter/nf_conntrack_core.c) uses RCU for connection lookup.
  • eBPF: BPF programs attached to kernel hooks are called under RCU read-side. TASKS RCU is used to synchronize program replacement.

Debugging Notes

  • CONFIG_PROVE_RCU: Enables lockdep-like checking for RCU rules (e.g., dereferencing RCU pointer outside rcu_read_lock()).
  • RCU_LOCKDEP_WARN(condition, msg): In the kernel, emits a warning if an RCU-protected pointer is accessed without holding rcu_read_lock().
  • rcu_dereference_check(p, condition): Assert that either rcu_read_lock() is held OR some other condition is true (e.g., holding a spinlock that implies RCU protection).
  • /proc/rcu_state (via CONFIG_RCU_TRACE): Shows current grace period state for debugging long grace periods.
  • rcutorture (kernel/rcu/rcutorture.c): Kernel module that stress-tests RCU implementation. Run to verify RCU correctness after kernel modifications.
  • Long grace periods (> 1 second): Caused by a CPU that never gets a quiescent state (e.g., stuck in a long interrupt handler or kernel thread that never sleeps). Check with rcu_cpu_stall_suppress and rcu_stall_timeout.

Security Implications

  • Use-after-free via grace period violation: If free() is called before synchronize_rcu(), a reader may access freed memory. In kernel context, this is a security vulnerability.
  • CVE-2019-11477 (SACK Panic): While not directly an RCU bug, the underlying mechanism (TCP SACK processing under the socket lock) demonstrated how high-rate packet processing could interact badly with RCU grace periods, delaying reclamation and causing memory exhaustion.
  • BPF program replacement race: If a BPF program is replaced and the old program is freed before TASKS RCU grace period, a running BPF program could access freed memory. TASKS RCU is specifically used to prevent this.
  • synchronize_rcu() in atomic context: Calling synchronize_rcu() from interrupt context or with a spinlock held is illegal (it may sleep). A BUG() in debug kernels catches this.

Performance Implications

  • Read-side cost (non-preemptible kernel): Just preempt_disable() + preempt_enable() — ~2-4 cycles. Zero atomic operations. Zero cache line bouncing.
  • Read-side cost (preemptible kernel / SRCU): Counter increment/decrement — ~10-20 cycles.
  • synchronize_rcu() latency: 1-50 ms (wait for all CPUs to schedule at least once). Acceptable for infrequent updates.
  • synchronize_rcu_expedited(): 10-100 µs (IPI to all CPUs). More expensive for the system but lower latency for the caller.
  • call_rcu() throughput: Can queue thousands of callbacks per second. Grace period batching amortizes overhead.
  • Comparison to rwlock: RCU read path is 10-100x faster than read_lock() on rwlock (which does an atomic increment). This is why the kernel routing table lookup was moved to RCU.

Common Pitfalls

  1. Sleeping inside rcu_read_lock(): Only allowed for SRCU. Regular RCU is preemption-based; sleeping causes a preemption, which is a quiescent state, potentially ending the grace period while still holding the read lock — use-after-free risk.
  2. Using stale RCU pointer after rcu_read_unlock(): The read-side critical section ends; after that point, the pointer may be freed. Store needed data before unlocking.
  3. Forgetting rcu_assign_pointer: Direct assignment (p = new_val) without rcu_assign_pointer lacks the release barrier, allowing readers to see new pointer but stale data on ARM/POWER.
  4. call_rcu accumulation: If writers call call_rcu faster than grace periods complete, the callback list grows unboundedly. rcu_barrier() waits for all pending callbacks.
  5. Using address of stack variable as RCU-protected pointer: The stack may be freed during the grace period. Only use heap-allocated objects with RCU.

Real-World Failure Cases

Linux DCCP module use-after-free (CVE-2017-6074): A use-after-free in the DCCP network protocol implementation. A double-free of a socket buffer occurred during path MTU discovery. While not a pure RCU bug, it demonstrated how reclamation ordering bugs lead to privilege escalation in kernel networking code. Fixed by adding proper reference counting.

RCU stall during CPU hotplug (Linux 3.x): When a CPU was offlined, it needed to pass through a quiescent state to allow grace periods to complete. A bug caused the offline CPU to not report its quiescent state properly, hanging synchronize_rcu() calls for tens of seconds. Manifested as kernel stall warnings (rcu_sched self-detected stall). Fixed in kernel/rcu/tree_plugin.h.

Amazon EC2 RCU stall (2016): Highly oversubscribed EC2 instances with CPU steal time caused vCPUs to appear busy (from the guest kernel's perspective) for long periods, preventing quiescent state detection. RCU saw apparent stalls and emitted warnings. Linux 4.12 improved RCU's awareness of paravirt environments.

Modern Usage and Cloud-Scale

  • RCU everywhere: The Linux kernel has ~15,000 rcu_read_lock()/rcu_read_unlock() call sites. It is the most-used synchronization primitive in the kernel.
  • Cloud hypervisors: KVM/QEMU uses RCU for vCPU data structures and IOMMU page tables.
  • Container density: With hundreds of containers per host, the zero-cost RCU read path is critical for not adding synchronization overhead to per-container kernel operations.
  • io_uring RCU: io_uring uses RCU to protect the list of registered file descriptors and BPF programs, allowing lock-free access from the io_uring fast path.

Future Directions

  • Userspace RCU primitives: C++26 proposes a std::rcu-like API. Until then, liburcu provides this for C programs.
  • Persistent RCU: Extending RCU semantics to non-volatile memory (PMDK) for crash-consistent read-mostly data structures.
  • Range-based RCU: Protecting sub-trees or ranges of a data structure with RCU rather than the entire structure, allowing finer-grained concurrency.
  • TASKS TRACE RCU: A newer variant for tracing infrastructure that needs to trace even idle/nohz CPUs.

Summary Table

RCU Variant Read-side cost Reader can sleep? Grace period Use Case
Classic/Tree RCU ~2 cycles (preempt disable) No Context switch General kernel
Preemptible RCU ~10 cycles (counter) No (yields quickly) Scheduler tick PREEMPT kernel
SRCU ~15 cycles (per-CPU counter) Yes Per-SRCU counter Filesystems, VFS
TASKS RCU ~0 (no read-side API) Yes Task scheduling BPF, kprobes
liburcu-qsbr ~0 (explicit QS) No quiescent_state() Userspace

Exercises

  1. Implement a read-mostly hash table: Use liburcu (or kernel RCU in a module) to protect a hash table. Writers call rcu_assign_pointer to update buckets. Readers call rcu_dereference and are lock-free. Benchmark read throughput vs a mutex-protected hash table.

  2. Grace period visualization: Write a kernel module that logs timestamps at rcu_read_lock/unlock and synchronize_rcu start/end. Run with multiple reader threads. Visualize the grace period on a timeline.

  3. SRCU in a filesystem: Study fs/notify/fsnotify.c's use of SRCU. Trace a path where srcu_read_lock() is acquired, the reader sleeps waiting for a kernel event, and srcu_read_unlock() is called. Verify the SRCU synchronize call waits correctly.

  4. RCU stall injection: Use a kernel module to hold rcu_read_lock() for 5 seconds. Observe the RCU CPU stall warning message in dmesg. Note the stall timeout and the warning format.

  5. Compare RCU vs rwlock in routing table: Write a benchmark simulating route lookups: 99 reader threads, 1 writer thread updating every 100 ms. Measure throughput and latency with (a) rwlock, (b) SRCU. Observe the difference.

References

  • McKenney, P.E. & Slingwine, J.D. (1998). "Read-Copy Update: Using Execution History to Solve Concurrency Problems." PCOSC '98.
  • McKenney, P.E. (2004). "Exploiting Deferred Destruction: An Analysis of Read-Copy-Update Techniques in Operating System Kernels." PhD thesis, OGI.
  • McKenney, P.E. (ongoing). "Is Parallel Programming Hard, And If So, What Can You Do About It?" https://kernel.org/pub/linux/kernel/people/paulmck/perfbook/
  • Linux kernel source: kernel/rcu/tree.c, kernel/rcu/srcutree.c, include/linux/rcupdate.h
  • Linux kernel documentation: Documentation/RCU/
  • liburcu: https://liburcu.org
  • Kung, H.T. & Robinson, J.T. (1981). "On Optimistic Methods for Concurrency Control." TODS 6(2):213-226.