Memory Barriers and Fences

Overview

Memory barriers (also called memory fences) are instructions that enforce ordering constraints on memory operations. They exist because modern CPUs and compilers aggressively reorder memory operations to improve performance — out-of-order execution, store buffers, load queues, and compiler instruction scheduling can all cause operations to appear to other processors in an order different from the program order. For single-threaded programs, this is transparent. For multi-threaded programs, it can cause subtle and catastrophic bugs where one thread observes incomplete or inconsistent state from another thread. Understanding memory barriers is essential for writing correct lock-free code, implementing synchronization primitives from scratch, and understanding why "obvious" optimizations in concurrent code can be wrong.

Prerequisites

Understanding of CPU pipelines and out-of-order execution
Knowledge of cache coherency (MESI protocol)
Familiarity with atomic operations (CAS, LL/SC)
Basic understanding of multi-threaded programming

Core Technical Content

Why Reordering Happens

Modern CPUs contain multiple mechanisms that cause visible reordering:

Store buffer: Writes are placed in a per-CPU store buffer before being committed to the cache hierarchy. Another CPU may not see the write until it drains from the store buffer. A later load by the writing CPU can bypass the store buffer and see the committed (old) value while the write is still buffered.
Load queue / forwarding: A CPU may speculate ahead and issue loads before all preceding stores have committed. On x86, a later load can be completed from the store buffer (forwarding), but to other CPUs the load appears reordered.
Out-of-order execution: Instructions execute as their inputs are ready, not in program order. A load that misses in L1 cache may stall; dependent loads and stores on independent addresses proceed.
Compiler: The compiler may reorder instructions, eliminate redundant loads/stores, cache values in registers — unless told not to via volatile or memory ordering annotations.

The Classic Store-Load Reordering Example

This demonstrates why TSO (Total Store Order, the x86 memory model) allows reordering:

Initial: x = 0, y = 0

CPU 0:                  CPU 1:
x = 1;                  y = 1;
r1 = y;                 r2 = x;

// Can we observe: r1 == 0 AND r2 == 0 ?

Under sequential consistency: NO. Either CPU 0's x=1 happens before CPU 1's r2=x, or after. Either way, at least one CPU sees the other's write.

Under x86 TSO: YES. CPU 0's x=1 is in its store buffer; CPU 0 reads y=0 from cache before its store drains. Simultaneously, CPU 1's y=1 is in its store buffer; CPU 1 reads x=0. Both stores drain after both reads, yielding r1=0, r2=0.

This result is observable on real hardware. Try it:

// Thread 0:
x = 1;
asm volatile("mfence" ::: "memory");  // with fence: prevents r1=0,r2=0
r1 = y;

// Thread 1:
y = 1;
asm volatile("mfence" ::: "memory");  // with fence
r2 = x;

Without MFENCE, the problematic result occurs ~1 in 10,000 iterations on Intel hardware. With MFENCE, it never occurs.

Memory Ordering Models

Sequential Consistency (SC): All memory operations appear to execute in some single global total order consistent with each thread's program order. No reordering visible. Expensive to implement in hardware. Lamport (1979).

Total Store Order (TSO): x86 model. Loads may be reordered before prior stores to different addresses (the store buffer case above). Loads are not reordered relative to other loads. Stores are not reordered relative to other stores. Store → Load is the one allowed reordering.

Partial Store Order (PSO): SPARC model. Additionally allows store → store reordering to different addresses.

Relaxed (Weak) Ordering: ARM, POWER, RISC-V. Any reordering is allowed (load-load, load-store, store-load, store-store) unless fences are used. Compilers and hardware have maximum freedom.

Compiler Barriers

A compiler barrier prevents the compiler from moving memory operations across the barrier:

// x86 / GCC compiler barrier:
asm volatile("" ::: "memory");

// Linux kernel macro:
barrier();  // defined as: asm volatile("" ::: "memory")

This does NOT emit any hardware instruction. It prevents the compiler from reordering or caching values in registers across the barrier point. Necessary even on sequentially consistent hardware if only compiler reordering is the concern.

volatile in C is NOT a substitute for a compiler barrier in multi-threaded code — it prevents compiler caching of a single variable but does not prevent reordering between volatile and non-volatile accesses.

x86 Hardware Memory Fences

x86 provides three fence instructions:

MFENCE (Memory Fence): Serializes all prior loads and stores. No subsequent load or store begins until all prior loads and stores are globally visible. Full barrier.

mfence

Linux: mb() expands to asm volatile("mfence" ::: "memory") on x86.

LFENCE (Load Fence): Serializes prior loads only. All prior loads complete before any subsequent loads or instructions. On modern Intel, also a speculation barrier (prevents speculative execution past the fence).

lfence

Linux: rmb() expands to asm volatile("lfence" ::: "memory") on x86.

SFENCE (Store Fence): Serializes prior stores. All prior stores are globally visible before any subsequent stores. Used primarily with non-temporal (streaming) stores (MOVNT*).

sfence

Linux: wmb() expands to asm volatile("sfence" ::: "memory") on x86. Note: For regular (cached) stores, wmb() on x86 is just a compiler barrier since x86 does not reorder store-store. sfence is needed only for NT stores.

LOCK prefix: Any LOCK-prefixed instruction (e.g., LOCK XCHG, LOCK CMPXCHG) acts as a full memory barrier, as does XCHG (implicitly locked).

ARM Memory Fences

ARM has a richer set of barrier instructions because its memory model is fully relaxed:

DMB (Data Memory Barrier): Ensures all memory accesses before the barrier are globally observed before any accesses after. Variants: - DMB ISH: Inner Shareable domain (all CPUs in same cluster) - DMB OSH: Outer Shareable domain - DMB SY: Full system - DMB LD, DMB ST: Load-only or Store-only variant

DSB (Data Synchronization Barrier): Stronger than DMB. Not only orders memory operations but also ensures completion before subsequent instructions. Used before cache maintenance operations.

ISB (Instruction Synchronization Barrier): Flushes the pipeline and instruction cache. Used after modifying page tables or updating self-modifying code. Not a memory ordering barrier in the conventional sense.

Linux ARM64 mappings: - mb() → dsb(sy) or dmb(ish) - rmb() → dmb(ishld) (load barrier) - wmb() → dmb(ishst) (store barrier)

Acquire-Release Semantics

Acquire: A load with acquire semantics. All memory operations after the acquire (in program order) are not moved before it. No subsequent load or store can appear before the acquire.

Release: A store with release semantics. All memory operations before the release (in program order) are not moved after it. No prior load or store can appear after the release.

Acquire-release pairs implement the "happens-before" relationship: a release store on one CPU synchronizes with an acquire load of the same variable on another CPU.

Thread 0 (publisher):          Thread 1 (subscriber):
data = 42;                     while (!ready.load(acquire)) ;
ready.store(1, release);       assert(data == 42); // guaranteed

This is cheaper than a full memory fence: acquire/release only need to prevent reordering in one direction.

x86: Acquire is a plain load (no extra instruction needed due to TSO). Release is a plain store. Full fence (MFENCE) is only needed for store-load ordering.

ARM64: Acquire load = LDAR (Load-Acquire). Release store = STLR (Store-Release). These are specific instructions that encode the barrier semantics.

C11 / C++11 Memory Model

The C11 standard (ISO C11, stdatomic.h) provides portable memory ordering:

#include <stdatomic.h>

atomic_int ready = ATOMIC_VAR_INIT(0);
int data = 0;

// Publisher:
data = 42;
atomic_store_explicit(&ready, 1, memory_order_release);

// Subscriber:
while (atomic_load_explicit(&ready, memory_order_acquire) == 0) {}
// data is guaranteed to be 42 here

Memory order values: - memory_order_relaxed: No ordering guarantee beyond atomicity. - memory_order_acquire: Acquire semantics (load only). - memory_order_release: Release semantics (store only). - memory_order_acq_rel: Both acquire and release (for RMW operations like CAS). - memory_order_seq_cst: Sequentially consistent — strongest guarantee; default for atomic_* without _explicit. - memory_order_consume: Intended for data-dependency ordering (pointer loads); practically equivalent to acquire in most implementations due to compiler complexity.

Linux Kernel Memory Barriers

The Linux kernel defines a rich set of barriers in include/asm-generic/barrier.h and arch-specific headers:

mb()         // Full memory barrier (all CPUs)
rmb()        // Read memory barrier
wmb()        // Write memory barrier

smp_mb()     // SMP full barrier (NOP on UP builds)
smp_rmb()    // SMP read barrier
smp_wmb()    // SMP write barrier

smp_mb__before_atomic()  // barrier before atomic op
smp_mb__after_atomic()   // barrier after atomic op

// Paired acquire/release:
smp_load_acquire(p)      // load with acquire semantics
smp_store_release(p, v)  // store with release semantics

READ_ONCE(x) and WRITE_ONCE(x, v) prevent compiler from caching or tearing a value:

// Without READ_ONCE, compiler may cache 'flag' in a register:
while (flag) {}  // infinite loop if compiler hoists flag into register

// With READ_ONCE, every iteration re-reads flag from memory:
while (READ_ONCE(flag)) {}

READ_ONCE is defined as:

#define READ_ONCE(x) (*(const volatile typeof(x) *)&(x))

The volatile cast prevents compiler caching; the typeof preserves the type. This is a compiler barrier for this specific variable.

The LKMM (Linux Kernel Memory Model)

Since Linux 4.15, the kernel ships a formal memory model (tools/memory-model/) based on the herd7 tool. The LKMM defines the semantics of all kernel synchronization operations in terms of happens-before and reads-from relations, allowing algorithmic verification of synchronization code:

tools/memory-model/linux-kernel.cat  -- the formal model
tools/memory-model/litmus-tests/     -- test cases

Historical Context

The recognition that hardware memory reordering was observable and dangerous emerged in the 1990s as shared-memory multiprocessors became common. Sarita Adve and Kourosh Gharachorloo published the seminal "Shared Memory Consistency Models: A Tutorial" (IEEE Computer, 1996).

The C11 memory model was designed by Boehm, Adve, and others (2008 proposal, standardized 2011) to give programmers a portable way to reason about concurrent memory access without relying on architecture-specific behavior.

Leslie Lamport introduced sequential consistency in "How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs" (IEEE TC, 1979).

Production Examples

Linux kernel spinlock/mutex unlock: arch/x86/include/asm/spinlock.h uses LOCK CMPXCHG which has implicit full barrier semantics.
Linux RCU: Uses smp_mb() at grace period boundaries to ensure memory ordering between RCU-protected data access and reclamation.
Linux jiffies: Updated with WRITE_ONCE() to prevent compiler optimization of the update.
glibc pthread_mutex: Uses LOCK CMPXCHG (full barrier) in the fast path.
Disruptor (LMAX): High-performance financial trading ring buffer uses explicit memory_order_release on sequence number publication to ensure consumers see all data before seeing the sequence number advance.

Debugging Notes

ThreadSanitizer: Detects data races (including those caused by missing barriers). Compile with -fsanitize=thread.
Linux LKMM litmus tests: tools/memory-model/scripts/runlitmus.sh runs formal model checks.
perf mem: Measures memory access latency, can reveal frequent cache misses from barrier-induced traffic.
x86_tso_checker: Tool to experimentally detect TSO reordering on x86 hardware.
In kernel debugging, missing smp_mb() shows up as subtle data visibility bugs that are non-reproducible on UP or lightly-loaded systems.

Security Implications

Spectre/Meltdown: Speculative execution crosses security boundaries. LFENCE on newer Intel CPUs acts as a speculation barrier, preventing the CPU from speculatively executing past it. Retpoline and similar mitigations insert LFENCE to prevent branch speculation.
Spectre v1 (bounds check bypass): The CPU speculatively executes past an array bounds check, loading from an out-of-bounds address. The fix uses array_index_nospec() in the kernel, which includes an LFENCE.
Missing memory barrier → data race → security bug: CVE-2016-5195 (Dirty COW) was exacerbated by missing barriers — the race window was wider than necessary.

Performance Implications

Full barrier cost: MFENCE on x86 is one of the most expensive instructions — drains the entire store buffer and load queue. ~30-100 cycles on modern Intel.
Acquire load on x86: Free — no extra instruction.
Release store on x86: MOV to memory (no MFENCE needed for release). ~4-10 cycles.
ARM barriers: DMB ISH ~5-30 cycles depending on pending memory operations.
smp_mb() in loops: Critical hot paths in kernel code avoid smp_mb() by using acquire/release pairs which are cheaper than full barriers.

Common Pitfalls

volatile instead of atomic: Using C volatile for shared variables does not provide memory ordering — only prevents compiler caching of that single variable.
memory_order_relaxed for everything: Optimizing prematurely by using relaxed ordering everywhere, forgetting that CAS success on one thread must synchronize with other threads reading the written value.
Missing barrier between data write and flag write: Classic publish-subscribe bug. Write the data, write the flag, but without a release/barrier between them the observer may see the new flag but old data.
Assuming x86 doesn't need barriers: x86 still needs compiler barriers (READ_ONCE/WRITE_ONCE) and a full MFENCE for store-load reordering. Not every algorithm needs MFENCE but the ones that do are subtle.

Real-World Failure Cases

Alpha CPU and Linux (1990s): DEC Alpha had the weakest memory model of any commercial CPU — it allowed load-load reordering, meaning a pointer could be loaded before the pointed-to data. Linux's RCU implementation had to explicitly handle Alpha's model, adding smp_read_barrier_depends() calls. Alpha's behavior drove much of the Linux memory barrier infrastructure.

MySQL InnoDB TSX/barrier regression: A performance optimization in InnoDB that removed some smp_mb() calls broke correctness on non-x86 (ARM) architectures, discovered during ARM server testing in 2016. The stores to the buffer pool's modify_clock were not visible in the right order.

Go runtime memory model bug (2012): A data race in the Go runtime garbage collector caused by a missing memory barrier between the write of a GC mark bit and the read of the mark bit by the sweep phase. Fixed by adding explicit atomic operations.

Modern Usage and Cloud-Scale

ARM Graviton / Neoverse: AWS's cloud instances often run on ARM which requires explicit DMB instructions. Porting x86-only code that relied on implicit TSO ordering to ARM reveals latent barrier bugs.
RISC-V: Uses a fence instruction (FENCE predecessor, successor) with explicit specification of which operation types to order — the most explicit and pedagogically clear fence design.
C++20 std::atomic_ref: Allows applying atomic semantics to existing (non-atomic) variables, useful when interfacing with C libraries that have their own synchronization.

Future Directions

Hardware memory model simulation: QEMU can simulate different memory models, allowing developers to test ARM memory ordering behavior on x86 development machines.
Formal verification of barrier usage: The LKMM and tools like herd7 are increasingly used to automatically check that kernel synchronization code correctly uses barriers.
ISA simplification: RISC-V's explicit fence design is likely to influence future ISA design — being explicit about what is being ordered reduces surprising behavior.

Summary Table

Operation	x86 instruction	ARM instruction	C11 equivalent	Cost
Full barrier	MFENCE	DMB ISH	memory_order_seq_cst	High
Load barrier	LFENCE	DMB ISHLD	memory_order_acquire	Low (x86)
Store barrier	SFENCE	DMB ISHST	memory_order_release	Low (x86)
Acquire load	MOV (plain)	LDAR	memory_order_acquire	~free (x86)
Release store	MOV (plain)	STLR	memory_order_release	Low
Compiler only	(asm volatile)	(asm volatile)	memory_order_relaxed	Zero hardware

Exercises

Reproduce TSO store-load reordering: Write a C program with two threads as in the classic example. Use __atomic_store_n(relaxed) and __atomic_load_n(relaxed). Count how many iterations produce the r1==0, r2==0 result. Then add smp_mb() between the store and load in each thread and verify the result disappears.
Acquire-release correctness proof: Write a publish-subscribe pattern where publisher writes 16 bytes of data and then does a release store of a sequence number. Subscriber spins on acquire load of sequence number then reads data. Use TSan to verify no races. Then intentionally break it (use relaxed) and observe TSan finding the race.
Compile to assembly: Write three small C11 programs: atomic_store(relaxed), atomic_store(release), atomic_store(seq_cst). Compile with -O2 -S for x86 and ARM64 (cross-compile). Observe which assembly instructions are generated for each.
LKMM litmus test: Download the Linux source, navigate to tools/memory-model/litmus-tests/. Run the MP+poonceonce+poonceonce.litmus test (message passing pattern). Understand the output. Write your own litmus test for the classic store-load reordering.
Kernel module barrier bug: Write a kernel module with a deliberately missing smp_mb() in a producer-consumer path. Use CONFIG_KCSAN (Kernel Concurrency Sanitizer) to detect it. Then add the barrier and verify the report disappears.

References

Adve, S.V. & Gharachorloo, K. (1996). "Shared Memory Consistency Models: A Tutorial." IEEE Computer 29(12):66-76.
Lamport, L. (1979). "How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs." IEEE TC 28(9):690-691.
Boehm, H. & Adve, S.V. (2008). "Foundations of the C++ Concurrency Memory Model." PLDI '08.
McKenney, P.E. (2017). "Is Parallel Programming Hard, And If So, What Can You Do About It?" https://kernel.org/pub/linux/kernel/people/paulmck/perfbook/
Linux kernel source: include/asm-generic/barrier.h, tools/memory-model/
ARM Architecture Reference Manual (ARM DDI 0487), §B2 (Memory Model)
Intel 64 Architecture SDM, Volume 3A §8.2 (Memory Ordering)
Preshing, J. (2012). "Memory Reordering Caught in the Act." https://preshing.com/20120515/