Atomics and Compare-and-Swap

Overview

Atomic operations are indivisible memory operations that appear to complete instantaneously from the perspective of all other processors. They are the hardware foundation upon which every synchronization primitive — locks, queues, reference counters, and lock-free data structures — is built. The compare-and-swap (CAS) operation is the most powerful and widely used atomic primitive, capable of conditionally updating a memory location in a single atomic step and serving as the building block for everything from simple mutexes to complex lock-free algorithms. However, atomics come with subtle pitfalls — the ABA problem being the most notorious — and their performance characteristics vary dramatically depending on contention level and hardware generation. Understanding atomics at the hardware level is essential for systems programmers, compiler writers, and anyone implementing concurrent data structures.

Prerequisites

Understanding of cache coherency (MESI protocol)
Knowledge of memory ordering models (TSO, relaxed)
Familiarity with basic lock implementations
Comfort reading assembly language (x86, ARM64)

Core Technical Content

Atomic Operation Definition

An operation on a memory location is atomic if: 1. It completes entirely or not at all (no visible partial state). 2. No other processor can observe the memory location in an intermediate state. 3. From all other processors' perspectives, it appears to happen at a single point in time.

On single-core CPUs, any instruction that fits in a bus transaction is naturally atomic. On multicore CPUs, atomicity requires the cache coherency protocol to serialize access — typically by granting the cache line exclusive ownership before the operation.

Hardware Atomics: x86

x86's LOCK prefix turns any read-modify-write instruction into an atomic operation by asserting the bus lock (older chips) or gaining exclusive cache line ownership (newer chips with Split Lock):

; Test-and-set: write 1, return old value
lock xchg [mutex], eax      ; atomic exchange

; Fetch-and-add: return old, add delta
lock xadd [counter], eax    ; eax = old value, [counter] += eax

; Compare-and-swap: if [ptr]==eax, write ecx
lock cmpxchg [ptr], ecx     ; result in eax (old value)

; Atomic increment
lock inc [counter]

; Atomic OR/AND/XOR
lock or [flags], eax

The XCHG instruction is always atomic on x86 (LOCK prefix is implicit for XCHG with memory operand). CMPXCHG requires explicit LOCK prefix.

LOCK CMPXCHG on x86-64:

; bool CAS(int *ptr, int expected, int new_val)
; Input: rdi=ptr, esi=expected (in eax), edx=new_val
movl    %esi, %eax         ; eax = expected
lock cmpxchgl %edx, (%rdi) ; if [rdi]==eax, [rdi]=edx, eax=old
setz    %al                ; set return value to ZF

Hardware Atomics: ARM64

ARM64 uses the Load-Exclusive / Store-Exclusive (LL/SC) idiom for read-modify-write:

// Atomic increment (ARM64)
retry:
    ldxr    w0, [x1]         // Load-Exclusive: load, mark reservation
    add     w0, w0, #1       // increment
    stxr    w2, w0, [x1]     // Store-Exclusive: store only if reservation still valid
    cbnz    w2, retry        // if store failed (another CPU wrote), retry

LDXR/STXR gives STXR an exclusive "reservation" on the cache line. If any other CPU writes to the cache line between LDXR and STXR, the STXR fails (writes 1 to the status register) and the loop retries. This is an optimistic approach that avoids bus locking.

ARMv8.1-A added CAS (compare-and-swap) as a direct instruction:

cas     w0, w1, [x2]    // if [x2]==w0, [x2]=w1, w0=old value
casa    ...             // acquire variant
casl    ...             // release variant
casal   ...             // acquire+release variant

ARMv8.1 CAS is often faster than LL/SC loops under high contention because it is a single instruction without retry overhead.

RISC-V Atomics

RISC-V provides both LR/SC (analogous to ARM64 LL/SC) and AMO (Atomic Memory Operation) instructions:

// AMO: Atomic fetch-and-add with acquire semantics
amoadd.w.aq rd, rs2, (rs1)   // rd = old *rs1; *rs1 += rs2 (acquire)

// Compare-and-swap (RISC-V uses LR/SC)
cas:
    lr.w.aq t0, (a0)          // Load-Reserved
    bne     t0, a1, fail      // if *a0 != expected, fail
    sc.w.rl t0, a2, (a0)      // Store-Conditional (release)
    bnez    t0, cas           // retry if failed
    li      a0, 1             // success
    ret
fail:
    li      a0, 0
    ret

Compare-and-Swap in Detail

CAS is the most fundamental building block for lock-free algorithms:

// C11 atomic CAS (strong):
bool atomic_compare_exchange_strong_explicit(
    atomic_int *obj,
    int *expected,
    int desired,
    memory_order success,
    memory_order failure
);

// Returns true if swap occurred, false if *obj != *expected
// On failure, *expected is updated with the current value of *obj

Strong vs Weak CAS: - atomic_compare_exchange_strong: Guaranteed to succeed if *obj == *expected. Never spuriously fails. - atomic_compare_exchange_weak: May spuriously fail (return false) even when *obj == *expected, due to LL/SC reservation loss on ARM. Cheaper in a retry loop.

Pattern for CAS retry loop:

int old = atomic_load(&counter);
while (!atomic_compare_exchange_weak(&counter, &old, old + 1)) {
    // old is updated by CAS on failure; no manual reload needed
}

The ABA Problem

ABA is the fundamental correctness hazard of CAS-based lock-free algorithms:

Initial: stack top -> A -> B -> C

Thread 0: reads top=A (about to CAS A -> B to pop A)
Thread 0: gets preempted

Thread 1: pops A (top -> B -> C)
Thread 1: pops B (top -> C)
Thread 1: pushes A back (top -> A -> C)

Thread 0 resumes: CAS(top, A, B) SUCCEEDS
          because top == A (even though the list changed!)
Result: top -> B, but B has been freed!
        C is now orphaned!

CAS cannot distinguish between "this is the original A" and "this is a new A that happens to have the same address." The A value was A, then B (not-A), then A again — ABA — and CAS saw only A→A and concluded nothing changed.

ABA Solutions

1. Tagged Pointers (Version Counter)

Append a monotonically incrementing version/stamp to the pointer. CAS operates on the combined (pointer, version) pair:

typedef struct {
    void *ptr;
    uintptr_t tag;
} tagged_ptr_t;

// Using 128-bit CAS (x86-64 CMPXCHG16B):
typedef __int128 tagged_ptr128_t;

bool cas_tagged(tagged_ptr128_t *loc,
                tagged_ptr128_t old,
                tagged_ptr128_t new) {
    return __sync_bool_compare_and_swap(loc, old, new);
}

// Push: new.tag = old.tag + 1

x86-64 CMPXCHG16B provides 128-bit CAS. On ARM64, use two 64-bit registers and CASP (Compare-and-Swap Pair) from ARMv8.1.

On 64-bit platforms, the high bits of a pointer are unused (virtual addresses are 48-bit on x86-64). Tag the pointer in the high bits:

#define TAG_SHIFT  48
#define PTR_MASK   ((1ULL << TAG_SHIFT) - 1)
#define TAG_MASK   (~PTR_MASK)

typedef uint64_t tagged_ptr_t;

static inline tagged_ptr_t make_tagged(void *ptr, uint16_t tag) {
    return (tagged_ptr_t)ptr | ((uint64_t)tag << TAG_SHIFT);
}

2. Hazard Pointers

Each thread maintains a small set of "hazard pointers" — pointers to objects it is currently accessing. An object cannot be freed if any hazard pointer points to it:

// Thread wants to dereference a shared pointer:
do {
    ptr = atomic_load(&shared_head);
    hazard_ptr[0] = ptr;         // publish hazard pointer
    // Verify ptr is still valid:
} while (ptr != atomic_load(&shared_head));

// Use ptr safely
// ...

hazard_ptr[0] = NULL;  // clear hazard pointer

// Reclaimer checks all hazard pointers before freeing

Used in: Java's java.util.concurrent.ConcurrentLinkedQueue, Folly (Facebook), some PostgreSQL lock-free paths.

3. Epoch-Based Reclamation (EBR)

Divide time into epochs. Readers announce their epoch on entry. Memory freed in epoch E can only be truly reclaimed when all readers have observed epoch > E:

atomic_int global_epoch = 0;
__thread int local_epoch;
__thread bool in_critical = false;

void rcu_read_lock() {
    local_epoch = atomic_load(&global_epoch);
    in_critical = true;
}

void rcu_read_unlock() {
    in_critical = false;
}

// Garbage collector: advance epoch and defer freeing
// Object is safe to free when all threads have epoch > retirement_epoch

This is the mechanism underlying Linux RCU (see 10-rcu.md).

Linux Kernel `atomic_t`

include/linux/atomic.h, arch/x86/include/asm/atomic.h:

typedef struct { int counter; } atomic_t;
typedef struct { long counter; } atomic64_t;

// Operations:
atomic_set(&v, i);                  // v = i (not atomic!)
atomic_read(&v);                    // read (barrier on ARM)
atomic_add(i, &v);                  // v += i
atomic_sub(i, &v);                  // v -= i
atomic_inc(&v);                     // v++
atomic_dec(&v);                     // v--
atomic_inc_return(&v);              // return v++ (new value)
atomic_dec_and_test(&v);            // return (--v == 0)
atomic_add_negative(i, &v);        // return ((v += i) < 0)
atomic_cmpxchg(&v, old, new);      // CAS, returns old value
atomic_xchg(&v, new);              // exchange, returns old

For per-CPU reference counting (e.g., page frame counts), the kernel uses atomic_t with explicit smp_mb() around state transitions.

C11 / C++11 Atomics

#include <stdatomic.h>

// Atomic integer
atomic_int x = ATOMIC_VAR_INIT(0);

// Operations with explicit memory ordering:
int old = atomic_load_explicit(&x, memory_order_acquire);
atomic_store_explicit(&x, 42, memory_order_release);
int prev = atomic_fetch_add_explicit(&x, 1, memory_order_acq_rel);
bool ok = atomic_compare_exchange_strong_explicit(
    &x, &old, 42,
    memory_order_acq_rel,   // success ordering
    memory_order_acquire    // failure ordering
);

Rules for choosing memory ordering: - relaxed: Just atomicity, no ordering. For counters where exact order doesn't matter (statistics). - acquire (load): Prevents subsequent memory ops from moving before this load. - release (store): Prevents prior memory ops from moving after this store. - acq_rel (RMW): Both acquire on the load part and release on the store part. - seq_cst: Full sequential consistency — use as default, optimize later.

Lock-Free Reference Counting with atomics

A common pattern: safe reference counting without locks:

typedef struct {
    atomic_int refcount;
    // ... data ...
} ref_counted_t;

void ref_get(ref_counted_t *obj) {
    atomic_fetch_add_explicit(&obj->refcount, 1, memory_order_relaxed);
    // Relaxed: incrementing a refcount held by current code doesn't need barriers
}

void ref_put(ref_counted_t *obj) {
    // Release: all prior accesses must be visible before decrement
    if (atomic_fetch_sub_explicit(&obj->refcount, 1, memory_order_release) == 1) {
        // Acquire: ensure we see all prior releases before freeing
        atomic_thread_fence(memory_order_acquire);
        free(obj);
    }
}

This is the Boost shared_ptr/std::shared_ptr model. The release on decrement ensures all modifications are visible; the acquire fence before free ensures the last decrementer sees all of them.

Linux kernel: kref_get(), kref_put() in include/linux/kref.h implement the same pattern using atomic_inc() and atomic_dec_and_test().

Historical Context

Test-and-set appeared in the IBM System/360 (1964) as the TS instruction, the first commercial hardware atomic operation.

Compare-and-swap was introduced in the IBM System/370 (1970) as the CS instruction. Maurice Herlihy proved in "Wait-Free Synchronization" (1991) that CAS has consensus number infinity — it can solve wait-free consensus for any number of processes, making it more powerful than test-and-set (consensus number 2).

The ABA problem was first noted by IBM engineers working on the System/370 lock-free stack in the 1970s and formally described by Michael and Scott in their 1996 lock-free queue paper.

Maged Michael introduced hazard pointers in "Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects" (IEEE TPDS, 2004).

Production Examples

Linux kernel: atomic_t and atomic64_t are used in thousands of places — page frame reference counts (page->_refcount), network socket counters, module reference counts.
Java java.util.concurrent: AtomicInteger, AtomicLong, AtomicReference use Unsafe.compareAndSwapInt(), which maps to LOCK CMPXCHG on x86.
Go sync/atomic: atomic.AddInt64(), atomic.CompareAndSwapPointer() — used extensively in the Go runtime for goroutine scheduling.
Redis: Uses atomic operations for its event loop counters and replication offset tracking.
PostgreSQL: Atomic operations (pg_atomic_fetch_add_u32() in src/include/port/atomics.h) for buffer pin counts and lightweight lock spin waiting.
DPDK: Atomics for packet ring buffer head/tail pointers, with relaxed ordering for high-throughput paths.

Debugging Notes

ThreadSanitizer: Detects non-atomic access to shared variables and incorrect memory ordering.
KCSAN (Kernel Concurrency Sanitizer, CONFIG_KCSAN): Detects data races in the kernel by instrumenting all memory accesses.
-fsanitize=address,thread: Compile-time sanitizers for user-space code.
Relacy: A systematic correctness checker for lock-free algorithms (by Dmitry Vyukov). Exhaustively explores all thread interleavings.
CDSChecker: Another model checker for C11 atomics programs.
perf stat -e bus-cycles: High bus cycles may indicate heavy atomic contention and cache line bouncing.

Security Implications

ABA leading to use-after-free: As demonstrated, ABA can cause a freed object to be reused, leading to heap corruption. In kernel context, this can lead to privilege escalation.
Integer overflow in atomics: An atomic counter that overflows (wraps to negative or back to initial value) can lead to reference count underflow and premature object freeing. Use atomic_inc_not_zero() for safety in reference counting.
CVE-2016-8655 (packet_set_ring): A race condition in the Linux AF_PACKET socket code. A missing lock between two operations allowed a use-after-free. The fix added an atomic check. Local privilege escalation to root.
CAS in security-critical paths: Ensure that CAS failures are handled: a failed CAS means the assumption was wrong and the operation should not proceed silently.

Performance Implications

Uncontended atomic: Same as a regular memory access plus a fence (~4-10 cycles).
Lightly contended CAS (1-4 threads): ~10-50 cycles including retry overhead.
Heavily contended CAS (16+ threads, same cache line): Can degrade to thousands of cycles as CPUs retry in a serialized queue. The "atomic transaction" is still fast per operation, but only one succeeds per try.
Fetch-and-add vs CAS: fetch_and_add is cheaper than CAS for simple increment because it never fails (no compare phase, no retry). Use fetch_and_add for counters, CAS for conditional updates.
Relaxed ordering: memory_order_relaxed operations are cheaper than seq_cst on ARM (no fence instruction). On x86, the cost is similar (TSO makes loads/stores naturally ordered).

Common Pitfalls

CAS loop without limit: A CAS loop under heavy contention can retry indefinitely (livelock). Add backoff (exponential or cpu_relax()).
Double-word CAS alignment: CMPXCHG16B requires 16-byte alignment. Misaligned use causes #GP fault.
Relaxed CAS on AArch64: atomic_compare_exchange_weak with memory_order_relaxed on ARM may not include the necessary fences for proper lock implementation. Use at least acq_rel for CAS used in locks.
Non-atomic read of atomic variable: Reading an atomic_int without atomic_load() (e.g., dereferencing directly in debug code) is undefined behavior in C11 and may produce torn reads on 64-bit values on 32-bit platforms.
Store-then-CAS vs CAS-then-read: The order of operations in a CAS loop matters. Reading the current value inside the CAS call (via the expected pointer update on failure) is safer than re-loading manually.

Real-World Failure Cases

FreeBSD uma_zone ABA (2010): A lock-free slab allocator in FreeBSD's UMA suffered an ABA problem in the CAS-based freelist. An object freed and reallocated between a read and a CAS caused the freelist to be corrupted. Fixed by using tagged pointers.

Linux x86 CMPXCHG8B and early Pentiums (F00F bug, 1997): On early Pentium processors, executing LOCK CMPXCHG8B on a non-quadword-aligned address caused a deadlock in the processor's internal bus logic. This is the infamous F00F (0xF0 0x0F 0xC7 0xC8) bug. The processor would hang until reset. Linux worked around this by patching the instruction to a dummy lock.

GCC memory_order_consume (C++11): The consume ordering was intended to be implemented as a data-dependency barrier (cheaper than acquire on POWER/ARM). However, compiler implementations could not correctly track dependency chains, causing compilers to treat it as acquire (more expensive). As of C++17, it is essentially deprecated in practice.

Modern Usage and Cloud-Scale

ARM cloud instances: The shift to ARM (Graviton, Neoverse) has required careful attention to acquire/release ordering in code that worked implicitly on x86 TSO.
lock-free epoch managers: Systems like Jemalloc and TCMalloc use epochs and per-thread atomic counters to safely reclaim memory.
Persistent memory (PMDK): Intel Optane DCPMM requires atomic 8-byte stores for crash-consistent updates; larger updates need undo/redo logging since atomic 64-byte writes are not available.
GPU atomics: CUDA and ROCm provide atomic operations on GPU global memory, but with much higher latency than CPU atomics (~100-1000x). GPU algorithms avoid atomics where possible.

Future Directions

Hardware CAS with backoff: Future ISAs may provide a "backoff hint" to the CPU's atomic retry hardware, reducing power consumption during contention.
MESI protocol evolution: Emerging interconnects (CXL) extend cache coherency across CPU-to-accelerator links, enabling atomics across heterogeneous compute.
Rseq (Restartable Sequences): Linux rseq(2) (v4.18) provides a per-CPU critical section that is restarted if preempted, enabling fast atomic per-CPU operations without disabling interrupts.

Summary Table

Operation	x86 instruction	ARM64 instruction	C11 function	Consensus #
Test-and-Set	LOCK XCHG	LDXR+STXR or SWP	atomic_exchange	2
Fetch-and-Add	LOCK XADD	LDADD (v8.1) or LR/SC	atomic_fetch_add	∞
Compare-and-Swap	LOCK CMPXCHG	CAS (v8.1) or LR/SC	atomic_compare_exchange	∞
Load	MOV	LDR/LDAR	atomic_load	N/A
Store	MOV	STR/STLR	atomic_store	N/A

Exercises

ABA demonstration: Implement a lock-free Treiber stack. Write a test that demonstrates the ABA problem by timing three threads: two that pop/push a node, and one that does a CAS pop that succeeds incorrectly. Fix using tagged pointers.
Performance comparison: Benchmark atomic_fetch_add vs a mutex-protected increment at 1, 2, 4, 8, 16, 32 threads. Plot throughput. Find the crossover point where mutex outperforms atomic due to contention.
C11 memory ordering: Implement a message-passing pattern in C11: thread A writes a buffer then does atomic_store(ready, 1, release). Thread B spins on atomic_load(ready, acquire) then reads the buffer. Verify (with TSan) that it is race-free. Break it by using relaxed and observe TSan detecting the race on ARM.
Hazard pointer implementation: Implement a hazard pointer system. Write a lock-free stack using it. Verify that no object is freed while still pointed to by a hazard pointer.
Linux kernel kref study: Read include/linux/kref.h and lib/kref.c. Trace through what happens when kref_put() is called by the last reference holder. Identify the memory barriers and explain why they are needed.

References

Herlihy, M. (1991). "Wait-Free Synchronization." TOCS 9(1):124-149.
Michael, M.M. (2004). "Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects." IEEE TPDS 15(6):491-504.
Michael, M.M. & Scott, M.L. (1996). "Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms." PODC '96.
Herlihy, M. & Shavit, N. (2008). The Art of Multiprocessor Programming, Chapters 5-7.
Linux kernel source: include/linux/atomic.h, arch/x86/include/asm/atomic.h, include/linux/kref.h
Intel SDM Volume 2: CMPXCHG, LOCK prefix description
ARM Architecture Reference Manual (ARM DDI 0487), §B2.9 (Atomic instructions)
Vyukov, D. Relacy: http://1024cores.net/home/relacy-race-detector