ARM Architecture: From Embedded to Apple Silicon

Technical Overview

ARM (Advanced RISC Machines) has evolved from a niche British microprocessor for the Acorn BBC Micro into the dominant processor architecture for mobile devices and, since Apple's 2020 M1, a formidable competitor in laptop and server markets. The AArch64 (ARMv8-A) ISA introduced a clean 64-bit design alongside its 32-bit AArch32 predecessor, retaining a fixed-width instruction encoding, load-store architecture, and a weak memory model. Modern ARM implementations—Apple M2, ARM Cortex-X3, Qualcomm Oryon—represent some of the most sophisticated microarchitectures ever designed, rivaling and in many cases surpassing Intel's Golden Cove in performance-per-watt.

Prerequisites

Basic assembly language concepts (registers, instructions, addressing modes)
Understanding of load-store architectures vs register-memory architectures
Familiarity with privilege levels and OS design concepts
Knowledge of memory ordering and barriers (see 07-memory-coherence-protocols.md)
Understanding of CPU pipeline concepts (see 01-cpu-pipeline-deep-dive.md)

Core Content

ARM History

1983: Acorn Computers (Cambridge, UK) designs the ARM1 processor for the BBC Micro successor. Lead designer: Sophie Wilson (ISA) and Steve Furber (chip). The key design philosophy: simplicity and efficiency over raw performance.

1987: ARM2 in the Acorn Archimedes — the first personal computer with an ARM processor. 8 MHz, 4 MIPS at 4W (Intel 286: 2.7 MIPS at 1.5W — ARM already competitive on performance-per-watt).

1990: ARM Ltd formed as a joint venture between Acorn, Apple, and VLSI Technology. The business model: design IP, license to chip makers. ARM never manufactured chips itself.

1993: ARM7TDMI — the breakthrough embedded design. 5-stage pipeline, Thumb 16-bit compressed instruction set, JTAG debug, multiplier, and DSP extensions. Used in: Nokia phones (1990s), GameBoy Advance (32 MHz ARM7TDMI), original iPod.

2004: ARM Cortex-A8 — first Cortex series processor. 13-stage OoO pipeline, VFPv3 floating point, NEON SIMD. Used in iPhone 1 (Samsung S5L8900), Texas Instruments OMAP3.

2012: ARM Cortex-A15 — 15-stage pipeline, OoO, ARMv7-A. Used in Samsung Exynos 5 Dual (Nexus 10).

2011: ARMv8-A announced — 64-bit AArch64 ISA debut.

2013: Apple A7 (iPhone 5s) — first consumer 64-bit ARM processor. Apple's custom "Cyclone" microarchitecture. Apple began full custom ARM design (licensed ISA, custom microarchitecture) years before announcing Apple Silicon.

2020: Apple M1 — Apple Silicon revolution. Custom ARM microarchitecture (Firestorm + Icestorm cores), 5nm TSMC, unified memory, Neural Engine. Demonstrated ARM performance exceeding Intel Core i9 on single-threaded workloads.

2022: Apple M2, M2 Pro/Max/Ultra — second generation. 5nm+. M2 Ultra: 192 GB unified memory.

2023: Apple M3, M3 Pro/Max — 3nm TSMC. Custom ray-tracing hardware.

2022: ARM Cortex-X3 — ARM's reference high-performance core for Android. Used in Qualcomm Snapdragon 8 Gen 2, Samsung Exynos.

2024: Qualcomm Oryon — Custom ARM microarchitecture (derived from Nuvia acquisition). Used in Snapdragon X Elite for Windows laptops. Targets direct Apple M3 competition.

AArch64 ISA (ARMv8-A and later)

AArch64 is a clean 64-bit ISA with strong RISC principles:

General-Purpose Registers:

X0-X30:  64-bit general purpose registers (31 registers)
         (Note: NOT X31 — that encodes as SP or XZR depending on context)

XZR:     Zero register (reads as 0, writes discarded) — encoded as register 31
         when SP is not expected (ALU/logic instructions)
SP:      Stack Pointer — encoded as register 31 in instructions that expect SP
         (load/store with stack base addressing)

PC:      Program Counter — NOT directly accessible as a named register
         (unlike x86 which has RIP and allows RIP-relative LEA)
         Access via: ADR/ADRP for PC-relative addresses, BL (saves PC in LR)

LR (X30): Link Register — CALL (BL instruction) saves return address here
          Software convention, not hardware-enforced

W0-W30:  32-bit views of X0-X30 (lower 32 bits)
         Writing W-register zero-extends to 64 bits (same as x86-64)
WZR:     32-bit zero register

FP (X29): Frame Pointer — software convention (ABI requirement)

Key ISA properties: - Fixed 32-bit instruction encoding: Every instruction is exactly 4 bytes. Enables simpler branch target prediction, simpler decode, eliminates the x86 pre-decode complexity. No VLIW, no variable-width. - Load-Store architecture: Memory is accessed ONLY via load (LDR) and store (STR) instructions. ALU operations work exclusively on registers. Contrast with x86 ADD [mem], reg — not legal in AArch64. - Condition codes: AArch64 has NZCV flags (Negative, Zero, Carry, oVerflow) but conditional execution of most instructions was removed (unlike AArch32). Only conditional branches and CSEL/CSET/CSINC remain. - No segment registers: Clean virtual address space, no segmentation complexity. - PC not directly writable: Enforces cleaner control flow; jump-to-address requires indirect branch (BR Xn) rather than MOV PC, Xn.

Instruction set samples:

// Data movement
LDR X0, [X1]          // load 8 bytes from [X1] into X0
LDR X0, [X1, #8]      // load from [X1+8]
LDR X0, [X1, X2]      // load from [X1+X2]
LDR X0, [X1, #8]!     // pre-index: X1 = X1+8; load from [X1]
LDR X0, [X1], #8      // post-index: load from [X1]; X1 = X1+8
LDXR X0, [X1]         // load exclusive (for LL/SC atomics)
STXR W0, X2, [X1]     // store exclusive (W0=0 success, 1 fail)
STP X0, X1, [SP, #-16]!  // store pair (X0 and X1) + update SP

// Arithmetic
ADD X0, X1, X2        // X0 = X1 + X2
ADD X0, X1, #100      // X0 = X1 + 100 (12-bit immediate)
ADDS X0, X1, X2       // same + set NZCV flags (S suffix)
SUB, SUBS, MUL, SDIV, UDIV: similar

// MADD (Multiply-Add): X0 = X1*X2 + X3
MADD X0, X1, X2, X3

// Shift
LSL X0, X1, #3        // X0 = X1 << 3
LSR X0, X1, X2        // X0 = X1 >> X2 (logical, unsigned)
ASR X0, X1, #4        // X0 = X1 >> 4 (arithmetic, signed)

// Bit manipulation
BFM (Bit Field Move), UBFM, SBFM: extract/insert bit fields
RBIT, REV, CLZ, CLS: reverse, count leading zeros

// Branches
B label               // unconditional branch (PC-relative ±128MB)
BL label              // branch + link (X30 = PC+4, then jump)
BR X0                 // branch to address in X0
BLR X0                // branch + link to register
RET                   // return: branch to X30
RET X2                // return to address in X2

// Conditional
CBZ X0, label         // branch if X0 == 0
CBNZ X0, label        // branch if X0 != 0
TBZ X0, #3, label     // branch if bit 3 of X0 == 0
B.EQ, B.NE, B.LT, B.GE, etc.  // conditional branches on NZCV

// CSEL (conditional select — no conditional execution for ALU ops)
CSEL X0, X1, X2, EQ  // X0 = (Z==1) ? X1 : X2  (branch-free conditional)

ARM Exception Levels

AArch64 defines 4 Exception Levels (ELs), providing progressively higher privilege:

     ┌─────────────────────────────────────────────────────┐
EL3  │ Secure Monitor (TrustZone EL3)                      │
     │ - Highest privilege                                  │
     │ - Manages transition between Secure and Normal world │
     │ - Runs Trusted Firmware-A (ARM-TF)                   │
     │ - SMC instruction: EL1/2 calls into EL3              │
     └─────────────────────────────────────────────────────┘
             ↕ (SMC)
     ┌───────────────────────────────┐ ┌───────────────────┐
EL2  │ Hypervisor (Normal World)     │ │ Secure EL2 (added │
     │ - KVM, Xen, Type-1 hypervisor │ │  in ARMv8.4-A)    │
     │ - Stage-2 address translation │ │                   │
     │ - VM-enter/exit via ERET/HVC  │ │                   │
     └───────────────────────────────┘ └───────────────────┘
             ↕ (HVC)
     ┌───────────────────────────────┐ ┌───────────────────┐
EL1  │ OS Kernel (Normal World)      │ │ Secure EL1        │
     │ - Linux, Windows, macOS kernel│ │ - Trusted OS      │
     │ - Page table setup            │ │ - OP-TEE, Trusty  │
     │ - Interrupt handlers          │ │                   │
     └───────────────────────────────┘ └───────────────────┘
             ↕ (SVC)
     ┌───────────────────────────────┐ ┌───────────────────┐
EL0  │ User Applications             │ │ Secure EL0        │
     │ - Normal user processes       │ │ - Trusted apps    │
     │ - System call via SVC #0      │ │ - TrustZone apps  │
     └───────────────────────────────┘ └───────────────────┘

Key instructions:
  SVC #imm   : EL0 → EL1 (syscall)
  HVC #imm   : EL1 → EL2 (hypercall)
  SMC #imm   : EL1/EL2 → EL3 (secure monitor call)
  ERET       : return to lower EL (pops ELR_ELn + SPSR_ELn)

TrustZone: ARM's hardware isolation between Secure World and Normal World. SCR_EL3.NS bit controls which world is active. Secure World has access to all memory; Normal World sees a restricted view. Used for: DRM key storage, biometric authentication, mobile payments (Apple Secure Enclave on M1/M2 is conceptually similar but uses additional isolation).

EL2 Stage-2 Translation: In a virtualized system, EL1 (guest OS) manages Stage-1 page tables (VA → IPA, Intermediate Physical Address). EL2 manages Stage-2 page tables (IPA → PA, Physical Address). The CPU performs a 2-stage walk for every memory access in a VM. ARM SMMU (System Memory Management Unit) provides stage-2 translation for DMA devices.

ARM Memory Model: RVWMO vs TSO

ARM implements a weakly ordered memory model—substantially more relaxed than x86's TSO (Total Store Order).

What "weakly ordered" means:

ARM example of observable reordering:
Thread 1:              Thread 2:
  STR X1, [&data]        LDR X0, [&flag]
  STR X2, [&flag]        LDR X3, [&data]

// Thread 2 might observe: flag=1 (updated) but data=0 (not yet updated)
// This is a valid ARM execution — stores can be reordered

// The same code on x86 (TSO) would never allow flag=1, data=0
// because x86 guarantees stores from the same thread are seen in order

ARM memory barrier instructions:

DMB ISH     // Data Memory Barrier — Inner Shareable
            // All preceding memory accesses complete before subsequent ones
            // ISH = applies to all processors in the Inner Shareable domain

DSB ISH     // Data Synchronization Barrier — stronger than DMB
            // All preceding memory accesses, cache/TLB operations complete
            // Used before TLBI (TLB invalidation), cache ops, page table walks

ISB         // Instruction Synchronization Barrier
            // Flushes pipeline + re-fetches from point of synchronization
            // Required after: CR register writes, MSR writes, cache ops
            // (equivalent to CPUID on x86 for serialization)

STL/LDA:    // Store-Release / Load-Acquire instructions
STLR X0, [X1]   // release store: no subsequent access appears before this store
LDAR X0, [X1]   // acquire load: no prior access appears after this load
LDAXR/STLXR     // exclusive variants (for LL/SC atomics with acquire/release)

C11/C++11 memory model mapping to ARM:

// Relaxed:  no barrier  (std::memory_order_relaxed)
// Acquire:  LDAR         (std::memory_order_acquire)
// Release:  STLR         (std::memory_order_release)
// SeqCst:   LDAR/STLR + DMB ISH (std::memory_order_seq_cst)

ARM LRCPC (Load-Release/Store-Consume, ARMv8.3-A): Even cheaper acquire/release via rcpc loads (LDAPR) — not fully sequentially consistent but sufficient for many lock implementations.

NEON SIMD

NEON (Advanced SIMD) is ARM's fixed-width SIMD extension: - 32 registers: V0-V31, each 128 bits - Named sub-views: B0-B31 (8-bit), H0-H31 (16-bit), S0-S31 (32-bit), D0-D31 (64-bit) - Q0-Q15 is legacy naming; AArch64 uses V notation - Operates on vectors of 2×64, 4×32, 8×16, or 16×8 elements

LD1 {V0.4S}, [X0]      // load 4 × 32-bit floats from [X0] into V0
LD1 {V0.4S, V1.4S, V2.4S, V3.4S}, [X0]  // load 4 vectors (interleaved optional)

FMLA V0.4S, V1.4S, V2.4S   // V0 += V1 * V2 (4 floats fused multiply-add)
FADD V0.4S, V0.4S, V1.4S   // vector add
FMAXV S0, V0.4S             // reduce max across vector → scalar

// Integer SIMD
ADD V0.4S, V1.4S, V2.4S    // 4 × INT32 add
UMULL V0.2D, V1.2S, V2.2S  // widening multiply: 2 × U32 → 2 × U64
UADDLV D0, V0.8H            // unsigned add across vector lanes

NEON use cases: Audio/video encoding (H.264, VP9), image processing, cryptography (AES/SHA via dedicated NEON instructions on ARMv8-A), ML inference (BLAS kernels).

SVE / SVE2 (Scalable Vector Extension)

SVE (ARMv8-A option, mandatory for HPC profiles in ARMv9-A) introduces vector length agnostic (VLA) programming: code executes correctly on implementations with any vector width from 128 to 2048 bits in steps of 128.

// SVE code example (loops over array without knowing vector width)
.loop:
  WHILELT P0.S, X0, X3    // P0 = predicate mask: active lanes where X0 < X3
  LD1W {Z0.S}, P0/Z, [X1, X0, LSL #2]   // load active elements from src
  LD1W {Z1.S}, P0/Z, [X2, X0, LSL #2]   // load from second src
  FADD Z0.S, P0/M, Z0.S, Z1.S           // add (P0 = merge predicate)
  ST1W {Z0.S}, P0, [X4, X0, LSL #2]     // store active elements
  INCW X0                               // increment by vector length (in 32-bit words)
  B.FIRST .loop                         // branch if any active elements remain

SVE on HPC hardware: - ARM Neoverse V1 (AWS Graviton3): 256-bit SVE - ARM Neoverse V2 (AWS Graviton4): 256-bit SVE2 - Fujitsu A64FX (Fugaku supercomputer): 512-bit SVE — world's fastest supercomputer in 2020-2022

SVE2: ARMv9-A extension. Adds integer SVE (SVE was originally FP/integer, SVE2 fills gaps). Mandatory in ARMv9.

ARM big.LITTLE and DynamIQ

big.LITTLE (2011): Asymmetric multiprocessing — large OoO cores (Cortex-A73/A77) paired with efficient in-order cores (Cortex-A53/A55). OS scheduler places compute-intensive threads on big cores, background threads on LITTLE cores.

DynamIQ (2017): Successor to big.LITTLE. All cores in a single cluster (vs separate clusters in big.LITTLE). Supports heterogeneous configurations: 1×Cortex-X1 + 3×Cortex-A78 + 4×Cortex-A55. Enables more fine-grained power management within a cluster (per-core clock gating, voltage adjustment).

Thread Director analog: ARM's hardware performance monitoring guides the OS scheduler (ACME — Asymmetric Core Multiprocessing Extension in ARMv9.1). Similar to Intel Thread Director: hardware hints about thread IPC, SIMD utilization inform task placement.

Apple M1/M2 Microarchitecture

Apple Silicon represents the most aggressive implementation of ARM ISA for high-performance desktop/laptop use. Apple licenses the ARM ISA but designs its own microarchitecture completely.

Apple M2 specifications (2022): - Process: TSMC 5nm (enhanced vs M1's 5nm) - Core configuration: 4 Firestorm (performance) + 4 Blizzard (efficiency) - Firestorm (P-core): 8-wide decode, 300+ ROB entries, 12 µop execution ports - Blizzard (E-core): smaller OoO, similar to Cortex-A53 efficiency class - GPU: 10-core Apple GPU (optional 19-core in M2 Pro) - Neural Engine: 16-core, 15.8 TOPS - Unified Memory: 8–24 GB LPDDR5, 100 GB/s bandwidth - Media engines: hardware H.264/H.265/ProRes decode+encode

Firestorm microarchitecture highlights: - 8-wide instruction decode: vs Intel Golden Cove's 6-wide. This enables much higher IPC on code with low IPC bottlenecks. - Massive ROB (~300–600 entries estimated): Enables deep speculation over long-latency memory operations. Apple has not published exact numbers; estimates from reverse-engineering. - Unified memory: No discrete GPU with separate VRAM. CPU and GPU share LPDDR5 memory pool. Eliminates PCIe transfer overhead for GPU workloads. The tradeoff: LPDDR5 is slower than GDDR6X; mitigated by higher bandwidth LPDDR5X. - Instruction fetch: Apple M2 reportedly fetches 8 instructions/cycle. A 6-wide decoder receiving 8 instructions keeps the decode always fed.

Why Apple M2 surpasses Intel Alder Lake on single-thread: 1. Higher instruction decode width (8 vs 6) 2. Larger ROB allows deeper speculation (latency hiding) 3. 5nm with high frequency (3.5 GHz base, up to ~3.7 GHz) 4. No frequency throttling (ARM doesn't have the x86 decode legacy overhead) 5. Shorter frontend (ARM fixed-width instructions decode faster than variable-width x86)

Apple M2 Ultra: Two M2 Max chips connected via Apple UltraFusion interconnect (2.5 TB/s, die-to-die). Appears as a single 192 GB unified memory system to the OS. Uses pointer authentication + shared coherence domain.

Historical Context

ARM's success is partly a business story. Acorn needed a low-power, simple processor and couldn't afford expensive RISC workstation CPUs. Sophie Wilson's design was intentionally minimal—the original ARM1 team was three people and designed the chip in 18 months (1983–1985). The licensing model (ARM doesn't manufacture, just designs and licenses IP) was pioneered out of necessity and became the dominant semiconductor IP model. By 2023, ARM had shipped over 250 billion processor cores across all licensees. ARM's 2020 attempted acquisition by NVIDIA for $40B was blocked by regulators; ARM IPO'd in September 2023 at ~$55B valuation.

Production Examples

AWS Graviton4 (2024): ARM Neoverse V2 cores (96 cores), 512 GB DDR5, SVE2, and 3× compute density improvement vs Graviton3. Powers AWS EC2 C8g, M8g, R8g instances.

Apple M2 in MacBook Pro (2022): Single-chip ARM SoC outperforming Intel Alder Lake i9 on multi-threaded Geekbench while using 40% less power. Demonstrated ARM viability for professional compute.

Qualcomm Snapdragon 8 Gen 3 (2023): Cortex-X4 performance core (ARM reference), LPDDR5X memory, Adreno 750 GPU. Used in Samsung Galaxy S24. Claims near-Apple M2 single-thread performance.

Fujitsu Fugaku (2020): ARM A64FX, 4.68 ExaFLOP/s (40× LINPACK). 8 cores per node × 7.6M nodes = 7.63M ARM cores. Used Japan's national supercomputer for COVID-19 protein folding and climate simulation.

Debugging Notes

ARM assembly debugging in GDB:

(gdb) info registers     # GPRs X0-X30, SP, PC, CPSR
(gdb) p/x $x0            # print X0 in hex
(gdb) info registers xmm # ARM: 'info registers vector' for NEON

Memory barrier debugging: Use valgrind --tool=helgrind which models ARM's weak memory ordering. Detect race conditions that would only manifest on ARM (but not on x86's stronger TSO). ThreadSanitizer with -fsanitize=thread also detects these.

Exception level inspection (Linux):

# Not directly accessible from EL0
# Kernel provides EL-related info via /proc/cpuinfo and /sys/devices/system/cpu/
cat /proc/cpuinfo | grep Features  # shows sve, neon, fp etc.

TrustZone debugging (EL3/S-EL1): Requires a hardware JTAG debugger (ARM DSTREAM, Lauterbach TRACE32) attached to the SoC. Software tools can't access Secure World from Normal World EL1. ARM CoreSight provides ETM (Embedded Trace Macrocell) for instruction tracing.

Security Implications

PAC (Pointer Authentication Codes, ARMv8.3-A): Embeds a cryptographic MAC (using QARMA cipher + CPU key) in unused high bits of pointers. On use, the MAC is verified — a corrupted pointer (stack overflow, ROP gadget) causes an authentication fault. iOS uses PAC for return addresses (protecting against stack-smashing) and function pointers. Linux supports PAC-protected return addresses since kernel 5.7.

BTI (Branch Target Identification, ARMv8.5-A): Similar to Intel CET IBT. Indirect branches can only target instructions marked with BTI instructions. A ROP chain jumping to non-BTI instructions faults.

MTE (Memory Tagging Extension, ARMv8.5-A): Each 16-byte aligned allocation is tagged with a 4-bit color tag stored in the pointer's top bits. Loads/stores use the tag in the pointer to check against the tag in memory (stored in a parallel metadata array). Mismatches → synchronous exception. Detects: use-after-free, buffer overflow. Android uses MTE in production on Google Pixel 8 (Cortex-X3).

TrustZone exploitation: If the Secure Monitor (EL3) has a vulnerability, an attacker from Normal World can gain access to Secure World (which holds cryptographic keys, DRM content). CVE-2016-6275 (Trusty OS buffer overflow via Trustonic Kinibi SMC handler), CVE-2018-11813 (Qualcomm QSEE overflow).

Performance Implications

Weak memory model performance advantage: ARM's weak ordering allows stores to be buffered and reordered without performance penalty. Code that doesn't need ordering (pure computation) runs at maximum speed. x86 TSO enforces stronger ordering that prevents certain optimizations. This gives ARM a ~5–10% performance advantage on correctly written lock-free algorithms.

Thumb-2 code density: AArch32 Thumb-2 (mixed 16/32-bit instructions) achieves ~26% smaller code than ARM32. In AArch64, all instructions are 32-bit (no Thumb mode). However, AArch64 code is still more compact than x86-64 due to the cleaner ISA (no REX prefixes, no legacy encoding waste).

NEON vs AVX-512 throughput: ARM NEON at 128 bits vs x86 AVX-512 at 512 bits — 4× wider. However, Apple M2 has multiple NEON execution units (4-wide SIMD execution), and ARM SVE2 at 256 bits (Graviton4) closes the gap. For ML inference, Apple M2's Neural Engine (15.8 TOPS) outperforms x86 for quantized INT8 inference.

Failure Modes and Real Incidents

Incident: ARM Linux kernel spin_lock data race on Cortex-A9 (2011): The Cortex-A9 implemented cache coherency in a way that exposed a previously latent race in the Linux ARM spinlock implementation. The spinlock used a LDREX/STREX sequence without a full DSB barrier. On Cortex-A9 with L2 PL310 cache controller, a store could complete after the STREX without the unlock being visible to all cores. Fixed by adding DSB before the unlock store.

Incident: Apple M1 Spectre mitigations incomplete (2021): Researchers found that Apple M1's mitigation for Spectre variant 2 (BHB injection) was insufficient. "Augury" (2022) demonstrated a prefetcher-based side channel unique to Apple Silicon that could leak memory contents across security domains. Apple patched via OS-level mitigations in macOS Ventura.

Incident: Pixel 3 MTE false positive bug (Google, 2022): Early Android deployment of MTE on Pixel 3 (pre-MTE hardware, software emulation) generated false positives from legitimate code using pointer tagging for GC purposes. Android's GC used the top bits of pointers for metadata — colliding with MTE tag interpretation. Resolution: reserved tag value 0 for GC use.

Modern Usage

AWS Graviton3 (ARM Neoverse V1, 2022): 64 ARM Cortex-X based cores, 60% better performance than Graviton2, 3× better FP than Graviton2. SVE at 256 bits. Used by Qualcomm, Netflix, Snapchat for ML inference and web serving at lower cost than x86.

Windows on ARM (Qualcomm Snapdragon X Elite, 2024): Microsoft Surface Pro 11 uses Qualcomm Oryon CPU (custom ARM). Windows ARM emulates x86-64 via Prism emulator at near-native speed on simple code. Native ARM64EC ABI for Windows apps.

Alibaba Yitian 710 (2021): Custom ARM server chip for Alibaba Cloud. 128 Cortex-A710 cores, 5nm TSMC. Used in Alibaba's e-commerce serving tier.

Future Directions

ARMv9.5-A (2024): Realm Management Extensions (RME) — hardware Confidential Computing; encrypted guest VMs even from hypervisor
SVE3 (proposed): Further widening of scalable vector registers to 4096+ bits for extreme HPC
MPAM (Memory Partitioning and Monitoring, ARMv8.4): Hardware QoS for cache and memory bandwidth partitioning in multi-tenant cloud
Apple M4 and beyond: Further widening of decode, larger caches, higher unified memory bandwidth; competition with Intel/AMD intensifying
RISC-V vs ARM: Open ISA pressure from RISC-V may challenge ARM's licensing model, especially in custom accelerators and embedded

Exercises

AArch64 calling convention: Write an AArch64 assembly function that computes fibonacci(N) iteratively. Use the AAPCS64 (ARM Procedure Call Standard) calling convention: X0 = input N, X0 = return value, preserve X19-X28. Verify by linking with a C caller.
Barrier analysis: Write a C program with two pthreads communicating via a shared variable (without mutex). Compile for ARM and disassemble. Identify where the C11 atomic load/store maps to LDAR/STLR. Then deliberately remove the barriers and use ThreadSanitizer to detect the resulting race.
SVE benchmarking: On a system with SVE support (AWS Graviton3 or Fujitsu A64FX), write an SVE-optimized dot product (using Z registers). Benchmark vs NEON, scalar, and auto-vectorized versions. Report GFLOP/s for each.
TrustZone exploration: On a Raspberry Pi 4 (ARM Cortex-A72), explore the TrustZone boundary using OP-TEE (open-source TrustZone OS). Write a simple trusted application that stores a secret in Secure World. Verify that Normal World cannot read the secret directly.
PAC pointer authentication: On Apple Silicon or Cortex-X3 with PAC support, write a C++ program that deliberately overwrites a return address on the stack. Observe the PAC authentication fault (crash). Then enable PAC in the linker (-fpac-ret) and verify the mitigation prevents the overwrite.

References

ARM Architecture Reference Manual (ARM ARM), ARMv8-A: https://developer.arm.com/documentation/ddi0487/
ARM Architecture Reference Manual, ARMv9-A (Supplement to ARM ARM)
Procedure Call Standard for the Arm 64-bit Architecture (AAPCS64), ARM Ltd 2023
Patterson & Hennessy, "Computer Organization and Design: ARM Edition," 2017
Apple M2 System on a Chip: https://www.apple.com/mac-mini/specs/ (publishes unified memory and SoC info)
Arm Cortex-X3 Technical Reference Manual
Fugaku Supercomputer Architecture, RIKEN, 2020
Kocher et al., "Spectre Attacks: Exploiting Speculative Execution," IEEE S&P 2019 (ARM variant analysis)
MTE overview: https://developer.arm.com/documentation/102925/