x86-64 Internals: Registers, System Calls, Memory Model, and Security

Prerequisites

Basic assembly language: instructions, operands, registers
Virtual memory: page tables, privilege rings, kernel/user distinction
CPU pipeline (01-cpu-pipeline.md): fetch, decode, execute
Operating system concepts: system calls, interrupt handling, privilege levels
Cache coherence (06-cache-coherence.md): memory ordering basics

Technical Overview

x86-64 (also known as AMD64, Intel 64, EM64T, or x86_64) is the 64-bit extension of the x86 ISA, defining the most widely deployed general-purpose CPU architecture in the world. As of 2024, virtually all server, desktop, and laptop CPUs are x86-64, running Linux, Windows Server, or macOS (Intel Macs).

Understanding x86-64 internals at the hardware/kernel level requires knowing: 1. The complete register set and its semantics 2. System call mechanics (SYSCALL/SYSRET instruction pair and MSRs) 3. Privileged control registers and their bits 4. The System V AMD64 ABI calling convention 5. The x86 memory model (TSO) and its implications for lock-free code 6. Security extensions: SMEP, SMAP, CET, PKU, SGX

This is not an assembly language tutorial. The focus is on the mechanisms that a kernel engineer, hypervisor developer, or systems security researcher needs to understand: how the hardware enforces privilege separation, how exceptions propagate, how the transition between user and kernel mode works at the instruction level.

Historical Context

1978 — Intel 8086 (16-bit x86): The original. Segmented addressing, 8 16-bit registers (AX, BX, CX, DX, SI, DI, BP, SP), 20-bit physical address space (1 MB). Used in the IBM PC 5150 (1981).

1985 — Intel 80386 (32-bit i386, IA-32): Extended to 32-bit. Registers expanded to EAX, EBX, ECX, EDX, ESI, EDI, EBP, ESP. 32-bit physical addressing (4 GB). Protected mode with privilege rings. This architecture defined the "x86" we know.

1999 — AMD64 design begins: AMD, seeing Intel's failed Itanium IA-64 architecture (incompatible with x86), designed a 64-bit extension that was fully backwards compatible with x86. Lead architect: Mike Uhrig, Fred Weber, and others at AMD.

2003 — AMD Opteron (K8, AMD64 launch): First x86-64 processor. Registers expanded to 64-bit (RAX, RBX, ...), 8 new general-purpose registers (R8-R15), 64-bit virtual addresses (48-bit canonical, extendable to 57-bit with LA57), SYSCALL/SYSRET as fast system call mechanism.

2004 — Intel EM64T (extended Memory 64 Technology): Intel's implementation of AMD64, renamed to "Intel 64". Identical ISA except for a few minor differences (LAHF/SAHF handling, some SMM behavior). Intel adopted AMD's design rather than competing with an incompatible extension.

2011 — AVX (Advanced Vector Extensions): Intel extended SIMD from 128-bit SSE (XMM) to 256-bit (YMM). VEX encoding prefix introduced, enabling 3-operand instructions and non-destructive operations.

2013 — AVX-512: Intel extended to 512-bit SIMD (ZMM0-ZMM31). Only deployed in Xeon Phi and server Xeon CPUs. Not in consumer (client) Intel CPUs. AMD adopted AVX-512 in Zen 4 (2022), with different port configurations.

2020 — AMD Zen 3: AMD finalizes 5-level paging support (LA57, 57-bit virtual addresses, 4 PB virtual address space). Intel simultaneously added it with Tiger Lake.

2023 — Intel CET (Control-flow Enforcement Technology): Shadow stack + Indirect Branch Tracking. Shipped in Tiger Lake (2020), now mainstream.

Complete x86-64 Register Set

General-Purpose Registers (GPRs)

x86-64 has 16 64-bit GPRs. The naming conventions reflect the 8086 heritage:

  64-bit  32-bit  16-bit  8-bit high  8-bit low   Conventional use
  ──────────────────────────────────────────────────────────────────
  RAX     EAX     AX      AH          AL          Accumulator, return value
  RBX     EBX     BX      BH          BL          Base, callee-saved
  RCX     ECX     CX      CH          CL          Counter, arg4 (syscall: 4th)
  RDX     EDX     DX      DH          DL          Data, arg3 / return-2nd
  RSI     ESI     SI      (none)      SIL         Source index, arg2
  RDI     EDI     DI      (none)      DIL         Dest index, arg1
  RBP     EBP     BP      (none)      BPL         Base pointer (frame), callee-saved
  RSP     ESP     SP      (none)      SPL         Stack pointer
  R8      R8D     R8W     (none)      R8B         arg5
  R9      R9D     R9W     (none)      R9B         arg6
  R10     R10D    R10W    (none)      R10B        arg10 (syscall: syscall num transport)
  R11     R11D    R11W    (none)      R11B        Scratch (destroyed by SYSCALL)
  R12     R12D    R12W    (none)      R12B        Callee-saved
  R13     R13D    R13W    (none)      R13B        Callee-saved
  R14     R14D    R14W    (none)      R14B        Callee-saved
  R15     R15D    R15W    (none)      R15B        Callee-saved

  Note: Writing to the 32-bit form (EAX, R8D, etc.) ZERO-EXTENDS to 64 bits.
        Writing to the 8-bit or 16-bit form does NOT zero-extend (legacy behavior).
        This creates a common bug: `mov al, 5` does not clear the upper 56 bits of RAX.

Special-Purpose Registers

  Register    Width   Description
  ─────────────────────────────────────────────────────────────────
  RIP         64-bit  Instruction pointer (program counter)
  RFLAGS      64-bit  Status flags (CF, PF, AF, ZF, SF, TF, IF, DF, OF, IOPL, NT, RF, VM, AC, VIF, VIP, ID)
                      RFLAGS[21] = ID → if writable, CPUID supported
                      RFLAGS[9]  = IF → interrupt enable flag (CLI/STI)
                      RFLAGS[8]  = TF → trap flag (single-step)
  RSP         64-bit  Stack pointer (also a GPR but ABI-constrained)

Segment Registers

In 64-bit mode, segmentation is largely vestigial. CS, DS, ES, SS are all treated as base=0, so segment addressing is effectively flat. However, FS and GS retain their base addresses and are used for thread-local storage (TLS) and per-CPU data:

  CS: Code segment. Selector value determines 64-bit mode vs compat mode.
  DS: Data segment. Base=0, ignored in 64-bit mode.
  ES: Extra segment. Base=0, ignored in 64-bit mode.
  SS: Stack segment. Base=0, ignored in 64-bit mode.
  FS: Loaded via WRFSBASE or MSR 0xC0000100 (FS.Base MSR)
      User space: TLS (Thread-Local Storage) base
      e.g., on Linux: FS:0x28 = stack canary (loaded by GCC)
            FS points to the TLS struct (struct pthread in glibc)
  GS: Loaded via WRGSBASE or MSR 0xC0000101 (GS.Base MSR)
      User space: some ABIs use GS for TLS (macOS uses GS for pthread struct)
      Kernel space: SWAPGS instruction swaps GS.Base with KernelGSBase (MSR 0xC0000102)
                    Kernel uses GS to access per-CPU struct (struct cpu_info on Linux)

SWAPGS mechanism (critical for kernel entry):

User space:     GS.Base = user TLS (e.g., pthread struct)
                KernelGSBase MSR = per-CPU kernel struct

Kernel entry (syscall/interrupt):
  SWAPGS    → swaps GS.Base ↔ KernelGSBase
              Now GS.Base = per-CPU kernel struct
              KernelGSBase = (old) user TLS (saved for return)

Kernel exit:
  SWAPGS    → swaps back
  SYSRET    → returns to user mode

Control Registers

Register  Bits  Description
─────────────────────────────────────────────────────────────────────────────
CR0       [0]   PE  = Protection Enable (0=real mode, 1=protected mode)
          [16]  WP  = Write Protect (kernel cannot write to read-only user pages
                      even at CPL=0 → enables CoW and readonly data for kernel)
          [31]  PG  = Paging Enable (0=physical addressing, 1=virtual+paging)

CR1       (reserved, GP fault if accessed)

CR2       [63:0] Page Fault Linear Address (PFLA)
                 On #PF exception: CR2 contains the virtual address that caused the fault
                 Read by page fault handler to determine which page to load

CR3       [11:0] PCID (Process Context Identifier — if PCIDE bit in CR4 set)
          [63:12] PML4/PML5 physical base address (top-level page table)
                  MOV CR3, rax : TLB flush (unless NOFLUSH bit set + PCID matches)
                  Every context switch: load new CR3 for new process's page tables
                  KPTI: uses 2 CR3 values per process (user and kernel page tables)

CR4       [4]   PSE  = Page Size Extensions (4MB pages in 32-bit mode)
          [5]   PAE  = Physical Address Extension (enables 64-bit PTEs in 32-bit mode)
          [7]   PGE  = Page Global Enable (global TLB entries persist across CR3 changes)
          [9]   OSFXSR = OS support for FXSAVE/FXRSTOR (SSE)
          [17]  PCIDE = Process Context Identifier Enable
          [20]  SMEP = Supervisor Mode Execution Prevention
                       Prevents ring 0 (kernel) from executing user-space pages
                       Mitigates kernel code injection attacks (cannot JMP to user buffer)
          [21]  SMAP = Supervisor Mode Access Prevention
                       Prevents ring 0 from reading/writing user-space pages
                       WITHOUT explicitly setting RFLAGS.AC first
                       Mitigates use-after-free and type confusion attacks in kernel
          [22]  PKE  = Protection Key Enable (user-mode)
          [23]  CET  = Control-flow Enforcement Technology enable
          [28]  LA57 = 5-level paging enable (57-bit virtual addresses, 4 PB space)

CR8       [3:0] TPR = Task Priority Register
                      On x86-64 with APIC: CR8 controls which interrupt priority levels
                      are accepted. CR8=15 → no interrupts accepted.
                      Accessible from ring 3 via MOV CR8 (with APIC virtualization).
                      Linux uses this to implement local_irq_disable() efficiently on SMP.

CR3 Switching and KPTI

Context switch (process A → process B):
  mov cr3, [process_B.cr3]
  → triggers TLB flush for all non-global pages
  → new PML4 table takes effect immediately
  → all subsequent virtual address translations use process B's page tables

KPTI context:
  Each process has TWO CR3 values:
  [1] User CR3: user page tables only (no kernel mapping)
  [2] Kernel CR3: full page tables (user + kernel)

  Entering kernel (syscall/interrupt):
    Load Kernel CR3 → TLB now sees kernel pages
  Returning to user:
    Load User CR3 → TLB loses kernel mappings (prevents Meltdown)

  Performance: CR3 switch costs ~10-20 cycles + TLB invalidation overhead
  PCID optimization: if process retains same PCID, TLB entries are TAGGED
  and not flushed. Reduces KPTI overhead from ~15% to ~3% for syscall-heavy code.

SIMD Registers: SSE/AVX/AVX-512

  Register       Width    ISA          Introduced
  ─────────────────────────────────────────────────────
  XMM0-XMM15    128-bit  SSE/SSE2     Pentium III (1999) / Pentium 4 (2001)
  YMM0-YMM15    256-bit  AVX/AVX2     Sandy Bridge (2011)
  ZMM0-ZMM31    512-bit  AVX-512      Skylake Xeon (2017), Zen 4 (2022)
  K0-K7           16-bit  AVX-512      Predicate/mask registers for AVX-512

  Physical relationship:
  ZMM0: [bits 511:256=upper AVX-512 half | bits 255:128=YMM0 upper | bits 127:0=XMM0]
        Writing XMM0 (128-bit): ZERO-EXTENDS to 256-bit (per AVX 3-operand encoding)
        Writing YMM0 (VEX-encoded): ZERO-EXTENDS to 512 bits
        Writing ZMM0: full 512-bit update

  AVX-512 features (varies by CPU):
  - 32 ZMM registers (vs 16 YMM in AVX2)
  - Mask registers (K0-K7): per-element predication
  - Embedded broadcast: {1to8}, {1to16} → broadcast a scalar to all vector lanes
  - Scatter/gather: ZMM-indexed loads/stores to non-contiguous memory

  AVX-512 penalty on Intel:
    Executing AVX-512 instructions causes the CPU to drop clock frequency
    (power: 512-bit SIMD consumes significantly more power)
    Recovery latency after the last AVX-512 instruction: ~1ms
    → Can HURT performance for code that mixes AVX-512 and non-SIMD
    → Heavy AVX-512 loops: sustained throughput gain (e.g., 2× for DGEMM)
    → Sparse AVX-512 use: frequency penalty + transition overhead = net loss

System V AMD64 ABI Calling Convention

The System V AMD64 ABI is the standard calling convention for Linux, macOS, and all Unix-like systems on x86-64. Windows uses a different calling convention (fastcall variant with 4 register args: RCX, RDX, R8, R9).

Integer/Pointer Arguments

Argument #  Register  Preserved by callee?
──────────────────────────────────────────────
  1          RDI       No (caller-saved)
  2          RSI       No (caller-saved)
  3          RDX       No (caller-saved)
  4          RCX       No (caller-saved)
  5          R8        No (caller-saved)
  6          R9        No (caller-saved)
  7+         Stack     (pushed right-to-left before CALL)

Return values:
  RAX  : Primary return value (integer/pointer)
  RDX  : Secondary return value (for 128-bit returns or struct returns)

Floating-Point Arguments

  FP/SSE arg 1   XMM0
  FP/SSE arg 2   XMM1
  ...
  FP/SSE arg 8   XMM7
  FP/SSE arg 9+  Stack

Callee-Saved Registers (must be preserved across a function call)

  RBX, RBP, R12, R13, R14, R15
  → A function may use these registers only if it saves/restores them (push/pop or explicit save)

  XMM0-XMM7 : caller-saved (not preserved)
  XMM8-XMM15: caller-saved (not preserved)
  → All XMM/YMM/ZMM registers are caller-saved in SysV ABI
  → A callee that uses YMM8-YMM15 must execute VZEROUPPER to avoid AVX-SSE transition penalties

Stack Frame and Red Zone

  RSP at function call entry: 16-byte aligned (RSP % 16 == 8 after CALL
                                because CALL pushes 8 bytes = return addr)
  RSP in function body:       must be 16-byte aligned before issuing another CALL

  Red zone: The 128 bytes BELOW RSP are "owned" by the current function.
            Signal handlers must not clobber this region.
            Leaf functions (no calls) can use the red zone without adjusting RSP.
            The Linux kernel disables the red zone for its own stack (stack
            switching on interrupt doesn't respect the ABI red zone).

   High addresses
   ┌─────────────────────────┐
   │     Caller's frame      │
   ├─────────────────────────┤ ← RSP before CALL
   │  Return address (8B)    │
   ├─────────────────────────┤ ← RSP after CALL (= current function entry RSP)
   │  Callee frame (locals)  │
   ├─────────────────────────┤ ← RSP after local allocation
   │  Red zone (128 bytes)   │ ← usable by leaf functions without SUB RSP
   └─────────────────────────┘
   Low addresses

syscall vs function call argument differences

Function call ABI (System V AMD64):
  Arg1=RDI, Arg2=RSI, Arg3=RDX, Arg4=RCX, Arg5=R8, Arg6=R9

SYSCALL ABI (Linux x86-64):
  Syscall#=RAX, Arg1=RDI, Arg2=RSI, Arg3=RDX, Arg4=R10, Arg5=R8, Arg6=R9
  Note: RCX → R10 (because SYSCALL destroys RCX — used to save return address)

SYSCALL/SYSRET: Fast System Call Mechanism

The SYSCALL/SYSRET instruction pair is the modern x86-64 mechanism for transitioning between user mode (ring 3) and kernel mode (ring 0). It replaced the older INT 0x80 / IRET path (INT80 is still supported for 32-bit compatibility on Linux).

MSRs Controlling SYSCALL/SYSRET

MSR 0xC0000080 (EFER — Extended Feature Enable Register):
  Bit 8 (SCE) = System Call Extensions: must be 1 to enable SYSCALL/SYSRET

MSR 0xC0000081 (STAR — System Target Address Register):
  Bits [63:48]: SYSRET CS and SS selectors
  Bits [47:32]: SYSCALL CS and SS selectors
  Bits [31:0]:  (32-bit syscall target in compat mode, not used in pure 64-bit)

MSR 0xC0000082 (LSTAR — Long mode System Target Address Register):
  Bits [63:0]: Linear address of kernel entry point for SYSCALL in 64-bit mode
  = Address of kernel's syscall handler (e.g., `entry_SYSCALL_64` in Linux)
  Set during kernel initialization: wrmsr(MSR_LSTAR, (uint64_t)entry_SYSCALL_64)

MSR 0xC0000083 (CSTAR — Compat mode System Target Address Register):
  Kernel entry for 32-bit code issuing SYSCALL (used by 32-bit processes on 64-bit kernel)

MSR 0xC0000084 (SFMASK — SYSCALL FLAGS MASK):
  On SYSCALL: RFLAGS &= ~SFMASK
  Typically: SFMASK = 0x47700 (clears: TF, DF, IF, IOPL, AC, NT)
  → Disables interrupts at kernel entry, clears trap flag, ensures clean RFLAGS

SYSCALL Instruction Execution Sequence

User executes SYSCALL instruction:

Step 1: Save state
  RCX ← RIP  (return address: next instruction after SYSCALL)
  R11 ← RFLAGS (saved user flags)

Step 2: Load kernel entry
  RIP ← LSTAR MSR  (kernel entry point)
  RFLAGS ← RFLAGS & ~SFMASK  (clear interrupt flag etc.)
  CS ← STAR[47:32] (kernel code selector, ring 0)
  SS ← STAR[47:32] + 8 (kernel stack selector)

Step 3: Kernel executes
  (RSP still points to user stack! Kernel must switch to kernel stack immediately)
  Linux entry_SYSCALL_64:
    swapgs                    ; GS → per-CPU kernel struct
    mov %rsp, %gs:cpu_tssp    ; save user RSP
    mov %gs:cpu_kernel_stack, %rsp  ; load kernel stack
    push rcx                  ; save user return address (RIP before syscall)
    push r11                  ; save user RFLAGS
    push rdi..r9              ; save syscall arguments
    call [sys_call_table + rax*8]  ; dispatch via syscall table
    pop ...                   ; restore
    SYSRET                    ; return to user

Step 4: SYSRET Instruction
  RIP ← RCX  (restore user return address)
  RFLAGS ← R11 | fixed_bits  (restore user flags; IF forced=1; certain bits forced)
  CS ← STAR[63:48] (user code selector, ring 3)
  SS ← STAR[63:48] + 8 (user stack selector)
  swapgs                    ; restore user GS

SYSRET Security Bug (CVE-2012-0217): On Intel CPUs, SYSRET with a non-canonical RCX (i.e., RCX bits [63:48] ≠ sign-extended from bit 47) raises a #GP (General Protection Fault) at CPL=0 (ring 0) — AFTER CS has been updated to ring 3 but before the privilege drop. This is exploitable. Linux and all major OSes now canonicalize RCX before SYSRET.

x86 TSO Memory Model

x86 implements Total Store Order (TSO) — a nearly sequential consistency model with one relaxation: stores can be buffered and observed out of order relative to loads.

Allowed and Forbidden Reorderings

  Operation pair     Allowed to reorder?   Notes
  ──────────────────────────────────────────────────────────────
  Load → Load        NO                   Memory reads are ordered
  Store → Store      NO                   Memory writes are ordered (FIFO store buffer)
  Load → Store       NO                   Load before store is maintained
  Store → Load       YES!                 A store can be reordered AFTER a subsequent load
                                          This is the TSO relaxation
                                          Root cause: store buffer — stores sit in
                                          the store buffer while later loads proceed

The TSO store-load reordering:
  Thread 0:           Thread 1:
  store x = 1         store y = 1
  r0 = load y         r1 = load x

  Under TSO: r0=0, r1=0 is possible!
  (Both threads' stores are in their store buffers, not yet globally visible,
   when both loads execute — each thread sees its own store but not the other's)

Memory Barriers / Fence Instructions

  MFENCE  : Full memory fence. Ensures all loads/stores before MFENCE
             complete before any loads/stores after MFENCE.
             Cost: ~20-100 cycles (serializes the store buffer flush)

  SFENCE  : Store fence. Orders stores only. Later stores cannot
             pass earlier stores. Used with non-temporal stores (MOVNT).
             Cost: ~5-15 cycles

  LFENCE  : Load fence. Prevents later loads from issuing before prior loads.
             Also prevents speculative execution from passing the LFENCE
             (used as speculation barrier in Spectre mitigations).
             Cost: ~3-10 cycles

  LOCK prefix: Atomically performs the operation AND acts as a full fence.
  (e.g., LOCK CMPXCHG, LOCK XADD, LOCK XCHG)
  XCHG: always has implicit LOCK semantics (even without prefix).

  Practical note:
    C11/C++11 atomic operations compile to appropriate fence instructions:
    memory_order_seq_cst → MFENCE (or LOCK XCHG on x86)
    memory_order_acquire → no barrier needed on x86 (loads are already ordered)
    memory_order_release → no barrier needed on x86 (except for NT stores)
    → x86 TSO is "almost free" for acquire/release — the main cost is seq_cst

x86-64 Register Map Table

  ┌─────────────────────────────────────────────────────────────────────────────┐
  │                        x86-64 REGISTER MAP                                  │
  ├──────────┬───────────┬────────┬────────┬──────────────────────────────────┤
  │  64-bit  │  32-bit   │ 16-bit │ 8-bit  │        Primary Use               │
  ├──────────┼───────────┼────────┼────────┼──────────────────────────────────┤
  │  RAX     │   EAX     │   AX   │ AH/AL  │ Return value, syscall number     │
  │  RBX     │   EBX     │   BX   │ BH/BL  │ Callee-saved, base pointer       │
  │  RCX     │   ECX     │   CX   │ CH/CL  │ Arg4 (function), saved by SYSCALL│
  │  RDX     │   EDX     │   DX   │ DH/DL  │ Arg3 (function), return value 2  │
  │  RSI     │   ESI     │   SI   │  SIL   │ Arg2 (function), string source   │
  │  RDI     │   EDI     │   DI   │  DIL   │ Arg1 (function), string dest     │
  │  RBP     │   EBP     │   BP   │  BPL   │ Callee-saved, frame pointer      │
  │  RSP     │   ESP     │   SP   │  SPL   │ Stack pointer                    │
  │  R8      │   R8D     │  R8W   │  R8B   │ Arg5 (function/syscall)          │
  │  R9      │   R9D     │  R9W   │  R9B   │ Arg6 (function/syscall)          │
  │  R10     │   R10D    │  R10W  │  R10B  │ Arg4 (syscall), caller-saved     │
  │  R11     │   R11D    │  R11W  │  R11B  │ Saved RFLAGS (SYSCALL), scratch  │
  │  R12     │   R12D    │  R12W  │  R12B  │ Callee-saved                     │
  │  R13     │   R13D    │  R13W  │  R13B  │ Callee-saved                     │
  │  R14     │   R14D    │  R14W  │  R14B  │ Callee-saved                     │
  │  R15     │   R15D    │  R15W  │  R15B  │ Callee-saved                     │
  ├──────────┼───────────┴────────┴────────┴──────────────────────────────────┤
  │  RIP     │  Instruction pointer                                            │
  │  RFLAGS  │  Condition codes + control flags (CF, ZF, SF, OF, IF, DF, TF) │
  ├──────────┴──────────────────────────────────────────────────────────────────┤
  │  Segment │  CS DS ES FS GS SS  (FS/GS retain base addresses in 64-bit mode)│
  ├───────────────────────────────────────────────────────────────────────────┤
  │  SIMD    │  XMM0-15 (128b)  YMM0-15 (256b)  ZMM0-31 (512b, AVX-512)     │
  │          │  K0-K7 (16-bit mask registers, AVX-512)                        │
  ├───────────────────────────────────────────────────────────────────────────┤
  │  Control │  CR0 CR2 CR3 CR4 CR8  (privileged, ring 0 only)               │
  │  Debug   │  DR0-DR3 (breakpoint address), DR6 (status), DR7 (control)    │
  │  MSR     │  Accessed via RDMSR/WRMSR (ring 0). Key: EFER, STAR, LSTAR,  │
  │          │  SFMASK, FS.Base, GS.Base, KernelGSBase, IA32_SPEC_CTRL      │
  └───────────────────────────────────────────────────────────────────────────┘

Intel CET: Control-Flow Enforcement Technology

CET provides hardware enforcement of forward-edge and backward-edge control flow integrity (CFI).

Shadow Stack (backward-edge CFI)

  Normal stack:                Shadow stack (new, in CR4.CET-protected memory):
  ┌─────────────────┐         ┌─────────────────┐
  │ return address  │         │ return address  │ (copy of return addr on CALL)
  │ local vars      │         └─────────────────┘
  │ ...             │
  └─────────────────┘

  On CALL:  Push return addr to BOTH normal stack AND shadow stack
  On RET:   Compare popped return addr (from normal stack)
            with top of shadow stack
            → MISMATCH → #CP (Control Protection Exception)

  Shadow stack: read-only to software (cannot be written via normal MOV)
               writable only by CALL/INCSSPQ/SAVEPREVSSP/RSTORSSP
               Prevents ROP (Return-Oriented Programming) attacks
               Attacker cannot overwrite shadow stack via buffer overflow
               (separate virtual memory page with WRSS protection)

Indirect Branch Tracking (IBT, forward-edge CFI)

  All valid indirect branch targets MUST begin with ENDBR64 (or ENDBR32) instruction.
  ENDBR64 is a NOP on CPUs without CET. On CET-enabled CPUs:
    If an indirect JMP or CALL lands anywhere that is NOT an ENDBR64 instruction
    → #CP exception

  ENDBR64 encoding: F3 0F 1E FA  (4 bytes)
  Prevents code injection: even if attacker redirects an indirect branch,
  the target must start with ENDBR64 — cannot point into the middle of code.

  CET IBT with Linux:
    GCC -fcf-protection=full: adds ENDBR64 to all function entry points
    Kernel: CONFIG_X86_KERNEL_IBT enables CET IBT for kernel

  Limitation: still allows ROP-like attacks using ENDBR64-prefixed gadgets
  (any existing function entry is a valid target). Does not prevent
  Spectre gadgets in the speculative path.

SMEP, SMAP, and Privilege Separation

  Ring Protection:
    Ring 0 (CPL=0): Kernel mode — can execute any instruction, access any memory
    Ring 3 (CPL=3): User mode — cannot execute privileged instructions
    Rings 1,2: Not used in modern OS designs (drivers run in ring 0)

  SMEP (Supervisor Mode Execution Prevention, CR4[20]):
    Ring 0 cannot execute a page that is user-accessible (U/S bit set in PTE).
    Prevents kernel from being redirected to execute shellcode in user buffers.
    Attack prevented: kernel write primitive + CALL to user buffer.
    Linux enables SMEP since 3.7 (if CPU supports it).

  SMAP (Supervisor Mode Access Prevention, CR4[21]):
    Ring 0 cannot implicitly READ or WRITE user-accessible pages.
    To access user memory, kernel must explicitly set RFLAGS.AC bit (via STAC instruction)
    and clear it afterwards (via CLAC).
    Prevents: kernel accidentally following a user-controlled pointer into user data.
    Also prevents temporal confusion attacks (kernel pointer dereference gadgets).

    Kernel usage pattern:
      copy_from_user():  stac; mov [user_ptr]; clac
      copy_to_user():    stac; mov [user_ptr] = val; clac

  PKU (Protection Keys for Userspace, CR4[22]):
    Each page table entry has a 4-bit "protection key" field (bits 62:59).
    A per-process PKRU register (accessible without syscall via RDPKRU/WRPKRU)
    stores 32 bits: 2 bits per key (WD=write-disable, AD=access-disable).
    Application can mark memory regions with a key and toggle access without TLB flush.
    Used by: glibc for PKEY-protected memory arenas, MemSafe, HFI isolation.

    Performance advantage: RDPKRU/WRPKRU are ~10 cycle operations (vs ~200 for mprotect syscall)
    Useful for: JIT sandboxes, safe memory partitioning within a process without syscall overhead

Debugging Notes

# Dump all registers via GDB
gdb ./program
(gdb) break main
(gdb) run
(gdb) info registers all    # all GPRs + flags + segment registers
(gdb) p/x $rflags           # RFLAGS in hex (decode: bit 9=IF, bit 7=SF, bit 6=ZF, bit 0=CF)
(gdb) x/32xg $rsp           # dump stack as 64-bit words

# Read/write MSRs from Linux (requires root)
# Install msr-tools: apt install msr-tools
modprobe msr
rdmsr 0xC0000082  # read LSTAR (kernel SYSCALL entry address)
rdmsr 0xC0000080  # read EFER
# Example: LSTAR = ffffffffb1c00000 → Linux kernel entry_SYSCALL_64 address (KASLR'd)

# Check SMEP/SMAP status
grep -E "smep|smap" /proc/cpuinfo | head -2

# Check CET support
grep -E " shstk| ibt" /proc/cpuinfo

# Debug SYSCALL path with strace
strace -e trace=openat,read,write,mmap ./program
# syscall numbers for x86-64: /usr/include/asm/unistd_64.h
# read=0, write=1, open=2, mmap=9, brk=12, rt_sigaction=13, ...

# ftrace kernel function tracing
echo function > /sys/kernel/debug/tracing/current_tracer
echo entry_SYSCALL_64 > /sys/kernel/debug/tracing/set_ftrace_filter
cat /sys/kernel/debug/tracing/trace

# Inspect CET shadow stack (via /proc/PID/maps - if kernel CET enabled)
cat /proc/$$/maps | grep -E "\[stack\]|\[shadow]"

Security Implications

Kernel stack exhaustion (stack overflow): Kernel stack on x86-64 is typically 8 KB or 16 KB (configurable, CONFIG_THREAD_INFO_IN_TASK). Deep recursion in kernel context overflows into adjacent kernel stack, causing silent corruption. Real exploits (e.g., CVE-2016-1583 ecryptfs stack overflow) used this to overwrite adjacent kernel data.
SMAP bypass: If a kernel function stores a user-controlled pointer in memory and later dereferences it without stac/clac protection (or with a SMAP-disabled window that's too wide), an attacker can redirect the kernel to read/write attacker-controlled data.
KASLR (Kernel Address Space Layout Randomization): The kernel is loaded at a random base address in virtual memory (x86-64: typically bits 47:21 randomized). Combined with SMEP/SMAP, forces attacker to leak a kernel address before exploiting control-flow redirections.
KPTI and the PCID optimization: Without PCID support, KPTI requires a full TLB flush on every kernel entry/exit (every syscall). With PCID (CR4[17]), the kernel and user page tables each get a distinct PCID. TLB entries are tagged by PCID, so no full flush is needed when switching between the two CR3 values for the same process.
Return address signing (future): Intel's CET shadow stack is the x86-64 answer to ARM's Pointer Authentication Codes (PAC). ARM PAC signs return addresses cryptographically; CET uses a dedicated shadow stack. Both target the same threat model (ROP/JOP).

Performance Implications

Register pressure and callee-saves: Functions that use RBX/RBP/R12-R15 must save/restore them. High register pressure from many local variables forces spills to stack, increasing memory traffic. Compilers use R8-R11 (caller-saved) first to avoid this.
VZEROUPPER overhead: Code mixing AVX2 (256-bit YMM) with SSE (128-bit XMM) must execute VZEROUPPER to avoid AVX-SSE transition penalties on Intel CPUs (prior to Skylake-SP). Compilers insert it automatically but it adds 1-2 cycles.
REX prefix encoding: Instructions using R8-R15 or high-8-bit registers require a REX prefix byte. This increases instruction density, slightly reducing fetch bandwidth. Not typically a performance issue on modern wide-fetch CPUs.
Partial register writes: Writing AL (low 8 bits of RAX) followed by reading RAX creates a false dependency on some CPUs (Intel pre-Ivy Bridge: 3-cycle penalty for reading the merged result). Prefer 32-bit writes (which zero-extend) over 8-bit writes in critical paths.
MFENCE cost: MFENCE on modern Intel CPUs costs ~20-100 cycles due to store buffer drain. In lock-free code, prefer LOCK prefix (which is an implicit fence) over separate MFENCE where possible. XCHG [mem], reg has implicit LOCK and provides both the exchange and the fence semantics at once.

Modern Usage

Linux Kernel Architecture-Specific Code

Key x86-64 kernel files: - arch/x86/entry/entry_64.S: SYSCALL entry point, exception handlers, SWAPGS - arch/x86/kernel/cpu/common.c: CPU feature detection, SMEP/SMAP/CET enable - arch/x86/mm/pgtable.c: Page table management, PCID allocation - arch/x86/include/asm/msr-index.h: All MSR numbers - arch/x86/include/asm/processor.h: CPU-specific structs, feature flags

# Verify kernel entry point (LSTAR MSR)
sudo rdmsr 0xC0000082 | xxd
# Then: cat /proc/kallsyms | grep entry_SYSCALL_64
# Addresses should match (minus KASLR offset)

# Check which security features are enabled
cat /proc/cpuinfo | grep -E "smep|smap|pku|cet|ibrs|ibpb"

Future Directions

APX (Advanced Performance Extensions, Intel 2024+): Intel Granite Rapids introduces 32 new 64-bit general-purpose registers (R16-R31) via the REX2 prefix. Reduces register pressure for compilers, reduces spills for large functions. Also adds "PUSH2/POP2" instructions to save two registers with one instruction.
AMX (Advanced Matrix Extensions): Intel Sapphire Rapids: tile-based matrix multiply instructions. 8 tile registers (TMM0-TMM7), each up to 16×64 bytes. TMUL instruction performs matrix multiply on tiles. Replaces hand-written AVX-512 matrix kernels for deep learning inference.
RAO-INT (Remote Atomic Operations on Integer): AMD specification for performing atomic integer operations on remote memory without loading to cache first. Relevant for NUMA — allows incrementing a remote counter without cache-line transfer.
x86S (x86 Simplification): Intel research proposal (2023) to remove 16-bit and 32-bit legacy operating modes from future CPUs. Pure 64-bit only. Would eliminate most of the complexity in the decode stage and allow significant microarchitectural simplification. Not yet committed to.
CET expansion: Future ISA extensions expected to close remaining CET gaps (ENDBR64 gadgets), integrate with hardware capabilities (CHERI-like features) for memory safety without full CHERI adoption.

Exercises

Register tracing: Write an assembly function (in .asm or __asm__ inline) that computes the sum of an array using all 16 GPRs as partial-sum accumulators (to exploit instruction-level parallelism). Verify with objdump -d that the compiler is not merging them. Measure IPC vs a single-accumulator version.
SYSCALL path tracing: Implement a minimal printf-like function that uses the raw SYSCALL instruction (not libc) to invoke write(1, buf, len). Verify the system call number from unistd_64.h (syscall 1). Use strace to confirm the syscall is being made correctly.
TSO memory model test: Implement the store-load reordering test in C using atomic operations with memory_order_relaxed. Run with 2 threads on a 2-core machine. Count how many iterations result in r0=0, r1=0 (the TSO-allowed "both stores delayed" outcome). Add memory_order_seq_cst and verify this outcome becomes impossible.
CR3 switching measurement: Write a kernel module (or use KVM/VMX to observe) that measures the cycle cost of MOV CR3, rax with and without PCID. Specifically: (a) measure TLB flush overhead by accessing a large array before and after CR3 switch, (b) compare PCID-tagged CR3 switch vs non-PCID full flush.
CET shadow stack validation: On a CET-enabled Linux system (kernel 5.18+, glibc 2.35+), compile a program with -fcf-protection=shadow-stack. Verify shadow stack presence in /proc/self/maps. Attempt to overwrite a return address on the normal stack (via controlled stack overflow in a test function) and confirm the resulting #CP exception kills the process with SIGSEGV.

References

AMD Corporation. (2024). AMD64 Architecture Programmer's Manual. Volumes 1-5. https://developer.amd.com/resources/developer-guides-manuals/
Intel Corporation. (2024). Intel 64 and IA-32 Architectures Software Developer's Manual. Volumes 1-4. https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html
System V Application Binary Interface: AMD64 Architecture Processor Supplement. (2023). https://gitlab.com/x86-psABIs/x86-64-ABI
Corbet, J. (2018). The current state of kernel page-table isolation. LWN.net. https://lwn.net/Articles/741878/
Intel Corporation. (2020). Control-flow Enforcement Technology Specification. https://www.intel.com/content/www/us/en/develop/download/intel-cet-technology-preview.html
Lipp, M., et al. (2020). Meltdown: Reading Kernel Memory from User Space. Communications of the ACM, 63(6).
Drepper, U. (2004). ELF Handling For Thread-Local Storage. Red Hat.
Fog, A. (2023). Calling Conventions for Different C++ Compilers and Operating Systems. https://www.agner.org/optimize/calling_conventions.pdf
Intel Corporation. (2023). Advanced Performance Extensions (APX) Architecture Specification. https://www.intel.com/content/www/us/en/developer/articles/technical/advanced-performance-extensions-apx.html