x86-64 Internals: Registers, System Calls, Memory Model, and Security
Prerequisites
- Basic assembly language: instructions, operands, registers
- Virtual memory: page tables, privilege rings, kernel/user distinction
- CPU pipeline (
01-cpu-pipeline.md): fetch, decode, execute - Operating system concepts: system calls, interrupt handling, privilege levels
- Cache coherence (
06-cache-coherence.md): memory ordering basics
Technical Overview
x86-64 (also known as AMD64, Intel 64, EM64T, or x86_64) is the 64-bit extension of the x86 ISA, defining the most widely deployed general-purpose CPU architecture in the world. As of 2024, virtually all server, desktop, and laptop CPUs are x86-64, running Linux, Windows Server, or macOS (Intel Macs).
Understanding x86-64 internals at the hardware/kernel level requires knowing: 1. The complete register set and its semantics 2. System call mechanics (SYSCALL/SYSRET instruction pair and MSRs) 3. Privileged control registers and their bits 4. The System V AMD64 ABI calling convention 5. The x86 memory model (TSO) and its implications for lock-free code 6. Security extensions: SMEP, SMAP, CET, PKU, SGX
This is not an assembly language tutorial. The focus is on the mechanisms that a kernel engineer, hypervisor developer, or systems security researcher needs to understand: how the hardware enforces privilege separation, how exceptions propagate, how the transition between user and kernel mode works at the instruction level.
Historical Context
1978 — Intel 8086 (16-bit x86): The original. Segmented addressing, 8 16-bit registers (AX, BX, CX, DX, SI, DI, BP, SP), 20-bit physical address space (1 MB). Used in the IBM PC 5150 (1981).
1985 — Intel 80386 (32-bit i386, IA-32): Extended to 32-bit. Registers expanded to EAX, EBX, ECX, EDX, ESI, EDI, EBP, ESP. 32-bit physical addressing (4 GB). Protected mode with privilege rings. This architecture defined the "x86" we know.
1999 — AMD64 design begins: AMD, seeing Intel's failed Itanium IA-64 architecture (incompatible with x86), designed a 64-bit extension that was fully backwards compatible with x86. Lead architect: Mike Uhrig, Fred Weber, and others at AMD.
2003 — AMD Opteron (K8, AMD64 launch): First x86-64 processor. Registers expanded to 64-bit (RAX, RBX, ...), 8 new general-purpose registers (R8-R15), 64-bit virtual addresses (48-bit canonical, extendable to 57-bit with LA57), SYSCALL/SYSRET as fast system call mechanism.
2004 — Intel EM64T (extended Memory 64 Technology): Intel's implementation of AMD64, renamed to "Intel 64". Identical ISA except for a few minor differences (LAHF/SAHF handling, some SMM behavior). Intel adopted AMD's design rather than competing with an incompatible extension.
2011 — AVX (Advanced Vector Extensions): Intel extended SIMD from 128-bit SSE (XMM) to 256-bit (YMM). VEX encoding prefix introduced, enabling 3-operand instructions and non-destructive operations.
2013 — AVX-512: Intel extended to 512-bit SIMD (ZMM0-ZMM31). Only deployed in Xeon Phi and server Xeon CPUs. Not in consumer (client) Intel CPUs. AMD adopted AVX-512 in Zen 4 (2022), with different port configurations.
2020 — AMD Zen 3: AMD finalizes 5-level paging support (LA57, 57-bit virtual addresses, 4 PB virtual address space). Intel simultaneously added it with Tiger Lake.
2023 — Intel CET (Control-flow Enforcement Technology): Shadow stack + Indirect Branch Tracking. Shipped in Tiger Lake (2020), now mainstream.
Complete x86-64 Register Set
General-Purpose Registers (GPRs)
x86-64 has 16 64-bit GPRs. The naming conventions reflect the 8086 heritage:
64-bit 32-bit 16-bit 8-bit high 8-bit low Conventional use
──────────────────────────────────────────────────────────────────
RAX EAX AX AH AL Accumulator, return value
RBX EBX BX BH BL Base, callee-saved
RCX ECX CX CH CL Counter, arg4 (syscall: 4th)
RDX EDX DX DH DL Data, arg3 / return-2nd
RSI ESI SI (none) SIL Source index, arg2
RDI EDI DI (none) DIL Dest index, arg1
RBP EBP BP (none) BPL Base pointer (frame), callee-saved
RSP ESP SP (none) SPL Stack pointer
R8 R8D R8W (none) R8B arg5
R9 R9D R9W (none) R9B arg6
R10 R10D R10W (none) R10B arg10 (syscall: syscall num transport)
R11 R11D R11W (none) R11B Scratch (destroyed by SYSCALL)
R12 R12D R12W (none) R12B Callee-saved
R13 R13D R13W (none) R13B Callee-saved
R14 R14D R14W (none) R14B Callee-saved
R15 R15D R15W (none) R15B Callee-saved
Note: Writing to the 32-bit form (EAX, R8D, etc.) ZERO-EXTENDS to 64 bits.
Writing to the 8-bit or 16-bit form does NOT zero-extend (legacy behavior).
This creates a common bug: `mov al, 5` does not clear the upper 56 bits of RAX.
Special-Purpose Registers
Register Width Description
─────────────────────────────────────────────────────────────────
RIP 64-bit Instruction pointer (program counter)
RFLAGS 64-bit Status flags (CF, PF, AF, ZF, SF, TF, IF, DF, OF, IOPL, NT, RF, VM, AC, VIF, VIP, ID)
RFLAGS[21] = ID → if writable, CPUID supported
RFLAGS[9] = IF → interrupt enable flag (CLI/STI)
RFLAGS[8] = TF → trap flag (single-step)
RSP 64-bit Stack pointer (also a GPR but ABI-constrained)
Segment Registers
In 64-bit mode, segmentation is largely vestigial. CS, DS, ES, SS are all treated as base=0, so segment addressing is effectively flat. However, FS and GS retain their base addresses and are used for thread-local storage (TLS) and per-CPU data:
CS: Code segment. Selector value determines 64-bit mode vs compat mode.
DS: Data segment. Base=0, ignored in 64-bit mode.
ES: Extra segment. Base=0, ignored in 64-bit mode.
SS: Stack segment. Base=0, ignored in 64-bit mode.
FS: Loaded via WRFSBASE or MSR 0xC0000100 (FS.Base MSR)
User space: TLS (Thread-Local Storage) base
e.g., on Linux: FS:0x28 = stack canary (loaded by GCC)
FS points to the TLS struct (struct pthread in glibc)
GS: Loaded via WRGSBASE or MSR 0xC0000101 (GS.Base MSR)
User space: some ABIs use GS for TLS (macOS uses GS for pthread struct)
Kernel space: SWAPGS instruction swaps GS.Base with KernelGSBase (MSR 0xC0000102)
Kernel uses GS to access per-CPU struct (struct cpu_info on Linux)
SWAPGS mechanism (critical for kernel entry):
User space: GS.Base = user TLS (e.g., pthread struct)
KernelGSBase MSR = per-CPU kernel struct
Kernel entry (syscall/interrupt):
SWAPGS → swaps GS.Base ↔ KernelGSBase
Now GS.Base = per-CPU kernel struct
KernelGSBase = (old) user TLS (saved for return)
Kernel exit:
SWAPGS → swaps back
SYSRET → returns to user mode
Control Registers
Register Bits Description
─────────────────────────────────────────────────────────────────────────────
CR0 [0] PE = Protection Enable (0=real mode, 1=protected mode)
[16] WP = Write Protect (kernel cannot write to read-only user pages
even at CPL=0 → enables CoW and readonly data for kernel)
[31] PG = Paging Enable (0=physical addressing, 1=virtual+paging)
CR1 (reserved, GP fault if accessed)
CR2 [63:0] Page Fault Linear Address (PFLA)
On #PF exception: CR2 contains the virtual address that caused the fault
Read by page fault handler to determine which page to load
CR3 [11:0] PCID (Process Context Identifier — if PCIDE bit in CR4 set)
[63:12] PML4/PML5 physical base address (top-level page table)
MOV CR3, rax : TLB flush (unless NOFLUSH bit set + PCID matches)
Every context switch: load new CR3 for new process's page tables
KPTI: uses 2 CR3 values per process (user and kernel page tables)
CR4 [4] PSE = Page Size Extensions (4MB pages in 32-bit mode)
[5] PAE = Physical Address Extension (enables 64-bit PTEs in 32-bit mode)
[7] PGE = Page Global Enable (global TLB entries persist across CR3 changes)
[9] OSFXSR = OS support for FXSAVE/FXRSTOR (SSE)
[17] PCIDE = Process Context Identifier Enable
[20] SMEP = Supervisor Mode Execution Prevention
Prevents ring 0 (kernel) from executing user-space pages
Mitigates kernel code injection attacks (cannot JMP to user buffer)
[21] SMAP = Supervisor Mode Access Prevention
Prevents ring 0 from reading/writing user-space pages
WITHOUT explicitly setting RFLAGS.AC first
Mitigates use-after-free and type confusion attacks in kernel
[22] PKE = Protection Key Enable (user-mode)
[23] CET = Control-flow Enforcement Technology enable
[28] LA57 = 5-level paging enable (57-bit virtual addresses, 4 PB space)
CR8 [3:0] TPR = Task Priority Register
On x86-64 with APIC: CR8 controls which interrupt priority levels
are accepted. CR8=15 → no interrupts accepted.
Accessible from ring 3 via MOV CR8 (with APIC virtualization).
Linux uses this to implement local_irq_disable() efficiently on SMP.
CR3 Switching and KPTI
Context switch (process A → process B):
mov cr3, [process_B.cr3]
→ triggers TLB flush for all non-global pages
→ new PML4 table takes effect immediately
→ all subsequent virtual address translations use process B's page tables
KPTI context:
Each process has TWO CR3 values:
[1] User CR3: user page tables only (no kernel mapping)
[2] Kernel CR3: full page tables (user + kernel)
Entering kernel (syscall/interrupt):
Load Kernel CR3 → TLB now sees kernel pages
Returning to user:
Load User CR3 → TLB loses kernel mappings (prevents Meltdown)
Performance: CR3 switch costs ~10-20 cycles + TLB invalidation overhead
PCID optimization: if process retains same PCID, TLB entries are TAGGED
and not flushed. Reduces KPTI overhead from ~15% to ~3% for syscall-heavy code.
SIMD Registers: SSE/AVX/AVX-512
Register Width ISA Introduced
─────────────────────────────────────────────────────
XMM0-XMM15 128-bit SSE/SSE2 Pentium III (1999) / Pentium 4 (2001)
YMM0-YMM15 256-bit AVX/AVX2 Sandy Bridge (2011)
ZMM0-ZMM31 512-bit AVX-512 Skylake Xeon (2017), Zen 4 (2022)
K0-K7 16-bit AVX-512 Predicate/mask registers for AVX-512
Physical relationship:
ZMM0: [bits 511:256=upper AVX-512 half | bits 255:128=YMM0 upper | bits 127:0=XMM0]
Writing XMM0 (128-bit): ZERO-EXTENDS to 256-bit (per AVX 3-operand encoding)
Writing YMM0 (VEX-encoded): ZERO-EXTENDS to 512 bits
Writing ZMM0: full 512-bit update
AVX-512 features (varies by CPU):
- 32 ZMM registers (vs 16 YMM in AVX2)
- Mask registers (K0-K7): per-element predication
- Embedded broadcast: {1to8}, {1to16} → broadcast a scalar to all vector lanes
- Scatter/gather: ZMM-indexed loads/stores to non-contiguous memory
AVX-512 penalty on Intel:
Executing AVX-512 instructions causes the CPU to drop clock frequency
(power: 512-bit SIMD consumes significantly more power)
Recovery latency after the last AVX-512 instruction: ~1ms
→ Can HURT performance for code that mixes AVX-512 and non-SIMD
→ Heavy AVX-512 loops: sustained throughput gain (e.g., 2× for DGEMM)
→ Sparse AVX-512 use: frequency penalty + transition overhead = net loss
System V AMD64 ABI Calling Convention
The System V AMD64 ABI is the standard calling convention for Linux, macOS, and all Unix-like systems on x86-64. Windows uses a different calling convention (fastcall variant with 4 register args: RCX, RDX, R8, R9).
Integer/Pointer Arguments
Argument # Register Preserved by callee?
──────────────────────────────────────────────
1 RDI No (caller-saved)
2 RSI No (caller-saved)
3 RDX No (caller-saved)
4 RCX No (caller-saved)
5 R8 No (caller-saved)
6 R9 No (caller-saved)
7+ Stack (pushed right-to-left before CALL)
Return values:
RAX : Primary return value (integer/pointer)
RDX : Secondary return value (for 128-bit returns or struct returns)
Floating-Point Arguments
FP/SSE arg 1 XMM0
FP/SSE arg 2 XMM1
...
FP/SSE arg 8 XMM7
FP/SSE arg 9+ Stack
Callee-Saved Registers (must be preserved across a function call)
RBX, RBP, R12, R13, R14, R15
→ A function may use these registers only if it saves/restores them (push/pop or explicit save)
XMM0-XMM7 : caller-saved (not preserved)
XMM8-XMM15: caller-saved (not preserved)
→ All XMM/YMM/ZMM registers are caller-saved in SysV ABI
→ A callee that uses YMM8-YMM15 must execute VZEROUPPER to avoid AVX-SSE transition penalties
Stack Frame and Red Zone
RSP at function call entry: 16-byte aligned (RSP % 16 == 8 after CALL
because CALL pushes 8 bytes = return addr)
RSP in function body: must be 16-byte aligned before issuing another CALL
Red zone: The 128 bytes BELOW RSP are "owned" by the current function.
Signal handlers must not clobber this region.
Leaf functions (no calls) can use the red zone without adjusting RSP.
The Linux kernel disables the red zone for its own stack (stack
switching on interrupt doesn't respect the ABI red zone).
High addresses
┌─────────────────────────┐
│ Caller's frame │
├─────────────────────────┤ ← RSP before CALL
│ Return address (8B) │
├─────────────────────────┤ ← RSP after CALL (= current function entry RSP)
│ Callee frame (locals) │
├─────────────────────────┤ ← RSP after local allocation
│ Red zone (128 bytes) │ ← usable by leaf functions without SUB RSP
└─────────────────────────┘
Low addresses
syscall vs function call argument differences
Function call ABI (System V AMD64):
Arg1=RDI, Arg2=RSI, Arg3=RDX, Arg4=RCX, Arg5=R8, Arg6=R9
SYSCALL ABI (Linux x86-64):
Syscall#=RAX, Arg1=RDI, Arg2=RSI, Arg3=RDX, Arg4=R10, Arg5=R8, Arg6=R9
Note: RCX → R10 (because SYSCALL destroys RCX — used to save return address)
SYSCALL/SYSRET: Fast System Call Mechanism
The SYSCALL/SYSRET instruction pair is the modern x86-64 mechanism for transitioning between user mode (ring 3) and kernel mode (ring 0). It replaced the older INT 0x80 / IRET path (INT80 is still supported for 32-bit compatibility on Linux).
MSRs Controlling SYSCALL/SYSRET
MSR 0xC0000080 (EFER — Extended Feature Enable Register):
Bit 8 (SCE) = System Call Extensions: must be 1 to enable SYSCALL/SYSRET
MSR 0xC0000081 (STAR — System Target Address Register):
Bits [63:48]: SYSRET CS and SS selectors
Bits [47:32]: SYSCALL CS and SS selectors
Bits [31:0]: (32-bit syscall target in compat mode, not used in pure 64-bit)
MSR 0xC0000082 (LSTAR — Long mode System Target Address Register):
Bits [63:0]: Linear address of kernel entry point for SYSCALL in 64-bit mode
= Address of kernel's syscall handler (e.g., `entry_SYSCALL_64` in Linux)
Set during kernel initialization: wrmsr(MSR_LSTAR, (uint64_t)entry_SYSCALL_64)
MSR 0xC0000083 (CSTAR — Compat mode System Target Address Register):
Kernel entry for 32-bit code issuing SYSCALL (used by 32-bit processes on 64-bit kernel)
MSR 0xC0000084 (SFMASK — SYSCALL FLAGS MASK):
On SYSCALL: RFLAGS &= ~SFMASK
Typically: SFMASK = 0x47700 (clears: TF, DF, IF, IOPL, AC, NT)
→ Disables interrupts at kernel entry, clears trap flag, ensures clean RFLAGS
SYSCALL Instruction Execution Sequence
User executes SYSCALL instruction:
Step 1: Save state
RCX ← RIP (return address: next instruction after SYSCALL)
R11 ← RFLAGS (saved user flags)
Step 2: Load kernel entry
RIP ← LSTAR MSR (kernel entry point)
RFLAGS ← RFLAGS & ~SFMASK (clear interrupt flag etc.)
CS ← STAR[47:32] (kernel code selector, ring 0)
SS ← STAR[47:32] + 8 (kernel stack selector)
Step 3: Kernel executes
(RSP still points to user stack! Kernel must switch to kernel stack immediately)
Linux entry_SYSCALL_64:
swapgs ; GS → per-CPU kernel struct
mov %rsp, %gs:cpu_tssp ; save user RSP
mov %gs:cpu_kernel_stack, %rsp ; load kernel stack
push rcx ; save user return address (RIP before syscall)
push r11 ; save user RFLAGS
push rdi..r9 ; save syscall arguments
call [sys_call_table + rax*8] ; dispatch via syscall table
pop ... ; restore
SYSRET ; return to user
Step 4: SYSRET Instruction
RIP ← RCX (restore user return address)
RFLAGS ← R11 | fixed_bits (restore user flags; IF forced=1; certain bits forced)
CS ← STAR[63:48] (user code selector, ring 3)
SS ← STAR[63:48] + 8 (user stack selector)
swapgs ; restore user GS
SYSRET Security Bug (CVE-2012-0217): On Intel CPUs, SYSRET with a non-canonical RCX (i.e., RCX bits [63:48] ≠ sign-extended from bit 47) raises a #GP (General Protection Fault) at CPL=0 (ring 0) — AFTER CS has been updated to ring 3 but before the privilege drop. This is exploitable. Linux and all major OSes now canonicalize RCX before SYSRET.
x86 TSO Memory Model
x86 implements Total Store Order (TSO) — a nearly sequential consistency model with one relaxation: stores can be buffered and observed out of order relative to loads.
Allowed and Forbidden Reorderings
Operation pair Allowed to reorder? Notes
──────────────────────────────────────────────────────────────
Load → Load NO Memory reads are ordered
Store → Store NO Memory writes are ordered (FIFO store buffer)
Load → Store NO Load before store is maintained
Store → Load YES! A store can be reordered AFTER a subsequent load
This is the TSO relaxation
Root cause: store buffer — stores sit in
the store buffer while later loads proceed
The TSO store-load reordering:
Thread 0: Thread 1:
store x = 1 store y = 1
r0 = load y r1 = load x
Under TSO: r0=0, r1=0 is possible!
(Both threads' stores are in their store buffers, not yet globally visible,
when both loads execute — each thread sees its own store but not the other's)
Memory Barriers / Fence Instructions
MFENCE : Full memory fence. Ensures all loads/stores before MFENCE
complete before any loads/stores after MFENCE.
Cost: ~20-100 cycles (serializes the store buffer flush)
SFENCE : Store fence. Orders stores only. Later stores cannot
pass earlier stores. Used with non-temporal stores (MOVNT).
Cost: ~5-15 cycles
LFENCE : Load fence. Prevents later loads from issuing before prior loads.
Also prevents speculative execution from passing the LFENCE
(used as speculation barrier in Spectre mitigations).
Cost: ~3-10 cycles
LOCK prefix: Atomically performs the operation AND acts as a full fence.
(e.g., LOCK CMPXCHG, LOCK XADD, LOCK XCHG)
XCHG: always has implicit LOCK semantics (even without prefix).
Practical note:
C11/C++11 atomic operations compile to appropriate fence instructions:
memory_order_seq_cst → MFENCE (or LOCK XCHG on x86)
memory_order_acquire → no barrier needed on x86 (loads are already ordered)
memory_order_release → no barrier needed on x86 (except for NT stores)
→ x86 TSO is "almost free" for acquire/release — the main cost is seq_cst
x86-64 Register Map Table
┌─────────────────────────────────────────────────────────────────────────────┐
│ x86-64 REGISTER MAP │
├──────────┬───────────┬────────┬────────┬──────────────────────────────────┤
│ 64-bit │ 32-bit │ 16-bit │ 8-bit │ Primary Use │
├──────────┼───────────┼────────┼────────┼──────────────────────────────────┤
│ RAX │ EAX │ AX │ AH/AL │ Return value, syscall number │
│ RBX │ EBX │ BX │ BH/BL │ Callee-saved, base pointer │
│ RCX │ ECX │ CX │ CH/CL │ Arg4 (function), saved by SYSCALL│
│ RDX │ EDX │ DX │ DH/DL │ Arg3 (function), return value 2 │
│ RSI │ ESI │ SI │ SIL │ Arg2 (function), string source │
│ RDI │ EDI │ DI │ DIL │ Arg1 (function), string dest │
│ RBP │ EBP │ BP │ BPL │ Callee-saved, frame pointer │
│ RSP │ ESP │ SP │ SPL │ Stack pointer │
│ R8 │ R8D │ R8W │ R8B │ Arg5 (function/syscall) │
│ R9 │ R9D │ R9W │ R9B │ Arg6 (function/syscall) │
│ R10 │ R10D │ R10W │ R10B │ Arg4 (syscall), caller-saved │
│ R11 │ R11D │ R11W │ R11B │ Saved RFLAGS (SYSCALL), scratch │
│ R12 │ R12D │ R12W │ R12B │ Callee-saved │
│ R13 │ R13D │ R13W │ R13B │ Callee-saved │
│ R14 │ R14D │ R14W │ R14B │ Callee-saved │
│ R15 │ R15D │ R15W │ R15B │ Callee-saved │
├──────────┼───────────┴────────┴────────┴──────────────────────────────────┤
│ RIP │ Instruction pointer │
│ RFLAGS │ Condition codes + control flags (CF, ZF, SF, OF, IF, DF, TF) │
├──────────┴──────────────────────────────────────────────────────────────────┤
│ Segment │ CS DS ES FS GS SS (FS/GS retain base addresses in 64-bit mode)│
├───────────────────────────────────────────────────────────────────────────┤
│ SIMD │ XMM0-15 (128b) YMM0-15 (256b) ZMM0-31 (512b, AVX-512) │
│ │ K0-K7 (16-bit mask registers, AVX-512) │
├───────────────────────────────────────────────────────────────────────────┤
│ Control │ CR0 CR2 CR3 CR4 CR8 (privileged, ring 0 only) │
│ Debug │ DR0-DR3 (breakpoint address), DR6 (status), DR7 (control) │
│ MSR │ Accessed via RDMSR/WRMSR (ring 0). Key: EFER, STAR, LSTAR, │
│ │ SFMASK, FS.Base, GS.Base, KernelGSBase, IA32_SPEC_CTRL │
└───────────────────────────────────────────────────────────────────────────┘
Intel CET: Control-Flow Enforcement Technology
CET provides hardware enforcement of forward-edge and backward-edge control flow integrity (CFI).
Shadow Stack (backward-edge CFI)
Normal stack: Shadow stack (new, in CR4.CET-protected memory):
┌─────────────────┐ ┌─────────────────┐
│ return address │ │ return address │ (copy of return addr on CALL)
│ local vars │ └─────────────────┘
│ ... │
└─────────────────┘
On CALL: Push return addr to BOTH normal stack AND shadow stack
On RET: Compare popped return addr (from normal stack)
with top of shadow stack
→ MISMATCH → #CP (Control Protection Exception)
Shadow stack: read-only to software (cannot be written via normal MOV)
writable only by CALL/INCSSPQ/SAVEPREVSSP/RSTORSSP
Prevents ROP (Return-Oriented Programming) attacks
Attacker cannot overwrite shadow stack via buffer overflow
(separate virtual memory page with WRSS protection)
Indirect Branch Tracking (IBT, forward-edge CFI)
All valid indirect branch targets MUST begin with ENDBR64 (or ENDBR32) instruction.
ENDBR64 is a NOP on CPUs without CET. On CET-enabled CPUs:
If an indirect JMP or CALL lands anywhere that is NOT an ENDBR64 instruction
→ #CP exception
ENDBR64 encoding: F3 0F 1E FA (4 bytes)
Prevents code injection: even if attacker redirects an indirect branch,
the target must start with ENDBR64 — cannot point into the middle of code.
CET IBT with Linux:
GCC -fcf-protection=full: adds ENDBR64 to all function entry points
Kernel: CONFIG_X86_KERNEL_IBT enables CET IBT for kernel
Limitation: still allows ROP-like attacks using ENDBR64-prefixed gadgets
(any existing function entry is a valid target). Does not prevent
Spectre gadgets in the speculative path.
SMEP, SMAP, and Privilege Separation
Ring Protection:
Ring 0 (CPL=0): Kernel mode — can execute any instruction, access any memory
Ring 3 (CPL=3): User mode — cannot execute privileged instructions
Rings 1,2: Not used in modern OS designs (drivers run in ring 0)
SMEP (Supervisor Mode Execution Prevention, CR4[20]):
Ring 0 cannot execute a page that is user-accessible (U/S bit set in PTE).
Prevents kernel from being redirected to execute shellcode in user buffers.
Attack prevented: kernel write primitive + CALL to user buffer.
Linux enables SMEP since 3.7 (if CPU supports it).
SMAP (Supervisor Mode Access Prevention, CR4[21]):
Ring 0 cannot implicitly READ or WRITE user-accessible pages.
To access user memory, kernel must explicitly set RFLAGS.AC bit (via STAC instruction)
and clear it afterwards (via CLAC).
Prevents: kernel accidentally following a user-controlled pointer into user data.
Also prevents temporal confusion attacks (kernel pointer dereference gadgets).
Kernel usage pattern:
copy_from_user(): stac; mov [user_ptr]; clac
copy_to_user(): stac; mov [user_ptr] = val; clac
PKU (Protection Keys for Userspace, CR4[22]):
Each page table entry has a 4-bit "protection key" field (bits 62:59).
A per-process PKRU register (accessible without syscall via RDPKRU/WRPKRU)
stores 32 bits: 2 bits per key (WD=write-disable, AD=access-disable).
Application can mark memory regions with a key and toggle access without TLB flush.
Used by: glibc for PKEY-protected memory arenas, MemSafe, HFI isolation.
Performance advantage: RDPKRU/WRPKRU are ~10 cycle operations (vs ~200 for mprotect syscall)
Useful for: JIT sandboxes, safe memory partitioning within a process without syscall overhead
Debugging Notes
# Dump all registers via GDB
gdb ./program
(gdb) break main
(gdb) run
(gdb) info registers all # all GPRs + flags + segment registers
(gdb) p/x $rflags # RFLAGS in hex (decode: bit 9=IF, bit 7=SF, bit 6=ZF, bit 0=CF)
(gdb) x/32xg $rsp # dump stack as 64-bit words
# Read/write MSRs from Linux (requires root)
# Install msr-tools: apt install msr-tools
modprobe msr
rdmsr 0xC0000082 # read LSTAR (kernel SYSCALL entry address)
rdmsr 0xC0000080 # read EFER
# Example: LSTAR = ffffffffb1c00000 → Linux kernel entry_SYSCALL_64 address (KASLR'd)
# Check SMEP/SMAP status
grep -E "smep|smap" /proc/cpuinfo | head -2
# Check CET support
grep -E " shstk| ibt" /proc/cpuinfo
# Debug SYSCALL path with strace
strace -e trace=openat,read,write,mmap ./program
# syscall numbers for x86-64: /usr/include/asm/unistd_64.h
# read=0, write=1, open=2, mmap=9, brk=12, rt_sigaction=13, ...
# ftrace kernel function tracing
echo function > /sys/kernel/debug/tracing/current_tracer
echo entry_SYSCALL_64 > /sys/kernel/debug/tracing/set_ftrace_filter
cat /sys/kernel/debug/tracing/trace
# Inspect CET shadow stack (via /proc/PID/maps - if kernel CET enabled)
cat /proc/$$/maps | grep -E "\[stack\]|\[shadow]"
Security Implications
-
Kernel stack exhaustion (stack overflow): Kernel stack on x86-64 is typically 8 KB or 16 KB (configurable,
CONFIG_THREAD_INFO_IN_TASK). Deep recursion in kernel context overflows into adjacent kernel stack, causing silent corruption. Real exploits (e.g., CVE-2016-1583ecryptfsstack overflow) used this to overwrite adjacent kernel data. -
SMAP bypass: If a kernel function stores a user-controlled pointer in memory and later dereferences it without
stac/clacprotection (or with a SMAP-disabled window that's too wide), an attacker can redirect the kernel to read/write attacker-controlled data. -
KASLR (Kernel Address Space Layout Randomization): The kernel is loaded at a random base address in virtual memory (x86-64: typically bits 47:21 randomized). Combined with SMEP/SMAP, forces attacker to leak a kernel address before exploiting control-flow redirections.
-
KPTI and the PCID optimization: Without PCID support, KPTI requires a full TLB flush on every kernel entry/exit (every syscall). With PCID (CR4[17]), the kernel and user page tables each get a distinct PCID. TLB entries are tagged by PCID, so no full flush is needed when switching between the two CR3 values for the same process.
-
Return address signing (future): Intel's CET shadow stack is the x86-64 answer to ARM's Pointer Authentication Codes (PAC). ARM PAC signs return addresses cryptographically; CET uses a dedicated shadow stack. Both target the same threat model (ROP/JOP).
Performance Implications
-
Register pressure and callee-saves: Functions that use RBX/RBP/R12-R15 must save/restore them. High register pressure from many local variables forces spills to stack, increasing memory traffic. Compilers use R8-R11 (caller-saved) first to avoid this.
-
VZEROUPPER overhead: Code mixing AVX2 (256-bit YMM) with SSE (128-bit XMM) must execute VZEROUPPER to avoid AVX-SSE transition penalties on Intel CPUs (prior to Skylake-SP). Compilers insert it automatically but it adds 1-2 cycles.
-
REX prefix encoding: Instructions using R8-R15 or high-8-bit registers require a REX prefix byte. This increases instruction density, slightly reducing fetch bandwidth. Not typically a performance issue on modern wide-fetch CPUs.
-
Partial register writes: Writing AL (low 8 bits of RAX) followed by reading RAX creates a false dependency on some CPUs (Intel pre-Ivy Bridge: 3-cycle penalty for reading the merged result). Prefer 32-bit writes (which zero-extend) over 8-bit writes in critical paths.
-
MFENCE cost: MFENCE on modern Intel CPUs costs ~20-100 cycles due to store buffer drain. In lock-free code, prefer LOCK prefix (which is an implicit fence) over separate MFENCE where possible.
XCHG [mem], reghas implicit LOCK and provides both the exchange and the fence semantics at once.
Modern Usage
Linux Kernel Architecture-Specific Code
Key x86-64 kernel files:
- arch/x86/entry/entry_64.S: SYSCALL entry point, exception handlers, SWAPGS
- arch/x86/kernel/cpu/common.c: CPU feature detection, SMEP/SMAP/CET enable
- arch/x86/mm/pgtable.c: Page table management, PCID allocation
- arch/x86/include/asm/msr-index.h: All MSR numbers
- arch/x86/include/asm/processor.h: CPU-specific structs, feature flags
# Verify kernel entry point (LSTAR MSR)
sudo rdmsr 0xC0000082 | xxd
# Then: cat /proc/kallsyms | grep entry_SYSCALL_64
# Addresses should match (minus KASLR offset)
# Check which security features are enabled
cat /proc/cpuinfo | grep -E "smep|smap|pku|cet|ibrs|ibpb"
Future Directions
-
APX (Advanced Performance Extensions, Intel 2024+): Intel Granite Rapids introduces 32 new 64-bit general-purpose registers (R16-R31) via the REX2 prefix. Reduces register pressure for compilers, reduces spills for large functions. Also adds "PUSH2/POP2" instructions to save two registers with one instruction.
-
AMX (Advanced Matrix Extensions): Intel Sapphire Rapids: tile-based matrix multiply instructions. 8 tile registers (TMM0-TMM7), each up to 16×64 bytes. TMUL instruction performs matrix multiply on tiles. Replaces hand-written AVX-512 matrix kernels for deep learning inference.
-
RAO-INT (Remote Atomic Operations on Integer): AMD specification for performing atomic integer operations on remote memory without loading to cache first. Relevant for NUMA — allows incrementing a remote counter without cache-line transfer.
-
x86S (x86 Simplification): Intel research proposal (2023) to remove 16-bit and 32-bit legacy operating modes from future CPUs. Pure 64-bit only. Would eliminate most of the complexity in the decode stage and allow significant microarchitectural simplification. Not yet committed to.
-
CET expansion: Future ISA extensions expected to close remaining CET gaps (ENDBR64 gadgets), integrate with hardware capabilities (CHERI-like features) for memory safety without full CHERI adoption.
Exercises
-
Register tracing: Write an assembly function (in
.asmor__asm__inline) that computes the sum of an array using all 16 GPRs as partial-sum accumulators (to exploit instruction-level parallelism). Verify withobjdump -dthat the compiler is not merging them. Measure IPC vs a single-accumulator version. -
SYSCALL path tracing: Implement a minimal
printf-like function that uses the rawSYSCALLinstruction (not libc) to invokewrite(1, buf, len). Verify the system call number fromunistd_64.h(syscall 1). Usestraceto confirm the syscall is being made correctly. -
TSO memory model test: Implement the store-load reordering test in C using atomic operations with
memory_order_relaxed. Run with 2 threads on a 2-core machine. Count how many iterations result inr0=0, r1=0(the TSO-allowed "both stores delayed" outcome). Addmemory_order_seq_cstand verify this outcome becomes impossible. -
CR3 switching measurement: Write a kernel module (or use KVM/VMX to observe) that measures the cycle cost of
MOV CR3, raxwith and without PCID. Specifically: (a) measure TLB flush overhead by accessing a large array before and after CR3 switch, (b) compare PCID-tagged CR3 switch vs non-PCID full flush. -
CET shadow stack validation: On a CET-enabled Linux system (kernel 5.18+, glibc 2.35+), compile a program with
-fcf-protection=shadow-stack. Verify shadow stack presence in/proc/self/maps. Attempt to overwrite a return address on the normal stack (via controlled stack overflow in a test function) and confirm the resulting#CPexception kills the process with SIGSEGV.
References
- AMD Corporation. (2024). AMD64 Architecture Programmer's Manual. Volumes 1-5. https://developer.amd.com/resources/developer-guides-manuals/
- Intel Corporation. (2024). Intel 64 and IA-32 Architectures Software Developer's Manual. Volumes 1-4. https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html
- System V Application Binary Interface: AMD64 Architecture Processor Supplement. (2023). https://gitlab.com/x86-psABIs/x86-64-ABI
- Corbet, J. (2018). The current state of kernel page-table isolation. LWN.net. https://lwn.net/Articles/741878/
- Intel Corporation. (2020). Control-flow Enforcement Technology Specification. https://www.intel.com/content/www/us/en/develop/download/intel-cet-technology-preview.html
- Lipp, M., et al. (2020). Meltdown: Reading Kernel Memory from User Space. Communications of the ACM, 63(6).
- Drepper, U. (2004). ELF Handling For Thread-Local Storage. Red Hat.
- Fog, A. (2023). Calling Conventions for Different C++ Compilers and Operating Systems. https://www.agner.org/optimize/calling_conventions.pdf
- Intel Corporation. (2023). Advanced Performance Extensions (APX) Architecture Specification. https://www.intel.com/content/www/us/en/developer/articles/technical/advanced-performance-extensions-apx.html