Skip to content

x86-64 Architecture Deep Dive

Technical Overview

x86-64 is the most successful instruction set architecture in computing history, powering virtually all PC and server workloads. Born from Intel's 8086 (1978) and extended by AMD in 2003 with AMD64 (the 64-bit extension), x86-64 carries 46 years of backward compatibility while supporting modern features: 64-bit virtual addressing, AVX-512 SIMD, hardware virtualization, memory protection extensions, and cryptographic instructions. Understanding x86-64 at the architecture level—registers, paging, segmentation, privilege levels, and the legacy modes that remain for backward compatibility—is essential for OS kernel development, performance engineering, security research, and systems programming.

Prerequisites

  • Familiarity with computer organization: registers, ALU, memory, I/O
  • Understanding of assembly language fundamentals
  • Knowledge of virtual memory and address translation
  • Understanding of privilege levels in operating systems
  • Familiarity with interrupt and exception handling

Core Content

ISA History: 46 Years of Backward Compatibility

1978: Intel 8086 — 16-bit, 20-bit address space (1MB), real mode only
1982: Intel 80286 — 16-bit protected mode, 24-bit address (16MB), 286 segmentation
1985: Intel 80386 (386) — 32-bit (IA-32 born), 32-bit protected mode, 32-bit paging, 4GB addr
1989: Intel 80486 — on-chip FPU (no external 8087), 8KB L1 cache integrated
1993: Intel Pentium — 64-bit data bus (internal 32-bit), superscalar (2-wide), P5 µarch
1995: Intel Pentium Pro (P6) — OoO execution, BTB, 3-wide decode, Spectre-class µarch origin
1997: MMX (first x86 SIMD) — 8 × 64-bit MMX regs (aliased to FP stack)
1999: SSE (Streaming SIMD Extensions) — 8 × 128-bit XMM registers (independent of FP)
2003: AMD Athlon 64 (AMD64) — x86-64: 64-bit address, 16 GPRs, RIP-relative addressing
2004: Intel EM64T (Intel 64) — Intel's AMD64-compatible extension in Pentium 4 Prescott
2006: Intel Core — dual-core, SSE3/SSSE3, efficient OoO
2008: SSE4.1/4.2, Intel Core 2/Nehalem
2011: Intel Sandy Bridge — AVX (256-bit SIMD with YMM registers)
2013: Intel Haswell — AVX2 (256-bit integer SIMD), FMA3 (fused multiply-add)
2016: Intel Skylake-SP — AVX-512 (512-bit SIMD with ZMM registers, 32 ZMM regs)
2021: Intel Golden Cove (Alder Lake P-core) — AVX-512 on consumer platform
2022: Intel Raptor Lake — same P-core, more cores, higher frequency
2024: Intel Arrow Lake — removed AVX-512, focused on power efficiency

Backward compatibility cost: The x86-64 CPU still supports Real Mode (16-bit), Virtual 8086 Mode, Protected Mode 16-bit/32-bit, and Long Mode 64-bit. The CPU boots in Real Mode. BIOS/UEFI transition through these modes to eventually reach Long Mode (64-bit). The microcode complexity to support all these modes is enormous and has been the source of several security vulnerabilities.

x86-64 Register Set

General-Purpose Registers (GPRs):

64-bit  32-bit  16-bit  8-bit high  8-bit low
RAX     EAX     AX      AH          AL
RBX     EBX     BX      BH          BL
RCX     ECX     CX      CH          CL
RDX     EDX     DX      DH          DL
RSI     ESI     SI      (none)      SIL
RDI     EDI     DI      (none)      DIL
RBP     EBP     BP      (none)      BPL
RSP     ESP     SP      (none)      SPL
R8      R8D     R8W     (none)      R8B
R9      R9D     R9W     (none)      R9B
R10     R10D    R10W    (none)      R10B
R11     R11D    R11W    (none)      R11B
R12     R12D    R12W    (none)      R12B
R13     R13D    R13W    (none)      R13B
R14     R14D    R14W    (none)      R14B
R15     R15D    R15W    (none)      R15B
RIP     EIP     IP      (none)      (none)  ← Instruction Pointer
RFLAGS  EFLAGS  FLAGS   (none)      (none)  ← Status flags

Important RFLAGS bits:
  CF (bit 0): Carry flag
  PF (bit 2): Parity flag
  ZF (bit 6): Zero flag
  SF (bit 7): Sign flag
  TF (bit 8): Trap flag (single-step debug)
  IF (bit 9): Interrupt flag (enables maskable interrupts)
  DF (bit 10): Direction flag (string operations)
  OF (bit 11): Overflow flag
  IOPL (bits 12-13): I/O Privilege Level
  NT (bit 14): Nested Task
  RF (bit 16): Resume Flag (suppresses #DB on next instruction)
  VM (bit 17): Virtual-8086 Mode
  AC (bit 18): Alignment Check
  VIF/VIP (bits 19-20): Virtual IF/IP for VME
  ID (bit 21): CPUID instruction supported

Writing to 32-bit subregisters zero-extends to 64 bits. This is a critical x86-64 rule:

mov eax, 1    ; RAX = 0x0000000000000001  (zero-extends, not sign-extends)
mov ax, 1     ; RAX unchanged in bits 63:16; bits 15:0 = 0x0001 (NO zero-extend of upper)
mov al, 1     ; RAX unchanged in bits 63:8; bits 7:0 = 0x01  (NO zero-extend)

SIMD Registers:

XMM0-XMM15  (SSE: 128-bit)       — named "legacy" in AVX context
YMM0-YMM15  (AVX/AVX2: 256-bit)  — upper 128 = YMM, lower 128 = XMM
ZMM0-ZMM31  (AVX-512: 512-bit)   — upper 384 = new, lower 256 = YMM, lower 128 = XMM
k0-k7       (AVX-512 opmask registers: 16-64 bits) — for predicated SIMD operations

Note: YMM0-15 added AMD64 8 additional XMM to reach XMM0-15.
      AVX-512 added ZMM16-31 (no corresponding XMM16-31 or YMM16-31 alias in user code)

Segment Registers:

CS  Code Segment  — code privilege level (CPL); in 64-bit mode, mostly historical
SS  Stack Segment — stack operations (PUSH/POP/CALL/RET)
DS  Data Segment  — data access (in 64-bit mode, base=0 for all, effectively unused)
ES  Extra Segment — (legacy string operations)
FS  F Segment     — 64-bit: FS.Base set to arbitrary value via WRFSBASE/MSR 0xC0000100
                          Used for thread-local storage (TLS) in userspace (Linux: %fs:0x28 = stack canary)
GS  G Segment     — 64-bit: GS.Base via WRGSBASE/MSR 0xC0000101
                          SWAPGS instruction: swap GS.Base with MSR_KERNEL_GS_BASE
                          Used by Linux kernel for per-CPU data pointer (%gs-relative access)

FS/GS for TLS (Thread-Local Storage):

// Linux glibc thread implementation
// Each thread's FS.Base points to its TLS block
// Access via __seg_fs or __thread (compiler handles)
// gcc: stack canary stored at FS:40 (0x28)
//    __builtin_ia32_rdfsbase64() reads FS.Base
// In glibc: struct pthread starts at FS:0

// Kernel: GS.Base → per-CPU struct
// On syscall entry: SWAPGS swaps GS.Base with MSR_KERNEL_GS_BASE
//   → GS now points to percpu area
// On syscall return: SWAPGS again, restores user GS.Base

Control Registers:

CR0: Machine control
  PE (bit 0):  Protected Mode Enable
  MP (bit 1):  Monitor coprocessor (FPU)
  EM (bit 2):  Emulation (FPU emulated if set)
  TS (bit 3):  Task Switched (for lazy FPU context switch — set on task switch; FPU use causes #NM)
  ET (bit 4):  Extension Type (386 only)
  NE (bit 5):  Numeric Error (FPU error signaling)
  WP (bit 16): Write Protect (prevent Ring 0 writing read-only pages — SMEP/SMAP enforcement)
  AM (bit 18): Alignment Mask
  NW (bit 29): Not Write-through
  CD (bit 30): Cache Disable
  PG (bit 31): Paging Enable — must be set before Long Mode activated

CR2: Page Fault Linear Address — set by MMU on #PF to the faulting virtual address
CR3: Page Directory Base Register — physical address of PML4 (top-level page table)
     Bit 3 (PWT), Bit 4 (PCD): cache hints for page table walks
     PCID (bits 11:0): Process Context Identifier (if CR4.PCIDE set)
CR4: Extended features
  PSE (bit 4):  Page Size Extension (4MB pages in 32-bit mode)
  PAE (bit 5):  Physical Address Extension (enables 36-bit+ physical addresses)
  PGE (bit 7):  Page Global Enable
  OSFXSR (bit 9): OS supports FXSAVE/FXRSTOR (required for SSE)
  SMEP (bit 20): Supervisor Mode Execution Prevention (Ring 0 cannot execute user pages)
  SMAP (bit 21): Supervisor Mode Access Prevention (Ring 0 cannot access user pages without STAC)
  PKE (bit 22): Protection Key Enable
  LA57 (bit 12): 5-level paging (57-bit VA space — PML5 top-level page table)
CR8: Task Priority Register (TPR) — controls which interrupts are serviced (0=all, 15=none)

Debug Registers:

DR0-DR3: Linear addresses of hardware breakpoints (up to 4 breakpoints)
DR6: Debug Status — records which breakpoint triggered, why (instruction/data/I/O)
DR7: Debug Control — enable/disable each breakpoint, condition (execution/write/read-write), size

Hardware breakpoints don't need to modify code (unlike INT3 software breakpoints).
Useful for: read/write watchpoints, non-invasive debugging of ROM/flash code.
gdb: watch <var> → DR0-DR3 hardware watchpoint
ptrace(PTRACE_POKEUSER, ..., DR7, ...) — set from debugger

MSRs (Model-Specific Registers): x86-64 CPUs have thousands of MSRs, accessed via RDMSR/WRMSR (privileged instructions).

Notable MSRs (addresses):
  0x10     IA32_TIME_STAMP_COUNTER (TSC) — read via RDTSC instruction
  0x174    IA32_SYSENTER_CS — SYSENTER target CS
  0x175    IA32_SYSENTER_ESP — SYSENTER target RSP
  0x176    IA32_SYSENTER_EIP — SYSENTER target RIP
  0xC0000080 IA32_EFER — Extended Feature Enable Register
           SCE (bit 0): SYSCALL/SYSRET enable
           LME (bit 8): Long Mode Enable
           LMA (bit 10): Long Mode Active (read-only, set by CPU when paging+LME)
           NXE (bit 11): No-Execute Enable (enables NX bit in page tables)
  0xC0000081 IA32_STAR — SYSCALL/SYSRET CS/SS selectors
  0xC0000082 IA32_LSTAR — SYSCALL 64-bit handler RIP
  0xC0000084 IA32_FMASK — RFLAGS mask for SYSCALL
  0xC0000100 FS.Base — FS segment base (also writable by WRFSBASE instruction)
  0xC0000101 GS.Base
  0xC0000102 MSR_KERNEL_GS_BASE — swapped with GS.Base on SWAPGS
  0x1B       IA32_APIC_BASE — local APIC base address + enable bit
  0x277      IA32_PAT — Page Attribute Table (memory type overrides)
  0x38D-0x38F IA32_PMC[0-3] — Performance Monitoring Counters

x86-64 Paging: 4-Level and 5-Level

4-Level Paging (default, LA48 — 48-bit linear address):

48-bit Virtual Address:
  Bits 47:39 = PML4 index  (9 bits → 512 entries in PML4)
  Bits 38:30 = PDPT index  (9 bits → 512 entries in PDPT)
  Bits 29:21 = PD index    (9 bits → 512 entries in PD)
  Bits 20:12 = PT index    (9 bits → 512 entries in PT)
  Bits 11:0  = Page Offset (12 bits → 4096-byte pages)

Translation:
  CR3 → PML4 base (physical)
  PML4[VA[47:39]] → PDPT base
  PDPT[VA[38:30]] → PD base  (OR: 1GB huge page if PS=1 in PDPT entry)
  PD[VA[29:21]]   → PT base  (OR: 2MB large page if PS=1 in PD entry)
  PT[VA[20:12]]   → physical frame number
  Physical address = frame_number << 12 | VA[11:0]

Page Table Entry (PTE) format (64-bit):
  Bit 0:    P — Present (if 0, page not in memory → #PF)
  Bit 1:    R/W — Read/Write (0 = read-only)
  Bit 2:    U/S — User/Supervisor (0 = kernel only)
  Bit 3:    PWT — Write Through (cache policy hint)
  Bit 4:    PCD — Cache Disable
  Bit 5:    A — Accessed (set by hardware on first access)
  Bit 6:    D — Dirty (set by hardware on first write, PTE only)
  Bit 7:    PS — Page Size (in PD: 1 = 2MB large page; in PDPT: 1 = 1GB huge page)
  Bit 8:    G — Global (not flushed by CR3 reload, used for kernel pages)
  Bits 11:9: AVL — Available for OS use
  Bits 51:12: Physical Frame Number (40 bits → 52-bit physical address, up to 4 PB)
  Bits 58:52: AVL — Available for OS use
  Bit 59:   MPK — Memory Protection Key index (bits 62:59 = PKEy)
  Bit 62:   Reserved
  Bit 63:   NX — No Execute (requires IA32_EFER.NXE=1)

5-Level Paging (LA57, 57-bit VA, Intel Ice Lake+, AMD Zen 4): - Adds PML5 level above PML4 - Bit 56:48 = PML5 index (9 bits) - VA space: 2^57 = 128 PB per privilege level - Enabled via CR4.LA57 = 1 - Linux supports LA57 since kernel 5.5 (enabled if CPU supports, on newer platforms)

TLB (Translation Lookaside Buffer): - L1 DTLB: 64 entries (4KB pages), 4-way, 1-cycle hit latency - L1 ITLB: 256 entries (4KB pages), 8-way - L2 TLB (STLB): 2048 entries (unified D+I), 8-way, ~12-cycle hit - TLB miss: 4-level page walk: 4 memory accesses (L1/L2 cache hits if pages recently walked) - INVLPG: invalidate specific TLB entry (O(1)) - CR3 reload: flush all non-global TLB entries (O(TLB-size), expensive: ~100s of cycles) - PCID (Process Context ID): allows multiple address spaces to coexist in TLB simultaneously; CR3 stores PCID; TLB entries tagged with PCID

x86 Port I/O

Legacy from 8086: separate I/O address space (64KB, 65536 ports).

IN  AL, DX    ; read byte from I/O port DX into AL
OUT DX, AL    ; write byte AL to I/O port DX
IN  AX, DX    ; read word from I/O port DX
OUT DX, AX    ; write word AX to I/O port DX
IN  EAX, DX   ; read dword
OUT DX, EAX   ; write dword

Port I/O privilege: Requires CPL ≤ IOPL (CR4 field) or TSS I/O permission bitmap allows the specific port. Kernel code (CPL=0) can always access all ports.

Legacy ACPI via port I/O: ACPI legacy sleep/wake uses I/O port 0xB2 (ACPI SMI command), PCI configuration space at ports 0xCF8/0xCFC, keyboard controller (8042) at 0x60/0x64. Modern firmware prefers memory-mapped I/O (MMIO) via ACPI tables.

CPUID Instruction

CPUID is the mechanism for software to discover CPU capabilities at runtime.

; Enumerate CPUID leaf 0 (max leaf and vendor string)
xor eax, eax
cpuid
; EAX = maximum leaf supported by this CPU
; EBX:EDX:ECX = vendor string (12 bytes)
; "GenuineIntel", "AuthenticAMD", "HygonGenuine", "GenuineIntel" etc.

; Leaf 1: Feature flags
mov eax, 1
cpuid
; ECX bit 28: AVX
; ECX bit 20: SSE4.2
; ECX bit 12: FMA
; EDX bit 23: MMX
; EDX bit 25: SSE
; EDX bit 26: SSE2
; EBX bits 23:16: CLFLUSH line size / 8 (typically 8 → 64-byte cache lines)

; Leaf 7 (structured extended feature): AVX-512, SMAP, SMEP, etc.
mov eax, 7
xor ecx, ecx
cpuid
; EBX bit 5: AVX512F (Foundation)
; EBX bit 3: BMI1
; EBX bit 8: BMI2
; EBX bit 16: AVX512F
; EBX bit 7: SMEP (Supervisor Mode Execution Prevention)
; ECX bit 2: UMIP (User Mode Instruction Prevention)

CPUID in hypervisors: When running in a VM, the hypervisor intercepts CPUID and may mask/expose features selectively. This is how cloud instances report a subset of host CPU features, or how a KVM guest sees a "QEMU Virtual CPU."

Linux cpu_has() / boot_cpu_has(): Linux kernel uses CPUID at boot to populate a per-CPU feature bitmap (x86_capability[]) and provides cpu_has(X86_FEATURE_AVX512F) etc. for runtime checks.

x86 Legacy Baggage

Real Mode (startup mode for all x86): - CPU starts at physical address 0xFFFF0 (BIOS/UEFI entry via CS:IP = F000:FFF0) - Segment:Offset addressing, 20-bit address space (1 MB) - No protection, no paging, 16-bit registers - Linux enters via bootloader (GRUB) which handles mode switch

System Management Mode (SMM): - Entered via SMI (System Management Interrupt) — highest privilege mode, transparent to OS - SMM handler runs in SMRAM (reserved DRAM region, not accessible to OS) - Used for: ACPI power management, hardware error handling, legacy USB emulation - Security concern: SMM firmware bugs can provide rootkit-level persistence (ThinkPwn 2016, LoJax 2018)

A20 Line: - Intel 8086 had 20-bit address bus. When IBM PC software relied on address wraparound (addresses above 1MB wrapping to 0), 80286 with 24-bit bus broke that assumption. - IBM PC/AT added A20 gate (via keyboard controller 8042) to mask address line 20, preserving the wrap behavior. - BIOS enables A20 before switching to protected mode. - Linux still enables A20 in early boot code (arch/x86/boot/a20.c). - No modern CPU has the A20 problem, but the code remains for compatibility.

Global Descriptor Table (GDT): In 64-bit mode, x86 segmentation is largely disabled (all segment bases = 0 except FS and GS), but the GDT still exists and must be set up correctly:

GDT entry (8 bytes):
  Bytes 0-1: Limit 15:0
  Bytes 2-4: Base 23:0
  Byte 5:    Access byte (type, privilege, present)
  Byte 6:    Flags (granularity, size, L-bit for 64-bit code segment)
  Byte 7:    Base 31:24

In 64-bit mode, CS.L=1 selects 64-bit code execution.
Ring 0 code segment, Ring 3 code segment, Ring 0 data, Ring 3 data are mandatory.
TSS (Task State Segment) descriptor in GDT is required for RSP switching on privilege change.

x86-64 System Call Path:

User mode (Ring 3):
  SYSCALL instruction → saves RIP to RCX, RFLAGS to R11
                      → loads RIP from MSR_LSTAR (kernel syscall handler)
                      → sets CS/SS from MSR_STAR
                      → masks RFLAGS per MSR_FMASK
                      → does NOT change RSP (kernel does it via SWAPGS + percpu RSP)

Kernel (Ring 0) entry (arch/x86/entry/entry_64.S, Linux):
  SWAPGS             → GS now points to per-CPU struct (gets kernel RSP from there)
  mov rsp, [gs:percpu_rsp_offset]  → switch to kernel stack
  push rcx           → save user RIP
  push r11           → save user RFLAGS
  push ...           → save remaining registers
  call do_syscall_64 → dispatch to system call handler
  ...
  SWAPGS             → restore user GS
  SYSRET             → restore RIP from RCX, RFLAGS from R11, drop to Ring 3

x86-64 Register Layout Diagram

x86-64 Register Overview:

General Purpose:
  RAX [63:0] → EAX [31:0] → AX [15:0] → AH [15:8] / AL [7:0]
  RBX [63:0] → EBX [31:0] → BX [15:0] → BH / BL
  RCX, RDX, RSI, RDI, RBP, RSP (similar sub-register aliases)
  R8-R15 [63:0] → R8D-R15D [31:0] → R8W-R15W [15:0] → R8B-R15B [7:0]

  RIP [63:0]      ← Instruction Pointer
  RFLAGS [63:0]   ← Status + Control flags

SIMD/FP:
  ZMM0-ZMM31 [511:0]  (AVX-512)
    YMM0-YMM15 [255:0] ← lower half of ZMM0-15 (AVX/AVX2)
      XMM0-XMM15 [127:0] ← lower half of YMM0-15 (SSE)
  k0-k7 [63:0]    ← Opmask registers (AVX-512 predicates)

Segment:
  CS, SS, DS, ES, FS, GS
  GDTR (48-bit), LDTR (16-bit), IDTR (48-bit), TR (16-bit)

Control:
  CR0, CR2, CR3, CR4, CR8

Debug:
  DR0-DR3, DR6, DR7

MSRs (accessed via RDMSR/WRMSR): ~3000 registers
  Key: IA32_EFER, MSR_LSTAR, MSR_STAR, FS.Base, GS.Base, TSC, PMCs

Historical Context

AMD created x86-64 (AMD64) in 2003 because Intel was pursuing IA-64 (Itanium), a completely incompatible 64-bit ISA requiring application recompilation. Intel planned to abandon x86 in favor of IA-64. AMD bet that backwards compatibility would win, extended x86 to 64 bits cleanly (16 GPRs, RIP-relative addressing, clean 64-bit mode), and was vindicated when Athlon 64 dominated Itanium commercially. Intel adopted AMD64 as "EM64T" (later "Intel 64") in 2004 after market failure of Itanium became clear. Itanium was discontinued in 2021. The AMD64 ABI is now the universal 64-bit calling convention on Linux, macOS, and Windows (with minor differences). AMD's prescient decision to design for backward compatibility rather than purity defines the processor landscape today.

Production Examples

Linux kernel x86_64: arch/x86/ contains 230,000+ lines of x86-64-specific kernel code. Key files: entry_64.S (system call/interrupt entry), head_64.S (kernel startup), mm/ (page table management), kernel/cpu/ (CPUID, feature detection).

Intel Alder Lake (2021): Uses x86-64 with LA48 paging (4-level). P-cores support AVX-512; E-cores do not. OS must not schedule AVX-512 code on E-cores. Intel reverted AVX-512 in consumer Raptor Lake and Arrow Lake due to this complexity.

AMD Zen 4 (2022): Supports AVX-512 natively on all cores. 5-level paging (LA57) supported. CPUID leaf 0x80000021 indicates Zen 4-specific features (CPUID 0x80000001.EBX).

Docker/OCI container security: Uses x86-64 virtualization extensions (VMX on Intel, SVM on AMD) for hardware-isolated containers. CPUID instruction behavior differs inside VMs—container runtimes must handle CPUID emulation for nested virtualization.

Debugging Notes

CPUID enumeration tool:

cpuid        # userspace tool, shows all CPUID leaves
cat /proc/cpuinfo  # Linux: shows flags from CPUID interpretation

Register inspection in gdb:

(gdb) info registers   # shows general-purpose + flags
(gdb) p/x $rax         # print RAX in hex
(gdb) p/x $ymm0        # print full 256-bit YMM0
(gdb) info registers xmm0  # lower 128-bit of YMM0

Detecting misuse of 32-bit register zero-extension: In perf, look for cases where compiler generates MOVZX where it should have just used a 32-bit move. gcc -O2 -fverbose-asm shows zero-extension choices.

GDT debugging (kernel):

// In kernel code:
store_gdt(&gdt_desc);  // save current GDT
// gdt_desc.size = N*8-1, gdt_desc.address = virtual address of GDT
// Verify GDT entries with /proc/kcore or crash dump tools

MSR reading from kernel:

rdmsr -p 0 0xC0000080  # read IA32_EFER on CPU 0 via msr kernel module
modprobe msr

Security Implications

SMEP (Supervisor Mode Execution Prevention): CR4.SMEP prevents Ring 0 from executing code in user-space pages. Mitigates kernel exploits that jump to shellcode placed in user memory. Enabled by default in Linux 3.0+.

SMAP (Supervisor Mode Access Prevention): CR4.SMAP prevents Ring 0 from reading user-space memory without explicit STAC/CLAC instructions. Mitigates kernel info-leak exploits that read user memory at unexpected kernel code paths.

UMIP (User Mode Instruction Prevention): CR4.UMIP prevents user mode from executing SGDT, SIDT, SLDT, STR, SMSW — which could leak kernel address layout. Enabled by default on modern kernels.

CET (Control-flow Enforcement Technology): Shadow stack (SHSTK) + indirect branch tracking (IBT). Shadow stack stores a second copy of return addresses in a CPU-protected region; RET must match shadow stack or fault. IBT requires indirect branch targets to be preceded by ENDBR64 instruction; a jump to a non-ENDBR64 address faults. Prevents most ROP (Return-Oriented Programming) exploits. Linux supports CET since kernel 6.6.

A20 and real mode vectors: SMM code often has access to real-mode IVT at physical address 0. A bug in SMM handler processing the A20 transition can corrupt the interrupt vector table.

CR0.WP bypass: If CR0.WP = 0, Ring 0 can write to read-only pages. A kernel vulnerability that clears CR0.WP allows overwriting any kernel code. SMEP additionally prevents executing user pages after this.

Performance Implications

ABI (Application Binary Interface) calling conventions: Linux x86-64 ABI (System V AMD64 ABI): arguments in RDI, RSI, RDX, RCX, R8, R9 (then stack). Return value in RAX (+ RDX for 128-bit). Windows x64 ABI: arguments in RCX, RDX, R8, R9, then stack. This difference requires careful attention for cross-platform code or when calling Windows code from Linux.

RIP-relative addressing: AMD64 added RIP-relative addressing mode: [RIP + offset]. Enables position-independent code (PIC) without segment tricks. Essential for shared libraries (DSOs) and kernel modules. Eliminates the need for GOT (Global Offset Table) indirect loads for local symbols.

REX prefix overhead: 64-bit instructions accessing R8-R15 or requiring 64-bit operand size need a REX prefix byte. Increases code size slightly. The decode overhead is <1 cycle due to x86 decoder pipeline.

AVX-512 frequency throttling: On Intel CPUs, using AVX-512 instructions can cause the CPU to reduce clock frequency by 100–400 MHz ("AVX-512 license"). This is because AVX-512 SIMD units consume significantly more power. Intel removed AVX-512 from consumer Raptor Lake and Arrow Lake CPUs partly for this reason. AMD Zen 4 supports AVX-512 without frequency throttling.

Failure Modes and Real Incidents

Incident: Intel Spectre/Meltdown (2018): Exploited x86-64 OoO and speculative execution to read arbitrary memory. Meltdown specifically exploited the behavior that in 64-bit mode, kernel pages are mapped (but marked U/S=0) in user page tables for performance (avoiding CR3 switch on syscall). KPTI removed those mappings.

Incident: AMD Zen 1 RDRAND microcode bug (2019, CVE-2019-9836): A microcode bug in Zen 1 caused RDRAND to return all-zeros when Hyper-Threading was enabled on AMD systems with specific BIOS versions. Cryptographic code relying on RDRAND for entropy was silently weakened. Discovered via statistical testing of RDRAND output.

Incident: Intel SYSRET privilege escalation (CVE-2012-0217): SYSRET in long mode with non-canonical address in RCX causes a #GP (General Protection Fault) in Ring 0 (the exception occurs at Ring 0 because SYSRET was interrupted before dropping privilege). The exception handler runs with Ring 3 RSP value—which attacker controls—enabling stack pivot. Fixed by kernel using IRET for returning to Ring 3 in this specific case.

Incident: A20 remnant in modern kernel causing startup hang: A legacy BIOS that didn't properly enable A20 before transferring to the bootloader caused Linux to hang during kernel decompression on a specific OEM laptop (2015, bugzilla). The kernel's own A20 enable code in arch/x86/boot/a20.c fixed it, but only after a 5-second delay trying multiple A20 methods (keyboard controller, BIOS call, port 0xEE).

Modern Usage

Intel CET deployed (2020+): Chrome and Linux 6.6 support CET shadow stack. CET IBT (indirect branch tracking) deployed in Linux 6.6 for hardening. Firefox deployed CET in 2020 for Windows.

5-level paging (LA57) in production: Linux 5.5+ supports LA57. AMD EPYC Genoa and Intel Sapphire Rapids support it in hardware. Most cloud providers don't enable it by default (insufficient demand, minor TLB walk latency overhead).

PKU (Protection Keys for Userspace): Allows marking memory regions with 4-bit keys; PKRU register controls access rights per key without a syscall. Used by glibc for sandboxing (Java/JVM for JIT code isolation, OpenSSL for key material protection).

Future Directions

  • APX (Advanced Performance Extensions, Intel 2024): Doubles GPRs to 32 (R16-R31), adds 3-operand instructions (destination ≠ source), reduces code spills. Intel Arrow Lake supports APX.
  • RAO-INT (Remote Atomic Operations on Integers, Intel): Non-temporal atomic updates (increment, exchange) to cache-bypassed addresses for lock-free data structures
  • LAM (Linear Address Masking): Allows user-space to embed metadata in pointer high bits without hardware faults (used by memory tagging, safe Rust implementations)
  • x86-64 deprecation question: ARM server share growing (AWS Graviton, Ampere Altra). x86 ISA complexity is real but backward compatibility moat remains formidable for general server workloads

Exercises

  1. CPUID feature detector: Write a C program using inline assembly (__cpuid() or raw CPUID) to detect: (a) AVX-512F support, (b) CET shadow stack support, (c) PKU support, (d) LA57 support. Print results and compare to /proc/cpuinfo flags.

  2. GDT construction: Write a 64-bit x86 assembly program (can run as UEFI app or kernel module) that: reads the current GDTR, dumps all GDT entries, identifies the Ring 0/Ring 3 code and data segments, and validates that the TSS descriptor is present and correct.

  3. Page table walker: Write a Linux kernel module that walks the 4-level page table for a given user-space virtual address. Print each level's physical address, entry value, and flags (P, R/W, U/S, NX). Compare with /proc/<pid>/pagemap.

  4. SIMD register inspection: Write a C function using AVX intrinsics that loads a 256-bit value into a YMM register. Use GDB to inspect the full YMM/ZMM register state. Demonstrate zero-extension behavior of 32-bit register writes.

  5. SYSCALL path timing: Measure the overhead of the SYSCALL/SYSRET path. Use getpid() (a minimal syscall) and clock_gettime(CLOCK_MONOTONIC). Subtract the overhead of a function call. Report cycles and nanoseconds. Compare with SYSENTER on 32-bit code (if available). Analyze how KPTI and Spectre mitigations affect the measurement.

References

  • Intel 64 and IA-32 Architectures Software Developer's Manual (SDM), 4 volumes, Intel 2024
  • Volume 1: Basic Architecture
  • Volume 2: Instruction Set Reference
  • Volume 3: System Programming Guide
  • Volume 4: Model-Specific Registers
  • AMD64 Architecture Programmer's Manual, AMD 2023 (volumes 1-5)
  • System V Application Binary Interface AMD64 Architecture Processor Supplement v1.0
  • Intel Control-flow Enforcement Technology Specification
  • Cyberus Technology, Meltdown Technical Paper, 2018
  • Lipp et al., "Meltdown: Reading Kernel Memory from User Space," USENIX Security 2018
  • Intel APX Specification, 2024
  • Linux kernel source: arch/x86/include/asm/, arch/x86/entry/entry_64.S