CPU Privilege Rings

Technical Overview

CPU privilege rings are a hardware-enforced mechanism for segmenting software into trust levels. The CPU tracks the current privilege level in hardware state that software cannot directly modify — only the hardware itself changes it in response to defined instructions and exception mechanisms. Code running at a lower privilege cannot execute instructions reserved for higher privilege levels; attempting to do so causes a hardware exception. This mechanism, present in some form on every modern general-purpose CPU, is the foundation upon which operating system security models are built.

The x86 architecture defines four rings (0–3), ARM defines four exception levels (EL0–EL3), and RISC-V defines three privilege modes (U/S/M). While the details differ, the core idea is universal: hardware supports a small set of privilege tiers, the OS uses them to separate kernel code from application code, and the hardware enforces the boundaries without trusting the software to do so.

Prerequisites

01-what-is-a-kernel.md: kernel vs OS distinction
02-user-space-vs-kernel-space.md: the fundamental separation
Basic understanding of CPU instruction execution
Familiarity with the concept of memory-mapped registers

Core Content

Intel x86 Protection Rings

Intel's ring architecture was designed in the 1970s as part of the iAPX 286 (1982) and formalized in the 386 (1985). It defines four privilege levels, encoded as the Current Privilege Level (CPL), stored in bits 0–1 of the CS (Code Segment) register:

         MOST PRIVILEGED
               |
  ┌────────────────────────┐
  │        Ring 0          │  CPL=00
  │  Kernel / OS core      │
  │  All instructions      │
  │  I/O port access       │
  │  Control register R/W  │
  │  Can access all rings  │
  └────────────────────────┘
               |
  ┌────────────────────────┐
  │        Ring 1          │  CPL=01
  │  (mostly unused)       │
  │  OS services (Multics) │
  │  VMware original use   │
  └────────────────────────┘
               |
  ┌────────────────────────┐
  │        Ring 2          │  CPL=10
  │  (mostly unused)       │
  │  Device drivers in     │
  │  some OS designs       │
  └────────────────────────┘
               |
  ┌────────────────────────┐
  │        Ring 3          │  CPL=11
  │  User applications     │
  │  Restricted instrs     │
  │  No direct I/O         │
  │  No control registers  │
  └────────────────────────┘
               |
         LEAST PRIVILEGED

The CPL is checked by the CPU before every instruction that touches protected resources. You cannot change CPL directly; it changes only when: 1. A SYSCALL/SYSENTER instruction is executed (user → ring 0) 2. An interrupt or exception fires (any → ring 0 handler) 3. An IRET/SYSRET instruction returns (ring 0 → previous CPL) 4. A far call through a call gate (deliberate CPL change through a descriptor)

Why Only Ring 0 and Ring 3 Are Used Today

The original intent was that OS services, device drivers, and applications would use rings 1–3 in a tiered trust model. OS/2 1.x (1987–1988) used ring 2 for some device drivers. In practice, the complexity of managing three non-kernel privilege levels — each requiring separate code segments, gate descriptors, and stack switches — outweighed the benefit.

Unix-derived OSes adopted a binary model: kernel (ring 0) and user (ring 3), skipping rings 1 and 2. Windows NT, designed by Dave Cutler at Microsoft in the late 1980s, made the same choice. The simplification meant: - No call gate setup for rings 1/2 - Simpler segment descriptor tables - Drivers either fully trusted (ring 0) or user-space (ring 3), with no middle ground

Rings 1 and 2 are architecturally valid but carry a non-zero performance cost for transitioning through them, and no mainstream OS uses them. The CPL=01 and CPL=10 cases are never generated on Linux or Windows.

Ring 1/2 Historical Use: VMware's Original x86 Virtualization

Before Intel VT-x hardware virtualization extensions (2006), VMware faced a fundamental problem: virtualizing x86 on x86 using pure software. Guest OS kernels expected to run in ring 0 and execute privileged instructions. VMware's solution (circa 1999): run the guest kernel in ring 1 and the VMM (Virtual Machine Monitor) in ring 0.

Guest ring-0 code was binary-translated so that sensitive but non-privileged instructions (which don't fault in ring 1 but don't virtualize correctly) were replaced with calls to ring 0 stubs. The guest thought it was in ring 0; the VMM ran in actual ring 0 and trapped the truly privileged operations.

This is the only major commercial use of rings 1 and 2 on x86 since OS/2.

Hypervisor Rings: Ring -1 via Intel VMX

Intel VT-x (Virtualization Technology for x86, 2006) introduced VMX (Virtual Machine Extensions), adding a new, more privileged level below ring 0: VMX root mode. This is informally called "ring -1."

  ┌──────────────────────────────────┐
  │  VMX Root Mode ("Ring -1")       │  Most privileged
  │  Hypervisor (KVM, Hyper-V, ESXi) │
  │  Controls VM entry/exit          │
  └──────────────────────────────────┘
               |  VMLAUNCH / VMRESUME (VM entry)
               |  VMEXIT (VM exit — any privileged op)
  ┌──────────────────────────────────┐
  │  VMX Non-Root Mode               │
  │  Ring 0: Guest OS kernel         │
  │  Ring 3: Guest user applications │
  └──────────────────────────────────┘

In VMX non-root mode, the guest kernel runs at CPL=0, but all its ring-0 operations are intercepted by the VMM (hypervisor) via VMEXIT events. The VMCS (Virtual Machine Control Structure) configures which operations trigger VMEXITs.

KVM (arch/x86/kvm/) in Linux implements this. When a guest executes CPUID, HLT, accesses certain MSRs, or modifies CR3, a VMEXIT fires and control returns to the KVM host kernel. KVM handles the operation and resumes the guest with VMRESUME.

AMD's equivalent is AMD-V / SVM (Secure Virtual Machine) with VMRUN, VMEXIT equivalents. ARM's equivalent is EL2 (hypervisor exception level).

ARM Exception Levels (EL0–EL3)

ARM AArch64 (64-bit ARM, used in Apple Silicon, Snapdragon, AWS Graviton) uses a cleaner four-level hierarchy:

  EL3 ┌─────────────────────────────┐  Most privileged
      │  Secure Monitor (TrustZone) │
      │  ARM Trusted Firmware (ATF) │
      │  Manages Secure/Normal World│
      └─────────────────────────────┘
               |  SMC (Secure Monitor Call)
  EL2 ┌─────────────────────────────┐
      │  Hypervisor                 │
      │  KVM on ARM, Xen, Type-1    │
      │  Stage-2 page tables        │
      └─────────────────────────────┘
               |  HVC (Hypervisor Call)
  EL1 ┌─────────────────────────────┐
      │  OS Kernel                  │
      │  Linux kernel, XNU          │
      │  Page tables (TTBR0/TTBR1)  │
      └─────────────────────────────┘
               |  SVC (Supervisor Call = syscall)
  EL0 ┌─────────────────────────────┐  Least privileged
      │  User Applications          │
      │  Same as ring 3 on x86      │
      └─────────────────────────────┘

SMC (Secure Monitor Call) transitions to EL3 — used for secure key storage, DRM, biometric authentication in TrustZone. HVC (Hypervisor Call) transitions to EL2. SVC (Supervisor Call) is the syscall instruction for EL0→EL1. Exception return (ERET) returns to a lower EL.

On Apple Silicon Macs, EL3 is used by Apple's Secure Enclave Processor firmware. EL2 is used by the Hypervisor.framework allowing Type-2 hypervisors (Parallels, UTM/QEMU) without kernel extensions. EL1 is the XNU kernel. EL0 is all user-space applications.

RISC-V Privilege Modes

RISC-V defines three privilege modes:

  M (Machine)     - Firmware, bootloader (OpenSBI)
                  - Most privileged, accesses all hardware
  |  MRET instruction
  S (Supervisor)  - OS kernel
                  - Similar to ring 0 / EL1
  |  ECALL from S
  U (User)        - Applications
                  - Least privileged

RISC-V is notable for its clean, formal specification. There is no implicit hardware behavior — every privileged behavior is explicitly defined in the ISA specification. ECALL from U-mode generates an environment-call-from-U exception, handled by S-mode (the kernel). ECALL from S-mode is handled by M-mode (the firmware/SBI layer).

Operations That Require Ring 0 on x86-64

The following operations cause a General Protection Fault (#GP, vector 13) if attempted from ring 3:

Operation	Instruction / Register	Why Restricted
Load interrupt descriptor table	`LIDT`	Would allow redirecting all exceptions
Load global descriptor table	`LGDT`	Would allow changing segment permissions
Load task register	`LTR`	Controls task state segment
Clear interrupt flag	`CLI`	Would allow disabling all interrupts
Set interrupt flag	`STI`	Unsafe to enable ints in user context
Halt processor	`HLT`	Power management requires kernel control
I/O port access	`IN`, `OUT`	Direct hardware access
Write CR0, CR3, CR4	`MOV CRn`	Page tables, CPU features
Write MSR	`WRMSR`	CPU configuration registers
Invalidate TLB	`INVLPG`	Only kernel should manage TLB
Load segment base	`WRGSBASE`	Only kernel should set segment bases

CR3 is the page table base register. Writing it with a different value switches the entire virtual address space — only the kernel scheduler should do this (during context switches). CLI would prevent interrupt delivery, potentially hanging the system. Any attempt to execute these in ring 3 immediately triggers #GP.

What Happens on a Privilege Violation

When CPL=3 code attempts a restricted operation:

CPU detects the violation before executing the instruction
CPU saves current RIP, RSP, RFLAGS, CS, SS to the current task's kernel stack
CPU loads the #GP handler's address from IDT entry 13
CPU transitions to CPL=0, loads the kernel stack from the TSS (Task State Segment)
GP handler (do_general_protection() in arch/x86/kernel/traps.c) runs
If the fault was in user space: kernel sends SIGSEGV to the process
If the fault was in kernel space: kernel oops or panic (a bug)

SMEP: Supervisor Mode Execution Prevention

SMEP (introduced on Intel Ivy Bridge, 2012; controlled by CR4 bit 20) prevents ring 0 code from executing pages marked as user-accessible in the page tables. Without SMEP, a kernel exploit could: 1. Place shellcode in user space 2. Trick the kernel into jumping to that user-space address 3. The shellcode would execute in ring 0

With SMEP enabled, any attempt by ring-0 code to execute a user-space page causes a #PF (page fault) with a reserved bit set, triggering a kernel panic. Linux enables SMEP at boot (native_write_cr4() sets X86_CR4_SMEP).

Without SMEP:
  Ring-0 exploit → jump to user-space shellcode → executes in ring 0

With SMEP (CR4.SMEP=1):
  Ring-0 code fetches from user page → #PF (Instruction Fetch in Supervisor Mode)
  → kernel panic, exploit fails

SMAP: Supervisor Mode Access Prevention

SMAP (Intel Broadwell, 2014; CR4 bit 21) prevents ring-0 code from reading or writing user-space memory unless the code explicitly uses STAC (Set AC Flag) and CLAC (Clear AC Flag) instructions around the access.

Legitimate kernel code that needs to copy data to/from user space uses copy_to_user() / copy_from_user(), which are wrapped with STAC/CLAC. An exploit that tricks the kernel into accessing a user-controlled pointer outside of these sanctioned windows will trigger a #PF.

Linux's access_ok() check (a prerequisite for all user-space pointer operations) combined with SMAP makes it dramatically harder to exploit confused-deputy attacks where the kernel is tricked into reading attacker-controlled data.

Historical Context

The ring concept was first realized in hardware on the Multics GE-645 mainframe (1965–1969). Multics supported up to 8 rings, used extensively: the kernel ran in ring 0, file system servers in ring 1, user programs in ring 4, and untrusted programs in outer rings. The hardware enforced inter-ring calls via gate descriptors.

The Intel 286 (1982) brought rings to the PC. The 286 had a protected mode with rings 0–3, segment-based protection, and hardware task switching. OS/2 1.x (1987) actually used this: the OS kernel in ring 0, device drivers in ring 2, applications in ring 3.

The 386 (1985) added paging on top of segmentation, giving per-page read/write/execute permissions. This made paging the primary protection mechanism and reduced segmentation to a legacy layer. Linux, from version 1.0 (1994), set up the GDT with flat segments (base 0, limit max) and relied entirely on page table permissions for protection.

AMD64 (2003) effectively retired most segmentation in 64-bit mode (segment bases are 0, only FS/GS bases are meaningful), making the CPL in CS bits 0–1 the sole relevant segment protection mechanism. The ring model was streamlined to what it always really was in practice: two levels.

Production Examples

KVM hypervisor on Linux: When you run a virtual machine with qemu-kvm, the guest kernel believes it is in ring 0 but actually runs in VMX non-root mode. Every CPUID, HLT, and I/O instruction triggers a VMEXIT to the host Linux kernel. KVM handles thousands of VMEXITs per second per vCPU. The VMEXIT overhead is a primary contributor to VM-vs-bare-metal performance gaps for syscall-heavy workloads.

Android TrustZone: Every Android phone uses ARM TrustZone (EL3). The Android OS runs at EL1. Secure key storage, DRM playback (Widevine), and fingerprint verification run in the Trusted Execution Environment (TEE) at EL3. The normal-world kernel (EL1) can never read TEE memory. An SMC instruction is the only transition mechanism.

AWS Nitro Hypervisor: Uses Intel VT-x. The Nitro hypervisor runs in VMX root mode. Guest kernels (your EC2 instance's Linux) run in VMX non-root mode. Nitro is designed to minimize VMEXITs for virtio I/O by moving device emulation to dedicated hardware (Nitro cards), leaving the hypervisor footprint extremely small.

Debugging Notes

Determining current privilege level in a crash dump:

# CS register in panic output encodes CPL in bits 0-1:
CS: 0010  → CPL=00 → ring 0 (kernel crash)
CS: 0033  → CPL=11 → ring 3 (user-space crash, kernel delivered signal)

Checking SMEP/SMAP enablement:

# Check CR4 bits via /proc/cpuinfo or dmesg
grep -m1 smep /proc/cpuinfo   # SMEP flag in CPU features
grep -m1 smap /proc/cpuinfo   # SMAP flag in CPU features
# In kernel: CR4 value logged during boot in dmesg
dmesg | grep "CR4:"

VMX VMEXIT analysis:

# Count VMEXITs per second for a running VM (KVM)
perf kvm stat -p <qemu-pid> -- sleep 1
# Shows breakdown by VMEXIT reason (CPUID, I/O, EPT violation, etc.)

Security Implications

The privilege ring model is the hardware foundation of all OS security. Its soundness depends on:

The hardware implementation being correct (Intel/AMD security advisories regularly cover privilege handling bugs)
The kernel correctly entering and exiting privileged mode (kernel entry/exit paths are among the most audited code in existence)
The kernel correctly validating user-supplied data before acting on it in ring 0

Notable ring-related security failures: - CVE-2017-5754 (Meltdown): x86 speculative execution allowed ring-3 code to transiently read ring-0 memory. The ring boundary held architecturally but failed under speculative execution. - CVE-2018-8897 (MOV SS / POP SS): A race between the MOV SS instruction and a debug exception could cause a debug exception to fire at a higher CPL than expected, leading to privilege escalation on multiple OS implementations. - VMCS misconfiguration: Hypervisors that incorrectly configure VMCS fields (e.g., leaving certain MSRs unintercepted) can allow guest VMs to affect the host. Full VMCS auditing is required for production hypervisors.

Performance Implications

VMEXIT cost: 500–1000 ns per VMEXIT (vs ~100ns for a syscall). Hypervisors are designed to minimize VMEXITs. Paravirtual drivers (virtio) batch operations to reduce VMEXIT frequency.
SMEP/SMAP overhead: Near-zero at runtime. The STAC/CLAC instructions are single-cycle. The protection mechanism does not add overhead to normal kernel operation.
Mode switch cost: Ring-0 ↔ ring-3 transitions (syscalls) are ~100–200ns. Ring-0 (VMX non-root) ↔ ring-(-1) (VMX root, hypervisor) transitions are 5–10x more expensive due to VMCS state saves/restores.

Failure Modes and Real Incidents

CVE-2018-8897 (May 2018): A Windows/Linux/macOS/Xen/hypervisor bug where a MOV/POP instruction changing the SS register, followed immediately by a debug interrupt, caused the interrupt to be delivered one instruction late at an unexpected privilege level. On Windows, this allowed privilege escalation from ring 3 to ring 0. Affected essentially every x86 OS. Root cause: CPU behavior at the ring-0/ring-3 transition boundary was not correctly handled by OS interrupt delivery code.

VMware ESXi CVE-2022-22972: A critical authentication bypass in the ESXi management interface. Not a ring-level bug, but a reminder that the hypervisor layer (ring -1) is an extremely high-value target — compromising it compromises every guest VM on the host.

Intel SGX Side Channels: SGX (Software Guard Extensions) effectively adds a "ring -2" for encrypted enclaves. Multiple side-channel attacks (SGAxe, PLATYPUS) have demonstrated that physical access to CPU hardware side channels can defeat the enclave isolation, even though the ring model itself held.

Modern Usage

On a modern x86-64 Linux system, the rings are used as follows at runtime:

Code	Ring	Notes
Linux kernel, interrupt handlers	0	CPL=00, all instructions allowed
KVM hypervisor (when VMX active)	-1 (VMX root)	Runs below the guest's ring 0
Device drivers (kernel modules)	0	Same ring as kernel core
VDSO functions	3	User-mode execution of kernel-provided code
glibc, application code	3	CPL=11, restricted
UEFI firmware (before boot)	0	After ExitBootServices(), kernel owns ring 0

The ARM picture (on a typical Android phone): - EL0: APK code, dalvikvm - EL1: Linux kernel - EL2: unused or minimal hypervisor (depends on OEM) - EL3: TrustZone TEE (keymaster, gatekeeper, DRM)

Future Directions

Intel TDX (Trust Domain Extensions): introduces a new "ring -2" concept where even the hypervisor cannot read VM memory. The CPU firmware acts as a trusted third party. This changes the threat model — even a compromised hypervisor cannot decrypt guest data.
Arm CCA (Confidential Compute Architecture): similar to TDX. Introduces "Realms" at a new privilege level, managed by a Realm Management Monitor (RMM) at EL2 and firmware at EL3.
RISC-V hypervisor extension (H-extension): adds a VS (Virtual Supervisor) and VU (Virtual User) mode below S and U, enabling efficient hardware virtualization on RISC-V similar to Intel VT-x.
eBPF as ring-0 extension: BPF programs run in ring 0 but are verified before execution, creating a controlled sub-domain within ring 0. This is the direction for safe kernel extension without full module privileges.

Exercises

Read the Intel SDM Volume 3A, Section 5.5 "Privilege Levels." Write a one-paragraph explanation of how the DPL (Descriptor Privilege Level), CPL (Current Privilege Level), and RPL (Requested Privilege Level) interact when a program accesses a segment.
On a Linux machine with KVM available, run perf kvm stat -a sleep 5 to collect VMEXIT statistics for any running VMs. Identify the top 3 VMEXIT reasons. Look up what kernel code handles each in arch/x86/kvm/vmx/exit.c.
Write a C program that reads CR0 or CR4. Compile it and run it as a regular user. What happens? Now write a kernel module that reads CR4 and logs the value to dmesg. What is bit 20 (SMEP)? Bit 21 (SMAP)?
Research ARM TrustZone from the perspective of a mobile payment application. What is the attestation flow? How does the payment app (EL0) interact with the secure element (EL3) to prove to a payment network that the device is legitimate?
Find the entry_SYSCALL_64 function in arch/x86/entry/entry_64.S. Identify the instruction that changes the privilege level from ring 3 to ring 0. Is it a single instruction? What registers does it use? What does it do with the old RSP?

References

Intel 64 and IA-32 Architectures Software Developer's Manual, Vol. 3A, Chapter 5 (Protection)
Intel 64 and IA-32 Architectures Software Developer's Manual, Vol. 3C, Chapter 23–30 (VMX)
AMD64 Architecture Programmer's Manual, Volume 2: System Programming
ARM Architecture Reference Manual, AArch64 (Chapter D1, AArch64 System Level Architecture)
RISC-V Privileged Architecture Specification v1.12: https://github.com/riscv/riscv-isa-manual
Linux kernel source: arch/x86/entry/entry_64.S, arch/x86/kernel/traps.c, arch/x86/kvm/
Original VMware Binary Translation paper: Keith Adams & Ole Agesen, "A Comparison of Software and Hardware Techniques for x86 Virtualization", ASPLOS 2006
Meltdown paper: Lipp et al., USENIX Security 2018
CVE-2018-8897: Nick Peterson & Rafal Wojtczuk, "Stack Clash" analysis