Traps, Faults, and Exceptions

Technical Overview

Exceptions are synchronous CPU events that interrupt normal instruction execution when the CPU detects an error condition or a programmed transition request. Unlike hardware interrupts, which are asynchronous (they can fire at any moment), exceptions are deterministic: they occur at a specific instruction as a direct consequence of executing that instruction. Every CPU architecture defines its own exception taxonomy, and the OS kernel must handle each class correctly.

Intel's x86 architecture classifies exceptions into three types: faults (the instruction can be retried after the exception is handled), traps (execution resumes at the instruction following the one that caused the exception), and aborts (the processor is in an unrecoverable state and cannot reliably restart execution). Understanding these distinctions is critical for kernel developers because misclassifying an exception leads to either an infinite loop (treating a fault as a trap) or missed handling (treating an abort as a fault).

Linux's exception handling code is among the most carefully written in the kernel. The page fault handler alone (arch/x86/mm/fault.c, do_page_fault()) handles demand paging, copy-on-write, stack expansion, memory-mapped files, MMIO faults in drivers, and invalid access — all in a single function that must execute correctly in every context, including from within the kernel itself.

Prerequisites

02-user-space-vs-kernel-space.md: ring context matters for exception handling
03-cpu-privilege-rings.md: CPL affects which exceptions are delivered
06-interrupts.md: exceptions use the same IDT mechanism as hardware interrupts

Core Content

x86 Exception Taxonomy

Faults

A fault is an exception that is reported "before" the faulting instruction — the saved RIP (instruction pointer) points to the faulting instruction itself, not the next instruction. After the handler runs and resolves the condition, the CPU restarts the faulting instruction. This is correct behavior for:

Page Fault (#PF, Vector 14): The instruction tried to access a virtual address that has no valid page table mapping. The handler maps the page (allocates a physical frame, fills it from disk, etc.) and restarts the instruction, which now succeeds.
General Protection Fault (#GP, Vector 13): Protection violation. The handler cannot "fix" this in user space — it delivers SIGSEGV or SIGBUS to the process. In kernel space, it triggers an oops.
Divide Error (#DE, Vector 0): Division by zero or overflow. In user space: SIGFPE. In kernel space: oops.
Bound Range Exceeded (#BR, Vector 5): BOUND instruction check failed. Delivers SIGSEGV.
Stack Fault (#SS, Vector 12): Stack segment limit exceeded or invalid stack segment. Delivers SIGSEGV.
Invalid TSS (#TS, Vector 10): Invalid Task State Segment during task switch. Usually fatal.
Segment Not Present (#NP, Vector 11): Segment descriptor has the present bit clear. Delivers SIGBUS.

Traps

A trap is an exception reported "after" the trapping instruction — the saved RIP points to the instruction following the one that caused the exception. Used for debugging and programmed transitions:

Breakpoint (#BP, Vector 3): The INT3 instruction (opcode 0xCC). Used by debuggers to inject breakpoints by replacing one byte of the target instruction with 0xCC. The trap handler (do_int3()) notifies the debugger (via ptrace or kprobes). After handling, execution resumes at the instruction after INT3 (i.e., the rest of the patched instruction sequence).
Debug Exception (#DB, Vector 1): Triggered by hardware debug registers (DR0–DR3 set watchpoint addresses, DR7 controls them), single-step mode (RFLAGS.TF=1), or task switch debug. Used by debuggers (gdb, perf) for hardware breakpoints and single-stepping.
Overflow (#OF, Vector 4): INTO instruction when OF flag is set. Rarely used in modern code.

Aborts

An abort is a severe exception from which the faulting instruction's address cannot be reliably determined. The processor may be in an inconsistent state:

Double Fault (#DF, Vector 8): Occurs when the CPU encounters a second exception while trying to deliver a first exception, AND the combination is in a defined "double fault" category. If the kernel stack is overflowed (the first exception) and the CPU tries to push exception info onto the stack (a second stack fault), this double faults. The kernel has a separate, dedicated double-fault stack (in the IST — Interrupt Stack Table) to handle this. If a double fault handler itself faults: triple fault, which resets the CPU.
Machine Check Exception (#MC, Vector 18): Hardware error — uncorrectable ECC memory error, CPU internal error, etc. This is an abort; the data in registers may be corrupt. The kernel's MCE handler decides whether to panic or attempt recovery.

x86 Exception Classification Summary:

FAULTS (RIP → faulting instruction, retriable):
  #DE  0   Divide Error          → SIGFPE / oops
  #BR  5   Bound Range           → SIGSEGV
  #UD  6   Invalid Opcode        → SIGILL
  #NM  7   Device Not Available  → lazy FPU save/restore
  #NP 11   Segment Not Present   → SIGBUS
  #SS 12   Stack Fault           → SIGSEGV
  #GP 13   General Protection    → SIGSEGV / oops
  #PF 14   Page Fault            → demand paging / SIGSEGV
  #MF 16   x87 FP Error          → SIGFPE
  #AC 17   Alignment Check       → SIGBUS

TRAPS (RIP → next instruction):
  #DB  1   Debug                 → debugger notification
  #BP  3   Breakpoint (INT3)     → debugger / kprobes
  #OF  4   Overflow (INTO)       → SIGSEGV
  #XM 19   SIMD FP Exception     → SIGFPE

ABORTS (address unreliable):
  #DF  8   Double Fault          → kernel panic
  #MC 18   Machine Check         → panic / recovery attempt

Page Fault: The Most Important Exception

The page fault handler is fundamental to how modern operating systems work. It is invoked hundreds of times per second on any active system — every time a process accesses memory it hasn't actually loaded yet, every copy-on-write trigger, every stack growth event.

Page Fault Error Code (pushed by CPU onto stack at entry):

Bit 0 (P):  0 = not-present page, 1 = protection violation
Bit 1 (W):  0 = read access, 1 = write access
Bit 2 (U):  0 = supervisor access, 1 = user access
Bit 3 (R):  1 = reserved bit in page table entry was set
Bit 4 (I):  1 = instruction fetch (NX violation)
Bit 5 (PK): 1 = protection-key violation

do_page_fault() / exc_page_fault() decision tree (arch/x86/mm/fault.c):

fault_address = CR2 register (contains the faulting virtual address)

if (fault in kernel space) {
    if (fault in kernel's fixup table) → recover (ex_handler_*)
    else → oops/panic
}

if (fault in user space) {
    find_vma(mm, fault_address)

    if (no VMA or address below VMA start) {
        if (near stack and growing down) → expand_stack()
        else → SIGSEGV (bad address)
    }

    if (VMA found) {
        if (protection violation: W access on read-only VMA) {
            if (COW page: anon private mapping) → do_wp_page() (copy-on-write)
            else → SIGSEGV
        }

        if (not-present page) {
            if (file-backed mapping) → do_fault() → read from filesystem
            if (anonymous mapping) → do_anonymous_page() → alloc zero page
            if (swap) → do_swap_page() → read from swap
        }
    }
}

Copy-on-write (COW) via page fault: When fork() creates a child process, the parent's memory is not copied immediately. Instead, both parent and child share the same physical pages with write-protect bits set. When either writes to a shared page, a page fault fires. The handler sees: "write access to a read-only page in a private anonymous VMA" → allocates a new physical page, copies the content, updates the PTE to point to the new page and make it writable. This is transparent to the application.

General Protection Fault (#GP)

do_general_protection() / exc_general_protection() (arch/x86/kernel/traps.c):

The #GP fault covers a wide range of protection violations: - Limit violation (segment limit exceeded) - Accessing a segment with wrong privilege level - Executing a privileged instruction from ring 3 (IN, OUT, LGDT, etc.) - Violating alignment rules for certain SIMD instructions (with #AC disabled) - Writing to a read-only segment - Various malformed instruction conditions

In user space: the kernel sends SIGSEGV (for most #GP causes) or SIGBUS. The signal includes a siginfo_t with si_code = SI_KERNEL and si_addr set to the faulting address if available.

In kernel space: an unexpected #GP is a serious kernel bug. It triggers oops_begin() → register dump → stack backtrace → oops_end(). If CONFIG_PANIC_ON_OOPS=y (common in production), this immediately panics the machine.

Signal Delivery: SIGSEGV from GP Fault

User process writes to address 0x0 (NULL dereference):

CPU: #PF (or #GP depending on exact access)
       ↓
do_page_fault() / do_general_protection()
  → determine: user-space fault with no valid mapping
  → call force_sig_fault(SIGSEGV, SEGV_MAPERR, fault_addr, current)
       ↓
force_sig_fault()
  → fills in struct kernel_siginfo
  → calls send_sig_info()
       ↓
The signal is queued to the task's signal queue

On return to user space:
  → do_notify_resume() checks TIF_SIGPENDING
  → handle_signal() sets up the signal frame on user stack
  → jumps to signal handler (or default action = terminate + core dump)

Default SIGSEGV action:
  → do_coredump() (if ulimit -c allows core files)
  → do_group_exit(SIGSEGV)
  → process terminates with signal 11 (SIGSEGV)

Double Fault and Triple Fault

Double fault (#DF, Vector 8): Occurs when the CPU cannot deliver an exception. Specifically, when a second exception occurs during delivery of a first, and the combination is in the Intel-defined "contributory + contributory" or "page fault + page fault" category.

The most common cause: kernel stack overflow. If the kernel stack overflows and the CPU tries to push the exception frame, it causes a second stack fault (#SS). This double faults. Linux allocates a separate double-fault stack (IST entry 1 in the TSS, doublefault_stack[]) of 4 KiB specifically for this handler. The handler logs the state and panics.

Normal kernel stack overflow scenario:
  Deep recursion → stack grows past bottom → hits guard page
  → #PF on guard page access
  → CPU tries to push #PF frame onto already-exhausted stack
  → #SS (stack fault, can't push)
  → Double fault (#DF) → switches to dedicated DF stack → kernel panic

Triple fault: If the double-fault handler itself faults (e.g., the double-fault stack is also corrupt), the CPU encounters a third exception. There is no triple-fault vector — this resets the CPU via the RESET pin. On older systems, this caused an immediate reboot. Modern chipsets/hypervisors log it. QEMU handles triple faults by stopping the vCPU.

NMI: Non-Maskable Interrupt

The NMI (Vector 2) fires even when the interrupt flag (IF) is cleared by CLI. It cannot be masked by software. Uses:

Hardware watchdog: CONFIG_HARDLOCKUP_DETECTOR uses the PMU (Performance Monitoring Unit) to generate a PMU overflow NMI if a CPU has not received a scheduler tick for 10 seconds. This catches CPUs stuck in an infinite loop with interrupts disabled. The NMI handler (nmi_handler() → watchdog_overflow_callback()) then logs a backtrace: "hard LOCKUP CPU#N stuck for 10s".

Hardware error reporting: The chipset or CPU asserts NMI for uncorrectable hardware errors before they escalate to MCE. The NMI handler queries the hardware error status registers.

Kernel debugging: perf uses the PMU NMI for CPU profiling. Every N instructions (or every N cycles), a PMU overflow fires an NMI, and the NMI handler records the current RIP. Over thousands of samples, this produces a statistical profile of CPU time.

NMI handling is challenging because NMIs can occur in any context, including within another NMI handler on some hardware. Linux uses a special NMI stack (IST entry 2) and tracks "NMI nesting" to avoid re-entrant NMI corruption.

Machine Check Exception (#MC)

MCE (Vector 18) reports hardware errors detected by the CPU's Machine Check Architecture (MCA). Error types: - Uncorrectable ECC memory errors - Cache errors (L1/L2/L3 parity errors) - CPU internal bus errors - Memory controller errors

The MCE handler in Linux (arch/x86/kernel/cpu/mce/core.c, do_machine_check()):

Reads MCA registers: MCG_STATUS, MCi_STATUS, MCi_ADDR, MCi_MISC for each bank
Determines if the error is correctable or uncorrectable
For correctable errors: logs to /dev/mcelog or mcelog(8), continues
For uncorrectable errors affecting the current execution: panics
For uncorrectable errors in "firmware first" mode: BIOS handles, notifies kernel via APEI (ACPI Platform Error Interface)

mcelog daemon or rasdaemon (modern replacement) parses MCE events and logs them. Production monitoring systems alert on MCE events — an uncorrectable DIMM error is a hardware failure that requires replacement before it causes data corruption.

# Check for recent MCE events
journalctl -u mcelog
# or
rasdaemon -d   # query rasdaemon's database

# Raw MCE data
cat /dev/mcelog | mcelog --client

Historical Context

x86 exception handling has its roots in the Intel 8086 (1978), which had a simple interrupt vector table and a handful of exceptions (divide error, single step, NMI, breakpoint, overflow, bounds check). The protected mode of the 286 (1982) added protection-related exceptions (#GP, #TS, #NP, #SS). The 386 (1985) added the page fault mechanism, enabling demand paging, and introduced the three-level exception classification (fault/trap/abort).

Double fault handling as a separate IDT entry was always in the design. The IST (Interrupt Stack Table) mechanism, which allows specifying separate stacks for specific exception vectors, was added in AMD64 (2003) specifically to improve double-fault and NMI handling reliability, since 32-bit task-gate-based alternative stacks were unwieldy.

Machine Check Architecture was introduced with the Pentium (1993). Early MCE implementations were basic (single register bank). Modern Intel and AMD CPUs have 20+ MCA banks covering every major CPU subsystem, and firmware-first MCE handling (where the platform firmware handles MCE before the OS) is now standard for enterprise systems.

Production Examples

Copy-on-write in container startup: When Docker starts a container using a layer-based image, the container's filesystem layers are read-only copy-on-write. When the container process first writes to any file that came from an image layer, a page fault fires, the COW mechanism copies that page (or block, depending on the storage driver), and the write proceeds. The entire container startup sequence for a complex application may trigger hundreds of thousands of page faults.

Page fault rate as a performance indicator: A database server under memory pressure shows elevated major page fault rates in /proc/PID/status (MajFlt). Each major page fault represents reading data from disk (from swap or a memory-mapped file). A spike in major faults correlates with latency spikes. Monitoring sar -B 1 (page fault rate) is a standard production metric.

MCE in AWS EC2 (2019, Spectre-variant patches): After Spectre mitigations were deployed on EC2 hosts, some hypervisor MCE injection to guest VMs was mishandled, causing guest Linux kernels to panic when the host injected an MCE notification. This affected instances running kernels without the corresponding MCE injection handling fix. Root cause: MCE delivery path changed by Spectre mitigation patches inadvertently altered MCE delivery semantics.

Debugging Notes

# Monitor page fault rates
vmstat 1
# Output: si/so = swap in/out (major faults), 'pgfault'/'pgmajfault'
cat /proc/vmstat | grep -E "pgfault|pgmajfault"

# Per-process fault counts
cat /proc/<PID>/status | grep -i fault

# ftrace: trace page faults
echo 1 > /sys/kernel/debug/tracing/events/exceptions/page_fault_user/enable
cat /sys/kernel/debug/tracing/trace

# kprobes: trace do_page_fault with address
bpftrace -e 'kprobe:do_user_addr_fault { printf("%s: fault at %lx\n", comm, arg1); }'

# Trap kernel GP faults (uncommon, but useful when debugging drivers)
# Look for "general protection fault" in dmesg after reproducing the issue
dmesg | grep -i "general protection"

# MCE events
dmesg | grep -i "machine check"
# Long-form MCE decode
mcelog --client < /dev/mcelog 2>/dev/null || rasdaemon --all

Decoding a kernel oops from #PF:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
PGD 0 P4D 0
Oops: 0002 [#1] SMP NOPTI
...
RIP: 0010:my_driver_function+0x3c/0x80

Error code breakdown: 0002
  bit 0 = 0: not-present page
  bit 1 = 1: write access
  bit 2 = 0: supervisor access (kernel fault)

Security Implications

Exception handlers as attack vectors: Each exception handler (especially #GP and #PF) runs in ring 0. A bug in do_page_fault() that mishandles an attacker-controlled fault address is a potential kernel exploit. Dirty COW (CVE-2016-5195) was exactly this — a race condition in the page fault handler's COW path.

SMAP and page fault: SMAP causes a kernel #PF when ring-0 code accesses user space without STAC. If an attacker can cause a kernel code path to access a pointer they control (without STAC/CLAC), SMAP ensures this causes a fault rather than silently reading attacker data. The copy_from_user() path explicitly uses STAC/CLAC.

#UD (Invalid Opcode) for detection: Some hypervisor escapes use UD2 (guaranteed invalid instruction) intentionally to cause a #UD and then analyze the handler's behavior to detect hypervisor presence or escape. VMMs must handle #UD carefully.

Kernel fixup table (arch/x86/mm/extable.c): The kernel has a table of (fault_address, fixup_address) pairs. When the kernel itself causes a page fault (e.g., copy_from_user() faults because the user pointer is bad), the fault handler checks if the faulting address is in this table. If so, it jumps to the fixup code (which returns -EFAULT) rather than panicking. This controlled fault handling is essential for safe user/kernel data copying.

Performance Implications

Page fault overhead: A minor page fault (mapping an anonymous page) costs ~1–2 microseconds. A major page fault (reading from disk) costs 100 microseconds to milliseconds (disk seek time). The Linux MM subsystem is heavily optimized to minimize fault path overhead: the hot path through handle_mm_fault() and do_anonymous_page() is designed to minimize locking and cache misses.

Transparent Huge Pages (THP) and faults: THP causes the kernel to try to allocate 2 MiB pages on fault. This reduces TLB pressure but can cause khugepaged to run and cause occasional latency spikes as it collapses 4K pages into 2M pages. Redis and other latency-sensitive applications disable THP (echo never > /sys/kernel/mm/transparent_hugepage/enabled) for this reason.

userfaultfd for fault handling in user space: userfaultfd(2) (Linux 4.3) allows a user-space process to intercept its own page faults. The faulting thread blocks, and a designated handler thread in the same process receives the fault information and can map the page with UFFDIO_COPY or UFFDIO_ZEROPAGE. Used for: VM live migration (QEMU uses it to track dirty pages), checkpointing (CRIU), and lazy memory copying. Cost: each fault goes through kernel → user handler → kernel, roughly 3 context switches per fault.

Failure Modes and Real Incidents

Dirty COW (CVE-2016-5195): A race condition in the page fault handler's do_wp_page() (write-protect page) function. The race: a thread could trigger a COW fault on a read-only mapping (e.g., an SUID binary's text pages mapped via /proc/self/mem), and a concurrent madvise(MADV_DONTNEED) could drop the mapping between the fault check and the actual page write. The kernel would then write to the original read-only page rather than the private copy. Exploited to overwrite /etc/passwd or SUID binaries without write permission.

Ghost GLIBC gethostbyname buffer overflow → SIGSEGV (2015): CVE-2015-0235. A heap buffer overflow in glibc's __nss_hostname_digits_dots() could corrupt memory such that a subsequent memory access triggered a SIGSEGV. The overflow was in user space, but the SIGSEGV was the visible symptom. Demonstrates how user-space exceptions are the visible manifestation of memory corruption.

x86 FXSAVE/FXRSTOR oops under KVM (2019): A race between lazy FPU context switching and KVM_RUN caused a #NM exception (Device Not Available, vector 7) to fire in kernel context where it wasn't expected. The do_device_not_available() handler triggered an oops because it assumed the exception occurred in user context. Fixed in Linux 5.0.

Modern Usage

kprobes and int3: Linux's kprobes mechanism (kernel/kprobes.c) patches kernel functions with INT3 (breakpoint, vector 3) to intercept execution. When the patched instruction is hit, the #BP handler fires, calls the kprobe handler, then single-steps the original instruction and returns. This is the foundation for bpftrace kernel probes and many ftrace plugins. The entire mechanism is built on the fault/trap infrastructure.

Live patching via page faults: Linux's live patching (kernel/livepatch/) remaps kernel function pages with new code. The page remapping uses the page fault mechanism (temporarily marking pages not-present to serialize all CPUs through the fault path, ensuring consistent state during the patch application).

userfaultfd in production: QEMU uses userfaultfd for post-copy live migration of VMs. When a VM is migrated while running, only dirty pages are sent first. The destination VM starts running immediately, with remaining pages fetched on demand — each access to an unmigrated page triggers a userfaultfd event that causes QEMU to fetch the page from the source host over the network.

Future Directions

Control Protection Exception (#CP, Vector 21): Introduced with Intel CET (Control-flow Enforcement Technology). Fires when a RET instruction's target doesn't match the shadow stack, or an indirect CALL/JMP target isn't in the valid endbranch table. Linux 6.6 enables CET shadow stacks for user space. This adds a new fault type to user-space exception handling.
Memory tagging via #GP: ARM MTE (Memory Tagging Extension) and Intel LAM (Linear Address Masking) use pointer tag bits. Accessing memory with a mismatched tag fires a fault. The kernel's exception handler extracts the tag mismatch information and delivers a structured signal to user space.
eBPF exception handlers: Ongoing discussion about allowing BPF programs to intercept user-space exception delivery, enabling more sophisticated signal handling frameworks without ptrace overhead.

Exercises

Write a C program that intentionally causes a SIGSEGV by writing to address (int*)0. Install a signal handler for SIGSEGV using sigaction() with SA_SIGINFO. In the handler, print si_addr from the siginfo_t. Compile and run. Confirm that si_addr is 0x0.
Write a C program that calls fork(), then has the child write to a variable (causing COW). Use MADV_WIPEONFORK and read about its effect. Alternatively, use /proc/self/pagemap before and after the COW fault to observe the page frame number change for the shared-then-CoW'd page.
Cause a kernel oops in a VM by writing a kernel module that deliberately dereferences a NULL pointer in its init function. Capture the oops output from dmesg. Decode the instruction pointer offset using addr2line -e vmlinux <offset> (requires a debug vmlinux). Identify the exact line of C source code that caused the fault.
Read arch/x86/mm/fault.c, specifically the bad_area_nosemaphore() function. Trace the call path from exc_page_fault() to the force_sig_fault() call for a user-space NULL dereference. List every function called in sequence. What kernel data structures are involved?
Enable the MCE injector (if available on your system: CONFIG_X86_MCE_INJECT=m). Use mce-inject to inject a correctable MCE. Observe the output in dmesg and rasdaemon. What fields are logged? What would happen if you injected an uncorrectable MCE?

References

Intel 64 and IA-32 Architectures Software Developer's Manual, Vol. 3A, Chapter 6 (Interrupt and Exception Handling)
Intel SDM Vol. 3B, Chapter 15 (Machine Check Architecture)
Linux kernel source: arch/x86/kernel/traps.c, arch/x86/mm/fault.c, arch/x86/kernel/cpu/mce/core.c, kernel/kprobes.c
Linux kernel documentation: Documentation/x86/exception-tables.rst
Linus Torvalds, "Dirty COW" analysis: https://lkml.org/lkml/2016/10/20/879
CVE-2016-5195 (Dirty COW): https://dirtycow.ninja/
Andrea Arcangeli, "userfaultfd: user space fault handling", LWN.net, 2015
Thomas Gleixner & Ingo Molnar, x86 exception handling refactoring, Linux 4.15
Brendan Gregg, "Page Faults" in BPF Performance Tools, Addison-Wesley, 2019