Skip to content

User Space vs. Kernel Space

Technical Overview

Every modern operating system enforces a hard partition between two execution environments: kernel space, where the kernel and its code run with unrestricted access to hardware and all memory, and user space, where all application code runs under enforced constraints. This separation is the foundational security and stability mechanism of every production operating system. It is implemented in hardware — CPUs provide privilege levels (rings, exception levels) that restrict what code can do based on which level it currently runs at. The OS configures these hardware mechanisms at boot time and enforces the boundary on every transition.

Understanding this boundary explains why a crashing Firefox tab does not take down the entire system, why a system call costs roughly 100 nanoseconds, and why writing a kernel driver is more dangerous than writing an application.

Prerequisites

  • Understanding of what a process is
  • Basic knowledge of virtual memory concepts
  • Familiarity with what a CPU instruction is
  • Read 01-what-is-a-kernel.md first

Core Content

Why the Separation Exists

Two orthogonal concerns motivate this separation:

Protection: User programs must not be able to corrupt kernel data structures, overwrite other processes' memory, or directly access hardware in ways that violate security policy. Without privilege separation, a malicious or buggy program could overwrite the kernel's process table, disable interrupts, or read any physical memory address.

Stability: Even without malicious intent, a buggy user program may dereference a null pointer or corrupt its own heap. If it ran with kernel privileges, this could crash the entire system. With the separation, the kernel kills the offending process, but the OS continues running.

These aren't just theoretical concerns. The CrowdStrike incident of July 2024, which caused 8.5 million Windows BSOD events, was caused by a kernel-mode security driver crashing — code running in kernel space, where a single bad pointer dereference kills the machine.

CPU Privilege Modes

The hardware enforces the boundary. x86 processors have four privilege rings (0–3); only rings 0 and 3 are used by mainstream OSes:

                    Most Privileged
                          |
  Ring 0 ┌───────────────────────────────┐
         │           KERNEL              │
         │  - All instructions allowed   │
         │  - Access any physical memory │
         │  - Set control registers      │
         │  - Enable/disable interrupts  │
         └───────────────────────────────┘
                          |
  Ring 1 ┌───────────────────────────────┐
  Ring 2 │   Historically unused on      │
         │   modern x86 Linux/Windows    │
         └───────────────────────────────┘
                          |
  Ring 3 ┌───────────────────────────────┐
         │         USER SPACE            │
         │  - Restricted instruction set │
         │  - Access only own VA space   │
         │  - Cannot directly do I/O     │
         │  - Cannot change page tables  │
         └───────────────────────────────┘
                          |
                    Least Privileged

ARM processors use Exception Levels (EL0–EL3): - EL0: user applications - EL1: kernel - EL2: hypervisor - EL3: secure monitor (TrustZone)

RISC-V uses privilege modes: U (user), S (supervisor/kernel), M (machine/firmware).

The concept is the same across architectures: a piece of hardware state tracks the current privilege level, hardware checks that state before executing certain instructions, and a defined exception mechanism transfers control to higher-privilege code when needed.

What Lives Where

Kernel space contains: - The kernel text and data (code, variables, stacks for kernel execution) - Interrupt handlers (ISRs) — must run in ring 0 to manipulate hardware - The scheduler (kernel/sched/core.c) — must access all process state - The memory manager (mm/) — must manipulate page tables, a ring-0-only operation - The VFS and filesystem drivers — must call block device drivers - Network stack — must access NIC hardware via drivers - Kernel modules — device drivers, filesystems, network protocols - Per-CPU interrupt stacks, NMI stacks, double-fault stacks - The struct page array (mem_map) describing all physical memory

User space contains: - All application code (web servers, databases, shells, browsers) - The C standard library (glibc, musl) — wraps syscalls, provides malloc, printf - Language runtimes (JVM, CPython, V8) — entirely in user space - System daemons (systemd, sshd, nginx, postgres) — even privileged daemons run in ring 3; they use syscalls to ask the kernel for privileged operations - Dynamic linker (ld.so) — maps shared libraries into process address space - VDSO — a special shared object the kernel maps into every process, containing a few functions (clock_gettime, gettimeofday) implemented without a syscall trap

Address Space Separation

On x86-64, the virtual address space is 48 bits (128 TiB total with 4-level paging, or 57 bits with 5-level). The convention, enforced by the kernel, splits it:

Virtual Address Space (x86-64, 4-level paging)

0x0000000000000000  +--------------------------------+
                    |   User space (0 - 128 TiB)     |
                    |   Stack, heap, mmap, text       |
                    |   Process's own page tables     |
0x00007fffffffffff  +--------------------------------+
                    |   Non-canonical hole (invalid)  |
                    |   (hardware restriction)        |
0xffff800000000000  +--------------------------------+
                    |   Kernel space (128 TiB)        |
                    |   Direct physical mapping       |
                    |   vmalloc area, modules area    |
                    |   Kernel text at -2 GB          |
0xffffffffffffffff  +--------------------------------+

The kernel is mapped into every process's address space at the top (using the kernel half of the virtual address space). Before KPTI (Kernel Page Table Isolation, a Meltdown mitigation), this meant the kernel's page table entries were present in every process's CR3. After KPTI, user-space processes have a shadow page table with only the minimum kernel mappings needed to enter and exit the kernel (entry_SYSCALL_64, interrupt vectors).

The hardware cannot access kernel virtual addresses from ring 3. Any attempt by user code to read 0xffff888000000000 (start of kernel's direct physical map) causes a page fault with a protection violation, which sends SIGSEGV to the process.

How Transitions Happen

There are exactly three mechanisms by which execution moves from user space to kernel space:

1. System Calls (deliberate, synchronous) The user program executes the SYSCALL instruction (x86-64). This reads the kernel entry point from the LSTAR MSR (set at boot by wrmsrl(MSR_LSTAR, entry_SYSCALL_64)), switches to ring 0, and switches the stack to the per-CPU kernel stack. The kernel validates arguments, performs the operation, and returns via SYSRET. The cost is approximately 100ns on a modern machine (including cache effects, KPTI overhead if enabled, and speculative execution barriers).

2. Hardware Interrupts (asynchronous) A device (NIC, disk controller, timer) raises an electrical signal on an IRQ line. The CPU saves the current instruction pointer and flags, looks up the handler in the IDT (Interrupt Descriptor Table), and transfers control to ring 0. The kernel's interrupt handler runs, services the device, and returns via IRET. The interrupted user process resumes exactly where it was.

3. Exceptions / Faults (synchronous, involuntary) The CPU itself raises an exception when it detects an error condition: a page fault when a memory access misses in the page table, a General Protection Fault when code violates a protection rule, a divide-by-zero fault. The kernel handles the exception — either resolving it (in the case of a demand-paged page fault) or delivering a signal to the process (SIGSEGV for an access violation).

Transition Cost: ~100ns

Mode switching is not free. A typical syscall on a modern Linux system costs 100–200ns due to:

  1. The SYSCALL instruction itself (~20 cycles)
  2. Saving all registers to struct pt_regs on the kernel stack
  3. Looking up the syscall in sys_call_table
  4. KPTI: switching CR3 from user page tables to kernel page tables (~50 cycles with PCID, ~200 cycles without)
  5. Spectre mitigations: IBRS/IBPB/STIBP barriers (worst case ~200 cycles)
  6. Return path: restore registers, switch CR3 back, SYSRET

This is why io_uring, DPDK, and similar systems go to great lengths to batch or eliminate syscalls.

Measurement:

// Measure getpid() syscall overhead
struct timespec t1, t2;
clock_gettime(CLOCK_MONOTONIC, &t1);
for (int i = 0; i < 1000000; i++) syscall(SYS_getpid);
clock_gettime(CLOCK_MONOTONIC, &t2);
// Divide elapsed by 1,000,000 for per-syscall cost

Kernel Modules and Ring 0 Access

Kernel modules (.ko files, loaded via insmod/modprobe) run entirely in ring 0. A module loaded with insmod my_driver.ko has the same privilege as the core kernel. It can: - Read and write any physical memory address - Access I/O ports - Call any exported kernel function (EXPORT_SYMBOL) - Register interrupt handlers - Modify page tables

This is why kernel modules require root to load (CAP_SYS_MODULE), why a buggy driver can cause a kernel panic, and why DKMS (Dynamic Kernel Module Support) is a significant attack surface. Secure Boot with module signing (CONFIG_MODULE_SIG_FORCE) prevents unsigned modules from loading, closing one avenue of rootkit installation.

User-Space Drivers: UIO and VFIO

To reduce risk, some drivers are implemented in user space:

UIO (Userspace I/O, drivers/uio/) Maps a device's memory-mapped I/O regions directly into a user-space process's address space. The process reads/writes device registers directly. Interrupt delivery goes through a file descriptor. Used for simple industrial devices and test equipment.

VFIO (Virtual Function I/O, drivers/vfio/) More complete. Uses IOMMU hardware to give a user-space process safe direct access to a PCI device — including full DMA capability, but with IOMMU isolation preventing the device from DMAsing into arbitrary physical memory. VFIO is the basis for passing PCIe devices through to VMs (QEMU/KVM device passthrough) and for DPDK's high-performance packet processing.

The tradeoff: user-space drivers are isolated from each other and cannot crash the kernel, but they cannot use kernel infrastructure (IRQ handling, DMA mapping APIs, power management). VFIO bridges this by keeping a thin kernel stub.


Historical Context

The concept of privileged execution modes appeared in hardware as early as 1959 on the IBM 709, which had a "supervisory" mode. The Atlas computer (1961, Manchester) introduced paged virtual memory with kernel/user separation. Multics formalized the ring structure in 1964, implementing all four rings in hardware and software.

Unix (1969–1971) adopted a two-mode model (kernel/user) that proved simpler to implement and reason about while providing sufficient protection. This binary model — rather than Multics' four rings — became the standard.

The emergence of virtual machine monitors (VMMs) in the 1970s (IBM VM/370) and their revival with x86 virtualization in the 2000s led to the addition of a hypervisor level (Intel VT-x "ring -1" / VMX root mode, AMD-V), effectively adding a third privilege level below ring 0. This is covered in 03-cpu-privilege-rings.md.


Production Examples

Containerization: Docker and container runtimes like containerd run application code entirely in user space. The container's kernel is the host kernel — there is no separate kernel per container. Namespace and cgroup isolation are kernel mechanisms, but the application code runs in ring 3, exactly as on bare metal. This is why containers start in milliseconds (no kernel boot) but why a kernel vulnerability in the host can potentially be exploited from inside a container.

Database direct I/O: PostgreSQL uses O_DIRECT for its buffer pool to bypass the kernel page cache. The data goes from user-space buffers through the kernel's block layer to disk without being copied into kernel memory. This gives PostgreSQL control over caching but requires understanding the user/kernel data path.

gVisor: Google's gVisor runs an application's syscalls through a user-space kernel (the "Sentry"), which then issues a reduced set of real syscalls to the host kernel. This interposes a second user-space/kernel boundary, reducing attack surface at the cost of ~10% performance overhead. Used in Google Cloud Run.


Debugging Notes

Identifying which space a crash occurred in: - Kernel panic: RIP in kernel address range (0xffffffff...), register dump with CS:0010 (ring 0) - Segfault in user space: SIGSEGV delivered, core dump generated, CS:0033 (ring 3)

Tracing the boundary: - strace shows every system call (crossing user→kernel→user) with timing - perf stat reports context-switches (voluntary kernel re-entry) and cpu-migrations - /proc/PID/status shows voluntary_ctxt_switches and nonvoluntary_ctxt_switches

Slow syscalls: - Use perf trace (strace with lower overhead) to find which syscalls dominate wall time - offcputime-bpfcc (BCC tool) shows time blocked inside kernel calls


Security Implications

The user/kernel boundary is the primary security primitive:

  • Privilege escalation attacks aim to get user-space code executing in ring 0 or to get the kernel to perform unauthorized operations on behalf of user space.
  • Kernel Self-Protection Project (KSPP): hardens the boundary. Key features: SMEP (bit 20 of CR4 — prevents ring 0 executing pages marked user-accessible), SMAP (bit 21 of CR4 — prevents ring 0 reading/writing user pages without STAC/CLAC), KASLR (randomizes kernel virtual address layout).
  • KPTI: after Meltdown (CVE-2017-5754), the kernel uses separate page tables for user and kernel mode. The user-space page table contains almost no kernel mappings, closing the Meltdown side channel. Cost: 5–30% on syscall-heavy workloads.
  • seccomp: allows a process to install a BPF program that filters which syscalls it is allowed to make. Used by Chrome, Firefox, and systemd. Reduces kernel attack surface by preventing processes from calling syscalls they have no legitimate reason to use.

Performance Implications

  • Syscall batching: io_uring (io_uring_setup(2), Linux 5.1) submits up to 65,536 I/O operations in a single syscall via shared ring buffers. The completion side can be polled without any syscall.
  • vDSO: clock_gettime(CLOCK_REALTIME), gettimeofday, and clock_getbanding are implemented in the vDSO as pure user-space reads from a kernel-maintained memory page. Cost: ~5ns vs ~100ns for a real syscall.
  • Huge pages for kernel mappings: The kernel maps its own text and data with 2 MiB pages, reducing TLB pressure for kernel code paths.
  • Coscheduling / CPU pinning: High-performance applications (DPDK, real-time systems) isolate CPUs from the kernel scheduler (isolcpus=, nohz_full=) to eliminate involuntary kernel entry (timer interrupts) on those CPUs.

Failure Modes and Real Incidents

Stack overflow in kernel: The kernel stack is fixed-size (16 KiB on x86-64, THREAD_SIZE). Deep call chains or recursive lock code can exhaust it. Linux uses guard pages and CONFIG_VMAP_STACK to detect kernel stack overflows rather than silently corrupting adjacent memory. Overflow causes an oops or panic.

Dirty COW (CVE-2016-5195): A race in mm/memory.c's copy-on-write path allowed user space to cause the kernel to write to a read-only file (e.g., SUID binaries). The boundary between user intent and kernel action was exploitable. Present in kernels from 2.6.22 to 4.8.2.

Spectre/Meltdown (2018): Meltdown (CVE-2017-5754) exploited the fact that kernel memory was mapped in user-space page tables. Speculative execution could read kernel data before the permission check raised a fault, leaking it via a cache side channel. The fix (KPTI) restored the isolation the hardware was supposed to provide but that speculative execution had violated.

Cloudflare Kernel Panic (2019): A Linux kernel bug in the bpf verifier caused a null pointer dereference in kernel space when Cloudflare deployed a new eBPF firewall rule. All affected machines rebooted simultaneously, causing a brief but global outage.


Modern Usage

  • Container security now focuses heavily on reducing the kernel attack surface exposed to untrusted container workloads. seccomp profiles, AppArmor/SELinux policies, and user namespaces all restrict the syscall interface between container code and the kernel.
  • The emergence of user-space networking stacks (DPDK, RDMA, io_uring with IORING_OP_RECV_MSG) is blurring the traditional boundary by moving more protocol logic into user space.
  • eBPF is perhaps the most profound recent development: it allows (verified, sandboxed) code to run in kernel context, callable from kernel hooks, without the full privileges of a kernel module. It is a controlled, policy-enforced extension of the kernel/user boundary.

Future Directions

  • Hardware enforcement of kernel integrity: Intel CET (Control-flow Enforcement Technology) adds shadow stacks in hardware, making return-address-overwrite exploits harder even in ring 0.
  • Rust kernel code: Rust's memory safety guarantees apply even to ring-0 code, reducing the class of bugs that can compromise the kernel/user boundary.
  • Confidential VMs (Intel TDX, AMD SEV-SNP): the kernel now has to manage multiple trust levels even within kernel space — some kernel data must be encrypted and inaccessible to the hypervisor. The boundary model gains a new dimension.
  • eBPF-as-kernel: BPF programs running in sched_ext and NFS BPF hooks implement kernel policy in a sandboxed, user-writeable way. The line between "kernel code" and "user-controlled kernel extension" is intentionally being made more nuanced.

Exercises

  1. Run getconf PAGE_SIZE and cat /proc/PID/maps for a running process. Identify the VDSO region (labeled [vdso]). Use objdump -d /proc/PID/maps_address (via /proc/self/maps and dd) to disassemble the vDSO and find clock_gettime. Observe that it contains no SYSCALL instruction.

  2. Write a C program that deliberately causes a kernel mode fault: attempt to write to address 0xffff888000000000 (a kernel virtual address). Compile and run it. Observe the SIGSEGV and the message from the kernel in dmesg. Explain why the process receives a signal rather than the system crashing.

  3. Use perf stat -e context-switches,cpu-migrations ./your_program on a CPU-bound program vs. an I/O-bound program. Compare the counts. Explain the difference in terms of voluntary vs. involuntary kernel entry.

  4. Install bpftrace and run bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }' for 10 seconds. Which processes make the most system calls? What does this tell you about their architecture?

  5. Read the source of arch/x86/entry/entry_64.S in the Linux kernel source tree, specifically the entry_SYSCALL_64 label. List in order the operations performed before the C function do_syscall_64() is called. What does this tell you about the cost of a system call?


References

  • Intel 64 and IA-32 Architectures Software Developer's Manual, Vol. 3, Chapter 5 (Protection)
  • AMD64 Architecture Programmer's Manual, Vol. 2, Chapter 4
  • Linux kernel source: arch/x86/entry/entry_64.S, arch/x86/entry/common.c, arch/x86/include/asm/current.h
  • KPTI commit: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=kaiser
  • Jonathan Corbet, "The current state of kernel page-table isolation", LWN.net, January 2018
  • Brendan Gregg, Systems Performance, 2nd ed., Addison-Wesley, 2020
  • Greg Kroah-Hartman, Linux Device Drivers, 3rd ed., O'Reilly (free online)
  • Meltdown paper: Lipp et al., "Meltdown: Reading Kernel Memory from User Space", USENIX Security 2018