02 — Privilege Escalation Techniques

Technical Overview

Kernel privilege escalation is the process of converting an arbitrary bug (memory corruption, logic flaw, race condition) into complete control of the system. The canonical goal is calling commit_creds(prepare_kernel_cred(NULL)) from ring 0, which assigns the current process the credentials of PID 0 (root with all capabilities). Everything between "the bug fires" and "commit_creds is called" is the exploit chain.

A modern kernel exploit is a multi-stage attack: each mitigation adds a stage. An exploit against a fully hardened 2024 kernel may require an information leak, a KASLR bypass, a stack pivot, a ROP chain, a SMEP bypass, and careful heap grooming — all coordinated to execute within a narrow timing window.

Prerequisites

Kernel virtual memory layout and KASLR
x86-64 calling convention, ROP (return-oriented programming) fundamentals
Linux credential structures (struct cred)
Interrupt and exception handling (CR4, ring transitions)
Kernel exploit classes (01-kernel-exploit-classes.md)

The Goal: commit_creds

The kernel represents process credentials in struct cred:

struct cred {
    atomic_t    usage;
    kuid_t      uid, euid, suid, fsuid;
    kgid_t      gid, egid, sgid, fsgid;
    struct group_info *group_info;
    kernel_cap_t cap_inheritable;
    kernel_cap_t cap_permitted;
    kernel_cap_t cap_effective;     /* all-1s = all capabilities */
    kernel_cap_t cap_bset;
    // ...
    struct user_namespace *user_ns;
    // ...
};

prepare_kernel_cred(NULL) returns a new struct cred with all UIDs set to 0 and all capabilities set (the credentials of init). commit_creds(cred) installs these credentials into the current task_struct. The sequence:

/* Kernel shellcode (ring 0 execution) */
commit_creds(prepare_kernel_cred(NULL));

After this call, the process has uid=0 and CAP_ALL — effectively root. Any subsequent execve("/bin/sh", ...) from user space spawns a root shell.

Exploit Chain Overview

  ┌────────────────────────────────────────────────────────────────┐
  │                EXPLOIT CHAIN (modern hardened kernel)          │
  │                                                                │
  │  1. TRIGGER BUG                                                │
  │     └── Memory corruption / race / logic flaw                 │
  │                                                                │
  │  2. INFORMATION LEAK                                           │
  │     └── Read kernel pointer → defeat KASLR                    │
  │         Know kernel .text base address                         │
  │                                                                │
  │  3. HEAP GROOMING (if needed)                                  │
  │     └── Control allocator to place attacker data adjacent     │
  │         to vulnerable object                                   │
  │                                                                │
  │  4. PRIMITIVE ESTABLISHMENT                                    │
  │     ├── Write primitive: overwrite kernel function pointer     │
  │     ├── Read primitive: read arbitrary kernel memory          │
  │     └── Control flow hijack                                   │
  │                                                                │
  │  5. SMEP/SMAP BYPASS                                           │
  │     └── Pivot to kernel ROP gadgets (never execute user pages)│
  │         or disable SMEP/SMAP in CR4                           │
  │                                                                │
  │  6. commit_creds(prepare_kernel_cred(NULL))                    │
  │     └── Install root credentials in current task              │
  │                                                                │
  │  7. RETURN TO USER SPACE                                       │
  │     └── swapgs + iretq to user-mode root shell               │
  └────────────────────────────────────────────────────────────────┘

Stage 1: Information Leak — Defeating KASLR

KASLR (Kernel Address Space Layout Randomization) randomizes the base address of the kernel image at each boot. Without KASLR, the kernel is always loaded at the same virtual address (e.g., 0xffffffff81000000 on x86-64), allowing attackers to hardcode gadget addresses. With KASLR, the base is randomly offset by up to 1GB (30 bits of entropy on x86-64, slide in pages of 2MB).

An attacker must find the kernel base before programming ROP gadgets. Common information leak techniques:

/proc/kallsyms (Restricted Access)

Before Linux 3.18, /proc/kallsyms showed real addresses to all users. Since 3.18, non-root users see 0000000000000000 for all symbols (controlled by kptr_restrict sysctl).

cat /proc/kallsyms | grep " T prepare_kernel_cred"
# Root: ffffffff81076480 T prepare_kernel_cred
# Non-root (restricted): 0000000000000000 T prepare_kernel_cred

Kernel addresses still leak through other interfaces: - /proc/modules: module load addresses (historically leaked; restricted in modern kernels under kptr_restrict=2) - dmesg: kernel pointers in printk output (restricted with kernel.dmesg_restrict=1)

Kernel Pointer Leaks in Vulnerable Code

Most modern exploits use a bug to directly read kernel memory. If the bug allows an out-of-bounds read into adjacent kernel structures, or a use-after-free read of a freed object that was reallocated with data containing kernel pointers, the attacker can extract pointer values.

/* Example: UAF read leaks kernel pointer */
// 1. Allocate obj_A (contains function pointer at offset 0)
// 2. Free obj_A (pointer to it is still in victim_ptr)
// 3. Allocate obj_B of same size (SLUB reuses the slab)
//    Kernel places a struct file here: first 8 bytes contain f_op pointer
// 4. Read via victim_ptr[0] → reads f_op (points into kernel .text)
// 5. Subtract offset of fops symbol from kernel base → kernel slide

kernel_base = leaked_ptr - (symbol_offset - KERNEL_BASE_COMPILE_TIME);

Side-Channel Leaks

TSX timing (CVE-2019-11135, TAA): Intel TSX aborts leak cache information, potentially revealing kernel addresses
Spectre variant 1 / 2: Speculative execution leaks values from kernel memory to user space via cache timing. The Spectre mitigations (retpoline, IBRS, IBPB) are imperfect and architectural variants continue to be discovered

Stage 2: Control Flow Hijack

Once the kernel base is known, the attacker must redirect execution. Common control flow hijack targets:

Function Pointer Overwrite

The kernel is full of function pointer tables. A memory corruption primitive that can write an 8-byte value to a known location can overwrite a function pointer:

/* struct file_operations — overwriting fop->read */
struct file_operations {
    struct module *owner;
    loff_t (*llseek)(...);
    ssize_t (*read)(...);        /* offset 16: overwrite this */
    ssize_t (*write)(...);
    // ...
};

/* After overwrite: user calls read() on this fd → kernel calls attacker's address */

Target structures with function pointers: - struct file_operations (triggered by read/write/ioctl on a file) - struct socket_ops / struct proto_ops (triggered by send/recv) - struct seq_operations (triggered by read on /proc file — CVE-2021-22555 used this) - struct timer_list (triggered at timer expiry) - struct notifier_block (triggered on network events, reboot, etc.)

Return Address Overwrite

Overwriting a saved return address on the kernel stack — the classic stack buffer overflow technique. Less common post-VMAP_STACK since the stack now has hardware guard pages.

If an overflow can reach the return address despite guard pages (e.g., by overflowing a very large struct that spans into the return address), the attacker programs it to point to a ROP gadget in the kernel.

Stage 3: SMEP Bypass — No User Page Execution

SMEP (Supervisor Mode Execution Prevention): CR4 bit 20. When set, any attempt by ring-0 code to execute code at a user-space virtual address causes a general protection fault. This prevents the naive "overwrite function pointer with user-space shellcode address" approach.

Bypass 1: Kernel ROP (Return-Oriented Programming)

Instead of executing user-space shellcode, chain together small sequences of existing kernel code ("gadgets") that each end with a ret instruction. The chain is stored on the kernel stack; each gadget executes then returns to the next:

Kernel stack (controlled by attacker):
  ┌─────────────────────────────┐
  │ addr of gadget1: pop rdi; ret│  ← return address
  ├─────────────────────────────┤
  │ 0x0 (NULL for prepare_kernel_cred) │  ← argument (rdi)
  ├─────────────────────────────┤
  │ addr of prepare_kernel_cred │  ← gadget2
  ├─────────────────────────────┤
  │ addr of gadget3: mov rdi,rax; ret │  ← gadget3 (move return val to rdi)
  ├─────────────────────────────┤
  │ addr of commit_creds        │  ← gadget4
  ├─────────────────────────────┤
  │ addr of swapgs; ret         │  ← restore GS base for user mode
  ├─────────────────────────────┤
  │ addr of iretq               │  ← return to user mode
  ├─────────────────────────────┤
  │ user-mode RIP (user shell)  │
  │ CS, RFLAGS, RSP, SS         │  ← iretq frame
  └─────────────────────────────┘

ROP gadgets are found using tools like ROPgadget or ropper applied to vmlinux. Since the kernel base is known (from stage 1), gadget addresses are computed as kernel_base + gadget_offset.

Bypass 2: CR4 Write Gadget

If the attacker finds a kernel gadget that writes a controlled value to CR4, they can clear bit 20 (SMEP) and then execute user-space shellcode:

; Gadget in kernel: mov cr4, rdi; ret
; Attacker sets rdi = (current_cr4 & ~(1<<20))  — clears SMEP bit
; Now user-space pages are executable from ring 0

This technique became less viable after Linux 5.3 introduced cr4_pinned_bits — a bitmask of CR4 bits that the kernel refuses to clear, checked in native_write_cr4(). SMEP and SMAP bits are always in cr4_pinned_bits.

Stage 4: SMAP Bypass — No User Memory Access

SMAP (Supervisor Mode Access Prevention): CR4 bit 21. When set, ring-0 code accessing user-space virtual addresses causes a fault. This prevents kernel ROP chains from reading attacker-controlled data placed in user space.

STAC/CLAC: The kernel uses stac (set AC flag — temporarily disables SMAP) and clac (clear AC — re-enables SMAP) around legitimate copy_to/from_user calls. These are single-instruction, so a ROP gadget stac; ret or a gadget containing copy_from_user implicitly disables SMAP within its execution.

Bypass strategy: Structure the ROP chain to not read user-space data. Instead: 1. Kernel stack contains the ROP chain (kernel stack — unaffected by SMAP) 2. If attacker data is needed, store it in kernel memory before triggering the exploit (e.g., write to a setsockopt buffer that is copied to kernel with copy_from_user, then read it back from kernel via the exploit)

Stage 5: Return to User Space

After commit_creds runs, the process has root credentials but is still executing in ring 0 with the wrong GS base register (kernel GS is loaded; user GS must be restored). The return sequence:

; Save user-mode return context before exploit:
;   push user-mode RFLAGS, RSP, user-mode RIP

; After commit_creds:
swapgs          ; restore user GS base (from MSR_KERNEL_GS_BASE)
iretq           ; interrupt return: pops RIP, CS, RFLAGS, RSP, SS
                ; returns to user-mode address with user-mode stack

; In user space:
execve("/bin/sh", NULL, NULL)  ; spawn root shell

On kernels with KPTI (Kernel Page Table Isolation, merged in 4.15 for Meltdown mitigation), there is an additional trampoline: returning to user space also switches CR3 to the user-space page table. The exploit must use swapgs_restore_regs_and_return_to_usermode (a helper in the kernel's entry_64.S) rather than raw iretq.

Heap Grooming

Before triggering the bug, the attacker "grooms" the kernel heap to ensure specific structures are allocated in predictable locations adjacent to the vulnerable object.

Heap grooming strategy for a UAF exploit:

  1. Spray N objects of size X to fill current partial slab
  2. Free every other object → creates "holes"
  3. Allocate the vulnerable object → lands in a hole
  4. Trigger UAF to free it
  5. Spray target objects (e.g., msg_msg, pipe_buffer, seq_operations)
     of same size → lands in freed slot
  6. Access via dangling pointer → accesses attacker-controlled target

Heap spray objects commonly used: - msg_msg (message queue): variable-size kernel allocation, user-controlled content via msgsnd() - pipe_buffer: allocates PAGE_SIZE per pipe buffer; content partially user-controlled - sk_buff (socket buffer): network packet buffers, user-controlled via send() - userfaultfd pages: user-space-backed pages that fault on demand — allows pausing allocation at specific points during race conditions

Historical Context

2003-2008: "Kernel shellcode" era. SMEP/SMAP did not exist. Exploits simply placed shellcode in user space, overwrote a kernel function pointer with the shellcode address, and triggered it. Code like:

uint8_t shellcode[] = {
    /* commit_creds(prepare_kernel_cred(NULL)) in hand-coded asm */
};
mprotect(shellcode_page, PAGE_SIZE, PROT_EXEC);
/* overwrite kernel function pointer with shellcode address */

2011-2014: SMEP appeared (Intel Sandy Bridge, 2011). ret2usr attacks died. Kernel ROP chains (ret2usr → ret2kernel_text) emerged. KASLR was not yet standard — offsets were fixed.

2015-2018: KASLR widely deployed. Two-stage exploits: info leak first, then ROP. SMAP added complexity. commit_creds gadgets became standardized.

2019-present: Multi-stage exploitation with heap grooming, cross-cache attacks, and careful timing. Mitigations create an arms race: each new defense raises the bar but doesn't eliminate exploitation.

Production Examples

Google Project Zero — CVE-2022-2588 exploit: A full public exploit for the route filter UAF by Notselwyn. The exploit: 1. Uses UAF in cls_route to read a net_device function pointer (info leak, KASLR bypass) 2. Uses the write primitive to overwrite a seq_operations->start function pointer with a controlled address 3. Triggers the overwrite via /proc/net/... read 4. ROP chain: commit_creds(prepare_kernel_cred(NULL)) 5. Works on Ubuntu 22.04 LTS with 5.15 kernel (fully patched at time of writing)

Dirty Pipe (CVE-2022-0847): A write-what-where primitive in the pipe splice code (not a classical corruption — a flag initialization error). Allowed overwriting read-only file-backed pages, including SUID binaries and /etc/passwd. No kernel code execution needed — exploitation was pure data manipulation. An elegantly simple exploit by Max Kellermann.

Debugging Notes

# Enable kernel exploit mitigations status check
grep -r "CONFIG_SMAP\|CONFIG_SMEP\|CONFIG_KASLR" /boot/config-$(uname -r)

# Check current KASLR offset (root only)
cat /proc/kallsyms | grep "T startup_64"
# Compare to build-time offset in vmlinux

# Monitor for privilege escalation via audit
auditctl -a always,exit -F arch=b64 -S setuid,setgid,setresuid,setresgid

# Kernel live-patching protects against known escalation vectors
systemctl status kpatch  # or livepatch

# Check if unprivileged user namespaces are enabled (major attack vector)
cat /proc/sys/kernel/unprivileged_userns_clone
# Ubuntu 23.10+: restricted via apparmor

Security Implications

The post-exploitation impact is total: - Credential theft: Read /etc/shadow, extract memory keys from all processes - Backdoor installation: Modify kernel syscall table, insert rootkit module - Container escape: Modify namespaces, unshare, regain host view - Hypervisor escape (rarely — requires additional bugs): Escalate from guest kernel to hypervisor

Performance Implications

Well-crafted exploits target rarely-exercised kernel paths to avoid performance regression before the exploit fires. However, the exploitation process itself — heap spraying, race condition driving, ROP chain execution — can cause system instability, particularly if the exploit crashes before completing.

Failure Modes

Kernel oops during exploitation: Double-free, invalid dereference in exploit path. System logs panic, often reboots. Forensics: kdump captures the crash dump with full context.
Timing failure in race: Race condition exploit requires precise timing. On loaded systems or with mitigations (kernel lockdown) timing may not be achievable.
Wrong KASLR bypass: Info leak parsed incorrectly. ROP gadget addresses wrong by offset. Kernel panics.

Modern Usage

In 2025, reliable kernel exploits against fully patched Ubuntu/RHEL/Fedora require significant research effort. Bug classes continue to shift: - io_uring's asynchronous execution model creates new race condition opportunities - eBPF verifier continues to have logic bugs - netfilter (nftables/iptables) has been a source of multiple CVEs (2022-2024)

Future Directions

Shadow stacks (Intel CET, x86-64): Hardware-enforced shadow stack stores return addresses separately from the regular stack. An attacker cannot overwrite a return address on the regular stack — the shadow stack check will fail. Return address ROP gadgets become infeasible. Linux merged CET shadow stack support in 6.6 for user space; kernel space support is in progress.

KCFI (Kernel Control Flow Integrity): Indirect call targets validated at compile time. Forward-edge CFI means call [rax] in the kernel only succeeds if rax points to an address that was registered as a valid call target for that call site. Merged for ARM64 in 5.13, x86-64 in 6.1.

Exercises

Set up a QEMU VM running Ubuntu 22.04 with a debug kernel (no KASLR, no SMEP). Write a simple kernel module with an intentional null pointer dereference. Map page 0 from user space and place commit_creds(prepare_kernel_cred(NULL)) shellcode. Verify the null dereference executes it.
Enable KASLR and SMEP (real or emulated via kernel config). Port the above exploit to use a kernel ROP chain. Use ROPgadget --binary vmlinux to find pop rdi; ret and ret gadgets.
Write a script that parses /proc/modules and computes the kernel slide from a leaked module address.
Reproduce the Dirty Pipe exploit (CVE-2022-0847) on a kernel <= 5.16.10. Modify /etc/passwd to set root's password to a known value. Understand why this works without code execution.
Audit a recent kernel CVE: pick a CVE from 2024, find the patch commit on kernel.org, and write a one-page analysis identifying: (a) the bug class, (b) the exploitability conditions, (c) the patch approach.

References

Jann Horn, Google Project Zero: https://googleprojectzero.blogspot.com
CVE-2022-2588 exploit (Notselwyn): https://github.com/Notselwyn/CVE-2022-2588
Dirty Pipe analysis (Max Kellermann): https://dirtypipe.cm4all.com/
"Exploiting the Linux Kernel via packet sockets" — Andrey Konovalov
"Linux kernel exploitation technique: struct seq_operations" — pwndbg blog
LWN "Kernel hardening" series: https://lwn.net/Kernel/Index/
arch/x86/entry/entry_64.S — ring transition code
Intel 64 Architecture SDM, Vol. 3A: Chapter 17 (SMEP/SMAP in CR4)
KCFI design document: Documentation/kbuild/llvm.rst