01 — Kernel Panic Analysis

Technical Overview

A kernel panic is the Linux kernel's response to an unrecoverable error — a situation where continuing execution would risk data corruption or undefined behavior worse than stopping immediately. From a production SRE perspective, kernel panics are among the most disruptive failure modes: they halt all processes on the machine, requiring a reboot or a hard reset. Understanding how to read panic messages, capture crash dumps, and extract root causes from vmcore files is an essential skill for anyone running production Linux at scale.

The terminology distinction: a kernel oops is a recoverable error (the kernel logs the error and attempts to continue); a kernel panic is a fatal, non-recoverable error (the system halts). Oops can promote to panics if panic_on_oops is set, or if the error occurs in an interrupt context where recovery is impossible. In production systems, panic_on_oops=1 is often set to ensure crash dumps are generated rather than allowing the kernel to limp along in a potentially corrupted state.

Prerequisites

Linux kernel fundamentals (virtual memory, interrupt handling, kernel/user space boundary)
Basic C programming (reading C struct layouts, pointer arithmetic)
Familiarity with x86-64 register conventions
Understanding of ELF binaries and symbol tables

Core Content

Anatomy of a Kernel Oops/Panic

When the kernel detects an unrecoverable error, it calls panic() (for fatal conditions) or BUG() / BUG_ON(condition) (for assertion failures in kernel code). The resulting message is printed to the kernel log buffer (dmesg) and serial console. Understanding its structure is the first step in diagnosis.

ANNOTATED KERNEL OOPS MESSAGE

[  1234.567890] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
                ^                                                          ^
                timestamp (seconds since boot)                            faulting address

[  1234.567891] PGD 0 P4D 0
                ^ page table entries — all zero means completely unmapped address

[  1234.567892] Oops: 0002 [#1] PREEMPT SMP PTI
                       ^^^^                 ^^^
                       error code:          PTI = Page Table Isolation active
                       0002 = write access to non-present page (0000: read, 0002: write)
                       [#1] = first oops in this session

[  1234.567893] CPU: 3 PID: 12345 Comm: mydriver Not tainted 5.15.0-76-generic #83-Ubuntu
                ^^^^      ^^^^^    ^^^^^^^^^^^
                CPU core  PID      process name that triggered oops

[  1234.567894] Hardware name: Dell Inc. PowerEdge R750/0HRT4F, BIOS 1.8.2 01/15/2023

[  1234.567895] RIP: 0010:my_driver_write+0x78/0x1a0 [mydriver]
                ^^^^ ^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^
                Reg   CS   symbol+offset/total_function_size [module]
                Instruction Pointer = where the crash occurred

[  1234.567896] Code: 48 8b 47 10 48 85 c0 74 0a 48 8b 00 48 85 c0 74 02 ff d0 31 d2 ...
                ^ Raw opcodes surrounding the faulting instruction (for disassembly)

[  1234.567897] RSP: 0018:ffffb3c8c1233ca8 EFLAGS: 00010282
[  1234.567898] RAX: 0000000000000000 RBX: ffff9a7b82f3d800 RCX: 0000000000000000
[  1234.567899] RDX: 0000000000000000 RSI: 00007fffeab12340 RDI: ffff9a7b82f3d800
[  1234.567900] RBP: ffffb3c8c1233ce8 R08: 0000000000000000 R09: 0000000000000000
[  1234.567901] R10: 0000000000000000 R11: 0000000000000246 R12: ffff9a7b82f3d800
[  1234.567902] R13: 0000000000000400 R14: 00007fffeab12340 R15: 0000000000000400
                ^ All 64-bit general-purpose registers at crash time

[  1234.567903] FS:  00007f8b2c5d5740(0000) GS:ffff9a7bc5f80000(0000) knlGS:0000000000000000
[  1234.567904] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  1234.567905] CR2: 0000000000000018
                ^^^^^^^^^^^^^^^^^^^
                CR2 = faulting virtual address (the NULL+0x18 that triggered the fault)

[  1234.567906] Call Trace:
[  1234.567907]  <TASK>
[  1234.567908]  vfs_write+0xb1/0x290
[  1234.567909]  ksys_write+0x55/0xd0
[  1234.567910]  __x64_sys_write+0x1d/0x30
[  1234.567911]  do_syscall_64+0x3b/0x90
[  1234.567912]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
                ^ Call stack (read top to bottom = most recent → oldest frame)

[  1234.567913] Modules linked in: mydriver(O) kvm_intel kvm irqbypass ...
                                             ^
                                             (O) = out-of-tree (not upstream) module

Kernel Panic Types

NULL pointer dereference (most common): accessing address 0 or near-0. In the example above, CR2=0x18 means the code accessed struct->field where struct was NULL (field offset = 0x18 = 24 bytes). Fix: add null checks before dereferencing pointers.

Stack overflow: the kernel stack per thread is 8KB (x86-64). Deep recursion or large stack allocations overflow into adjacent memory. Symptoms: panic at __stack_chk_fail or stack corruption detected. Kernel stack overflows are frequently fatal immediately.

Double fault: a fault occurring while handling another fault. Usually indicates severe memory corruption or a stack overflow so severe the IDT handler stack is corrupted.

Machine check exception (MCE): hardware error (ECC memory error, CPU internal error, bus error). Logged via /sys/bus/machinecheck/ and MCE daemon. MCE panics often indicate hardware failure — replace the hardware.

RCU stall: Read-Copy-Update mechanism stall. A CPU is blocking RCU grace periods, preventing memory reclamation. Common cause: a CPU spinning in an interrupt handler or softirq for >21 seconds. Often indicates runaway interrupt handling or firmware bugs.

Softlockup: a CPU is not scheduling for >20 seconds (configured via kernel.watchdog_thresh). The watchdog NMI fires, printing the softlockup message. Common cause: a spinlock held too long, or an infinite loop in kernel code.

Hardlockup: NMI watchdog detects a CPU not executing interrupt handlers for >10 seconds. Indicates a CPU stuck in a state where even NMI cannot preempt (rare in non-buggy hardware). Often indicates hardware bugs or firmware issues.

Hung task: a task has been uninterruptible (D state) for more than kernel.hung_task_timeout_secs (default 120s). Common cause: NFS mount hanging, SCSI device I/O timing out without error, or deadlock in kernel I/O path.

Decoding Symbols

The oops shows my_driver_write+0x78/0x1a0. To find the exact source line:

# Method 1: addr2line (requires vmlinux or module debug symbols)
addr2line -e /lib/modules/$(uname -r)/kernel/drivers/mydriver/mydriver.ko \
  -j .text -f 0x78
# Output: my_driver_write at mydriver.c:145

# Method 2: objdump (disassemble to find the instruction)
objdump -d /lib/modules/$(uname -r)/kernel/drivers/mydriver/mydriver.ko | \
  grep -A 5 "<my_driver_write+0x78>"

# Method 3: /proc/kallsyms (for built-in kernel symbols, not modules)
grep "my_driver_write" /proc/kallsyms
# Output: ffffffffc0a23100 t my_driver_write [mydriver]
# Then: 0xffffffffc0a23100 + 0x78 = 0xffffffffc0a23178

# Method 4: decode_stacktrace.sh (kernel script, handles full oops automatically)
# Feed the oops text to decode_stacktrace.sh with the kernel build directory
./scripts/decode_stacktrace.sh /path/to/vmlinux /usr/src/linux < oops.txt

kdump Setup and vmcore Capture

kdump uses kexec to load a "crash kernel" into a reserved memory region at boot time. When the production kernel panics, it boots the crash kernel using the hardware state at crash time. The crash kernel then captures the memory of the crashed kernel as a /proc/vmcore file.

# Install kdump-tools (Ubuntu/Debian)
apt install linux-crashdump kdump-tools

# Reserve memory for crash kernel (add to GRUB_CMDLINE_LINUX)
# /etc/default/grub:
GRUB_CMDLINE_LINUX="crashkernel=512M"  # Reserve 512MB for crash kernel
# update-grub && reboot

# Verify kdump is configured and crash kernel is loaded
kdump-config show
cat /proc/cmdline | grep -o "crashkernel=[^ ]*"
cat /proc/sys/kernel/kexec_crash_size  # how much memory is reserved

# Configure where vmcore is saved (/etc/kdump.conf or /etc/default/kdump-tools)
KDUMP_SYSCTL="kernel.panic_on_oops=1"
KDUMP_PATH=/var/crash

# Configure makedumpfile level (reduce vmcore size by skipping free/zero pages)
# Level 31 = most aggressive compression (skip user pages, zero pages, cache pages)
MAKEDUMPFILE_OPTIONS="-l --message-level 1 -d 31"

# After a crash, vmcore appears at:
ls /var/crash/$(date +%Y-%m-%d-%H:%M)/
# vmcore     vmcore-dmesg.txt

makedumpfile reduces vmcore size significantly (a 256GB system produces a 256GB raw vmcore; makedumpfile with -d 31 typically produces 100-500MB by skipping pages that aren't relevant to crash analysis):

# Compress vmcore for transfer/storage
makedumpfile -l --message-level 1 -d 31 /proc/vmcore /var/crash/vmcore.compressed

# Extract dmesg from vmcore (no crash tool needed)
makedumpfile --dump-dmesg /proc/vmcore /tmp/dmesg.txt

crash Tool Analysis

The crash utility (from Dave Anderson, Red Hat) analyzes vmcore files using the vmlinux debug symbol file. It provides an interactive gdb-like interface for navigating kernel data structures.

# Install crash
apt install crash   # or yum install crash

# Open a vmcore (requires vmlinux with debug symbols matching the panicking kernel)
crash /usr/lib/debug/boot/vmlinux-5.15.0-76-generic \
      /var/crash/2024-01-15-14:32/vmcore

# Inside crash:

# bt: backtrace of the panicking thread
crash> bt
PID: 12345  TASK: ffff9a7b82f3d800  CPU: 3   COMMAND: "mydriver"
 #0 [ffffb3c8c1233b80] machine_kexec at ffffffffa5e6b3d8
 #1 [ffffb3c8c1233bd8] __crash_kexec at ffffffffa5f3e2a4
 #2 [ffffb3c8c1233ca0] oops_end at ffffffffa5e3bc27
 #3 [ffffb3c8c1233cc0] no_context at ffffffffa5e4c1f3
 #4 [ffffb3c8c1233d00] __bad_area_nosemaphore at ffffffffa5e4c47e
 #5 [ffffb3c8c1233d48] bad_area at ffffffffa5e4c543
 #6 [ffffb3c8c1233d60] do_page_fault at ffffffffa5e4ca72
 #7 [ffffb3c8c1233db0] page_fault at ffffffffa5ea016e
 #8 [ffffb3c8c1233e48] my_driver_write+0x78/0x1a0 [mydriver]
 #9 [ffffb3c8c1233eb0] vfs_write at ffffffffa5e9b2c1

# bt -a: backtrace for ALL tasks (find the crashing thread)
crash> bt -a | grep "PID:" | head -20

# ps: list all processes
crash> ps
   PID    PPID  CPU  TASK             ST  %MEM  VSZ   RSS  COMM
      1       0   0  ffff9a7b80010000  IN   0.2  168736  12416  systemd
  ...

# vm: virtual memory areas of a process
crash> vm 12345

# files: open files for a process
crash> files 12345

# log: print kernel message buffer (dmesg equivalent)
crash> log
# or for last N messages:
crash> log | tail -100

# struct: print a kernel struct
crash> struct task_struct ffff9a7b82f3d800
crash> struct sk_buff ffff9a7bc2f40000

# dis: disassemble a function
crash> dis my_driver_write
crash> dis my_driver_write+0x78   # disassemble around the fault

# kmem: kernel memory usage
crash> kmem -i  # memory info summary
crash> kmem -s  # slab cache info

# mod: loaded modules
crash> mod
crash> mod -s mydriver /lib/modules/5.15.0-76-generic/kernel/drivers/mydriver/mydriver.ko
# loads symbols for the module, enabling bt and dis to show source info

# rd: read memory at address
crash> rd ffff9a7b82f3d800   # read 64-bit word at address
crash> rd -S ffff9a7b82f3d800 20  # read 20 structs starting at address

KASAN and KFENCE in Oops Messages

Modern kernels ship with additional memory safety tools that produce extended oops information:

KASAN (Kernel Address Sanitizer) — enabled with CONFIG_KASAN:

BUG: KASAN: use-after-free in my_driver_write+0x78/0x1a0
Write of size 8 at addr ffff8880052f3d80 by task kworker/0:1/45
                         ^^^^^^^^^^^^^^^^ the bad address

CPU: 0 PID: 45 Comm: kworker/0:1

Allocated by task 12345:
 kmalloc+0x... (called from my_driver_alloc+0x40)
 my_driver_alloc+0x40/0x80

Freed by task 12345:
 kfree+0x... (called from my_driver_release+0x2c)
 my_driver_release+0x2c/0x80

KASAN catches: use-after-free, heap buffer overflow, stack buffer overflow, global variable overflow. Overhead: ~1.5-2x memory, ~2x CPU. Not for production — for development/testing.

KFENCE (Kernel Electric Fence) — enabled with CONFIG_KFENCE, production-safe (<1% overhead):

BUG: KFENCE: out-of-bounds write in my_driver_write+0x78/0x1a0
Out-of-bounds write at 0x0000000052f3d88 (8 bytes right of kfence-#42):
 my_driver_write+0x78/0x1a0

KFENCE uses probabilistic sampling (not every allocation is protected, only 1/100 by default) to catch memory errors at low overhead. Suitable for production kernels.

Historical Context

Kernel panic analysis has been part of Unix systems administration since the BSD era (1970s-1980s). The equivalent in BSD was the core dump written to the swap partition (savecore). Linux adopted similar mechanisms.

The crash tool was developed by Dave Anderson at Red Hat in the early 2000s, building on the work of the Lkcd (Linux Kernel Crash Dumps) project. It has been the standard tool for vmcore analysis ever since. kexec-based kdump (Vivek Goyal, IBM, 2004) replaced earlier mechanisms that required a secondary kernel loaded at boot time in a fixed location.

KASAN was contributed by Andrey Ryabinin in 2015 (Linux 4.0). KFENCE (Google, 2021 — Linux 5.12) was specifically designed to be production-safe through probabilistic sampling, making always-on kernel memory error detection feasible.

Production Examples

# Check if kdump service is active
systemctl status kdump  # RHEL/CentOS
systemctl status kdump-tools  # Ubuntu/Debian

# Force a kernel panic to test kdump (WARNING: crashes the system)
echo 1 > /proc/sysrq-trigger  # no, this just syncs
echo c > /proc/sysrq-trigger   # THIS triggers a kernel panic

# Verify crash kernel loaded
dmesg | grep -i "crash kernel"
# Output: crashkernel reserved: 0x000000007e000000 - 0x000000009e000000 (512 MB)

# After a crash: extract dmesg from vmcore without running crash
makedumpfile --dump-dmesg /var/crash/vmcore /tmp/dmesg-from-crash.txt

# Check vmcore validity
file /var/crash/vmcore
# Output: /var/crash/vmcore: ELF 64-bit LSB core file, x86-64, version 1 (SYSV)

# Quick triage: get just the call trace from dmesg
journalctl -k -b -1 | grep -A 50 "Oops:"  # previous boot's kernel messages

Debugging Notes

"Cannot read memory" in crash tool: the vmcore was created with -d 31 (aggressive filtering), and the memory you're trying to read was filtered out as non-kernel page. Recollect the vmcore with -d 1 (minimal filtering) to preserve more memory.

Module symbols not loaded in crash: run mod -s module_name /path/to/module.ko to load the module's symbol table. Without this, bt shows raw addresses for module functions.

Oops in interrupt context: if the RIP shows a function like apic_timer_interrupt or do_IRQ in the backtrace, the oops occurred in interrupt context. This makes recovery impossible (hence promoting oops to panic). The relevant user code is below the interrupt in the stack.

Decoding KASLR-randomized addresses: kernels with KASLR (Kernel Address Space Layout Randomization) randomize the kernel base address at boot. The crash tool handles this automatically when using a matching vmlinux. For manual decoding: find the kernel base from dmesg | grep "Decompressing Linux" or from /proc/kallsyms (add the _text symbol address as the offset).

Security Implications

vmcore files contain a snapshot of all kernel and process memory at crash time. This includes encryption keys, TLS session keys, user data in kernel buffers, and process address spaces. Treat vmcore files as highly sensitive. Store encrypted; transmit over encrypted channels only.
kdump must run as root. The crash kernel itself has unrestricted access to the main kernel's memory image — this is inherent to the design. Protect the kdump mechanism from unauthorized access.
Oops messages in dmesg may reveal kernel memory addresses (defeats KASLR for that session) and internal kernel data structures. dmesg_restrict=1 (/proc/sys/kernel/dmesg_restrict) prevents non-root users from reading dmesg on production systems.
Deliberate kernel panics can be used as a denial-of-service: triggering a panic (if an attacker has code execution in the kernel) crashes the system. Defense: limit kernel code execution (seccomp, LSM policies, module signing).

Performance Implications

kdump/kexec: at normal operation, the reserved crashkernel memory is just reserved — no performance impact. The crash kernel itself is loaded into reserved memory at boot but not executing.
Kernel panic → vmcore capture time: on a 256GB machine with makedumpfile -d 31, capturing takes 30-120 seconds before the system reboots. With -d 0 (no filtering), it could take 30+ minutes.
KASAN overhead: ~1.5x memory and ~2x CPU. Never for production.
KFENCE overhead: <1% CPU. Suitable for production detection of memory errors.
panic_on_oops=1: promotes all oops to panics, causing reboots on recoverable errors. This trades availability (one crash causes a reboot) for reliability (corrupt state doesn't propagate). Recommended for production.

Failure Modes and Real Incidents

AWS EC2 kernel panic from ENA driver bug (2019): An ENA (Elastic Network Adapter) driver bug in certain kernel versions caused NULL pointer dereferences under high network packet rates. The oops was in ena_io_irq_handler+0x60/0xa0 [ena]. Analysis via kdump + crash tool revealed that the ena_ring pointer was being accessed after being freed during a driver reset. AWS issued an AMI update; the fix was a missing mutex around the reset path.

Memory corruption from MMIO (Cloudflare post, 2021): A production server experienced periodic kernel panics with KFENCE reports showing heap corruption in the network driver's DMA buffer management. The crash tool showed the corruption pattern was 8 bytes beyond a skb_shared_info struct. Root cause: the NIC firmware was writing beyond its DMA buffer on specific packet sizes (firmware bug). Identified via kdump + KFENCE; fixed by firmware update.

RCU stall from VM spinlock (production hypervisor incident): A KVM hypervisor node showed RCU stall warnings followed by softlockup panics. Crash analysis showed a VCPU thread holding a spinlock in the KVM exit handler for >100ms while waiting for shadow page table flush. Root cause: 200 VMs all receiving a TLB shootdown simultaneously. Fix: batching TLB shootdowns across VMs.

Modern Usage

Ubuntu 22.04+: uses linux-crashdump metapackage; kdump configured out-of-box.
RHEL 9 / AlmaLinux 9: uses kexec-tools + kdump.service; systemd unit manages crash kernel loading.
Kubernetes: kdump on Kubernetes nodes requires the kdump pod to run privileged with hostPID and access to /proc/vmcore. Tools like kdump-operator automate vmcore capture and upload to S3.
crash 8.x (2023): added Python scripting support, improved module symbol loading, better handling of compressed vmcores.

Future Directions

Live kernel patching + panic reduction: Ksplice (Oracle), kpatch (Red Hat), and livepatch (SUSE/Canonical) allow applying security fixes without reboots, reducing the number of panics caused by known vulnerabilities.
Kernel Address Sanitizer in production (KFENCE improvement): ongoing work to increase KFENCE coverage without overhead increase; probabilistic sampling rate tuning.
Automated vmcore triage: systems like Red Hat's ABRT (Automatic Bug Reporting Tool) attempt to automatically classify kernel crashes by matching call traces against known bug patterns.

Exercises

Oops decoding: Find a kernel oops message online (Linux kernel mailing list archives have thousands). Decode each field: identify the fault type from the error code, the faulting address from CR2, the function and offset from RIP, and read the call trace. Use /proc/kallsyms from a running system to look up symbol addresses.
kdump setup: On a test VM (not production), configure kdump. Set crashkernel=128M in the bootloader. Verify the crash kernel is loaded (/proc/sys/kernel/kexec_crash_size). Trigger a controlled panic with echo c > /proc/sysrq-trigger. Locate the vmcore and extract the dmesg with makedumpfile --dump-dmesg.
crash tool navigation: Download a publicly available vmcore from the kernel test farm or generate one in a VM. Open it with crash. Run: bt, ps | head -20, log | tail -50, mod. Find the panicking process and decode its call trace manually using dis and /proc/kallsyms.
KFENCE testing: Compile a minimal kernel module that deliberately performs a use-after-free (allocate with kmalloc, free with kfree, then write to the freed memory). Load the module on a CONFIG_KFENCE-enabled kernel. Observe the KFENCE report in dmesg. Decode the "Allocated by task" and "Freed by task" stacks.
Production triage simulation: Given a kernel oops message (fabricate one or use a real one from bugzilla.kernel.org), perform the full analysis: (a) identify the fault type, (b) find the crashing kernel function and line number using addr2line, (c) read the call trace and explain what the kernel was doing, (d) hypothesize the root cause based on the register values and code context.

References

Linux Kernel Documentation: Documentation/admin-guide/kdump/kdump.rst
Linux Kernel Documentation: Documentation/admin-guide/bug-hunting.rst
crash tool manual: man crash and https://github.com/crash-utility/crash
Anderson, Dave. "crash Utility: Crash Dump Analysis." Red Hat Summit 2009.
KASAN documentation: Documentation/dev-tools/kasan.rst
KFENCE documentation: Documentation/dev-tools/kfence.rst
Venkateswaran, Sreekrishnan. Essential Linux Device Drivers. Prentice Hall, 2008. Chapter 21 (Debugging).
Gregg, Brendan. "Linux Performance: Advanced Debugging." USENIX LISA 2014.