What Is a Kernel?
Technical Overview
The kernel is the central component of an operating system: a privileged software layer that sits directly above the hardware and provides a controlled, portable interface through which all other software accesses machine resources. It is not the operating system itself — the OS is the kernel plus the user-space programs, libraries, and services that make a system usable — but it is the irreplaceable core around which everything else is built.
At its most fundamental level, the kernel does two things: it manages hardware resources (CPU time, memory, I/O devices) and it provides abstractions of those resources to programs running in user space. A process does not talk directly to a physical memory chip; it talks to the kernel's virtual memory subsystem, which maps logical addresses to physical pages and enforces isolation. A process does not write bytes directly to a disk platter; it calls write(2), and the kernel's VFS layer routes that call through a filesystem driver to a block device driver to the actual storage hardware.
Prerequisites
- Basic understanding of what a CPU, RAM, and storage are
- Familiarity with the concept of a program and a process
- Awareness that software runs at different privilege levels
- Some exposure to the C programming language (kernel code is primarily C)
Core Content
Kernel Responsibilities
The Linux kernel (used throughout this archive as the reference implementation) organizes its responsibilities into several major subsystems:
Process Management
The kernel creates, schedules, and destroys processes. It maintains a struct task_struct for every process and kernel thread — on a busy production server, this list can contain thousands of entries. The scheduler (kernel/sched/core.c) decides which process runs on which CPU core at any given microsecond, balancing fairness, latency, and throughput.
Memory Management
The MM subsystem (mm/) manages physical memory (page frames tracked in struct page), virtual address spaces (described by struct mm_struct and a tree of struct vm_area_struct), page tables, the page cache, the swap subsystem, and memory allocators (the buddy allocator for pages, SLAB/SLUB/SLOB for kernel objects, and kmalloc for arbitrary kernel allocations).
Device Management
Drivers (drivers/) allow the kernel to speak the protocol of each piece of hardware. The kernel provides a unified driver model (drivers/base/) so that a USB storage device, a PCI NIC, and an I2C temperature sensor are all registered, enumerated, and power-managed through the same infrastructure.
Filesystem Support
The Virtual Filesystem Switch (VFS, fs/) provides a single set of system calls (open, read, write, stat, mmap) regardless of whether the underlying storage is ext4, XFS, Btrfs, tmpfs, NFS, or a FUSE-mounted object store. Each real filesystem registers its own struct file_operations and struct inode_operations implementing those abstract operations.
Networking
The networking stack (net/) implements the protocol suite from Ethernet frames up through TCP/IP to the socket API exposed to applications. It is one of the most complex subsystems, containing tens of thousands of lines implementing TCP congestion control alone.
Security
The Linux Security Module (LSM) framework (security/) allows security policies to be enforced at kernel hook points. SELinux, AppArmor, and seccomp all plug in here. The kernel itself enforces the fundamental access controls: a process can only access files it has permission to access, can only signal processes in its session, and cannot read another process's memory.
The Kernel vs. the Operating System
This distinction matters because it affects how you think about problems:
| Layer | Examples |
|---|---|
| Kernel | Linux, Windows NT kernel, XNU (macOS), FreeBSD kernel |
| System libraries | glibc, musl, ntdll.dll, libSystem.dylib |
| System daemons | systemd, launchd, svchost.exe |
| Shell | bash, zsh, cmd.exe, PowerShell |
| Applications | Firefox, PostgreSQL, OpenSSH |
"Linux" in common usage refers to the entire GNU/Linux ecosystem. Strictly, "Linux" is only the kernel. The distinction matters when you debug a production issue: a glibc bug is not a kernel bug, even though glibc sits between your application and the kernel.
The Kernel as Resource Manager and Abstraction Layer
These two roles are intertwined but conceptually separate:
As a resource manager, the kernel is an accountant. It tracks who owns which physical pages, which CPU cycles are allocated to which process, which file descriptors are open. It enforces limits (cgroups, ulimits, capability checks). It arbitrates contention (the scheduler, the block I/O elevator, network queueing disciplines).
As an abstraction layer, the kernel is a translator. It makes every disk look like a stream of bytes, every network card look like a socket, every CPU look like a sequential instruction machine regardless of how many cores or NUMA nodes exist. This is what enables the same PostgreSQL binary to run on a Raspberry Pi and a 256-core AMD EPYC server.
ASCII Layered Diagram
+----------------------------------------------------------+
| APPLICATIONS |
| (Firefox, PostgreSQL, sshd, bash, ...) |
+----------------------------------------------------------+
|
System Calls (open, read,
write, mmap, socket, ...)
|
+----------------------------------------------------------+
| SYSTEM LIBRARIES |
| (glibc / musl: wraps syscalls, provides C runtime) |
+----------------------------------------------------------+
|
Traps into kernel via SYSCALL /
SYSENTER instruction
|
+----------------------------------------------------------+
| KERNEL |
| +----------------+ +----------+ +------------------+ |
| | Process/Sched | | Memory | | VFS / Filesystem | |
| +----------------+ +----------+ +------------------+ |
| +----------------+ +----------+ +------------------+ |
| | Networking | | Security | | Device Drivers | |
| +----------------+ +----------+ +------------------+ |
+----------------------------------------------------------+
|
Architecture-specific code (arch/)
reads/writes hardware registers
|
+----------------------------------------------------------+
| HARDWARE |
| CPU cores Physical RAM NIC Disk USB ... |
+----------------------------------------------------------+
Monolithic vs. Microkernel: A Preview
Linux is a monolithic kernel: all the subsystems above live in a single address space running in ring 0 (the most privileged CPU mode). A function call from the scheduler to the memory allocator is just a function call — no overhead.
In a microkernel (Mach, L4, seL4, QNX), only the absolute minimum runs in ring 0: address space management, inter-process communication, and thread scheduling. File systems, device drivers, and networking run as user-space servers. The overhead of IPC between components is higher, but a crashing driver cannot corrupt the kernel.
In practice, most production operating systems are hybrid. macOS uses XNU, which is a Mach microkernel with a large BSD kernel component merged in. Windows NT has a microkernel-style object manager but with most executive services in ring 0 for performance.
Kernel Size and Complexity Evolution
The growth of the Linux kernel is a concrete measure of how hardware complexity has outpaced simplicity:
| Year | Version | Lines of Code | Notable additions |
|---|---|---|---|
| 1991 | 0.01 | ~10,000 | x86 only, no networking, no modules |
| 1994 | 1.0 | ~176,000 | Networking, module support |
| 1999 | 2.2 | ~1.8M | SMP support, many new drivers |
| 2003 | 2.6 | ~5.9M | NPTL threads, device model rewrite |
| 2011 | 3.0 | ~14.6M | Btrfs, KVM, cgroups |
| 2015 | 4.0 | ~19.5M | Live patching, eBPF |
| 2020 | 5.10 | ~27.8M | io_uring, BPF CO-RE |
| 2023 | 6.6 | ~32M+ | EEVDF scheduler, Rust infrastructure |
This growth is driven overwhelmingly by drivers. The core kernel (scheduler, MM, VFS, networking) has grown proportionally much less. Every new GPU generation, every new WiFi chipset, every new storage protocol adds tens of thousands of lines.
The Kernel/Userspace Contract
The most important guarantee Linux makes: the kernel ABI toward user space is stable and never broken. This is Linus Torvalds' famous rule, stated explicitly in Documentation/process/stable-api-nonsense.rst: the system call interface, the behavior of /proc, the format of /sys, and the behavior of signals — none of these change in ways that would break existing binaries.
This is why a statically linked x86-64 binary compiled in 2003 still runs on a 6.6 kernel. The kernel/user contract is the /usr/include/asm boundary.
The kernel makes no such promise to kernel modules or between internal subsystems. Internal APIs change between versions. This is why out-of-tree drivers break on every kernel update.
Historical Context
The concept of a privileged operating nucleus dates to the late 1950s. The IBM 709 and later the Compatible Time-Sharing System (CTSS, 1961) at MIT demonstrated that a central supervisor could multiplex hardware among users. The Multics project (1964–1969) formalized the idea of hierarchical rings of protection and an integrated file system. Unix (1969, Bell Labs) distilled Multics' ideas into something small enough to rewrite in C in 1972 — the first portable kernel.
The term "kernel" became standard with BSD Unix in the late 1970s. Dijkstra's THE system (1968) and Brinch Hansen's work on the RC 4000 (1969) established the microkernel concept. The monolithic vs. microkernel debate erupted publicly in 1992 in the Usenet comp.os.minix thread between Andrew Tanenbaum and Linus Torvalds — a debate whose practical resolution is the hybrid architectures used in production today.
Production Examples
Google's use of the Linux kernel: Google runs a modified Linux kernel across its entire fleet. Their kernel team maintains patches for features like cgroup v2 improvements, TCP modifications (BBR congestion control, which they contributed upstream), and custom scheduler tuning. The kernel is a first-class production concern.
Android kernel fragmentation: Android uses a Linux kernel but OEMs maintain their own device-specific forks. A kernel running a Samsung Galaxy S23 may be based on Linux 5.15 LTS with hundreds of Samsung-specific patches. The Generic Kernel Image (GKI) project at Google attempts to standardize the kernel core while moving OEM code into loadable modules — a practical application of the kernel/driver boundary.
AWS Nitro: Amazon's Nitro hypervisor moves device emulation (networking, EBS) out of the hypervisor process and into dedicated hardware controllers. From the guest kernel's perspective, it talks to standard virtio devices. The host-side is a custom Linux-based system. The kernel abstraction enables this hardware/software boundary to be moved.
Debugging Notes
When a kernel bug manifests, the primary artifact is a kernel panic message. Learning to read one is essential:
BUG: kernel NULL pointer dereference, address: 0000000000000010
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
RIP: 0010:some_driver_function+0x48/0x120
RIP(instruction pointer) tells you exactly where in kernel code the fault occurred.- The function name + offset allows
addr2lineorgdb vmlinuxto find the source line. CONFIG_KALLSYMS=yis required for symbol resolution in production kernels.
Key debugging tools: dmesg, /proc/kmsg, ftrace (/sys/kernel/debug/tracing/), perf, kprobes, eBPF.
Security Implications
The kernel is the ultimate trust boundary. A process running in user space is constrained. Code running in ring 0 (kernel space) is not. Therefore:
- Every kernel vulnerability is a potential full system compromise.
- Privilege escalation attacks (CVE-2016-5195 "Dirty COW", CVE-2022-0847 "Dirty Pipe") exploit kernel bugs to gain ring 0 execution or write to files the attacker shouldn't be able to write.
- Defense layers: SMEP (prevents kernel from executing user-space pages), SMAP (prevents kernel from accessing user-space data without explicit
stac/clac), KASLR (randomizes kernel load address), CFI (Control Flow Integrity), and seccomp (limits which syscalls a process may invoke).
Kernel attack surface is proportional to kernel size. This is a fundamental argument for microkernels in high-security environments (seL4 has a formally verified 9,000-line kernel).
Performance Implications
The kernel is in the critical path of almost every I/O operation. Performance-critical systems spend significant effort reducing kernel involvement:
- DPDK (Data Plane Development Kit): moves NIC polling from the kernel into user space, eliminating interrupt overhead and the kernel networking stack for packet processing. Used by telcos and cloud providers for line-rate packet forwarding.
- io_uring (Linux 5.1+,
io_uring_setup(2)): submits I/O operations in batches via a shared ring buffer, dramatically reducing syscall overhead for high-IOPS workloads. - vDSO (virtual Dynamic Shared Object): maps certain kernel data (current time, etc.) into user space so that
clock_gettime(CLOCK_REALTIME)doesn't require a syscall trap at all. - CPU time in kernel:
perf statreportstask-clocksplit between user and sys. A well-tuned database server should spend less than 5% of CPU time in kernel mode during sequential reads.
Failure Modes and Real Incidents
The 2010 Linux Kernel OOM Killer Incident (various): Under memory pressure, the kernel's Out-Of-Memory killer selects and kills a process. Misconfigured overcommit settings (/proc/sys/vm/overcommit_memory) can cause the OOM killer to fire unexpectedly, killing critical daemons. This has caused database process terminations at scale.
CrowdStrike / Windows BSOD (July 2024): A faulty update to the CrowdStrike Falcon sensor, a ring-0 kernel driver on Windows, caused an invalid memory access during boot, triggering a BSOD (kernel panic). 8.5 million machines were rendered unbootable. This is a direct consequence of driver code running in ring 0: one bad pointer dereference crashes the entire system.
Dirty COW (CVE-2016-5195): A race condition in the kernel's copy-on-write memory subsystem allowed an unprivileged user to write to read-only memory-mapped files, including /etc/passwd. Exploited in the wild before a patch was available. Affected every Linux kernel from 2.6.22 through 4.8.2.
Modern Usage
Kernel development in 2024 is dominated by:
- eBPF: programs that run in a kernel-verified virtual machine, allowing safe kernel extension without writing kernel modules. Used for observability (Cilium, Falco), networking (XDP), and security (seccomp-BPF).
- Rust in the kernel: Linux 6.1 merged the first Rust infrastructure (
rust/). Rust modules can be written for subsystems like device drivers, reducing the class of memory safety bugs endemic to C. - io_uring: reshaping how high-performance I/O is written, enabling fully async, batched I/O with minimal syscall overhead.
- CXL (Compute Express Link): new hardware for memory pooling between CPUs and accelerators is forcing kernel memory management to evolve significantly.
Future Directions
- Rust-first kernel components: long-term, safety-critical subsystems may be rewritten in Rust. The
rust-for-linuxproject has upstream support. First real drivers (NVMe, network) are being submitted. - eBPF as a kernel extension mechanism: BPF programs can now implement entire schedulers (sched_ext, merged in 6.11), TCP congestion algorithms, and filesystem operations. The kernel may evolve toward a smaller, more stable core with policy implemented in BPF.
- Confidential computing: Intel TDX and AMD SEV-SNP require new kernel abstractions for encrypted memory and attestation, with significant MM subsystem changes.
- Exokernel revival: DPDK, SPDK, and RDMA applications are effectively building exokernel-style systems where applications manage hardware directly. This trend will continue.
Exercises
-
Run
uname -ron a Linux machine and look up the source of that kernel version at https://elixir.bootlin.com. Navigate toinit/main.cand findstart_kernel(). List the first 10 function calls made instart_kernel()and briefly describe what each does. -
Use
strace -c ls /tmpto count the system calls made byls. Which syscall is called most frequently? What does that tell you about whatlsdoes? -
Read
/proc/meminfoand/proc/slabinfoon a running Linux system. Identify which kernel slab cache is consuming the most memory. Research what objects that cache holds. -
Find the definitions of
struct task_structandstruct mm_structin the Linux kernel source (include/linux/sched.handinclude/linux/mm_types.h). Count the number of fields in each. What does the size of these structures tell you about kernel complexity? -
Write a minimal C program that makes a raw system call using the
syscall(2)wrapper (e.g.,syscall(SYS_getpid)). Compile it and run it understrace. Confirm the syscall number used matches the kernel'sarch/x86/entry/syscalls/syscall_64.tbl.
References
- Linus Torvalds, comp.os.minix post announcing Linux, August 25, 1991
- Andrew Tanenbaum, Modern Operating Systems, 4th ed., Pearson, 2014
- Robert Love, Linux Kernel Development, 3rd ed., Addison-Wesley, 2010
- Linux kernel source:
init/main.c,include/linux/sched.h,mm/,fs/,net/,security/ - Kernel documentation:
Documentation/process/stable-api-nonsense.rst - Linus Torvalds on ABI stability: https://lkml.org/lkml/2012/12/23/75
- LWN.net — the authoritative source for Linux kernel development coverage: https://lwn.net
- Linux kernel cross-reference: https://elixir.bootlin.com