Kernel Initialization
Technical Overview
The Linux kernel's boot sequence is a precisely ordered, carefully orchestrated process that transforms raw hardware — fresh out of firmware's hands — into a fully functional, multi-process operating system. Every step must happen in the correct order: you cannot initialize the page allocator before you know how much physical memory exists; you cannot start the scheduler before you have a per-CPU run queue; you cannot run init before the filesystem is mounted. The initialization code lives primarily in init/main.c and the various arch/ directories, and it represents some of the most carefully sequenced code in the kernel.
The entire boot sequence, from the first line of start_kernel() to the execution of PID 1 (/sbin/init or /lib/systemd/systemd), takes roughly 50–500ms on modern hardware. Understanding it illuminates why the kernel is organized the way it is, what "built-in vs. module" means, and how early boot issues manifest.
Prerequisites
01-what-is-a-kernel.md: what the kernel does02-user-space-vs-kernel-space.md: why the early boot environment is special04-hardware-abstraction.md: ACPI, device tree, hardware discovery- Basic understanding of the boot process (UEFI/BIOS → bootloader → kernel)
Core Content
The Entry Point: start_kernel()
The kernel entry point for architectures running under a bootloader is start_kernel() in init/main.c. This is called after architecture-specific assembly code (arch/x86/boot/compressed/head_64.S → arch/x86/kernel/head_64.S → arch/x86/kernel/head64.c:start_kernel_entry()) has:
- Switched to 64-bit long mode (x86-64)
- Set up a minimal page table (identity map + kernel mapping)
- Set up a temporary stack
- Decompressed the kernel (if using
bzImage) - Called
start_kernel()
At the entry to start_kernel(), there is exactly one CPU running (the Bootstrap Processor, BSP), no scheduler, no memory allocator (only fixed allocations from the bootloader's memory map), and no interrupt handling.
Initialization Sequence
start_kernel() in init/main.c
│
├─ set_task_stack_end_magic(&init_task) // Mark bottom of init task stack
├─ smp_setup_processor_id() // Record which CPU we're on (BSP=0)
├─ debug_objects_early_init() // Early debug object tracking
│
├─ cgroup_init_early() // Very early cgroup setup
├─ local_irq_disable() // Disable interrupts (we have no IDT yet)
│
├─ boot_cpu_init() // Mark BSP as online/active
├─ page_address_init() // Init high-memory page address table
│
├─ pr_notice(linux_banner) // Print "Linux version 6.x..." to console
│
├─ early_security_init() // LSM early hooks
│
├─ setup_arch(&command_line) // ARCHITECTURE-SPECIFIC INIT (see below)
│
├─ setup_boot_config() // Parse /proc/bootconfig
├─ setup_command_line(command_line) // Save kernel command line
├─ setup_nr_cpu_ids() // Count CPUs
├─ setup_per_cpu_areas() // Allocate per-CPU data sections
├─ smp_prepare_boot_cpu() // Prepare BSP for SMP
├─ boot_cpu_hotplug_init()
│
├─ build_all_zonelists() // Set up NUMA zone lists
├─ page_alloc_init() // Register page alloc CPU hotplug callbacks
├─ pr_notice("Kernel command line: %s") // Log command line
├─ jump_label_init() // Static key optimization init
├─ parse_early_param() // Parse __setup() parameters
├─ parse_args() // Parse remaining command line args
│
├─ mm_init() // Initialize memory management (see below)
│
├─ ftrace_init() // Function tracer init
├─ early_trace_init() // Early tracing infrastructure
│
├─ sched_init() // Initialize scheduler
├─ preempt_disable() // Disable preemption
│
├─ idr_init_cache() // IDR allocator cache
├─ rcu_init() // Initialize RCU subsystem
│
├─ trace_init() // Tracing infrastructure
│
├─ radix_tree_init() // Radix tree allocator
├─ maple_tree_init() // Maple tree allocator
│
├─ init_IRQ() // Set up IDT, APIC, PIC
├─ tick_init() // Tick/timer subsystem
├─ rcu_init_nohz() // NO_HZ/tickless RCU setup
├─ init_timers() // Timer wheel init
├─ hrtimers_init() // High-resolution timer init
├─ softirq_init() // Software interrupt init
├─ timekeeping_init() // Timekeeping (clock source, etc.)
│
├─ time_init() // Architecture-specific time init
├─ perf_event_init() // perf event infrastructure
│
├─ profile_init() // Kernel profiling
│
├─ call_function_init() // IPI call function setup
│
├─ local_irq_enable() // INTERRUPTS NOW ENABLED
│
├─ kmem_cache_init_late() // Finalize slab allocator
├─ console_init() // Initialize console output
│
├─ lockdep_init() // Lock dependency validator
├─ locking_selftest() // Lock validator self-test
│
├─ mem_encrypt_init() // AMD SME/SEV memory encryption
├─ kmemleak_init() // Kernel memory leak detector
│
├─ setup_per_cpu_pageset() // Per-CPU page set allocator
│
├─ numa_policy_init() // NUMA memory policy
├─ late_time_init() // Late architecture time init
├─ calibrate_delay() // Compute "bogomips" (jiffy calibration)
│
├─ pid_idr_init() // PID allocator
├─ anon_vma_init() // Anonymous VMA slab
├─ thread_stack_cache_init()
├─ cred_init() // Credentials slab
├─ fork_init() // Process fork infrastructure
├─ proc_caches_init() // Slab caches for proc/fs
│
├─ uts_ns_init() // UTS namespace
├─ key_init() // Kernel keyring
├─ security_init() // LSM full initialization
├─ dbg_late_init() // Late debug init
├─ net_ns_init() // Initial network namespace
│
├─ vfs_caches_init() // VFS: dentry, inode caches
├─ pagecache_init() // Page cache
│
├─ signals_init() // Signal queue slab
├─ seq_file_init() // seq_file infrastructure
├─ proc_root_init() // Mount /proc
│
├─ nsfs_init() // Namespace filesystem
├─ cpuset_init() // cpuset cgroup controller
│
├─ cgroup_init() // Full cgroup initialization
│
├─ taskstats_init_early()
├─ delayacct_init()
│
├─ check_bugs() // CPU bug workarounds
│
├─ acpi_subsystem_init() // ACPI full init
├─ arch_post_acpi_subsys_init()
├─ sti() // Re-enable NMIs
│
├─ arch_call_rest_init()
│ └─ rest_init() // THE KEY TRANSITION (see below)
setup_arch(): Architecture-Specific Initialization
setup_arch() is the largest and most hardware-specific function called from start_kernel(). On x86-64 (arch/x86/kernel/setup.c), it:
- Parses the boot parameters (
struct boot_paramsfrom the bootloader) - Calls
e820__memory_setup(): reads the E820 memory map from BIOS/UEFI (which physical memory ranges are usable RAM, reserved, ACPI, etc.) - Initializes the early ACPI tables (
acpi_boot_init()) - Sets up the initial page tables (
init_mem_mapping()) - Initializes the APIC (
apic_intr_mode_init()) - Sets up the GDT and TSS
- Calls
x86_init.oem.arch_setup()for OEM-specific initialization
mm_init(): Memory Management Initialization
static void __init mm_init(void)
{
page_ext_init_flatmem(); // page extension tables (CONFIG_PAGE_EXTENSION)
init_debug_pagealloc(); // debug page allocation tracking
report_meminit(); // log memory init policy
kmem_cache_init(); // Initialize the SLAB/SLUB allocator
pgtable_init(); // Page table type init
vmalloc_init(); // vmalloc area initialization
ioremap_huge_init(); // huge page ioremap support
init_espfix_bsp(); // x86: ESPFIX (16-bit stack fix)
pti_init(); // Page Table Isolation (Meltdown mitigation)
}
After mm_init(), kmalloc() is available. Before mm_init(), the kernel uses memblock — a simple first-fit allocator that reserves ranges from the boot memory map.
sched_init(): Scheduler Initialization
sched_init() (kernel/sched/core.c):
1. Allocates per-CPU struct rq (run queues)
2. Initializes each rq: CFS rq, RT rq, DL rq
3. Sets current to &init_task (the idle/swapper task, PID 0)
4. Sets up the load balancer
After sched_init(), the scheduler can track the current task but cannot yet schedule between tasks (no timer interrupt yet).
rest_init(): The Key Transition
rest_init() creates the first two kernel threads and then converts the current CPU context (which has been running as PID 0 / idle) into the idle task:
noinline void __ref rest_init(void)
{
struct task_struct *tsk;
int pid;
rcu_scheduler_starting();
// Create kernel_init thread (will become PID 1 / init)
pid = kernel_thread(kernel_init, NULL, CLONE_FS);
// Create kthreadd (PID 2: kernel thread daemon)
pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES);
kthreadd_task = find_task_by_pid_ns(pid, &init_pid_ns);
// Allow kernel_init and kthreadd to run:
system_state = SYSTEM_SCHEDULING;
complete(&kthreadd_done);
// The current context (init_task, PID 0) becomes the idle task
schedule_preempt_disabled(); // First scheduling decision
// Now running as the idle task. Loop forever:
cpu_startup_entry(CPUHP_ONLINE); // → arch_cpu_idle() (HLT/WFI) in a loop
}
After rest_init(), the system has three tasks:
- PID 0 (swapper/0): the idle task. Runs arch_cpu_idle() (HLT on x86) when no other task is runnable. Never actually appears in ps output (it's the "no task" state).
- PID 1 (kernel_init thread, about to become init/systemd)
- PID 2 (kthreadd): the kernel thread daemon
PID 0, PID 1, PID 2
PID 0 — idle/swapper
init_task is statically defined (not created by fork()):
// include/linux/init_task.h
struct task_struct init_task = INIT_TASK(init_task);
This is the only task in the system not created by fork(). It is per-CPU in spirit — after SMP bring-up, each CPU's idle loop is represented as swapper/N (where N is the CPU number) in /proc, though these are not separate tasks in the traditional sense.
PID 1 — kernel_init → /sbin/init
kernel_init() (init/main.c):
1. Calls kernel_init_freeable() which does the bulk of late initialization:
- do_basic_setup() → do_initcalls() (runs all __initcall registered functions)
- Mounts the initial RAM filesystem (initramfs/initrd)
- Waits for kthreadd_done
2. Opens /dev/console as stdin/stdout/stderr
3. Calls run_init_process("/sbin/init") (or init= kernel parameter)
4. This calls execve("/sbin/init") — the kernel thread transforms into a user-space process
After this execve, PID 1 is running in user space, executing systemd/SysV init/OpenRC/etc.
PID 2 — kthreadd
kthreadd() (kernel/kthread.c) is the kernel thread daemon. When other kernel code calls kthread_create(), it adds a request to kthread_create_list. kthreadd wakes up, processes the list, and creates new kernel threads by calling copy_process(). This ensures all kernel threads are proper children of PID 2 in the process tree.
Process tree (early boot):
PID 0: swapper/0 (idle, per-CPU)
├── PID 1: init (kernel_init → execve /sbin/init)
└── PID 2: kthreadd (kernel thread daemon)
├── PID 3: kworker/0:0
├── PID 4: kworker/0:0H
├── PID 5: kworker/u8:0
├── PID 6: mm_percpu_wq
├── PID 7: ksoftirqd/0
├── PID 8: migration/0
├── PID 9: rcu_gp
├── PID 10: rcu_par_gp
├── PID 11: slub_flushwq
├── PID 12: netns
...
Kernel Command Line Parsing
The bootloader passes a command line string to the kernel (e.g., root=/dev/sda1 ro quiet splash). The kernel parses it in two passes:
Early parsing (parse_early_param(), before mm_init()): Handles parameters marked with early_param("name", handler_fn). Examples: earlycon=, earlyprintk=, memmap=, initrd=.
Late parsing (parse_args(), after setup_arch()): Handles parameters marked with __setup("name=", handler_fn). Examples: root=, ro, rw, init=, panic=, nohz=, isolcpus=.
Module parameters (module_param() in modules / core_param() in built-ins): handled as part of do_initcalls() or during module loading.
__init and __initdata Sections
Functions and data marked __init and __initdata are placed in special ELF sections (.init.text, .init.data):
static int __init my_subsystem_init(void)
{
// This code is only run once, during boot
return 0;
}
After kernel_init_freeable() completes, free_initmem() is called. This unmaps and frees the .init.text and .init.data sections — typically 1–4MB of kernel memory. You'll see dmesg output like:
Freeing unused kernel image (initmem) memory: 2912K
This is why you sometimes see a kernel function that existed in a module but can't be called after boot — it was in __init and has been freed.
__initdata: Variables only needed during init (e.g., boot-time string parsing buffers) are similarly freed.
The initcall Mechanism
do_initcalls() (init/main.c) is called during kernel_init_freeable() and is how all built-in subsystems and drivers register themselves. The mechanism:
// A built-in driver registers its init function:
static int __init e1000_init_module(void)
{
return pci_register_driver(&e1000_driver);
}
module_init(e1000_init_module); // expands to: device_initcall(e1000_init_module)
module_init() when compiled into the kernel (not as a module) expands to __define_initcall("6", fn, 6) (level 6 = device_initcall). This places a pointer to e1000_init_module in the .initcall6.init ELF section.
do_initcalls() iterates through the levels 0–7 in order, calling every function in each level:
| Level | Macro | Used for |
|---|---|---|
| early | early_initcall() |
Must run before anything else |
| 0 | pure_initcall() |
Pure/platform init |
| 1 | core_initcall() |
Core kernel subsystems |
| 2 | postcore_initcall() |
Post-core init |
| 3 | arch_initcall() |
Architecture-specific init |
| 4 | subsys_initcall() |
Bus subsystems (PCI, USB...) |
| 5 | fs_initcall() |
Filesystems |
| 6 | device_initcall() / module_init() |
Drivers |
| 7 | late_initcall() |
Late initialization |
Within each level, the order depends on the link order of the object files.
When compiled as a module (CONFIG_E1000=m): module_init(e1000_init_module) registers the function to be called when the module is loaded by modprobe. The kernel's do_initcalls() never calls it.
Boot Sequence Timeline
UEFI/BIOS firmware
│ 0-2 seconds: POST, hardware init, boot device selection
│
▼
Bootloader (GRUB2 / systemd-boot)
│ ~100ms: loads kernel image + initrd from disk
│ decompresses bzImage
│
▼
arch/x86/boot/compressed/head_64.S
│ switches to 64-bit mode
│ sets up identity mapping
│
▼
arch/x86/kernel/head_64.S
│ final page table setup
│ clears BSS
│
▼
start_kernel() ← "T+0" for kernel timing purposes
│ T+0ms: setup_arch() - memory map, ACPI, page tables
│ T+5ms: mm_init() - slab allocator ready
│ T+6ms: sched_init() - scheduler ready
│ T+7ms: init_IRQ() - interrupts configured
│ T+8ms: local_irq_enable() - INTERRUPTS ENABLED
│ T+10ms: console_init() - kernel messages visible on console
│ T+15ms: rest_init() → PID 1 and PID 2 created
│
▼
kernel_init (PID 1)
│ T+20ms: do_initcalls() - all built-in drivers probe
│ T+100ms: initramfs unpacked
│ T+120ms: /sbin/init exec'd
│
▼
systemd (PID 1, user space)
│ T+200ms-3s: unit files processed, services started
▼
Login prompt / service ready
Historical Context
The start_kernel() function has existed since essentially Linux 0.01 (1991), where it was much shorter — perhaps 50 lines. By Linux 6.6, start_kernel() is ~180 lines and calls into dozens of subsystems.
The initcall mechanism was introduced in Linux 2.3.x (1999) to replace the ad-hoc #ifdef-guarded initialization calls that littered start_kernel(). Before initcalls, adding a new built-in subsystem required editing main.c. After initcalls, a driver could be made built-in or modular by simply changing the Kconfig option without touching main.c.
The distinction between PID 0, 1, and 2 has been stable since Linux 2.0. Before kthreadd (added in Linux 2.6.17, 2006), kernel threads were direct children of PID 1 (the init process), which was awkward — it meant PID 1 could inadvertently affect kernel threads.
__init sections and free_initmem() have existed since the 2.4 era. The memory savings are significant — on a minimal embedded Linux system, reclaiming init memory can be a sizeable fraction of total kernel memory.
Production Examples
Boot time optimization at Google: Google's bare-metal servers are reprovisioned frequently. Boot time matters. Key techniques: using a custom initrd that only contains the minimum drivers needed, disabling PCI enumeration for unused buses (via ACPI SSDT overrides), and using kexec to reboot into a new kernel without going through UEFI POST.
Android init sequence: Android's init process (PID 1) is not systemd or SysV init — it's Android's custom init binary. It parses .rc files (Android Init Language), mounts filesystems, sets up SELinux policy, starts the Zygote process (the JVM pre-fork parent for all Android apps), and starts system services. The early boot phases (up to init starting) are identical to mainline Linux.
Container init (PID 1 in Docker): When Docker starts a container with a single process (e.g., nginx), nginx becomes PID 1 in the container's PID namespace. PID 1 has special semantics: it receives SIGTERM when the container is stopped and must handle zombie reaping. Many containers use tini or dumb-init as a minimal PID 1 that correctly handles signals and zombie processes, then execs the actual application.
Debugging Notes
# See full boot messages (kernel ring buffer):
dmesg | head -200
# Measure kernel init timing:
dmesg | grep "Freeing unused kernel" # marks end of initcall phase
systemd-analyze # time spent in kernel, initrd, userspace
systemd-analyze blame # which units took longest
# Which initcalls ran and how long they took:
# Requires: CONFIG_INIT_CALL_DEBUG=y or initcall_debug kernel param
dmesg | grep "initcall"
# Find which module/driver registered a slow initcall:
dmesg | grep "calling " | grep -v "returned 0"
# See all kernel threads (descendants of PID 2):
ps -eo pid,ppid,comm | awk '$2==2'
# Verify kernel command line:
cat /proc/cmdline
# See init memory freed:
dmesg | grep "Freeing unused kernel image"
Debugging boot hangs:
- If the system hangs before console_init(): no console output. Add earlycon=uart8250,io,0x3f8,115200 to kernel command line for serial output.
- If hangs during do_initcalls(): add initcall_debug to kernel command line. The last "calling" line shows which initcall is stuck.
- If hangs waiting for PID 1: init=/bin/sh kernel parameter boots to a minimal shell.
Security Implications
init= kernel parameter: The init= parameter overrides which binary is run as PID 1. An attacker with physical access who can modify the kernel command line (via GRUB, if not password-protected) can run init=/bin/sh to get a root shell. Mitigation: GRUB password protection, UEFI Secure Boot, CONFIG_CMDLINE_FORCE (builds the command line into the kernel, ignoring bootloader-provided one).
Initramfs trust: The initramfs is loaded by the bootloader and executed as early-stage PID 1 code. If it is not signed and verified by Secure Boot, a compromised initramfs can compromise the entire system. Distributions using Unified Kernel Images (UKI) bundle the kernel, initramfs, and cmdline into a single PE binary signed for Secure Boot.
__init code removal and security: Removing __init code after boot reduces the kernel's executable code footprint, making ROP (Return-Oriented Programming) chains harder to construct — fewer gadgets exist in executable kernel memory.
Performance Implications
do_initcalls() parallelization: All built-in driver probing currently happens sequentially (within each initcall level). This is a bottleneck on servers with many PCI devices. Experimental work on parallel initcalls has been proposed but not merged, partly because some initcalls have undocumented ordering dependencies.
SMP bringup cost: Starting secondary CPUs (Application Processors) takes 1–5ms per CPU on typical hardware. A 256-CPU server can spend 256–1280ms on CPU bringup alone. This is why CONFIG_HOTPLUG_CPU is enabled — CPUs can be brought online/offline at runtime without rebooting.
Memory freed by free_initmem(): Typically 2–5MB on a typical server kernel. Not significant on a machine with 64GB of RAM, but meaningful on embedded systems. This is why CONFIG_TRIM_UNUSED_KSYMS (remove symbols not referenced by any module) is important for embedded builds.
Failure Modes and Real Incidents
Kernel panic on first schedule(): If sched_init() has a bug or the per-CPU runqueue is corrupt, the first call to schedule() in rest_init() causes a panic with "Kernel panic - not syncing: Attempted to kill the idle task!" or a NULL pointer dereference in scheduler code.
PID 1 exit causes kernel panic: If PID 1 (init/systemd) exits for any reason, the kernel panics immediately: "Kernel panic - not syncing: Attempted to kill init! exitcode=0x..." This is the mechanism that converts a systemd crash into a system crash. Production systems set panic=5 (reboot 5 seconds after panic) and init_on_panic=1 to handle this case.
initramfs failure to mount root: If the initramfs cannot find the root filesystem (wrong UUID, missing driver, encrypted volume), the system drops to a recovery shell in the initramfs. The boot message is "Give root password for maintenance" or similar. This is the most common cause of non-booting Linux systems after kernel updates that fail to update initramfs with the correct modules.
AWS/GCP kernel upgrades causing boot failures (historical): Cloud providers occasionally push kernel updates that change the kernel command line format or require new initramfs content. If the AMI's initramfs doesn't include the new driver for the virtio storage device variant, the instance fails to boot. Recovery requires attaching the root volume to another instance.
Modern Usage
kexec: kexec(2) allows loading a new kernel into memory and "rebooting" into it without going through BIOS/UEFI POST. The loaded kernel jumps directly to start_kernel() with the existing hardware state. Used by: kdump (crash kernel loading), kernel updates on servers without tolerance for UEFI boot time, live patching of the kernel.
Unified Kernel Image (UKI): A single PE/COFF binary combining vmlinux, initramfs, kernel cmdline, and the splash screen, signed for UEFI Secure Boot. systemd's ukify tool creates these. The initcall sequence and PID 1 are identical; the change is in how the bootloader loads and verifies the kernel image.
Rust init code: As Rust kernel code is added (drivers in drivers/char/rust/, etc.), their __init-equivalent setup runs through the initcall mechanism via module_init!() macro in Rust. The initialization ordering and memory freeing work identically.
Future Directions
- Parallel initcalls: Enabling parallel execution of independent initcalls (particularly level 6 = device_initcall) could reduce kernel init time from ~100ms to ~20ms on servers with many independent PCI devices. Requires analysis of initcall dependency graphs.
- ACPI-less ARM boot: Some ARM systems are moving away from Device Tree toward full ACPI support, allowing unified server management stacks. This changes
setup_arch()significantly for ARM64. - Reduced
start_kernel()complexity: Ongoing discussions about breakingstart_kernel()into a more formal "phase" API with clear dependency tracking, rather than the current linear sequential ordering.
Exercises
-
Download the Linux 6.x source and read
init/main.cfromstart_kernel()throughrest_init(). Make a list of every function called and categorize each as: memory management, CPU/scheduling, interrupt handling, device/bus, timing, or security. What is the ratio of categories? -
Add
initcall_debugto your VM's kernel command line (edit GRUB, append tolinuxline). Reboot and collectdmesg. Find the 5 slowest initcalls. What are they? What kernel subsystems or drivers do they initialize? -
Write a kernel module that uses
module_init()andmodule_exit(). Then change it to usedevice_initcall()instead ofmodule_init()and compile it built-in (not as a module). Verify it runs duringdo_initcalls()by checkingdmesgat boot. Explain what would have to change to make it a module again. -
Trace the execution of
kernel_init()from when it runs as a kernel thread to when it callsexecve("/sbin/init"). Readkernel_init_freeable()ininit/main.c. What is the last thing it does before returning tokernel_init()? What happens to thekernel_initthread stack afterexecve? -
Use
systemd-analyze(orbootchart) to measure the time your system spends in: firmware, bootloader, kernel (tostart_kernelreturning), initrd, and user-space init. What percentage of total boot time is in the kernel? What subsystem dominates kernel boot time according toinitcall_debug?
References
- Linux kernel source:
init/main.c,arch/x86/kernel/setup.c,arch/x86/kernel/head_64.S,kernel/sched/core.c,kernel/kthread.c - Linux kernel documentation:
Documentation/admin-guide/kernel-parameters.rst,Documentation/process/changes.rst - Robert Love, Linux Kernel Development, 3rd ed., Chapter 17 (Kernel Initialization)
- Ulrich Drepper & Jakub Jelinek, "How To Write Shared Libraries" — explains ELF sections including
.init - Werner Almesberger, "Booting Linux: The History and the Future", Ottawa Linux Symposium 2000
- LWN.net, "Parallel init scripts and related topics": https://lwn.net/Articles/240697/
- Unified Kernel Image: https://uapi-group.org/specifications/specs/unified_kernel_image/
- kexec documentation:
Documentation/admin-guide/kdump/kdump.rst