Kernel Threads

Technical Overview

A kernel thread is a schedulable entity that runs entirely in kernel space — it has a struct task_struct, a kernel stack, and is visible to the scheduler, but it never runs user-space code and never has a user-space virtual address space. Kernel threads are the OS's way of performing background work asynchronously: writing dirty pages to disk, compacting memory, migrating tasks across CPUs, processing deferred interrupt work, running the RCU subsystem, and hundreds of other essential maintenance tasks that the kernel needs to do concurrently with user-space execution.

On a typical production Linux server, there are dozens to hundreds of kernel threads running at any time. Understanding which threads exist, what they do, and how they are created is essential for interpreting ps, top, and performance profiler output — kernel threads consume CPU time, appear in performance profiles, and their blocking behavior directly affects application latency.

Prerequisites

02-kernel-initialization.md: PID 2 (kthreadd) and the first threads
03-kernel-memory-model.md: kernel stack allocation
06-kernel-locking-overview.md (adjacent file): synchronization in threads
Understanding of POSIX threads (for comparison)

Core Content

Kernel Thread Concept

A kernel thread is: - A struct task_struct with mm == NULL (no user address space) - A kernel stack (THREAD_SIZE = 16 KiB on x86-64) - Schedulable by the kernel scheduler (CFS, RT, DL policies apply) - Visible in ps as [thread_name] (square brackets indicate kernel thread) - A child of PID 2 (kthreadd) in the process tree

Kernel threads differ from user processes in key ways: - No user-space VA space (current->mm == NULL) - Cannot access user memory (no copy_to_user/copy_from_user with NULL mm) - Never receive signals from user space (though kernel can send them internally) - Don't have file descriptors or working directory (or share them with parent) - Cannot be strace'd (no syscall boundary) - Their CPU time shows up as %sys in top, directly attributable to the thread

`kthread_create()` and `kthread_run()`

Source: include/linux/kthread.h, kernel/kthread.c

// Create a kernel thread (not yet running):
struct task_struct *task = kthread_create(threadfn, data, namefmt, ...);
// threadfn: int (*threadfn)(void *data) — the thread's main function
// data: opaque void * passed to threadfn
// namefmt: printf-style name for the thread (shown in ps)
// Returns ERR_PTR on failure, task_struct pointer on success

if (IS_ERR(task)) {
    pr_err("Failed to create thread: %ld\n", PTR_ERR(task));
    return PTR_ERR(task);
}

// Optionally set CPU affinity before starting:
kthread_bind(task, cpu_number);

// Start the thread (make it runnable):
wake_up_process(task);

// Combined create + start:
struct task_struct *task = kthread_run(threadfn, data, namefmt, ...);
// kthread_run = kthread_create + wake_up_process

// Create a thread on a specific CPU:
struct task_struct *task = kthread_create_on_cpu(threadfn, data, cpu, namefmt);

How kthread_create() works internally:

Creates a kthread_create_info struct with the arguments
Adds it to kthread_create_list (a global list)
Wakes up kthreadd (PID 2)
kthreadd dequeues the request and calls copy_process() to create the new task
The new task starts running in kthread() (a wrapper in kernel/kthread.c), which calls threadfn(data) after initialization

`kthread_stop()` and `kthread_should_stop()`

Clean thread lifecycle management requires cooperation between the creator and the thread:

// The thread function:
static int my_thread_fn(void *data)
{
    struct my_data *d = data;

    // Initialize local state

    while (!kthread_should_stop()) {
        // Do work

        // When there's nothing to do, sleep until woken or stop requested:
        wait_event_interruptible(d->wq,
            d->has_work || kthread_should_stop());

        if (kthread_should_stop())
            break;

        process_work(d);
    }

    // Cleanup
    return 0;  // Return value captured by kthread_stop()
}

// Creator thread stopping it:
kthread_stop(task);  // Sets kthread_should_stop(), wakes thread, waits for exit

kthread_should_stop() returns true after kthread_stop() has been called. kthread_stop() is a blocking call — it waits for the thread to exit. If the thread is sleeping in wait_event_interruptible(), kthread_stop() wakes it up (since kthread_should_stop() returns true, the wakeup condition changes).

Important: Never call kthread_stop() on a thread that has already called do_exit() or returned from its thread function. Also never kfree a task_struct — the kernel manages the lifetime of task_struct.

Common Kernel Threads

On a typical x86-64 Linux server, examining ps aux | grep '^\[' | head -40 reveals:

Housekeeping threads (per-CPU):

Name	Purpose	Source
`kworker/N:M`	Generic worker threads for workqueue work	`kernel/workqueue.c`
`kworker/N:MH`	High-priority worker threads	`kernel/workqueue.c`
`kworker/u<N>:M`	Unbound worker threads (global pool)	`kernel/workqueue.c`
`ksoftirqd/N`	Softirq daemon for CPU N	`kernel/softirq.c`
`migration/N`	Task migration between CPUs	`kernel/sched/core.c`
`idle_inject/N`	CPU idle injection for power capping	`drivers/powercap/`
`watchdog/N`	Hardlockup detector per CPU	`kernel/watchdog.c`
`cpuhp/N`	CPU hotplug events	`kernel/cpu.c`

Memory management threads:

Name	Purpose	Source
`kswapd0`	Per-NUMA-node page reclaim (swapping)	`mm/vmscan.c`
`kcompactd0`	Per-NUMA-node memory compaction	`mm/compaction.c`
`khugepaged`	Transparent huge page collapsing	`mm/huge_memory.c`
`kthrotld`	Block I/O throttling (cgroup blkio)	`block/blk-throttle.c`
`oom_reaper`	Kill OOM-selected process's mm	`mm/oom_kill.c`

I/O related threads:

Name	Purpose	Source
`kblockd`	Block device I/O work	`block/blk-core.c`
`kworker` (flush)	Writeback of dirty pages	`mm/backing-dev.c`
`jbd2/sdaN-M`	Journal commit for ext4 (per filesystem)	`fs/jbd2/`
`xfsaild/sdaN`	XFS log daemon	`fs/xfs/xfs_log.c`
`nvme-wq`	NVMe async event worker	`drivers/nvme/host/`

Networking:

Name	Purpose	Source
`kworker` (napi/N)	NAPI softirq processing	`net/core/dev.c`
`rpciod`	NFS RPC I/O daemon	`net/sunrpc/`
`nfsiod`	NFS async I/O	`fs/nfs/`

RCU subsystem:

Name	Purpose	Source
`rcu_gp`	RCU grace period kthread	`kernel/rcu/tree.c`
`rcu_exp_gp`	Expedited RCU grace periods	`kernel/rcu/tree.c`
`rcu_tasks_kthread`	Tasks RCU	`kernel/rcu/tasks.trace.c`

Workqueues

Source: include/linux/workqueue.h, kernel/workqueue.c

Workqueues are the modern, preferred mechanism for deferring work from interrupt context or from code that cannot sleep. Instead of creating dedicated kernel threads for every deferred operation, workqueues share a pool of threads.

Basic usage:

// Define a work item:
struct work_struct my_work;

// Work function:
static void my_work_fn(struct work_struct *work)
{
    // This runs in a kworker thread context
    // CAN sleep (but typically shouldn't for long)
    // CAN allocate with GFP_KERNEL
    // CAN acquire mutexes
    struct my_struct *s = container_of(work, struct my_struct, work);
    process_deferred_work(s);
}

// Initialize:
INIT_WORK(&my_work, my_work_fn);

// Queue work on the system workqueue:
schedule_work(&my_work);                    // queue on system_wq

// Or queue with a delay:
schedule_delayed_work(&my_delayed_work, msecs_to_jiffies(100));

Custom workqueues:

// Create dedicated workqueue:
struct workqueue_struct *my_wq = alloc_workqueue("my_wq",
    WQ_MEM_RECLAIM |  // can do memory reclaim
    WQ_FREEZABLE,     // frozen during suspend
    max_active        // 0 = use default (256 or ncpus*2)
);

// Queue on specific workqueue:
queue_work(my_wq, &my_work);

// Flush (wait for all queued work to complete):
flush_workqueue(my_wq);

// Destroy:
destroy_workqueue(my_wq);

Workqueue flags:

Flag	Meaning
`WQ_UNBOUND`	Not bound to a specific CPU (global pool)
`WQ_FREEZABLE`	Frozen during system suspend/resume
`WQ_MEM_RECLAIM`	Has dedicated rescue worker for OOM situations
`WQ_HIGHPRI`	Workers run at elevated priority
`WQ_CPU_INTENSIVE`	Work items may be CPU-intensive (limit parallelism)
`WQ_ORDERED`	Work items execute one at a time in submission order

System workqueues (pre-allocated, shared): - system_wq — default, non-ordered, general purpose - system_highpri_wq — high-priority tasks - system_long_wq — tasks expected to run for a long time - system_unbound_wq — not CPU-bound - system_freezable_wq — frozen during suspend - system_power_efficient_wq — power-efficient (may use fewer CPUs)

Per-CPU workqueues: For work that must run on a specific CPU (e.g., synchronizing per-CPU state), schedule_work_on(cpu, &work) queues on a specific CPU. The kworker/N:M threads are bound to CPU N.

Kernel Thread vs. Tasklet vs. Softirq

These are three different mechanisms for deferred work, with different constraints:

                    Sleep?  Preemptible?  Context
─────────────────────────────────────────────────────
Softirq             No      No           softirq ctx
Tasklet             No      No           softirq ctx
Workqueue/kthread   Yes     Yes          process ctx

Softirqs (include/linux/interrupt.h): The lowest-overhead deferred work mechanism. Runs in ksoftirqd/N or directly after hardirq. Cannot sleep. Runs at fixed priority. Used for networking (NET_RX_SOFTIRQ, NET_TX_SOFTIRQ), timer expiry (TIMER_SOFTIRQ), SCSI (SCSI_SOFTIRQ). Limited to 10 pre-defined types globally.

Tasklets: Built on softirqs (TASKLET_SOFTIRQ, HI_TASKLET_SOFTIRQ). Cannot sleep. Can be created dynamically per-device. Serialized — a given tasklet runs on at most one CPU at a time. Deprecated as of Linux 5.10 — drivers should migrate to workqueues.

Workqueues / kernel threads: The modern approach. Can sleep. Fully preemptible. Used for any work that might take time, acquire locks, or do I/O. The kworker threads that implement workqueues account for a significant portion of kernel CPU usage on busy systems.

`kthread_worker` API

Source: include/linux/kthread.h

For cases where a dedicated single-threaded kernel thread is needed but workqueue overhead is too high, kthread_worker provides a hybrid:

static struct kthread_worker my_worker;
static struct kthread_work my_work;

// Initialize worker:
struct task_struct *task = kthread_run_worker(0, &my_worker, "my_worker");

// Initialize work item:
kthread_init_work(&my_work, my_work_fn);

// Queue work:
kthread_queue_work(&my_worker, &my_work);

// Flush:
kthread_flush_work(&my_work);

// Stop worker:
kthread_stop(task);

This creates exactly one dedicated thread (my_worker) that processes all items queued to it. Used by: jbd2 (ext4 journal), ksoftirqd replacement designs, crypto subsystem.

Thread Naming Conventions (`/proc/PID/comm`)

# List kernel threads with their names:
ps -eo pid,ppid,stat,comm | grep '^\s*[0-9]* 2 '

# Examples:
#   7  2  S  kworker/0:0        (kworker on CPU 0, worker 0)
#  11  2  S  kworker/u8:0       (unbound worker, pool 8, worker 0)
#  16  2  S  ksoftirqd/0        (softirq daemon for CPU 0)
#  17  2  S  migration/0        (task migration for CPU 0)
# 179  2  S  kswapd0            (page reclaim for NUMA node 0)
# 207  2  S  jbd2/sda1-8        (ext4 journal for /dev/sda1)

The kernel thread name is set by the namefmt argument to kthread_create(). It can be changed at runtime using kthread_set_per_cpu() or set_task_comm(). The /proc/PID/comm file contains the thread's name (max 15 characters).

Historical Context

Early Linux (1.0, 1994) had very few kernel threads. kswapd appeared in Linux 2.0 (1996) for background memory reclaim. The bdflush and kupdated threads handled dirty page writeback until Linux 2.6 replaced them with pdflush (2003) and then the per-bdi (per backing-device-info) flush threads in 2.6.32 (2009).

The current workqueue system (CMWQ — Concurrency Managed Workqueue) was introduced in Linux 2.6.36 (2010) by Tejun Heo, replacing the earlier keventd and multiple independent workqueue implementations. CMWQ dynamically creates and destroys kworker threads based on actual work load, preventing both thread starvation (too few) and thread explosion (too many).

kthreadd (PID 2) was introduced in Linux 2.6.17 (2006). Before this, kernel threads were created as children of the calling process, which could be any task — creating a tangled process tree where kernel threads appeared under random user processes.

Tasklets, once the recommended approach for deferred interrupt work in drivers, are now deprecated. The Linux 5.10 changelog explicitly discourages new tasklet use, recommending workqueues instead. Removing existing tasklet usage is an ongoing cleanup project.

Production Examples

kswapd thrashing under memory pressure: When a production server runs low on memory, kswapd0 (and kswapd1 on NUMA systems) runs continuously, scanning LRU lists and writing dirty pages to swap. Heavy kswapd activity shows up as high %sys on specific CPUs in top. This is a classic signal of memory pressure. Monitoring vmstat 1 and watching the si/so (swap in/out) columns or kswapd CPU usage is standard practice.

jbd2 thread and write latency: Every ext4 filesystem has a jbd2/sdaN-M thread that commits the journal every 5 seconds (default commit=5). If a disk is slow, the jbd2 thread blocks on disk I/O, and any fsync() calls from user-space processes block waiting for the journal commit. Tuning commit=1 for lower latency or commit=60 for higher throughput is a common production ext4 tuning knob.

khugepaged and latency spikes: khugepaged scans anonymous memory regions and collapses groups of 512 4K pages into a single 2M transparent huge page. This collapsing work happens in a kernel thread and involves acquiring mmap lock. On systems with large mmap-ed regions (databases, JVMs), khugepaged activity can cause latency spikes as it holds the mmap lock. Redis, JVM applications, and MongoDB commonly disable transparent huge pages (echo never > /sys/kernel/mm/transparent_hugepage/enabled) to eliminate this jitter.

Debugging Notes

# List all kernel threads
ps -eo pid,ppid,stat,pcpu,comm | awk '$2==2 || $2==0' | sort -k4 -rn | head -30

# Identify which kworker is using CPU
# kworker names include CPU:id, find the hot one:
ps -eo pid,pcpu,comm | grep kworker | sort -k2 -rn | head -5

# Then identify what work item it's running (requires bpftrace or perf):
perf top -p <kworker_pid>

# Trace all work items queued to system_wq:
bpftrace -e 'tracepoint:workqueue:workqueue_queue_work {
    printf("%s queued work: %s\n", comm, str(args->function));
}'

# Find stuck kernel threads (blocked in D state):
ps aux | awk '$8=="D"'
# D = uninterruptible sleep, usually blocked on disk I/O or a kernel lock

# Stack traces of all blocked kernel threads:
echo t > /proc/sysrq-trigger   # print all tasks' stacks to dmesg
dmesg | grep -A 20 "task:kswapd"

# Check workqueue statistics:
cat /sys/kernel/debug/workqueue/wq_completion

# Monitor kworker pool sizes:
cat /proc/workqueue_stat  # requires CONFIG_WQ_WATCHDOG

Diagnosing a stuck kernel thread:

# If a kernel thread is in D state for >120s, kernel prints warning:
# "INFO: task kworker/3:0:1234 blocked for more than 120 seconds"
# The stack trace shows where it's stuck

# In a live system, get the stack:
cat /proc/<pid>/wchan    # wait channel (function where thread is sleeping)
cat /proc/<pid>/stack    # full kernel stack trace

Security Implications

Kernel thread hijacking: An attacker who can execute kernel code (e.g., via a kernel module or exploit) can hijack a kernel thread by overwriting its thread_fn function pointer or by queuing malicious work items to a workqueue. Kernel thread integrity is as important as any other kernel code.

kworker as a rootkit hiding point: Kernel rootkits sometimes hide their activities in kernel threads or by queuing work to existing workqueues. The kworker threads are a natural cover because there are many of them and they do varied work. eBPF-based detection tools (Falco, Tracee) can monitor workqueue item submissions and detect suspicious patterns.

kthread_stop() and signal delivery: Kernel threads do not receive POSIX signals from user space under normal circumstances. They can receive kernel-internal notifications via kthread_should_stop() and similar mechanisms. This separation means a user process cannot kill a kernel thread with kill -9 <kthread_pid> — the signal is ignored.

Performance Implications

kworker CPU usage: kworker threads collectively appear prominently in CPU profiles on busy systems. In perf top, kworker spinning often indicates softirq work that has been deferred (NAPI polling, block I/O completions). The workqueue concurrency manager dynamically limits concurrent kworkers to max(ncpus, 512) per pool to prevent over-threading.

Binding considerations: Per-CPU kernel threads (migration/N, ksoftirqd/N) cannot be migrated from their CPU. Workqueue threads are either per-CPU (bound) or unbound. On a system with CPU isolation (isolcpus=), unbound kworkers won't run on isolated CPUs, but bound kworkers still will — this needs explicit management for strict real-time requirements.

Thread count impact: Having too many kernel threads increases scheduling overhead (more entries in the run queue) and increases memory usage (each thread has a 16 KiB kernel stack). On a system with 64 CPUs, a 100-queue NVMe SSD might create 100 nvme-wq threads — manageable. A misconfigured system with thousands of workqueue threads can see scheduler overhead affecting latency.

Failure Modes and Real Incidents

kswapd soft lockup: Under extreme memory pressure, kswapd can enter a loop scanning LRU lists without making progress (no pages are freeable — all are mapped and dirty). The kernel detects this as a soft lockup after 120 seconds and logs a warning. The only recovery is usually an OOM kill or manual intervention.

kcompactd fragmentation stall: On production servers with long uptimes and many page allocations, memory becomes fragmented at the order-9 level (4 MiB contiguous allocations). kcompactd runs to defragment by migrating pages, but this is expensive and can cause latency spikes as it holds per-zone lock. Affected workloads: applications using huge pages (THP), DMA coherent buffers, kernel module loading.

Workqueue deadlock (historically): A workqueue can deadlock if work item A blocks waiting for work item B, and both are queued to the same single-threaded workqueue. CMWQ detects this (via lockdep integration) and either creates additional workers or warns. Pre-CMWQ (Linux <2.6.36), this was a real and hard-to-debug deadlock pattern in drivers.

jbd2 thread stall causing ext4 hang: If the underlying storage device experiences a latency spike (due to SSD garbage collection, RAID rebuild, or cloud EBS I/O throttling), the jbd2/sdaN thread blocks waiting for journal I/O. All fsync() calls from user processes then block, causing widespread application stalls. The iostat tool shows this as high %await on the device while the journal commit is pending.

Modern Usage

kthread_worker in NVMe: NVMe's async event handling uses kthread_worker for per-namespace work. Each NVMe namespace gets a dedicated kthread_worker for reset work and error handling, ensuring that resets are serialized per-namespace while being independent across namespaces.

eBPF and kworker observation: bpftrace can be attached to workqueue entry/exit tracepoints to attribute kworker CPU time to specific work functions:

bpftrace -e '
tracepoint:workqueue:workqueue_execute_start {
    @start[tid] = nsecs;
    @fn[tid] = str(args->function);
}
tracepoint:workqueue:workqueue_execute_end {
    if (@start[tid]) {
        @time[str(@fn[tid])] = hist(nsecs - @start[tid]);
        delete(@start[tid]); delete(@fn[tid]);
    }
}'

This immediately shows which work functions are consuming kworker time — invaluable for diagnosing "high %sys CPU" issues attributed to kworker threads.

io_uring and worker threads: When io_uring cannot complete an operation inline (e.g., a file read that blocks), it offloads to a pool of io_worker threads managed by io_uring. These are not kworker threads but purpose-built io_uring worker threads, bounded per ring instance, visible as iou-wrk-PID in process listings.

Future Directions

Tasklet removal: Ongoing effort to replace all tasklet usage in drivers with workqueues. As of Linux 6.6, ~200 drivers still use tasklets. The goal is eventual tasklet removal from the kernel.
Preemptible workqueues: PREEMPT_RT replaces some softirq/tasklet/kworker patterns with fully preemptible kernel threads, allowing high-priority real-time tasks to preempt even "interrupt-level" work.
Work item priority inheritance: Proposal to allow workqueue items to inherit the priority of the waiting process, so that high-priority user processes don't stall waiting for low-priority kworker threads.
BPF sleep context: Work is ongoing to allow BPF programs to run in "sleepable" contexts, potentially replacing some uses of kernel threads for policy enforcement.

Exercises

Run ps aux | grep '^\[' | wc -l to count kernel threads on your system. Then categorize them by prefix (kworker, ksoftirqd, kswapd, etc.). What is the most common category? What does this tell you about what the kernel does most in the background?
Write a kernel module that creates a kernel thread using kthread_run(). The thread should increment a per-CPU counter every second, sleeping with msleep(1000) between increments. After 10 seconds, stop the thread with kthread_stop(). Log the total count across all CPU-seconds.
Use bpftrace -e 'tracepoint:workqueue:workqueue_queue_work { @[str(args->function)] = count(); }' --timeout 30 to count workqueue submissions by function over 30 seconds. What are the top 5 work functions? Research where each is defined in the kernel source.
Read kernel/workqueue.c, specifically the worker_thread() function. This is the main loop for every kworker thread. What is the flow when there is work to do vs. when the worker is idle? How does CMWQ decide to create more worker threads? What is WORKER_IDLE vs. WORKER_RUNNING?
Find a driver in the kernel that uses kthread_create() directly (rather than workqueues). Good candidates: drivers/md/raid5.c, drivers/scsi/. Read its thread function and identify: (a) how it waits for work, (b) how it processes work, (c) how it shuts down. Compare this pattern to a workqueue-based approach and explain the trade-offs.

References

Linux kernel source: kernel/kthread.c, kernel/workqueue.c, include/linux/kthread.h, include/linux/workqueue.h
Linux kernel documentation: Documentation/core-api/workqueue.rst
Tejun Heo, "Concurrency Managed Workqueue (cmwq)", LWN.net: https://lwn.net/Articles/403891/
Robert Love, Linux Kernel Development, 3rd ed., Chapter 3 (Process Management) and Chapter 8 (Bottom Halves)
LWN.net, "The deprecation of tasklets": https://lwn.net/Articles/830964/
Brendan Gregg, BPF Performance Tools, Chapter 14 (Kernel Internals) — kworker tracing examples
Jonathan Corbet, "Kernel threads and the kthread_worker API", LWN.net
Linux kernel source for common threads: mm/vmscan.c (kswapd), mm/compaction.c (kcompactd), fs/jbd2/journal.c (jbd2)