Kernel Threads

Technical Overview

A kernel thread is an execution context that the kernel scheduler directly manages. It has a Thread Control Block (TCB) in kernel memory, is visible to the kernel scheduler, and can be independently scheduled onto a CPU. In Linux, kernel threads and user threads are unified — the task_struct represents both, and the kernel makes no fundamental distinction. The difference is in what they share.

Understanding kernel threads requires understanding clone() — Linux's fundamental creation primitive — and the threading model it enables.

Prerequisites

Process concept (address space, file descriptors, signal handlers)
Virtual memory and page tables
Kernel/user space boundary and syscall mechanism
CPU scheduling basics (run queues, context switches)
x86-64 segment registers (for TLS understanding)

Core Concepts

The `task_struct` and Kernel Representation

Every thread in Linux is represented by a task_struct in the kernel. This is the Thread Control Block — it contains everything the kernel needs to manage the thread:

task_struct Layout (simplified, key fields)
============================================

struct task_struct {
    /* Thread/process identity */
    pid_t               pid;         // thread ID (unique)
    pid_t               tgid;        // thread group ID (= PID of first thread)
    char                comm[16];    // command name

    /* Scheduling state */
    volatile long       state;       // TASK_RUNNING, TASK_INTERRUPTIBLE, ...
    int                 prio;        // effective priority
    int                 static_prio;
    struct sched_entity se;          // CFS scheduling entity

    /* Memory */
    struct mm_struct    *mm;         // address space (NULL for kernel threads)
    struct mm_struct    *active_mm;  // active mm (may be borrowed)

    /* Files */
    struct files_struct *files;      // open file descriptors

    /* Signals */
    struct sighand_struct *sighand;  // signal handlers
    sigset_t            blocked;     // blocked signals

    /* Namespaces */
    struct nsproxy      *nsproxy;    // PID/net/mount/etc. namespaces

    /* Credentials */
    const struct cred   *cred;       // uid, gid, capabilities

    /* Stack */
    void                *stack;      // kernel stack pointer

    /* TLS (Thread-Local Storage) */
    struct thread_struct thread;     // arch-specific: x86 TLS entries
};

Thread vs. Process: The `clone()` Unification

Linux does not have separate "process creation" and "thread creation" syscalls. Both fork() and pthread_create() ultimately call clone(), differing only in which resources are shared:

clone() Flags and Resource Sharing
=====================================

fork() = clone(SIGCHLD, ...)
  Creates new:
    - Address space (copy-on-write)
    - File descriptor table
    - Signal handler table
  Shares: nothing (independent process)

pthread_create() = clone(CLONE_VM | CLONE_FS | CLONE_FILES | 
                         CLONE_SIGHAND | CLONE_THREAD | 
                         CLONE_SETTLS | CLONE_PARENT_SETTID |
                         CLONE_CHILD_CLEARTID, ...)
  Creates new:
    - Stack (allocated by libpthread)
    - Thread ID (pid in kernel)
    - TLS block
  Shares:
    - CLONE_VM: address space (same page tables)
    - CLONE_FILES: file descriptor table
    - CLONE_SIGHAND: signal handlers
    - CLONE_FS: filesystem context (cwd, umask)

vfork() = clone(CLONE_VFORK | CLONE_VM | SIGCHLD, ...)
  Shares address space temporarily, blocks parent
  Used for exec() optimization (no CoW needed)

unshare(): reverse direction — take shared resource private
  unshare(CLONE_FILES) creates private fd table
  Used by container tools, shells, etc.

The kernel implementation perspective: a "process" is simply a task_struct that doesn't share mm_struct with other tasks. A "thread" is a task_struct that shares mm_struct (and usually files_struct, sighand_struct) with related tasks. The kernel scheduler sees both as task_struct instances — the distinction is purely about sharing.

Kernel Thread Lifecycle

Kernel Thread Lifecycle
========================

                    clone()/pthread_create()
                            |
                            v
                    TASK_RUNNING (runnable)
                    [on run queue]
                            |
                    scheduler picks it up
                            |
                            v
                    TASK_RUNNING (running on CPU)
                            |
              +-------------+-------------+
              |             |             |
         I/O wait       sleep()     preempted
              |             |             |
              v             v             v
    TASK_INTERRUPTIBLE  TASK_INTERRUPTIBLE  TASK_RUNNING
    (sleeping, signal-   (sleeping, signal-  (back on runqueue)
     wakeup possible)     wakeup possible)
              |             |
         signal or      signal or
         I/O completes  timeout
              |             |
              +------+-------+
                     |
                     v
              TASK_RUNNING (runnable again)

Other states:
  TASK_UNINTERRUPTIBLE: D state — deep sleep, signal can't wake
    (classic: waiting for NFS, broken disk, kernel bug)
  TASK_ZOMBIE (Z): thread exited, parent hasn't wait()ed yet
  TASK_STOPPED (T): SIGSTOP or ptrace

The critical state is TASK_UNINTERRUPTIBLE (D state). A thread in D state holds kernel resources (often a lock or I/O wait) that can't be interrupted. kill -9 doesn't work on D state threads. If a kernel bug causes infinite D state, the only resolution is reboot. This is why hanging NFS mounts are particularly dangerous.

Thread Stack

Each thread has two stacks:

1. User-space stack: - Default size: 8MB (Linux) — check with ulimit -s - Adjustable: pthread_attr_setstacksize() or ulimit -s unlimited - Location: near the top of the user address space, grows down - Guard page: one page (4KB) below the stack, mapped with no access rights — stack overflow triggers a segfault rather than silent corruption

# Check default stack size
ulimit -s  # reports in KB, default 8192 = 8MB

# Set stack size for threads at runtime
pthread_attr_t attr;
pthread_attr_init(&attr);
pthread_attr_setstacksize(&attr, 2 * 1024 * 1024);  // 2MB
pthread_create(&tid, &attr, thread_func, NULL);

2. Kernel stack: - Default size: 8KB on 32-bit, 16KB on 64-bit x86 (since Linux 4.1) - Allocated in kernel address space - Used during syscalls, interrupt handling while thread is executing - Stack overflow in kernel = kernel panic (no guard page mechanism)

Memory Layout for a Thread
===========================

Virtual Address Space (one process, two threads)

High addresses
+------------------------+
| Thread 1 stack (8MB)  | ← grows downward
| [guard page]          |
+------------------------+
| Thread 2 stack (8MB)  | ← grows downward
| [guard page]          |
+------------------------+
| heap (shared)         |
+------------------------+
| BSS (shared)          |
| data (shared)         |
| text/code (shared)    |
+------------------------+
Low addresses

Kernel address space (not visible to user):
+------------------------+
| task_struct (thread 1) |
| kernel stack (16KB)    |
+------------------------+
| task_struct (thread 2) |
| kernel stack (16KB)    |
+------------------------+

Thread-Local Storage (TLS)

TLS provides per-thread variables: each thread gets its own private copy, sharing the same virtual address.

// Declare TLS variable
__thread int errno_copy;        // traditional, non-POSIX
_Thread_local int thread_id;    // C11 standard

// glibc errno is TLS: this is thread-safe
// Each thread has its own errno
int result = some_syscall();
if (result == -1) {
    // errno is __thread int in glibc
    perror("syscall failed");  // reads THIS thread's errno
}

TLS Implementation on x86-64:

TLS is implemented using the FS segment register. The Thread Control Block (TCB) address is stored in FS.base register (accessible via MSR_FS_BASE). TLS variables are at negative offsets from the TCB:

x86-64 TLS Layout
==================

FS.base → [TCB - Thread Control Block]
           |
           +-- pthread_t self (at offset 0)
           +-- dtv pointer (dynamic TLS vector)
           +-- stack guard canary

TLS variables (at negative offsets from FS.base):

  FS.base - 4  →  errno (glibc's per-thread errno)
  FS.base - 8  →  __thread int my_var  (user TLS)
  ...

Assembly access to TLS variable:
  movq %fs:0x-4, %rax    # load errno
  movq $0, %fs:0x-4      # clear errno

The kernel sets FS.base on each context switch to point to the current thread's TCB. This is why TLS is "free" at runtime — it's a single segment-relative memory access.

Observing Threads in /proc

Linux exposes threads through /proc:

# Threads appear as separate directories under /proc/PID/task/
ps aux | grep myapp
# myapp  1234  ...

ls /proc/1234/task/
# 1234  1235  1236  1237   ← each is a thread (same TGID=1234)

# Thread details
cat /proc/1234/task/1235/status
# Name:   myapp
# Tgid:   1234   ← thread group ID (= process PID)
# Pid:    1235   ← this thread's PID (= TID)
# ...

# gettid() vs getpid()
# getpid() returns TGID (same for all threads in process)
# gettid() returns TID (unique per thread)
# Used in logging, profiling, debugging

#include <sys/syscall.h>
pid_t tid = syscall(SYS_gettid);  // no glibc wrapper until glibc 2.30
// glibc 2.30+:
#include <unistd.h>
pid_t tid = gettid();

NPTL (Native POSIX Thread Library)

Before NPTL (pre-2003 Linux), Linux had LinuxThreads — a threads implementation that used separate processes for threads, had visible thread IDs from getpid(), and had numerous POSIX non-conformances.

NPTL (Ulrich Drepper and Ingo Molnár, 2002-2003) introduced:

1:1 threading model: Every pthread maps to exactly one kernel thread (task_struct). Direct, transparent.
Correct POSIX semantics: Thread group ID matches process ID for all threads.
Fast thread creation: Using CLONE_VM + CLONE_FILES etc., a new thread can be created in ~1-5µs vs. >100µs for a full process.
Futex-based synchronization: Mutex operations fast-path entirely in userspace; kernel involvement only on contention.

# Verify NPTL is in use
getconf GNU_LIBPTHREAD_VERSION
# NPTL 2.35

Thread Creation Cost

Thread vs. Process Creation Cost (Linux x86-64, 2024)
=======================================================

Operation               | Latency  | Notes
------------------------|----------|----------------------------------
pthread_create()        | 1-5 µs   | Shares mm, fd, signals
fork()                  | 50-500 µs| CoW page table copy, fd dup
fork() + exec()         | 500µs-5ms| Load new binary, dynamic linker

Memory cost per thread:
  task_struct:          ~1.7 KB    (kernel data)
  Kernel stack:         16 KB      (kernel, non-swappable)
  User stack (default): 8 MB       (virtual, lazily allocated)
  TLS block:            ~128 bytes (for typical pthread TLS)

Minimum real memory per thread: ~18 KB kernel + pages touched in stack
Maximum virtual space per thread: ~8 MB (stack reservation)

A system with 10,000 threads consumes: - Kernel memory: ~170 MB (task_structs + kernel stacks) — non-swappable - Virtual address space: ~80 GB (stack reservations) — rarely actually allocated

This is why web servers don't use one-thread-per-connection; 10,000 threads consumes significant kernel resources even if most are sleeping.

Historical Context

LinuxThreads (1996-2003)

LinuxThreads, the pre-NPTL threading library, worked by creating truly separate processes with clone() but without CLONE_THREAD. Each "thread" had a distinct PID, visible in ps. The POSIX spec requires all threads in a process to share a PID; LinuxThreads violated this.

Symptoms: a multi-threaded program on old Linux would show N processes in ps instead of N threads of 1 process. getpid() returned different values in different threads. Signal delivery was non-POSIX.

Red Hat's and SUSE's decision to ship NPTL in 2003 was a significant Linux maturity milestone for server workloads.

The Solaris Thread Model Influence

Solaris 2 (early 1990s) introduced a two-level threading model (M:N — user threads multiplexed on kernel LWPs). Many UNIX systems followed. Linux chose the simpler 1:1 model for NPTL, which was controversial at the time. In retrospect, the 1:1 model proved simpler to implement correctly and performs well on modern multicore hardware.

Production Examples

Java on Linux

The JVM uses NPTL for its thread model (since Java 1.4). Each Java thread maps to exactly one kernel thread. A Java application with 100 threads creates 100 task_structs. This is why large Java applications with many threads (e.g., a thread-per-connection server) can show thousands of entries in ps or top.

# See JVM threads as kernel threads
jps  # find Java process PID
ps -L -p <pid>  # list all threads (LWPs)
# Or: cat /proc/<pid>/status | grep Threads

Apache MPM Prefork/Worker/Event

Apache's threading model evolution reflects kernel thread cost understanding: - MPM Prefork: One process per connection (fork overhead, no threading) - MPM Worker: N processes × M threads (thread pool, NPTL-efficient) - MPM Event: Thread pool + async I/O (threads not blocked during keepalive)

The MPM Event model is essentially the modern solution to kernel thread overhead for I/O-bound workloads.

Debugging Notes

# strace a single thread
strace -p <tid>  # TID, not PID
strace -f -p <pid>  # trace all threads of a process

# gdb: switch between threads
gdb -p <pid>
(gdb) info threads
# 1 Thread 0x7f... (LWP 1234) "main_thread"
# 2 Thread 0x7f... (LWP 1235) "worker_1"
(gdb) thread 2
(gdb) bt  # backtrace of thread 2

# perf: per-thread CPU profiling
perf record -g -p <pid> -t <tid> -- sleep 10
perf report

# /proc virtual filesystem for thread inspection
cat /proc/<pid>/task/<tid>/wchan  # what the thread is waiting on
cat /proc/<pid>/task/<tid>/syscall  # current syscall (if in syscall)
cat /proc/<pid>/task/<tid>/status  # full status

# Check thread state across the whole process
for tid in /proc/<pid>/task/*; do
    echo -n "TID $(basename $tid): "
    grep State $tid/status
done

Deadlock Detection

# If threads are stuck in TASK_INTERRUPTIBLE (S) waiting for mutex:
# Check /proc/<pid>/task/<tid>/wchan for each thread

# futex-based deadlock detection:
# strace on the process shows threads blocked in futex():
strace -p <pid> 2>&1 | grep futex

# Linux perf: lock analysis
perf lock record -a
perf lock report  # shows mutex contention, hold times, wait times

Security Implications

Thread ID Predictability

gettid() returns predictable, incrementing TIDs. A vulnerability that requires knowing a target thread's TID can often succeed by brute-force if the process can observe TIDs (e.g., via /proc). Use of randomized TIDs is not standard in Linux.

TLS Canary

glibc uses TLS to store the stack guard canary:

// In glibc's stack protection:
// The canary is stored at fs:0x28 (offset from TCB)
// Stack-smashing protection reads this value to detect overwrites

If an attacker can read TLS (via an info leak), they can bypass stack canary protection. This is a known technique in exploit development.

Thread Safety of Signal Handlers

Only async-signal-safe functions can be called from signal handlers. A signal can be delivered to any thread in a multi-threaded process (unless blocked via pthread_sigmask). A signal handler that calls non-async-signal-safe functions (malloc(), printf(), fopen()) can cause deadlock or corruption.

Performance Implications

Context Switch Cost

A kernel thread context switch involves: 1. Save FPU/SIMD state (if dirty) — expensive: ~100-300 cycles for AVX-512 2. Switch page tables (CR3 write) — TLB flush: ~50 cycles + TLB refill cost 3. Update task_struct state 4. Restore new thread's registers and FPU state

Typical kernel thread context switch: ~1-2 µs on modern x86-64.

KPTI adds: one full TLB flush on kernel→user transition (~50-200 cycles for TLB refill amortized).

Futex: Fast Userspace Mutex

The futex (Fast Userspace muTEX) allows most mutex operations to complete in userspace without a syscall:

Futex Uncontended Lock (no kernel involvement):
  1. Atomic CAS on futex word in user memory
  2. If 0 → 1 succeeds: thread holds lock
  3. No syscall, no kernel involvement
  Latency: ~10-30 ns (atomic operation on x86)

Futex Contended Lock (kernel required):
  1. Atomic CAS fails (futex word already locked)
  2. syscall futex(FUTEX_WAIT): thread suspends
  3. ... wait ...
  4. Unlocking thread does CAS, checks waiters
  5. syscall futex(FUTEX_WAKE): wakes waiter
  Latency: ~1-5 µs (includes context switch)

Failure Modes and Real Incidents

glibc Thread Cancellation Bug (CVE-2013-4237 and similar)

Thread cancellation (pthread_cancel) uses asynchronous signal delivery internally. Cancellation at the wrong point during a syscall can leave file descriptors or mutexes in inconsistent states. Multiple glibc versions have had cancellation-related bugs that caused resource leaks or crashes in long-running multithreaded servers.

Best practice: avoid pthread_cancel. Use cancellation points explicitly with flags.

Apache SIGPIPE Multi-Thread Storm

Early Apache versions had a bug where a SIGPIPE (broken client connection) could be delivered to any thread in the process. The signal handler called non-async-safe functions. Under high load, SIGPIPE delivery to a thread holding a mutex caused the mutex to never be released — the process hung with all threads waiting for the mutex.

Fix: block SIGPIPE in all threads except a dedicated signal handler thread, or use MSG_NOSIGNAL flag on send operations.

Modern Usage

Kernel threads remain the foundation of all concurrent programming on Linux. Go goroutines, Java virtual threads, Node.js event loop — all of these ultimately run on top of NPTL kernel threads. The only exception is purely user-space N:M threading systems (rare in production).

Future Directions

Java Virtual Threads (Project Loom): Java 21 introduced virtual threads — N:M threading where millions of Java threads multiplex onto a smaller pool of kernel threads. This shifts from one-kernel-thread-per-Java-thread, directly addressing the scaling limits of kernel thread stacks. See 02-user-threads-and-green-threads.md.

io_uring and Thread Reduction: io_uring allows single-threaded programs to handle thousands of concurrent I/O operations without needing one thread per operation. For I/O-bound servers, this can reduce thread counts by 10-100x.

Kernel Thread vs. Lightweight Thread: As hardware adds more cores (64-128 cores in server CPUs), the overhead of per-core scheduling becomes relevant. Future Linux scheduler work (EEVDF, etc.) focuses on reducing cross-core cache thrashing during thread migration.

Exercises

Thread vs. Process Creation Benchmark: Write a benchmark that measures pthread_create() + pthread_join() overhead vs. fork() + waitpid(). Run 10,000 iterations of each. Analyze: where does each spend its time? Use perf stat to count page faults, context switches, and LLC misses.
TLS Implementation Inspection: Write a C program with a __thread variable. Use gdb to find its address, then compute its offset from FS.base (read via arch_prctl(ARCH_GET_FS)). Confirm the offset matches the glibc ABI. Add a second thread and verify each has its own copy at the same offset.
/proc Thread Exploration: Write a multi-threaded program with 5 threads, each in a different state (running, sleeping with sleep(), blocked on a mutex, blocked on I/O). Use /proc/PID/task/TID/status and /proc/PID/task/TID/wchan to observe each state. Map the states to the task_struct state constants.
Futex Behavior Under Contention: Write a microbenchmark that measures mutex lock/unlock latency at different contention levels (0 threads contending, 2, 4, 8, 16). Plot the distribution. Identify the contention threshold where futex FUTEX_WAIT syscalls start dominating (observable in strace -c output).
Clone Flags Experiment: Write a program that uses clone() directly (with the syscall, not pthread_create) to create tasks with different flag combinations. Experiment with: (a) sharing only mm (not files), (b) sharing only files (not mm), (c) sharing nothing (essentially fork). Observe the behavior when the child modifies shared state.

References

Bovet, D.P. and Cesati, M. Understanding the Linux Kernel, 3rd ed. O'Reilly, 2005. Chapters 3 (Processes) and 11 (Signals).
Drepper, U. "The Native POSIX Thread Library Design." Red Hat, 2002. (Internal design document, describes NPTL architecture)
Love, R. Linux Kernel Development, 3rd ed. Chapter 3: Process Management.
Kerrisk, M. The Linux Programming Interface. No Starch Press, 2010. Chapter 28-33: Threads.
glibc NPTL source code: https://sourceware.org/git/glibc.git (nptl/ directory)
Linux man pages: clone(2), pthread_create(3), futex(2), gettid(2)
Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 3A: System Programming Guide. [FS/GS segment register TLS details]