07 — Context Switching
Technical Overview
A context switch is the operation by which the OS saves the execution state of one task (process or thread) and restores the execution state of another, transferring CPU control between them. It is the fundamental primitive that makes multitasking possible. Context switches are also one of the most significant sources of latency in any OS: they are invisible to application code but have measurable and sometimes dominant costs in throughput-sensitive systems. Understanding exactly what is saved and restored, why, and at what cost enables principled optimization decisions.
Prerequisites
01-process-concept.md:task_struct, process states, scheduling fields- CPU architecture basics: registers, privilege levels, page tables (
cr3), XSAVE 09-scheduling/(see that section for full CFS/scheduler details)
Core Content
What Constitutes "Context"
The execution context of a user-space task comprises everything the CPU needs to continue executing it correctly:
Context of one task_struct:
┌─────────────────────────────────────────────────────┐
│ User-space registers │
│ ┌─────────────────────────────────────────────┐ │
│ │ General-purpose: rax, rbx, rcx, rdx, │ │
│ │ rsi, rdi, rbp, rsp, r8–r15 │ │
│ │ Instruction pointer: rip │ │
│ │ Flags register: rflags │ │
│ │ Segment registers: cs, ss, ds, es, fs, gs │ │
│ └─────────────────────────────────────────────┘ │
│ │
│ FPU / SIMD state │
│ ┌─────────────────────────────────────────────┐ │
│ │ x87 FPU state (80-bit registers ×8) │ │
│ │ SSE/AVX state (xmm0–15 or ymm0–15/zmm0–31) │ │
│ │ Saved via XSAVE / XRSTOR │ │
│ │ Lazy: only saved if task used FPU │ │
│ └─────────────────────────────────────────────┘ │
│ │
│ Kernel execution state (in thread_struct) │
│ ┌─────────────────────────────────────────────┐ │
│ │ Kernel stack pointer (rsp after syscall) │ │
│ │ fs/gs base (TLS pointers for user space) │ │
│ │ Callee-saved registers (rbx, rbp, r12–r15) │ │
│ │ saved in kernel stack frame by switch_to()│ │
│ └─────────────────────────────────────────────┘ │
│ │
│ Address space reference │
│ ┌─────────────────────────────────────────────┐ │
│ │ mm_struct pointer │ │
│ │ cr3 value (top-level page table address) │ │
│ │ Only reloaded if switching to a different mm │ │
│ └─────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────┘
Callee-saved registers (rbx, rbp, r12, r13, r14, r15 on x86-64 SysV ABI) are the
ones that switch_to() must save and restore — they must be preserved across function
calls by the callee. The caller-saved registers (rax, rcx, rdx, rsi, rdi, r8–r11) are
not saved because the calling convention does not guarantee them across a call.
The Context Switch Path in Linux
Context switches happen inside __schedule() in kernel/sched/core.c. They occur
when:
1. The current task calls a blocking syscall (schedule() is called explicitly)
2. The timer interrupt fires and TIF_NEED_RESCHED is set
3. A higher-priority task is woken up (preemption if CONFIG_PREEMPT)
context switch sequence:
─────────────────────────────────────────────────────────────────────
__schedule() kernel/sched/core.c
│
├── pick_next_task(rq, prev) — CFS/RT/DL picks "next" task
│
├── context_switch(rq, prev, next)
│ │
│ ├── prepare_task_switch(rq, prev, next)
│ │ — perf/tracing hooks
│ │
│ ├── switch_mm_irqs_off(prev->mm, next->mm, next)
│ │ — if prev->mm != next->mm (different process):
│ │ load_new_mm_cr3() → write cr3
│ │ (TLB flush: all non-global entries invalidated)
│ │ — if same mm (threads of same process):
│ │ cr3 write skipped (same page tables)
│ │
│ └── switch_to(prev, next, prev)
│ arch/x86/kernel/process_64.c
│ │
│ ├── save callee-saved regs of "prev" onto prev's kernel stack
│ ├── swap kernel stack pointer (rsp): prev→task, next→task
│ ├── restore callee-saved regs of "next" from next's kernel stack
│ ├── update TSS.sp0 (kernel stack pointer for next syscall)
│ ├── reload fs/gs base (for TLS)
│ └── XSAVE/XRSTOR (FPU state) — lazy: only if CR0.TS is clear
│
└── finish_task_switch() — decrease prev's usage counter, handle signals
─────────────────────────────────────────────────────────────────────
After switch_to() completes, execution continues in the new task at the point it was
last preempted (or at the initial task start for a newly created task).
FPU/SIMD State: Lazy Save
Saving and restoring the full AVX-512 state (zmm0–zmm31 × 512 bits = 2 KB) for every context switch would be expensive for tasks that never use floating-point or SIMD. Linux uses lazy FPU save:
- On context switch: if
CR0.TS(task switched) flag is clear, FPU save is deferred. CR0.TS is set, which causes the next FPU instruction in the new task to raise#NM(device not available exception). - On
#NMhandler: the kernel saves the previous task's FPU state (now that we know the new task needs it), loads the new task's FPU state, and clears CR0.TS.
Modern CPUs use XSAVE/XRSTOR which save only the components actually modified
(tracked via a xstate_bv bitmap in the XSAVE area), further reducing the save cost.
With AVX-512 on Skylake-X and later, XSAVE costs ~100 ns and saves ~2 KB. A kernel
compiled with CONFIG_X86_KERNEL_IBT and Intel CET adds shadow stack context.
Address Space Switch: cr3 and TLB
The page table base (cr3) identifies the current address space. Writing cr3 is one
of the most expensive individual operations in a context switch:
switch_mm() cost breakdown (approx, modern x86-64):
──────────────────────────────────────────────────────
Operation Cost
───────────────────────── ────────────────
cr3 write (same ASID) ~20 ns (PCID: no TLB flush)
cr3 write (different ASID, ~200–1000 ns depending on TLB size
full TLB flush) and how many TLB entries must be
re-walked on subsequent accesses
L1/L2 cache cold miss (first ~5–10 ns per cache line
access to new task's pages)
L3 cache cold miss ~40–100 ns per cache line
──────────────────────────────────────────────────────
PCID (Process Context IDentifier) (Intel: CR4.PCIDE, Linux: X86_FEATURE_PCID):
the CPU stores TLB entries tagged with a 12-bit PCID. When loading a new cr3 with the
same PCID, the CPU does NOT flush those TLB entries. Linux uses PCIDs (since 4.14) to
avoid full TLB flushes on context switches — each mm_struct gets a PCID while active.
Context switches between threads of the same process (same mm_struct) skip the
cr3 write entirely — the biggest performance optimization possible for thread-heavy
workloads.
Meltdown Mitigation: KPTI
Kernel Page-Table Isolation (KPTI), added in Linux 4.15 to mitigate Meltdown (CVE-2017-5754), doubles the cr3 write cost by maintaining two page table roots per process: - User page tables: kernel text is unmapped (prevents Meltdown speculation reads) - Kernel page tables: full mapping, used during kernel execution
Every syscall entry/exit must now switch cr3 between user and kernel page tables.
On CPUs with INVPCID support, this is done efficiently with PCID tricks. On older CPUs
it is a full TLB flush, adding ~1–3 µs to every syscall on workloads with large TLBs.
Context Switch Cost: Numbers
Context switch cost breakdown (approximate, modern x86-64, Linux 5.x):
────────────────────────────────────────────────────────────────────────
Component Duration Notes
───────────────────────────── ────────── ────────────────────────
Register save (callee-saved) ~30–50 ns Push/pop to kernel stack
cr3 write (PCID, no flush) ~20–30 ns Tagged TLB retained
cr3 write (full TLB flush) ~200–500 ns TLB cold on next accesses
FPU save (XSAVE, lazy) ~100–200 ns Only if FPU was used
Cache warm-up (L1/L2 cold) ~500 ns–5 µs Depends on working set fit
Thread switch (same mm) ~0.5–2 µs No cr3 change
Process switch (different mm) ~2–10 µs cr3 + cache effects
Process switch (mm large/cold) ~10–100 µs Full cache eviction + TLB
────────────────────────────────────────────────────────────────────────
The dominant cost is almost never the register save — it is the cache cold effect. A process that has a 32 MB working set occupying L3 cache lines will experience significant slowdown after a context switch because its lines have been evicted by the incoming task.
Measuring Context Switches
Per-process statistics from /proc/PID/status:
grep ctxt_switches /proc/$(pgrep nginx | head -1)/status
# voluntary_ctxt_switches: 1432
# nonvoluntary_ctxt_switches: 87
voluntary: task calledschedule()explicitly (blocking syscall,sched_yield)nonvoluntary: task was preempted by the scheduler against its will (time slice expiry or higher-priority task woke up)
System-wide context switch rate:
vmstat 1 # cs column: context switches per second
sar -w 1 # cswch/s column
perf stat -e context-switches ./program
Per-CPU context switch breakdown:
perf stat -a -e context-switches,cpu-migrations sleep 5
# cpu-migrations: task moved from one CPU to another (more expensive than local switch)
Latency tracing with perf sched:
perf sched record -- sleep 5
perf sched latency # per-task scheduling latency distribution
perf sched timehist # timeline of context switches
Context Switch Sequence Diagram
CPU timeline:
Time ─────────────────────────────────────────────────────────────►
Task A Interrupt/Syscall Task B Task A
(running) │ (running) (resumes)
───────────────────┼───────────────────────────────────────
│
┌────▼────────────────────────────────┐
│ 1. Hardware: │
│ save rip, rsp, rflags to TSS/ │
│ kernel stack (as pt_regs) │
│ switch to kernel stack │
│ switch to ring 0 │
├──────────────────────────────────────┤
│ 2. Kernel entry (interrupt/syscall): │
│ save remaining pt_regs │
│ handle interrupt / syscall work │
│ call schedule() if needed │
├──────────────────────────────────────┤
│ 3. __schedule(): │
│ pick_next_task() → Task B │
│ context_switch(prev=A, next=B) │
├──────────────────────────────────────┤
│ 4. switch_mm() (if A.mm != B.mm): │
│ write cr3 → B's page tables │
│ (TLB flush or PCID switch) │
├──────────────────────────────────────┤
│ 5. switch_to(A→B): │
│ push callee-saved regs (A) │
│ swap kernel rsp: A.rsp → B.rsp │
│ pop callee-saved regs (B) │
│ update TSS.sp0 for B │
│ reload fs/gs TLS base for B │
├──────────────────────────────────────┤
│ 6. Return from kernel: │
│ restore pt_regs (B's user regs) │
│ iret/sysret → Task B in ring 3 │
└──────────────────────────────────────┘
↑ ↑
Task A saved Task B resumes
here; A's here; B's
kernel stack kernel stack
preserved preserved
Voluntary vs. Involuntary in Practice
A high nonvoluntary_ctxt_switches rate indicates the task is CPU-bound and is being
preempted before it completes its work slice. This is generally fine — it means the
scheduler is fairly sharing the CPU. However, if this causes latency issues:
- Use SCHED_FIFO or SCHED_RR (real-time scheduling) to avoid preemption
- Use sched_setaffinity() to pin the task to a dedicated CPU
- Use isolcpus kernel parameter to prevent the scheduler from migrating other tasks
onto specific CPUs
A high voluntary_ctxt_switches rate with low CPU utilization indicates excessive
blocking (waiting on locks, I/O, or nanosleep). This is often intentional but can
indicate lock contention or excessive small-granularity I/O.
Historical Context
The first Unix systems performed process switches by swapping the entire process image to disk and loading another (literal swapping). Early PDP-11 Unix had a tiny memory (64 KB address space limit) and context switches were expensive by modern standards because of the swap.
The shift to paging and demand-load (BSD 4.0, 1980) meant context switches no longer
required full memory copies but introduced TLB flush overhead. The x86 architecture's
Task State Segment (TSS) mechanism was designed for hardware-assisted context switching
(x86 IRET to a new TSS), but Linux (like most OSes) ignores hardware task switching
and does software context switching via switch_to() — hardware switching was found to
be slower and less flexible.
PCIDs for TLB tagging were introduced in Intel Westmere (2010) but Linux did not use them until Spectre/Meltdown mitigations necessitated frequent cr3 switches in 2018.
Production Examples
Finding high context-switch processes:
pidstat -w 1 10 # per-process context switch rate (sysstat package)
# or
while true; do
for pid in /proc/[0-9]*/status; do
awk -v pid="${pid%/status}" -v pid="${pid##/proc/}" \
'/ctxt_switches/{print pid, $0}' "$pid"
done | sort -k3 -rn | head -10
sleep 1
done
Measuring raw context switch latency:
# lmbench ct_sw benchmark:
lat_ctx -s 0 2 # 0-byte working set, 2 processes: ~1–3 µs typically
lat_ctx -s 64k 2 # 64KB working set (fits in L1): ~2–5 µs
lat_ctx -s 256k 2 # 256KB working set (fits in L2): ~5–15 µs
lat_ctx -s 16m 2 # 16MB working set (L3 thrash): ~50–200 µs
Pinning a latency-sensitive process:
# Run on CPU 3 only, isolated from scheduler migration:
taskset -c 3 ./latency-sensitive-app
# Or from within the process:
cpu_set_t cpus; CPU_ZERO(&cpus); CPU_SET(3, &cpus);
sched_setaffinity(0, sizeof(cpus), &cpus);
Debugging Notes
perf sched latencyis the gold standard for diagnosing scheduling latency. It measures the time from when a task became runnable to when it actually started running (run queue latency), which includes context switch overhead and scheduling policy effects.- High
nonvoluntarycount without CPU pressure: this can indicate a real-time task at higher priority preempting the measured task repeatedly. Useperf sched timehistto see which tasks are causing preemptions. cpu-migrationsinperf stat: a task migrated between CPUs invalidates its L1/L2 cache lines on the source CPU (cache coherence protocol). If cpu-migrations are high, usesched_setaffinityto pin the task.schedule_hrtimeoutvsnanosleep: both cause voluntary context switches.schedule_hrtimeoutis used by the kernel internally for sub-jiffy sleeps; from user space,clock_nanosleep(CLOCK_MONOTONIC, ...)is the high-resolution path./proc/PID/wchan: for tasks in voluntary sleep, this file shows the kernel function where the task is waiting.ep_poll= in epoll;do_futex= waiting on a futex;pipe_wait= waiting on a pipe;sk_wait_data= waiting on socket data.
Security Implications
- Spectre (CVE-2017-5753) and context switches: Spectre variant 2 exploits branch
predictor state that is shared across processes. The kernel mitigates via
IBRS(Indirect Branch Restricted Speculation) orIBPB(Indirect Branch Predictor Barrier) on context switches. IBPB costs ~1–4 µs and was enabled by default on many distros, adding significant context switch overhead. Retpoline mitigates without IBPB overhead for kernel code, but user-space cross-process speculation still requires IBPB in high-security deployments. - MDS/RIDL mitigations: Microarchitectural Data Sampling vulnerabilities require
clearing CPU microarchitectural buffers on context switch.
VERWinstruction (used by kernel viamds_user_clearon return-to-user) adds ~50 ns per context switch. Can be disabled (mds=off) in environments where cross-process trust allows. - KPTI overhead: the cr3 double-switch introduced by KPTI adds ~1–3 µs per syscall
on Meltdown-vulnerable CPUs. AMD CPUs are immune to Meltdown; they can boot with
noptisafely. - Time-of-context-switch side channel: precise measurement of context switch timing can leak information about the scheduler state and thus about other processes' activity. Timing attacks in shared cloud environments exploit this.
Performance Implications
- Thread pools vs. processes: threads of the same process share
mm_struct, so thread-to-thread context switches skip thecr3write. For workloads doing frequent context switches (server with many concurrent connections), threads are significantly cheaper than processes. - Reducing context switches with async I/O:
io_uringallows batching many I/O operations in a single syscall (or zero syscalls withIORING_SETUP_SQPOLL). A busy- loop kernel thread polls the submission queue, eliminating context switches for I/O dispatch. For high-IOPS storage workloads, this removes tens of thousands of context switches per second. - NUMA effects: context switches that move a task between NUMA nodes lose not only
L1/L2 cache locality but also remote DRAM access penalty (~100 ns vs. ~40 ns for local
NUMA). The NUMA-aware scheduler tries to avoid this but may migrate for load balance.
Pin with
numactl --cpunodebind=0 --membind=0for latency-critical tasks. - Scheduler tick vs. tickless:
CONFIG_HZ_1000means the scheduler timer fires 1000 times per second, limiting time slices to 1 ms minimum and adding 1000 interrupts/s of base overhead.CONFIG_NO_HZ_FULL(tickless) suppresses ticks on CPUs running a single task, reducing context-switch-related interrupts to near-zero.
Failure Modes
| Failure | Symptom | Diagnosis |
|---|---|---|
| Context switch storm | High CPU sys%, many cs in vmstat | pidstat -w; find process switching >10000/s |
| Scheduling latency spike | Request tail latency p99 >> p50 | perf sched latency; check GC/compaction |
| CPU migration thrash | Poor cache utilization, high IPC loss | perf stat -e cpu-migrations; use affinity |
| Meltdown/Spectre mitigation overhead | 10–30% syscall regression | dmesg | grep -i spectre; benchmark with/without mitigations |
| NUMA migration latency | Memory access slowdown after migration | numastat -p PID; pin with numactl |
| Priority inversion | High-priority task waiting for low-priority | Check RT_MUTEX and SCHED_FIFO priority chains |
Modern Usage
io_uring SQPOLL: the io_uring kernel thread polls the submission queue without any
epoll_wait or read syscall from the application. The application submits I/O requests
by writing to a shared ring buffer. This eliminates context switches for I/O submission
and often for completion as well.
eBPF scheduler (sched_ext): merged in Linux 6.12, sched_ext allows userspace BPF
programs to implement the scheduling policy. The BPF scheduler's dispatch function
runs in the context switch path. This brings scheduling policy close to user-space
control while keeping it in kernel context for safety.
Google's per-CPU kernel threads (ghOSt): Google published a system where the Linux scheduler can delegate per-CPU scheduling decisions to a userspace agent, allowing nanosecond-granularity policy control. This effectively makes the context switch path a plugin boundary.
Future Directions
- Hardware task switching renaissance: proposals for hardware-assisted context switches using Intel's FRED (Flexible Return and Event Delivery) and AMX (Advanced Matrix Extensions) state management may revive interest in hardware-managed context data — though software still provides more flexibility.
- Larger XSAVE state: as CPUs add more SIMD extensions (AMX tiles = 8 KB of state,
AVX-512 = 2 KB), XSAVE save/restore cost grows. Lazy save with fine-grained component
tracking (
xstate_bv) will become more important, not less. - Shadow stacks in context switch: with Intel CET shadow stack (merged in Linux 6.6),
switch_to()must now also save/restore the shadow stack pointer (SSP) alongside the regular stack pointer. - BPF programs in switch_to() path: ongoing work to allow BPF hooks at key points in
context_switch()for custom scheduling telemetry without modifying the kernel.
Exercises
-
Measure context switch cost directly: write two processes communicating via a pipe (ping-pong benchmark). Each process reads 1 byte, writes 1 byte back. Measure total round-trip time with
clock_gettime(CLOCK_MONOTONIC). Divide by 2 to estimate single context switch latency. Compare with lmbenchlat_ctx -s 0 2. -
Thread vs. process switch cost: extend the ping-pong above to use (a) two processes (different mm) and (b) two threads in the same process (same mm). Compare context switch latency. Use
perf stat -e context-switches,csto verify counts. -
TLB effect measurement: write a program that allocates a working set of size W (parameter), forks a second process with a different working set of size W, and does ping-pong context switches. Measure round-trip time as a function of W: 4 KB, 64 KB, 512 KB, 4 MB, 32 MB. Plot the "TLB cliff" where context switch time increases sharply.
-
FPU lazy save: write two threads. Thread A does no floating-point. Thread B does heavy AVX2 computation. Measure context switch latency between them (a) when B is active, (b) when B is idle. Use
perf stat -e fp_comp_ops_exeorCR0.TSprobing via a trap handler to verify the lazy save is actually triggered. -
Spectre/KPTI overhead: on an Intel CPU, run a syscall-intensive benchmark (
strace -c ./programto count syscalls). Boot once withmitigations=offand once with default mitigations. Compare total syscall time. Identify which mitigation flag in/sys/devices/system/cpu/vulnerabilities/corresponds to the overhead.
References
kernel/sched/core.c—__schedule(),context_switch()arch/x86/kernel/process_64.c—switch_to(),copy_thread()arch/x86/mm/tlb.c—switch_mm_irqs_off(), PCID managementarch/x86/kernel/fpu/core.c—switch_fpu_return(), XSAVE/XRSTORarch/x86/include/asm/switch_to.h—switch_to()macro- Love, Linux Kernel Development, 3rd ed. — Chapter 4 (Process Scheduling), context switch discussion
- Bovet & Cesati, Understanding the Linux Kernel, 3rd ed. — Chapter 3 (hardware context, switch_to implementation)
- lmbench: http://www.bitmover.com/lmbench/ —
lat_ctxfor context switch measurement - "An analysis of Linux scalability to many cores" (Clements et al., OSDI 2012)
- "The Spectre in the Machine: Spectre/Meltdown and Linux" — LWN series (2018)
- Intel SDM, Vol. 3A: Chapter 9 (Processor Management), Chapter 14 (XSAVE)
man 7 sched— scheduling policies and priority overviewman 2 sched_setaffinity,man 2 sched_yield,man 2 sched_setscheduler- perf-sched tutorial:
man perf-sched