Skip to content

07 — Context Switching

Technical Overview

A context switch is the operation by which the OS saves the execution state of one task (process or thread) and restores the execution state of another, transferring CPU control between them. It is the fundamental primitive that makes multitasking possible. Context switches are also one of the most significant sources of latency in any OS: they are invisible to application code but have measurable and sometimes dominant costs in throughput-sensitive systems. Understanding exactly what is saved and restored, why, and at what cost enables principled optimization decisions.


Prerequisites

  • 01-process-concept.md: task_struct, process states, scheduling fields
  • CPU architecture basics: registers, privilege levels, page tables (cr3), XSAVE
  • 09-scheduling/ (see that section for full CFS/scheduler details)

Core Content

What Constitutes "Context"

The execution context of a user-space task comprises everything the CPU needs to continue executing it correctly:

Context of one task_struct:
┌─────────────────────────────────────────────────────┐
│  User-space registers                               │
│  ┌─────────────────────────────────────────────┐   │
│  │ General-purpose: rax, rbx, rcx, rdx,        │   │
│  │   rsi, rdi, rbp, rsp, r8–r15                │   │
│  │ Instruction pointer: rip                    │   │
│  │ Flags register: rflags                      │   │
│  │ Segment registers: cs, ss, ds, es, fs, gs   │   │
│  └─────────────────────────────────────────────┘   │
│                                                     │
│  FPU / SIMD state                                   │
│  ┌─────────────────────────────────────────────┐   │
│  │ x87 FPU state (80-bit registers ×8)         │   │
│  │ SSE/AVX state (xmm0–15 or ymm0–15/zmm0–31) │   │
│  │ Saved via XSAVE / XRSTOR                    │   │
│  │ Lazy: only saved if task used FPU           │   │
│  └─────────────────────────────────────────────┘   │
│                                                     │
│  Kernel execution state (in thread_struct)          │
│  ┌─────────────────────────────────────────────┐   │
│  │ Kernel stack pointer (rsp after syscall)    │   │
│  │ fs/gs base (TLS pointers for user space)    │   │
│  │ Callee-saved registers (rbx, rbp, r12–r15)  │   │
│  │   saved in kernel stack frame by switch_to()│   │
│  └─────────────────────────────────────────────┘   │
│                                                     │
│  Address space reference                           │
│  ┌─────────────────────────────────────────────┐   │
│  │ mm_struct pointer                            │   │
│  │ cr3 value (top-level page table address)     │   │
│  │ Only reloaded if switching to a different mm │   │
│  └─────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────┘

Callee-saved registers (rbx, rbp, r12, r13, r14, r15 on x86-64 SysV ABI) are the ones that switch_to() must save and restore — they must be preserved across function calls by the callee. The caller-saved registers (rax, rcx, rdx, rsi, rdi, r8–r11) are not saved because the calling convention does not guarantee them across a call.


The Context Switch Path in Linux

Context switches happen inside __schedule() in kernel/sched/core.c. They occur when: 1. The current task calls a blocking syscall (schedule() is called explicitly) 2. The timer interrupt fires and TIF_NEED_RESCHED is set 3. A higher-priority task is woken up (preemption if CONFIG_PREEMPT)

context switch sequence:
─────────────────────────────────────────────────────────────────────
__schedule()                     kernel/sched/core.c
   │
   ├── pick_next_task(rq, prev)  — CFS/RT/DL picks "next" task
   │
   ├── context_switch(rq, prev, next)
   │      │
   │      ├── prepare_task_switch(rq, prev, next)
   │      │      — perf/tracing hooks
   │      │
   │      ├── switch_mm_irqs_off(prev->mm, next->mm, next)
   │      │      — if prev->mm != next->mm (different process):
   │      │          load_new_mm_cr3() → write cr3
   │      │          (TLB flush: all non-global entries invalidated)
   │      │      — if same mm (threads of same process):
   │      │          cr3 write skipped (same page tables)
   │      │
   │      └── switch_to(prev, next, prev)
   │             arch/x86/kernel/process_64.c
   │             │
   │             ├── save callee-saved regs of "prev" onto prev's kernel stack
   │             ├── swap kernel stack pointer (rsp): prev→task, next→task
   │             ├── restore callee-saved regs of "next" from next's kernel stack
   │             ├── update TSS.sp0 (kernel stack pointer for next syscall)
   │             ├── reload fs/gs base (for TLS)
   │             └── XSAVE/XRSTOR (FPU state) — lazy: only if CR0.TS is clear
   │
   └── finish_task_switch()     — decrease prev's usage counter, handle signals
─────────────────────────────────────────────────────────────────────

After switch_to() completes, execution continues in the new task at the point it was last preempted (or at the initial task start for a newly created task).


FPU/SIMD State: Lazy Save

Saving and restoring the full AVX-512 state (zmm0–zmm31 × 512 bits = 2 KB) for every context switch would be expensive for tasks that never use floating-point or SIMD. Linux uses lazy FPU save:

  • On context switch: if CR0.TS (task switched) flag is clear, FPU save is deferred. CR0.TS is set, which causes the next FPU instruction in the new task to raise #NM (device not available exception).
  • On #NM handler: the kernel saves the previous task's FPU state (now that we know the new task needs it), loads the new task's FPU state, and clears CR0.TS.

Modern CPUs use XSAVE/XRSTOR which save only the components actually modified (tracked via a xstate_bv bitmap in the XSAVE area), further reducing the save cost.

With AVX-512 on Skylake-X and later, XSAVE costs ~100 ns and saves ~2 KB. A kernel compiled with CONFIG_X86_KERNEL_IBT and Intel CET adds shadow stack context.


Address Space Switch: cr3 and TLB

The page table base (cr3) identifies the current address space. Writing cr3 is one of the most expensive individual operations in a context switch:

switch_mm() cost breakdown (approx, modern x86-64):
──────────────────────────────────────────────────────
Operation                         Cost
─────────────────────────         ────────────────
cr3 write (same ASID)             ~20 ns (PCID: no TLB flush)
cr3 write (different ASID,        ~200–1000 ns depending on TLB size
  full TLB flush)                 and how many TLB entries must be
                                  re-walked on subsequent accesses
L1/L2 cache cold miss (first      ~5–10 ns per cache line
  access to new task's pages)
L3 cache cold miss                ~40–100 ns per cache line
──────────────────────────────────────────────────────

PCID (Process Context IDentifier) (Intel: CR4.PCIDE, Linux: X86_FEATURE_PCID): the CPU stores TLB entries tagged with a 12-bit PCID. When loading a new cr3 with the same PCID, the CPU does NOT flush those TLB entries. Linux uses PCIDs (since 4.14) to avoid full TLB flushes on context switches — each mm_struct gets a PCID while active.

Context switches between threads of the same process (same mm_struct) skip the cr3 write entirely — the biggest performance optimization possible for thread-heavy workloads.


Meltdown Mitigation: KPTI

Kernel Page-Table Isolation (KPTI), added in Linux 4.15 to mitigate Meltdown (CVE-2017-5754), doubles the cr3 write cost by maintaining two page table roots per process: - User page tables: kernel text is unmapped (prevents Meltdown speculation reads) - Kernel page tables: full mapping, used during kernel execution

Every syscall entry/exit must now switch cr3 between user and kernel page tables. On CPUs with INVPCID support, this is done efficiently with PCID tricks. On older CPUs it is a full TLB flush, adding ~1–3 µs to every syscall on workloads with large TLBs.


Context Switch Cost: Numbers

Context switch cost breakdown (approximate, modern x86-64, Linux 5.x):
────────────────────────────────────────────────────────────────────────
Component                         Duration      Notes
─────────────────────────────     ──────────    ────────────────────────
Register save (callee-saved)      ~30–50 ns     Push/pop to kernel stack
cr3 write (PCID, no flush)        ~20–30 ns     Tagged TLB retained
cr3 write (full TLB flush)        ~200–500 ns   TLB cold on next accesses
FPU save (XSAVE, lazy)           ~100–200 ns   Only if FPU was used
Cache warm-up (L1/L2 cold)       ~500 ns–5 µs  Depends on working set fit
Thread switch (same mm)          ~0.5–2 µs     No cr3 change
Process switch (different mm)    ~2–10 µs      cr3 + cache effects
Process switch (mm large/cold)   ~10–100 µs    Full cache eviction + TLB
────────────────────────────────────────────────────────────────────────

The dominant cost is almost never the register save — it is the cache cold effect. A process that has a 32 MB working set occupying L3 cache lines will experience significant slowdown after a context switch because its lines have been evicted by the incoming task.


Measuring Context Switches

Per-process statistics from /proc/PID/status:

grep ctxt_switches /proc/$(pgrep nginx | head -1)/status
# voluntary_ctxt_switches:     1432
# nonvoluntary_ctxt_switches:  87
  • voluntary: task called schedule() explicitly (blocking syscall, sched_yield)
  • nonvoluntary: task was preempted by the scheduler against its will (time slice expiry or higher-priority task woke up)

System-wide context switch rate:

vmstat 1          # cs column: context switches per second
sar -w 1          # cswch/s column
perf stat -e context-switches ./program

Per-CPU context switch breakdown:

perf stat -a -e context-switches,cpu-migrations sleep 5
# cpu-migrations: task moved from one CPU to another (more expensive than local switch)

Latency tracing with perf sched:

perf sched record -- sleep 5
perf sched latency   # per-task scheduling latency distribution
perf sched timehist  # timeline of context switches

Context Switch Sequence Diagram

CPU timeline:

Time ─────────────────────────────────────────────────────────────►
      Task A        Interrupt/Syscall     Task B       Task A
      (running)          │               (running)    (resumes)
      ───────────────────┼───────────────────────────────────────
                         │
                    ┌────▼────────────────────────────────┐
                    │ 1. Hardware:                         │
                    │    save rip, rsp, rflags to TSS/     │
                    │    kernel stack (as pt_regs)         │
                    │    switch to kernel stack            │
                    │    switch to ring 0                  │
                    ├──────────────────────────────────────┤
                    │ 2. Kernel entry (interrupt/syscall): │
                    │    save remaining pt_regs            │
                    │    handle interrupt / syscall work   │
                    │    call schedule() if needed         │
                    ├──────────────────────────────────────┤
                    │ 3. __schedule():                     │
                    │    pick_next_task() → Task B         │
                    │    context_switch(prev=A, next=B)    │
                    ├──────────────────────────────────────┤
                    │ 4. switch_mm() (if A.mm != B.mm):   │
                    │    write cr3 → B's page tables      │
                    │    (TLB flush or PCID switch)        │
                    ├──────────────────────────────────────┤
                    │ 5. switch_to(A→B):                  │
                    │    push callee-saved regs (A)        │
                    │    swap kernel rsp: A.rsp → B.rsp   │
                    │    pop callee-saved regs (B)         │
                    │    update TSS.sp0 for B              │
                    │    reload fs/gs TLS base for B       │
                    ├──────────────────────────────────────┤
                    │ 6. Return from kernel:               │
                    │    restore pt_regs (B's user regs)  │
                    │    iret/sysret → Task B in ring 3    │
                    └──────────────────────────────────────┘
                                  ↑               ↑
                         Task A saved         Task B resumes
                         here; A's            here; B's
                         kernel stack         kernel stack
                         preserved            preserved

Voluntary vs. Involuntary in Practice

A high nonvoluntary_ctxt_switches rate indicates the task is CPU-bound and is being preempted before it completes its work slice. This is generally fine — it means the scheduler is fairly sharing the CPU. However, if this causes latency issues: - Use SCHED_FIFO or SCHED_RR (real-time scheduling) to avoid preemption - Use sched_setaffinity() to pin the task to a dedicated CPU - Use isolcpus kernel parameter to prevent the scheduler from migrating other tasks onto specific CPUs

A high voluntary_ctxt_switches rate with low CPU utilization indicates excessive blocking (waiting on locks, I/O, or nanosleep). This is often intentional but can indicate lock contention or excessive small-granularity I/O.


Historical Context

The first Unix systems performed process switches by swapping the entire process image to disk and loading another (literal swapping). Early PDP-11 Unix had a tiny memory (64 KB address space limit) and context switches were expensive by modern standards because of the swap.

The shift to paging and demand-load (BSD 4.0, 1980) meant context switches no longer required full memory copies but introduced TLB flush overhead. The x86 architecture's Task State Segment (TSS) mechanism was designed for hardware-assisted context switching (x86 IRET to a new TSS), but Linux (like most OSes) ignores hardware task switching and does software context switching via switch_to() — hardware switching was found to be slower and less flexible.

PCIDs for TLB tagging were introduced in Intel Westmere (2010) but Linux did not use them until Spectre/Meltdown mitigations necessitated frequent cr3 switches in 2018.


Production Examples

Finding high context-switch processes:

pidstat -w 1 10        # per-process context switch rate (sysstat package)
# or
while true; do
  for pid in /proc/[0-9]*/status; do
    awk -v pid="${pid%/status}" -v pid="${pid##/proc/}" \
      '/ctxt_switches/{print pid, $0}' "$pid"
  done | sort -k3 -rn | head -10
  sleep 1
done

Measuring raw context switch latency:

# lmbench ct_sw benchmark:
lat_ctx -s 0 2           # 0-byte working set, 2 processes: ~1–3 µs typically
lat_ctx -s 64k 2         # 64KB working set (fits in L1): ~2–5 µs
lat_ctx -s 256k 2        # 256KB working set (fits in L2): ~5–15 µs
lat_ctx -s 16m 2         # 16MB working set (L3 thrash): ~50–200 µs

Pinning a latency-sensitive process:

# Run on CPU 3 only, isolated from scheduler migration:
taskset -c 3 ./latency-sensitive-app
# Or from within the process:
cpu_set_t cpus; CPU_ZERO(&cpus); CPU_SET(3, &cpus);
sched_setaffinity(0, sizeof(cpus), &cpus);

Debugging Notes

  • perf sched latency is the gold standard for diagnosing scheduling latency. It measures the time from when a task became runnable to when it actually started running (run queue latency), which includes context switch overhead and scheduling policy effects.
  • High nonvoluntary count without CPU pressure: this can indicate a real-time task at higher priority preempting the measured task repeatedly. Use perf sched timehist to see which tasks are causing preemptions.
  • cpu-migrations in perf stat: a task migrated between CPUs invalidates its L1/L2 cache lines on the source CPU (cache coherence protocol). If cpu-migrations are high, use sched_setaffinity to pin the task.
  • schedule_hrtimeout vs nanosleep: both cause voluntary context switches. schedule_hrtimeout is used by the kernel internally for sub-jiffy sleeps; from user space, clock_nanosleep(CLOCK_MONOTONIC, ...) is the high-resolution path.
  • /proc/PID/wchan: for tasks in voluntary sleep, this file shows the kernel function where the task is waiting. ep_poll = in epoll; do_futex = waiting on a futex; pipe_wait = waiting on a pipe; sk_wait_data = waiting on socket data.

Security Implications

  • Spectre (CVE-2017-5753) and context switches: Spectre variant 2 exploits branch predictor state that is shared across processes. The kernel mitigates via IBRS (Indirect Branch Restricted Speculation) or IBPB (Indirect Branch Predictor Barrier) on context switches. IBPB costs ~1–4 µs and was enabled by default on many distros, adding significant context switch overhead. Retpoline mitigates without IBPB overhead for kernel code, but user-space cross-process speculation still requires IBPB in high-security deployments.
  • MDS/RIDL mitigations: Microarchitectural Data Sampling vulnerabilities require clearing CPU microarchitectural buffers on context switch. VERW instruction (used by kernel via mds_user_clear on return-to-user) adds ~50 ns per context switch. Can be disabled (mds=off) in environments where cross-process trust allows.
  • KPTI overhead: the cr3 double-switch introduced by KPTI adds ~1–3 µs per syscall on Meltdown-vulnerable CPUs. AMD CPUs are immune to Meltdown; they can boot with nopti safely.
  • Time-of-context-switch side channel: precise measurement of context switch timing can leak information about the scheduler state and thus about other processes' activity. Timing attacks in shared cloud environments exploit this.

Performance Implications

  • Thread pools vs. processes: threads of the same process share mm_struct, so thread-to-thread context switches skip the cr3 write. For workloads doing frequent context switches (server with many concurrent connections), threads are significantly cheaper than processes.
  • Reducing context switches with async I/O: io_uring allows batching many I/O operations in a single syscall (or zero syscalls with IORING_SETUP_SQPOLL). A busy- loop kernel thread polls the submission queue, eliminating context switches for I/O dispatch. For high-IOPS storage workloads, this removes tens of thousands of context switches per second.
  • NUMA effects: context switches that move a task between NUMA nodes lose not only L1/L2 cache locality but also remote DRAM access penalty (~100 ns vs. ~40 ns for local NUMA). The NUMA-aware scheduler tries to avoid this but may migrate for load balance. Pin with numactl --cpunodebind=0 --membind=0 for latency-critical tasks.
  • Scheduler tick vs. tickless: CONFIG_HZ_1000 means the scheduler timer fires 1000 times per second, limiting time slices to 1 ms minimum and adding 1000 interrupts/s of base overhead. CONFIG_NO_HZ_FULL (tickless) suppresses ticks on CPUs running a single task, reducing context-switch-related interrupts to near-zero.

Failure Modes

Failure Symptom Diagnosis
Context switch storm High CPU sys%, many cs in vmstat pidstat -w; find process switching >10000/s
Scheduling latency spike Request tail latency p99 >> p50 perf sched latency; check GC/compaction
CPU migration thrash Poor cache utilization, high IPC loss perf stat -e cpu-migrations; use affinity
Meltdown/Spectre mitigation overhead 10–30% syscall regression dmesg | grep -i spectre; benchmark with/without mitigations
NUMA migration latency Memory access slowdown after migration numastat -p PID; pin with numactl
Priority inversion High-priority task waiting for low-priority Check RT_MUTEX and SCHED_FIFO priority chains

Modern Usage

io_uring SQPOLL: the io_uring kernel thread polls the submission queue without any epoll_wait or read syscall from the application. The application submits I/O requests by writing to a shared ring buffer. This eliminates context switches for I/O submission and often for completion as well.

eBPF scheduler (sched_ext): merged in Linux 6.12, sched_ext allows userspace BPF programs to implement the scheduling policy. The BPF scheduler's dispatch function runs in the context switch path. This brings scheduling policy close to user-space control while keeping it in kernel context for safety.

Google's per-CPU kernel threads (ghOSt): Google published a system where the Linux scheduler can delegate per-CPU scheduling decisions to a userspace agent, allowing nanosecond-granularity policy control. This effectively makes the context switch path a plugin boundary.


Future Directions

  • Hardware task switching renaissance: proposals for hardware-assisted context switches using Intel's FRED (Flexible Return and Event Delivery) and AMX (Advanced Matrix Extensions) state management may revive interest in hardware-managed context data — though software still provides more flexibility.
  • Larger XSAVE state: as CPUs add more SIMD extensions (AMX tiles = 8 KB of state, AVX-512 = 2 KB), XSAVE save/restore cost grows. Lazy save with fine-grained component tracking (xstate_bv) will become more important, not less.
  • Shadow stacks in context switch: with Intel CET shadow stack (merged in Linux 6.6), switch_to() must now also save/restore the shadow stack pointer (SSP) alongside the regular stack pointer.
  • BPF programs in switch_to() path: ongoing work to allow BPF hooks at key points in context_switch() for custom scheduling telemetry without modifying the kernel.

Exercises

  1. Measure context switch cost directly: write two processes communicating via a pipe (ping-pong benchmark). Each process reads 1 byte, writes 1 byte back. Measure total round-trip time with clock_gettime(CLOCK_MONOTONIC). Divide by 2 to estimate single context switch latency. Compare with lmbench lat_ctx -s 0 2.

  2. Thread vs. process switch cost: extend the ping-pong above to use (a) two processes (different mm) and (b) two threads in the same process (same mm). Compare context switch latency. Use perf stat -e context-switches,cs to verify counts.

  3. TLB effect measurement: write a program that allocates a working set of size W (parameter), forks a second process with a different working set of size W, and does ping-pong context switches. Measure round-trip time as a function of W: 4 KB, 64 KB, 512 KB, 4 MB, 32 MB. Plot the "TLB cliff" where context switch time increases sharply.

  4. FPU lazy save: write two threads. Thread A does no floating-point. Thread B does heavy AVX2 computation. Measure context switch latency between them (a) when B is active, (b) when B is idle. Use perf stat -e fp_comp_ops_exe or CR0.TS probing via a trap handler to verify the lazy save is actually triggered.

  5. Spectre/KPTI overhead: on an Intel CPU, run a syscall-intensive benchmark (strace -c ./program to count syscalls). Boot once with mitigations=off and once with default mitigations. Compare total syscall time. Identify which mitigation flag in /sys/devices/system/cpu/vulnerabilities/ corresponds to the overhead.


References

  • kernel/sched/core.c__schedule(), context_switch()
  • arch/x86/kernel/process_64.cswitch_to(), copy_thread()
  • arch/x86/mm/tlb.cswitch_mm_irqs_off(), PCID management
  • arch/x86/kernel/fpu/core.cswitch_fpu_return(), XSAVE/XRSTOR
  • arch/x86/include/asm/switch_to.hswitch_to() macro
  • Love, Linux Kernel Development, 3rd ed. — Chapter 4 (Process Scheduling), context switch discussion
  • Bovet & Cesati, Understanding the Linux Kernel, 3rd ed. — Chapter 3 (hardware context, switch_to implementation)
  • lmbench: http://www.bitmover.com/lmbench/ — lat_ctx for context switch measurement
  • "An analysis of Linux scalability to many cores" (Clements et al., OSDI 2012)
  • "The Spectre in the Machine: Spectre/Meltdown and Linux" — LWN series (2018)
  • Intel SDM, Vol. 3A: Chapter 9 (Processor Management), Chapter 14 (XSAVE)
  • man 7 sched — scheduling policies and priority overview
  • man 2 sched_setaffinity, man 2 sched_yield, man 2 sched_setscheduler
  • perf-sched tutorial: man perf-sched