Skip to content

Section 09: Scheduling — Overview

Section Purpose and Scope

The scheduler is the executive function of the operating system. It decides which runnable task runs next, on which CPU, for how long, and under what conditions it is preempted. Every latency measurement, every throughput benchmark, and every fairness guarantee in a running system is ultimately a product of scheduling decisions made thousands of times per second.

This section provides a comprehensive treatment of scheduling theory and practice: the formal goals and metrics, the classic algorithms and their theoretical properties, the Linux Completely Fair Scheduler (CFS) in implementation detail, the Windows scheduler's priority model, real-time scheduling theory (EDF, RMS), NUMA-aware scheduling, gang scheduling for HPC, work-stealing, and the kernel data structures and code paths that implement scheduling in Linux. Priority inversion — a subtle and historically catastrophic bug — receives dedicated treatment.


Prerequisites

  • Section 06 (CPU Architecture): CPU topology, NUMA, SMT — the hardware the scheduler manages
  • Section 07 (Process Management): task_struct, process states, runqueues
  • Section 08 (Threading Models): threads as the units of scheduling

Learning Objectives

After completing this section you will be able to:

  1. State the four fundamental scheduling goals and explain how they conflict
  2. Describe FIFO, SJF, STCF, Round Robin, MLFQ, and CFS algorithms with their complexity and fairness properties
  3. Explain how the Linux CFS works: red-black tree, virtual runtime, weight, and the sched_entity
  4. Describe the Linux real-time scheduler (SCHED_FIFO, SCHED_RR) and explain when to use it
  5. Explain EDF and Rate Monotonic Scheduling for hard real-time systems
  6. Describe NUMA-aware scheduling and the cost of remote memory access on scheduling decisions
  7. Explain priority inversion, the Mars Pathfinder incident, and priority inheritance/ceiling protocols
  8. Trace a scheduling decision through Linux kernel source: schedule(), pick_next_task(), context switch
  9. Explain the scheduler implementation details: run queue, load balancing, tickless kernel

Architecture Overview

LINUX SCHEDULER ARCHITECTURE (Linux 6.x)

  ┌─────────────────────────────────────────────────────────────┐
  │                    schedule() function                      │
  │                                                             │
  │  1. Check need_resched flag                                 │
  │  2. Disable preemption                                      │
  │  3. Pick next task via scheduler class hierarchy:           │
  │                                                             │
  │  stop_sched_class   ← SCHED_STOP (highest priority)        │
  │       ▼ (if empty)                                          │
  │  dl_sched_class     ← SCHED_DEADLINE (EDF)                 │
  │       ▼ (if empty)                                          │
  │  rt_sched_class     ← SCHED_FIFO / SCHED_RR               │
  │       ▼ (if empty)                                          │
  │  fair_sched_class   ← SCHED_NORMAL / SCHED_BATCH (CFS)     │
  │       ▼ (if empty)                                          │
  │  idle_sched_class   ← SCHED_IDLE / swapper                 │
  │                                                             │
  │  4. Perform context switch (switch_to macro)                │
  └─────────────────────────────────────────────────────────────┘

CFS RUN QUEUE (per CPU):

  cfs_rq
  ├── min_vruntime = 1000ns  (tracks leftmost node)
  ├── nr_running = 5
  └── tasks_timeline (red-black tree, keyed by vruntime):

       [1020ns]
      /         \
  [1010ns]    [1030ns]
   /    \      /    \
[1005] [1015][1025] [1040]
    ▲
    └── next to run (minimum vruntime)

  Virtual Runtime formula:
  vruntime += delta_exec × (NICE_0_LOAD / se.load.weight)
  Higher weight (nicer = lower number) → vruntime grows slower → runs more often


CONTEXT SWITCH SEQUENCE:

  Currently running: Task A        Next task: Task B
  ─────────────────────────        ─────────────────

  save_context(A):
    push rsp, rbp, rbx, r12–r15 onto kernel stack A
    save FPU/SSE state (lazy)
    save %fs (TLS pointer)

  load_context(B):
    restore %fs (B's TLS)
    restore FPU/SSE state (lazy)
    pop r15–r12, rbx, rbp, rsp from kernel stack B
    ret → resumes B at its last kernel return address

  CR3 switch (if A and B in different processes):
    mov cr3, B.mm.pgd  ← TLB flush (or PCID if supported)


NUMA SCHEDULING:

  ┌──────────────────────┐       ┌──────────────────────┐
  │    Node 0            │       │    Node 1            │
  │  CPU0 CPU1 CPU2 CPU3 │       │  CPU4 CPU5 CPU6 CPU7 │
  │  ┌──────────────┐   │       │  ┌──────────────┐   │
  │  │  runqueue    │   │       │  │  runqueue    │   │
  │  │  G: [T1,T2] │   │  QPI  │  │  G: [T3,T4] │   │
  │  └──────────────┘   │◄─────►│  └──────────────┘   │
  │  Local DRAM          │       │  Local DRAM          │
  │  (T1's memory here) │       │                      │
  └──────────────────────┘       └──────────────────────┘

  NUMA Balancing: kernel periodically unmaps pages (PROT_NONE)
  to trigger page faults → measures access patterns → migrates
  pages and tasks to co-locate working set with CPU

Key Concepts

  • Scheduling Goals: (1) Fairness — each process gets its entitled share; (2) Efficiency — CPU is never idle when work is available; (3) Throughput — maximize completed jobs per unit time; (4) Latency / Response Time — minimize time from ready to first execution. These goals fundamentally conflict: maximizing throughput favors long-running batch jobs; minimizing latency favors short-interactive jobs.
  • Preemptive Scheduling: The scheduler can forcibly remove a running task from the CPU (on a timer interrupt or at scheduling points). Necessary for latency guarantees and fairness. Modern general-purpose OSes are fully preemptive.
  • Cooperative Scheduling: Tasks voluntarily yield the CPU. Simpler but vulnerable to a single non-yielding task starving all others. Used in early Windows (3.x), many embedded RTOSes, and async event loops.
  • FIFO / FCFS: First job in the ready queue runs to completion. Simple. Convoy effect: one long job delays many short ones. Average waiting time can be terrible.
  • Shortest Job First (SJF) / STCF: Optimal for minimizing average turnaround time but requires knowing future job lengths (impossible in general). Starvation of long jobs is a problem.
  • Round Robin (RR): Each task runs for a fixed time quantum, then is preempted and placed at the back of the queue. Quantum too small → excessive context switch overhead; too large → degrades to FIFO. Typical quantum: 1–10ms.
  • Multi-Level Feedback Queue (MLFQ): Multiple priority queues. New tasks enter the highest-priority queue. If a task uses its entire quantum, it's demoted. If it yields early (I/O bound), it stays/rises. Approximates SJF without knowing job lengths. Used in many real-world schedulers as the conceptual basis.
  • CFS (Completely Fair Scheduler): Linux's default scheduler since 2.6.23. Allocates CPU time proportional to task weight (derived from nice value). Tracks virtual runtime (vruntime) per task — time the task has run, normalized by weight. Always runs the task with the lowest vruntime (leftmost node in a red-black tree). Achieves O(log n) scheduling with perfect fairness over time.
  • sched_entity: The scheduling entity embedded in task_struct (and task_group for group scheduling). Contains vruntime, sum_exec_runtime, load weight, and a red-black tree node.
  • SCHED_DEADLINE: Linux's EDF-based scheduler (Linux 3.14). Tasks specify period, runtime, and deadline. The kernel admits tasks only if the system can guarantee all deadlines (admission control). Highest priority after stop tasks.
  • Real-Time Scheduling (SCHED_FIFO / SCHED_RR): Fixed-priority preemptive scheduling. SCHED_FIFO: runs until it blocks or a higher-priority task becomes runnable. SCHED_RR: like SCHED_FIFO but with a time quantum among equal-priority tasks. Starvation of lower-priority tasks is by design.
  • EDF (Earliest Deadline First): Optimal real-time algorithm for preemptive uniprocessor: always run the task whose deadline is soonest. Optimal in the sense that if any algorithm can meet all deadlines, EDF can.
  • Rate Monotonic Scheduling (RMS): Assigns fixed priorities based on task period — shorter period → higher priority. Optimal among fixed-priority preemptive algorithms. CPU utilization bound: ln(2) ≈ 69.3% for n tasks.
  • Load Balancing: In SMP systems, the scheduler must distribute tasks across CPUs. Linux uses a hierarchical domain structure (SMT → core → NUMA node → system) with periodic and triggered load balancing. Imbalanced load wastes CPUs; overly aggressive migration hurts cache affinity.
  • NUMA-Aware Scheduling: Awareness that migrating a task to a remote NUMA node is costly if its memory is local. Linux NUMA balancing (kernel 3.8+) uses page fault trapping to detect remote accesses and proactively migrates tasks and memory to the same node.
  • Gang Scheduling: Scheduling multiple cooperating threads or processes on different CPUs simultaneously — critical for tightly coupled parallel workloads (MPI jobs, GPU+CPU pipelines). Avoids spin-waiting for a partner thread that isn't scheduled.
  • Work Stealing: Each CPU maintains a local run queue. Idle CPUs steal tasks from the tail of busy CPUs' queues. Minimizes synchronization; used in Go's scheduler, Tokio, and Intel TBB.
  • Priority Inversion: A high-priority task is blocked waiting for a resource held by a low-priority task, which is preempted by medium-priority tasks. Effectively, the high-priority task runs at the priority of the low-priority task.
  • Priority Inheritance: When a low-priority task holds a resource needed by a high-priority task, the low-priority task temporarily inherits the higher priority until it releases the resource. POSIX PTHREAD_PRIO_INHERIT mutex protocol.
  • Priority Ceiling Protocol: Each mutex has a ceiling = the highest priority of any task that will ever lock it. A task that acquires the mutex inherits the ceiling priority, preventing priority inversion without needing to know the actual waiting tasks.

The Mars Pathfinder Priority Inversion (1997)

A canonical real-world case study:

  Task Priorities:
  High:   Meteorological data collection (bc_dist)
  Medium: Communication task
  Low:    Information bus management (bc_sched) — holds a shared mutex

  Scenario:
  1. bc_sched (Low) acquires mutex, updates shared data
  2. bc_dist (High) becomes runnable, preempts bc_sched
  3. bc_dist tries to acquire mutex → BLOCKED (held by bc_sched)
  4. Communication task (Medium) preempts bc_sched!
  5. bc_sched cannot run → mutex held → bc_dist blocked indefinitely
  6. Watchdog timer detects bc_dist not running → system RESET

  Fix: Enable priority inheritance in the mutex configuration
  (the fix was uploaded to the Mars lander via a command sequence)

Major Historical Milestones

Year Milestone
1961 CTSS: first preemptive time-sharing scheduler (quantum = 200ms)
1968 Dijkstra: semaphores enable the synchronization that scheduling requires
1970 THE OS: strict layered scheduling — theoretical basis
1973 UNIX: simple round-robin with priorities; /proc not yet invented
1982 Liu & Layland paper: EDF and RMS analysis — foundation of RT scheduling theory
1992 Priority Inversion studied formally; priority inheritance proposed
1994 Linux 1.0: O(n) scheduler — scanned all tasks each reschedule
1996 Linux 2.0: SMP scheduling; global runqueue with spin lock
1997 Mars Pathfinder priority inversion incident
2000 Linux 2.4: improved SMP but still O(n) with global lock
2002 Linux 2.5: O(1) scheduler (Con Kolivas, Ingo Molnar) — O(1) pick_next_task
2007 Linux 2.6.23: CFS replaces O(1) scheduler (Ingo Molnar)
2008 Linux 2.6.24: cgroup-based group scheduling in CFS
2009 Linux 2.6.32: nohz/tickless kernel — no timer ticks on idle CPUs
2012 Linux 3.8: NUMA scheduling and automatic NUMA balancing
2014 Linux 3.14: SCHED_DEADLINE — EDF in mainline Linux
2016 Linux 4.6: Deadline scheduling for groups
2018 Linux 4.20: Energy-aware scheduling for mobile (EAS)
2021 Linux 5.14: Core scheduling for SMT security (same-thread trust)
2023 EEVDF (Earliest Eligible Virtual Deadline First) proposed as CFS successor
2023 Linux 6.6: EEVDF merged — replaces the CFS red-black tree algorithm

Modern Relevance and Production Use Cases

Latency-sensitive services: Databases (PostgreSQL, MySQL), network proxies (Envoy, Nginx), and trading systems use SCHED_FIFO or CPU affinity pinning (taskset, numactl) to guarantee bounded scheduling latency. Understanding CFS priority and latency targets is essential for SLA engineering.

Container resource management: Kubernetes CPU requests/limits map to CFS bandwidth control (cpu.cfs_quota_us, cpu.cfs_period_us). A container throttled by CFS shows high "iowait" in metrics even when doing no I/O — understanding CFS prevents misdiagnosis.

Go runtime tuning: GOMAXPROCS controls the number of Ps in the GMP model. Setting it to NUMA node size instead of total CPU count can reduce cross-NUMA memory traffic. runtime.LockOSThread() pins a goroutine to a kernel thread for real-time latency.

Cloud scheduler efficiency: Hyperscalers (Google, AWS, Meta) invest heavily in custom scheduler patches. Google's Autopilot, Meta's Twine, and AWS's Nitro scheduler all extend or replace CFS for their workload mixes. Understanding CFS limitations is the starting point.

Embedded and automotive: PREEMPT_RT (now in mainline Linux) combined with SCHED_FIFO brings hard real-time guarantees to Linux. Automotive applications (ROS2, Drive PX) rely on carefully configured RT priorities and isolation.


File Map

09-scheduling/
├── 00-overview.md                     ← This file
├── 01-scheduling-goals-metrics.md     ← Throughput, latency, fairness, utilization
├── 02-preemptive-vs-cooperative.md    ← Mechanics, tradeoffs, hybrid approaches
├── 03-classic-algorithms.md           ← FIFO, SJF, STCF, RR, MLFQ with analysis
├── 04-cfs-completely-fair-scheduler.md← vruntime, RB tree, weight, group scheduling
├── 05-linux-rt-scheduler.md           ← SCHED_FIFO, SCHED_RR, priority model
├── 06-sched-deadline.md               ← EDF in Linux, admission control, CBS
├── 07-real-time-theory.md             ← EDF, RMS, utilization bounds, response time
├── 08-smp-load-balancing.md           ← Domain hierarchy, steal triggers, cache affinity
├── 09-numa-aware-scheduling.md        ← NUMA balancing, page migration, numactl
├── 10-gang-scheduling.md              ← Coscheduling, MPI implications, GPU sync
├── 11-work-stealing.md                ← Algorithm, deque, correctness analysis
├── 12-priority-inversion.md           ← Inheritance, ceiling, Mars Pathfinder case study
├── 13-scheduler-implementation.md     ← schedule(), pick_next_task(), context_switch()
├── 14-windows-scheduler.md            ← Priority boost, DFSS, UMS, realtime classes
├── 15-energy-aware-scheduling.md      ← EAS, big.LITTLE, power-performance tradeoffs
├── 16-eevdf.md                        ← EEVDF algorithm, comparison to CFS, Linux 6.6+

Cross-References

  • Section 06 (CPU Architecture): NUMA topology, SMT — the hardware the scheduler manages
  • Section 07 (Process Management): task_struct, sched_entity, process states
  • Section 08 (Threading Models): Go GMP scheduler; Tokio work-stealing; kernel thread scheduling
  • Section 10 (Synchronization): Priority inversion connects scheduling and synchronization
  • Section 20 (Containers): CFS bandwidth control = CPU limits in Kubernetes
  • Section 35 (Real-Time Systems): Full RT scheduling theory and PREEMPT_RT

Essential: Files 01–05, 12–13. Scheduling goals, CFS, and priority inversion are non-negotiable knowledge for production systems work.

Deep dive recommended: Files 06–07 for real-time systems. Files 08–09 for NUMA and high-core-count systems. Files 14–16 for Windows engineers or modern Linux scheduler research.

Hands-on: Use chrt to set a process to SCHED_FIFO. Use perf sched record/report to visualize scheduling decisions. Write a Go program that reveals GOMAXPROCS effects on throughput. Simulate priority inversion with three threads in C.

Estimated study time: 20–25 hours including source code reading and experiments.