Classic Scheduling Algorithms
Technical Overview
Before examining Linux's production scheduler, understanding the classic algorithms is essential. They form the theoretical foundation on which all modern schedulers build. Each algorithm makes a different tradeoff between simplicity, optimality, and practicality. Many are provably optimal for specific objectives under specific assumptions — and catastrophically bad when those assumptions break.
This file traces the evolution from trivially simple queuing (FIFO) through adaptive feedback-driven policies (MLFQ) and probabilistic approaches (lottery scheduling). The patterns of thought — burst prediction, feedback loops, priority aging — recur throughout modern scheduler design.
Prerequisites
- Scheduling fundamentals (01-scheduling-fundamentals.md)
- Understanding of process states and context switching
- Basic probability for lottery scheduling section
- Queuing theory concepts (optional but helpful)
FIFO / FCFS (First-Come, First-Served)
The simplest possible policy: process requests are served in the order they arrive. No priorities, no preemption, no prediction.
Mechanics: A single FIFO queue of runnable tasks. When the CPU is free, dequeue the head task and run it to completion (or until it blocks).
The Convoy Effect: The critical failure mode. If a long CPU-bound job arrives first, all subsequent jobs — even very short ones — queue behind it.
Arrival order: Job A (burst=100ms), Job B (burst=5ms), Job C (burst=5ms)
Gantt Chart (FCFS):
Time: 0 100 105 110
|─────────A───────|─B──|─C──|
Turnaround times:
A: 100ms
B: 105ms (waited 100ms for A)
C: 110ms (waited 105ms)
Average turnaround: 105ms
Compare to optimal ordering (SJF):
Time: 0 5 10 110
|─B─|─C─|────A────|
Turnaround times:
B: 5ms
C: 10ms
A: 110ms
Average turnaround: 41.7ms ← 2.5x better
Convoy effect is common in practice: A slow database query holding a connection pool slot, a large file read blocking an I/O queue, a fat HTTP request on a single-threaded server — all are FCFS convoys. Solutions: preemption, priority queues, or workload segregation.
When FCFS is acceptable: Batch processing systems where all jobs have similar length. Job submission queues (print queues, render farms) where starvation is acceptable and simplicity is valued. Also used internally within same-priority bands of more sophisticated schedulers.
SJF/SRTF: Shortest Job First / Shortest Remaining Time First
SJF (non-preemptive): When the CPU becomes free, select the job with the shortest estimated next CPU burst. Provably optimal for minimizing average turnaround time in a batch (non-preemptive) model.
SRTF (preemptive SJF): When a new job arrives, if its estimated burst is shorter than the remaining burst of the current job, preempt the current job. Optimal for minimizing average turnaround time including preemption.
The Prediction Problem: SJF requires knowing job burst length in advance. For offline batch scheduling, this is possible (operator provides estimates). For online scheduling, we must predict.
Exponential Averaging for Burst Prediction:
The scheduler maintains an estimate τ of the next burst length, updated after each observed burst t:
τ(n+1) = α · t(n) + (1 - α) · τ(n)
Where:
t(n) = actual last CPU burst duration
τ(n) = previous estimate
α = smoothing factor (typically 0.5)
τ(n+1) = new estimate
With α = 0.5:
τ(n+1) = 0.5 · t(n) + 0.5 · τ(n)
= 0.5 · t(n) + 0.25 · t(n-1) + 0.125 · t(n-2) + ...
Geometric decay: recent history weighted more heavily.
This is the same formula used in TCP RTT estimation (with different α) and in CFS's load tracking (PELT, Per-Entity Load Tracking uses exponential decay over 32ms windows).
Starvation Risk: SJF can starve long jobs indefinitely if short jobs keep arriving. In a system with continuous short job arrivals, a 10-minute job may never run. Mitigated by aging: progressively increase the effective priority (or decrease the effective burst estimate) of waiting tasks.
Starvation example:
Continuous arrivals of 5ms jobs: J1, J2, J3, J4, ...
Long job L (burst = 1000ms) arrives at t=0
SJF schedule:
J1(5ms) J2(5ms) J3(5ms) ... → L never runs
With aging (priority boost after 200ms of waiting):
J1 J2 J3 ... [L's priority boosted] ... L finally runs
Round Robin (RR)
The canonical preemptive fair-sharing algorithm. Each task gets a fixed time quantum q. After q milliseconds, if the task has not blocked, it is preempted and moved to the tail of the ready queue.
Mechanics: 1. Maintain a circular FIFO queue of runnable tasks 2. Pick the head task, run for at most q time units 3. If task blocks before quantum expires: remove from queue (re-enqueue when unblocked) 4. If quantum expires: preempt, move to tail of queue, pick next head
The Quantum Dilemma:
Quantum too small (q = 1ms, context switch cost = 0.1ms):
Overhead = 0.1 / (1 + 0.1) ≈ 9% just on context switches
Each task appears to get ~1ms slice: high responsiveness
But thrashing if tasks have 100ms bursts: 100 preemptions per job
Quantum too large (q = 1000ms):
Essentially FCFS: short jobs wait up to 1 second
Low overhead but poor response time
Sweet spot (q = 4-100ms):
Linux time slice: historically 100ms, now 4-100ms adaptive
Response time acceptable for interactive use
Context switch overhead ≈ 0.1-1%
RR Gantt Chart (q = 4ms, 3 tasks A=12ms, B=8ms, C=4ms):
Time: 0 4 8 12 16 20 24
|─A─|─B─|─C─|─A─|─B─|─A─|
Turnaround times:
C: 12ms (ran 4ms in each of first 3 quanta: t=8-12 completes)
B: 20ms
A: 24ms
Average: 18.7ms
Note: No convoy effect. Interactive response is at most q=4ms away.
RR and I/O-bound tasks: RR is naturally good for I/O-bound tasks because they voluntarily block before their quantum expires, giving back CPU time. They don't "waste" their quantum. CPU-bound tasks always use their full quantum.
RR and cache behavior: With many tasks and a small quantum, each task runs for only 4ms before a context switch. On return, its L1 cache is cold. This is the fundamental tension between responsiveness and efficiency in round-robin systems.
Multi-Level Queue (MLQ)
Partition tasks into multiple queues, each with its own scheduling algorithm and priority level. Queues have fixed priorities relative to each other.
Classic two-queue system (interactive vs batch):
┌─────────────────────────────────────────┐
│ Foreground Queue (Interactive) │ Priority: HIGH
│ Algorithm: Round Robin (small quantum) │
│ Tasks: shells, editors, browsers │
└─────────────────────────────────────────┘
│ (only if foreground queue empty)
▼
┌─────────────────────────────────────────┐
│ Background Queue (Batch) │ Priority: LOW
│ Algorithm: FCFS │
│ Tasks: compilers, simulations, backups │
└─────────────────────────────────────────┘
Problems with MLQ: Rigid. Once a task is classified into a queue, it stays there. A background compilation that the user is now actively waiting on still runs at low priority. Starvation of background queue if foreground is always busy. Tasks must be manually classified — how does the OS know which queue?
MLQ is useful when workload types are clearly distinct and externally classified. It's used in network QoS (different queues for VoIP, video streaming, best-effort data) and in OS priority levels for system vs user processes.
MLFQ: Multi-Level Feedback Queue
The most influential classical algorithm. Solves MLFQ's rigidity by using task behavior to dynamically adjust priority. Described by Corbató (1962) for CTSS and formalized by subsequent work.
Rules: 1. If Priority(A) > Priority(B): A runs, B does not 2. If Priority(A) = Priority(B): A and B run in round robin 3. New task enters at highest priority queue 4. If task uses full quantum without blocking: demoted to next lower queue 5. If task blocks before quantum expires: stays at current priority (or is promoted) 6. Periodically: boost all tasks back to highest priority (prevents starvation)
MLFQ Queues (example with 4 levels):
Q0 (highest, q=8ms): [new tasks enter here]
Q1 (q=16ms): [tasks that used full Q0 quantum]
Q2 (q=32ms): [tasks that used full Q1 quantum]
Q3 (lowest, FCFS): [long-running CPU-bound jobs settle here]
Feedback:
Task blocks before quantum → stays in current queue (or promoted)
Task uses full quantum → demoted one level
Periodic boost (every 1s) → all tasks move to Q0
MLFQ approximates SJF without needing prior knowledge: short tasks stay at the top (they complete before using their quantum), long tasks sink to the bottom (they repeatedly use their full quantum and get demoted).
MLFQ Gantt Chart (single CPU, q=4ms at Q0, q=8ms at Q1):
Tasks: A (CPU-bound, 20ms), B (interactive, 3ms bursts with I/O gaps)
Time: 0 4 8 12 16 20 24
Queue: Q0 Q1 Q0 Q1 Q2
A A B A A
(A used full quantum: demoted to Q1)
(B arrived, preempts A in Q0, blocks after 3ms: stays Q0)
(A resumes in Q1, uses full quantum: demoted to Q2)
Result: B (interactive) gets low latency despite sharing CPU with A
MLFQ Starvation and the Gaming Problem:
- Starvation: If Q0 is always busy, Q3 tasks never run. Periodic priority boost (rule 6) prevents this.
- Gaming: A smart task can issue a dummy I/O just before its quantum expires, staying at high priority forever. BSD's scheduler was famously gameable this way. Mitigation: track total CPU time at current level, not just per-slice.
Historical implementations:
- BSD Unix (4.2BSD, 1983): Used a multilevel feedback queue with 32 priority bands. Priority decayed as a function of recent CPU usage (using exponential decay similar to burst prediction). Interactive tasks naturally got high priority; CPU hogs sank. This was the dominant Unix scheduler for two decades.
- Windows NT/2000/XP: 32 priority levels (0-31). Levels 16-31 are real-time (fixed). Levels 1-15 are dynamic (user space, OS can boost/decay). A "boost" system gives temporary priority increases to foreground windows and threads that wake from I/O. Similar to MLFQ concepts.
- Solaris: Used a multi-level dispatch table with explicit priority tables configurable per class.
Priority Scheduling with Aging
Pure priority scheduling (always run the highest-priority runnable task) suffers from starvation of low-priority tasks. Aging is the universal cure.
Mechanism: Increase a task's effective priority as it waits. Common implementations:
Simple aging: priority(t) = base_priority + waiting_time / aging_factor
Example (aging_factor = 10ms):
Task L: base_priority=10, waiting 100ms → effective priority = 10 + 10 = 20
Task H: base_priority=50, just arrived → effective priority = 50
H runs first, but L will eventually reach priority 50 after 400ms of waiting
Linux's CFS uses vruntime as an implicit aging mechanism: the longer a task waits, the further its vruntime falls behind the global minimum, and the higher its implicit priority when it next runs.
Priority donation (related concept): When a high-priority task is blocked waiting for a resource held by a low-priority task, temporarily "donate" the high priority to the low-priority task so it can complete quickly and release the resource. This is the essence of priority inheritance. See 08-priority-inversion.md.
Lottery Scheduling
Proposed by Waldspurger and Weihl (1994), lottery scheduling is a probabilistic proportional-share algorithm. Each task is assigned a number of "lottery tickets." At each scheduling decision, a ticket is drawn randomly; the task holding that ticket runs next.
Mechanics:
Task A: 100 tickets (wants 50% CPU share)
Task B: 50 tickets (wants 25% CPU share)
Task C: 50 tickets (wants 25% CPU share)
Total: 200 tickets
Draw a random number r in [1, 200]:
r ∈ [1, 100] → A runs (50% probability)
r ∈ [101, 150] → B runs (25% probability)
r ∈ [151, 200] → C runs (25% probability)
Properties: - Probabilistically fair: Over time, each task gets CPU proportional to its tickets. - No starvation: Every task has a nonzero probability of winning each draw. - Simple hierarchical control: Groups can hold tickets and distribute them internally. A group with 100 tickets gives 100% of its share to its internal tasks. - Ticket inflation: A task in a group can create new tickets for itself, stealing share from siblings. Requires trust or enforcement.
Ticket currency and inflation:
Organization structure:
User Alice: 100 tickets (50% of system)
Process A1: 500 Alice-tickets → 500/1000 = 50% of Alice's share = 25% system
Process A2: 500 Alice-tickets → 25% system
User Bob: 100 tickets (50% of system)
Process B1: 100 Bob-tickets → 100% of Bob's share = 50% system
Each user can issue as many tickets to their own processes as they want
(currency inflation within their allocation) without affecting other users' shares.
Limitations: - High variance at short timescales. A task with 1% of tickets may go 100+ quanta without running (though probability decreases exponentially). - Random draws are wasteful to compute compared to deterministic ordering. - No support for task priorities within a group without explicit ticket allocation. - Rarely used in production kernels due to variance and overhead.
Stride scheduling (related): Deterministic alternative to lottery. Each task has a stride inversely proportional to its tickets. A global pass counter advances; the task with the minimum pass value runs next. Identical long-run proportionality to lottery but zero variance. Used in Xen's credit scheduler for CPU allocation to VMs.
Comparing Algorithms: Key Metrics
Algorithm | Throughput | Response | Fairness | Starvation | Complexity
────────────────────────────────────────────────────────────────────────
FCFS | High | Poor | Poor | No* | O(1)
SJF | Optimal | Poor | Poor | Yes | O(n) est.
SRTF | Optimal | Good | Poor | Yes | O(n) est.
RR | Medium | Good | Good | No | O(1)
MLFQ | Good | Good | Good | With boost | O(levels)
Priority+ | Variable | Variable | Good | No | O(log n)
Lottery | Variable | Good | Good | No | O(n)
* FCFS has no starvation but severe convoy delays for short jobs
Debugging Notes
- Identifying MLFQ-like behavior in Linux:
/proc/[pid]/schedshowsprioandse.sum_exec_runtime. Highnr_involuntary_switchesindicates frequent preemption (task is CPU-bound and being quantum-limited). - Simulating classic algorithms: The
cgroupsCPU scheduler can approximate some of these. SCHED_FIFO with varying priorities approximates strict priority queues. schedtool: Can set scheduling policy and priority for running processes.schedtool -R -p 50 [pid]sets SCHED_RR at priority 50.- Detecting starvation: If
getdelays(from Linux kernel tools/accounting) shows high CPU delay for a task, it is runnable but not getting scheduled — a starvation signal. - Quantum visualization:
perf sched timehistshows the timeline of task scheduling events, revealing quantum lengths and preemption patterns.
Security Implications
- MLFQ gaming in priority systems: An adversarial process that knows the scheduling algorithm can manipulate its burst behavior to stay in high-priority queues, gaining more CPU than its fair share. Linux mitigates this in CFS by tracking total CPU usage, not just per-interval behavior.
- Lottery ticket hoarding: In multi-user lottery systems, a user who accumulates many tickets dominates. Requires kernel enforcement of ticket caps per user.
- Priority inversion in priority scheduling: A fundamental security and reliability concern, especially in RTOS and safety-critical systems. See 08-priority-inversion.md.
Performance Implications
- Round Robin quantum tuning: Linux's
kernel/sched/fair.ccomputes the effective quantum assched_latency_ns / nr_running, bounded bysched_min_granularity_ns. This adaptive quantum keeps scheduling latency bounded while reducing context switch overhead when few tasks compete. - MLFQ queue count and quantum doubling: BSD used 32 levels with priority recalculated every second. The recalculation itself was O(n tasks) and showed up in profiling on loaded systems. Linux CFS avoids this with the rb-tree, which updates in O(log n).
- Context switch frequency is the primary overhead: MLFQ with many levels and short top-level quanta can cause more context switches than pure RR. Profile with
vmstat 1cs column (context switches per second).
Modern Usage
Classic algorithms live on in modern systems in layered form:
- Network packet scheduling: Linux
tc(traffic control) implements FIFO, SFQ (Stochastic Fair Queuing — a variant of RR for network flows), FQ-CoDel (Fair Queuing with Controlled Delay), and HFSC (Hierarchical Fair Service Curves). These are essentially MLFQ and lottery scheduling for network packets. - Disk I/O scheduling: The Linux block layer has
mq-deadline(priority + deadline),kyber(latency-targeted), andbfq(Budget Fair Queuing — lottery/proportional share for I/O). See storage systems section. - Thread pools in databases: Connection schedulers in PostgreSQL, MySQL, and Oracle use FIFO queues within priority classes, with aging to prevent long query starvation.
- Job schedulers (Kubernetes, YARN, Slurm): Cluster-level job scheduling uses MLFQ, fair-share algorithms, and deadline scheduling for batch and real-time workloads.
Future Directions
- Learning-based burst prediction: ML models can predict CPU burst length and I/O behavior from historical data, potentially outperforming exponential averaging. Research systems (e.g., Paragon, Quasar from Cornell) use offline profiling to enable SJF-like scheduling in cloud systems.
- BPF-extensible scheduling (sched_ext): Allows custom scheduling policies in BPF (Linux 6.12), bringing academic algorithm experimentation to production kernels without patching.
- Work-stealing deques: Parallel workloads (Go runtime, Tokio async runtime, Java ForkJoinPool) use local FIFO + work-stealing LIFO deques per thread — a modern evolution of MLFQ ideas.
Exercises
- Gantt chart practice: Given arrivals [A:0ms/20ms burst, B:5ms/3ms burst, C:8ms/3ms burst], compute average turnaround time for FCFS, SJF (non-preemptive), SRTF, and RR (q=4ms). Which algorithm is best for average turnaround? Which for response time?
- Exponential averaging: Implement burst prediction in Python with α=0.5 and α=0.125. Given observed bursts [6, 4, 6, 4, 13, 13, 13], plot the estimates. How quickly does each α adapt to the regime change at burst 5?
- MLFQ simulator: Write a simple MLFQ simulator with 3 queues (q=2ms, 4ms, 8ms) and a boost every 50ms. Run it with 5 tasks: 2 interactive (3ms bursts, 10ms I/O) and 3 CPU-bound (100ms bursts). Measure average response time and CPU utilization.
- Convoy detection: Using
strace -T -e trace=read,writeon a single-threaded server, identify cases where one slow syscall delays subsequent requests. This is an application-level FCFS convoy. - Lottery scheduling: Implement stride scheduling in Python. Assign strides inversely proportional to weights. Verify that over 1000 scheduling decisions, each task gets CPU proportional to its weight within 1%.
References
- Corbató, F.J., et al., "An Experimental Time-Sharing System", AFIPS 1962 — original MLFQ description
- Waldspurger, C.A. and Weihl, W.E., "Lottery Scheduling: Flexible Proportional-Share Resource Management", OSDI 1994
- Waldspurger, C.A., "Stride Scheduling: Deterministic Proportional-Share Resource Management", MIT Technical Report, 1995
- Arpaci-Dusseau, "Operating Systems: Three Easy Pieces", Chapters 7-9
- McKenney, P., "Selecting the right CPU scheduler", IBM developerWorks, 2008
- Aas, J., "Understanding the Linux 2.6.8.1 CPU Scheduler", Silicon Graphics, 2005
- Silberschatz, Galvin, Gagne, "Operating System Concepts", 10th ed., Chapter 5