01 — Scheduler and Race Condition Failures
Technical Overview
The scheduler is one of the most consequential components in any operating system. It decides which thread runs, for how long, and at what priority. When the scheduler has bugs — or when application-level concurrency assumptions interact badly with scheduling policy — the results range from severe latency degradation to complete system hangs to catastrophic spacecraft resets. This document examines five landmark failures caused by scheduler bugs, priority inversion, and race conditions, each of which produced lasting design changes.
Prerequisites
- Understanding of preemptive vs cooperative scheduling
- Thread priority levels (real-time, normal, idle)
- Mutex semantics and blocking behavior
- Watchdog timer concepts
- Linux scheduler history: O(N) → O(1) → CFS
- Linux RCU (Read-Copy-Update) fundamentals
- cgroups CPU bandwidth control
Historical Context
Scheduling has been a research-active area since the 1960s. The tension between fairness and priority-correctness has never been fully resolved. Real-time systems introduced strict priority requirements that traditional Unix schedulers ignored. The growth of multicore systems exposed new classes of scalability bugs that simply didn't exist on uniprocessors. The story of scheduler failures is the story of implicit assumptions meeting adversarial reality.
Case Study 1: Mars Pathfinder Priority Inversion (1997)
What Happened
On July 4, 1997, the Mars Pathfinder spacecraft successfully landed on Mars — a massive engineering triumph. Within days of landing, the spacecraft began experiencing mysterious full system resets, wiping the lander's in-flight data buffers and forcing science operations to halt. Engineers at JPL observed the resets happening, traced them to a watchdog timer firing, but could not easily reproduce the failure on Earth.
The spacecraft ran VxWorks, a deterministic real-time operating system widely used in embedded and aerospace systems. VxWorks supports priority-based preemptive scheduling with mutex support, and includes optional priority inheritance — a mechanism that was, crucially, not enabled.
Technical Root Cause
The Pathfinder software contained three relevant threads:
Thread Priority Name
----------------------------------------------
HIGH (P=3) Meteorological Data Thread (ASI/MET)
MEDIUM (P=2) Communication Bus Scheduler Thread
LOW (P=1) Information Bus (IMP) Management Thread
The Information Bus (IMP) thread held a mutex protecting shared data structures on the information bus. The Meteorological Data Thread also needed this mutex to publish sensor readings to the bus. The Communication Bus Scheduler thread ran at medium priority and did not need the mutex.
Classic priority inversion scenario:
Time →
LOW thread acquires mutex M
LOW thread preempted by MEDIUM thread (M still held by LOW)
HIGH thread becomes runnable, needs mutex M
HIGH thread blocks on M
MEDIUM thread runs (indefinitely — it does not need M)
LOW thread never gets CPU to release M
HIGH thread starved
Watchdog fires: HIGH thread has not executed within deadline
SYSTEM RESET
The watchdog timer was monitoring the meteorological data (ASI/MET) thread. That thread held the highest priority but could not run because the mutex it needed was held by the low-priority IMP thread, which could not run because the medium-priority communication thread was perpetually preempting it. The medium-priority thread acted as an unintentional barrier between the low-priority mutex holder and the high-priority waiter.
This is textbook priority inversion: a high-priority thread's effective priority drops to that of a low-priority thread holding a resource, because intermediate-priority threads can interpose.
The formal definition:
Priority Inversion occurs when:
- Thread H (high priority) is blocked on resource R
- Resource R is held by thread L (low priority)
- Thread M (medium priority) is runnable and does not need R
- M preempts L, keeping L from releasing R
- H is starved by M, despite H > M in priority
VxWorks supports priority inheritance as an optional mutex flag (SEM_INVERSION_SAFE). When enabled, if a low-priority thread holds a mutex that a higher-priority thread is waiting for, the low-priority thread temporarily inherits the higher thread's priority until it releases the mutex. JPL engineers had disabled this feature for performance reasons and because they believed the scenario would not occur.
Debugging Methodology
JPL engineers received telemetry showing the watchdog reset. They reviewed the VxWorks trace logs embedded in the crash dump and identified which task had failed its deadline. By correlating task state at the moment of reset with mutex ownership logs, they reconstructed the dependency chain.
The breakthrough came when an engineer realized the scenario was reproducible on Earth if the lab system was subjected to the same communication bus load that occurred on Mars during peak science data collection. Under lighter load, the race window was too narrow to hit consistently.
VxWorks provided taskInfo() and semaphore tracing APIs. The engineers used these to instrument a test system and confirm the priority inversion chain.
Fix
The fix was a software patch uploaded to Mars over the Deep Space Network — a 20-minute one-way signal delay away. The patch set SEM_INVERSION_SAFE on the relevant mutex. After the patch was uplinked and applied, the system resets stopped entirely.
This was accomplished without a hardware reset of the spacecraft, which would have been far more disruptive. The mission continued successfully.
Lessons Learned
- Enable priority inheritance by default in RTOS environments. The performance overhead is negligible compared to the correctness guarantee.
- Watchdog timers must be robust to priority inversion. A watchdog that can be fooled by priority inversion provides false safety.
- Stress-test under representative load. The priority inversion was not caught in testing because lab loads were lighter than Mars science-collection loads.
- Document every synchronization primitive decision. The decision to disable
SEM_INVERSION_SAFEwas made without documentation, so no one revisited it.
Case Study 2: Linux Scheduler Scalability: O(N) → O(1) → CFS
What Happened
In early Linux kernels (pre-2.6), the scheduler had O(N) complexity for picking the next task — it iterated over all runnable tasks on every scheduling decision. On systems with hundreds of processes, this was tolerable. On large SMP systems with thousands of threads and heavy load, the scheduler became a bottleneck: a significant fraction of CPU time was consumed deciding who ran next, and the scheduler lock became a global contention point.
By the Linux 2.5/2.6 era (early 2000s), large production servers running databases and web workloads were reporting scheduler-induced latency spikes and unfairness.
Technical Root Cause: O(N) Scheduler
The original Linux scheduler maintained a single run queue protected by a global spinlock. When schedule() was called, it walked the entire run queue computing a "goodness" value for each task:
/* Simplified O(N) goodness function */
static int goodness(struct task_struct *p, int this_cpu, ...) {
int weight = p->priority;
if (p->mm == current->mm) weight += 1; // same memory space bonus
if (p->processor == this_cpu) weight += PROC_CHANGE_PENALTY;
return weight;
}
/* Walk all tasks to find best — O(N) */
list_for_each(tmp, &runqueue_head) {
p = list_entry(tmp, struct task_struct, run_list);
if (goodness(p, ...) > c) { next = p; c = goodness(p, ...); }
}
On a 16-CPU system with 2000 runnable threads, every scheduling event required iterating 2000 entries under a global spinlock — serializing all 16 CPUs. Under the workloads of early 2000s enterprise Linux deployments, this caused measurable scheduler overhead.
The O(1) Scheduler (2.6.0 — 2007)
Ingo Molnar introduced the O(1) scheduler in Linux 2.5.2 (merged 2.6.0, 2003). The design used per-CPU run queues and a two-array bitmap scheme:
Active array: tasks with remaining timeslice
Expired array: tasks that exhausted timeslice
Priority bitmap: 140 bits (100 real-time + 40 nice levels)
Finding next task: find_first_set_bit() — O(1) always
Per-CPU queues eliminated the global lock contention. The bitmap made "find highest priority runnable task" a single instruction on architectures with bsfl/bsfq.
However, the O(1) scheduler computed "interactive" vs "batch" heuristics using sleep/run time ratios that were fragile. A task sleeping in small bursts was classified as interactive and given priority boosts. This could be gamed (intentionally or accidentally), causing unfair starvation of CPU-bound tasks. The heuristics were tuned and re-tuned through the 2.6.x series without ever being fully satisfactory.
CFS: Completely Fair Scheduler (2.6.23 — 2007)
Ingo Molnar again replaced the scheduler in 2007 with CFS (Completely Fair Scheduler), which remains the Linux default scheduler (with modifications) today.
CFS models an idealized "perfectly fair" CPU — one that runs all N tasks simultaneously each at 1/N speed. It tracks vruntime (virtual runtime) for each task: how much CPU time a task has received, weighted by its priority (nice value). The task with the smallest vruntime is always scheduled next.
vruntime += real_cpu_time * (NICE_0_LOAD / task_weight)
Red-black tree ordered by vruntime:
Leftmost node = task that has received least CPU = next to run
Scheduling complexity: O(log N) insert/delete, O(1) find-min
The red-black tree provides O(log N) operations, and in practice the tree is small (leftmost cached), making scheduling decisions extremely fast.
CFS Failure Mode: cgroup CPU Throttling Latency
CFS introduced CPU bandwidth control via cgroups (cpu.cfs_quota_us / cpu.cfs_period_us). This is used extensively in Kubernetes to enforce CPU limits on pods.
The mechanism: each cgroup has a quota of CPU time per period (default 100ms). When a cgroup exhausts its quota, all tasks in it are throttled until the next period begins.
Production failure pattern:
Container CPU limit: 1 CPU
CFS period: 100ms
CFS quota: 100ms per 100ms period
Scenario:
t=0: Container starts processing request
t=0-10ms: Uses 100ms of quota (burst on multi-core)
t=10ms: Throttled for remaining 90ms of period
t=100ms: Period resets, container unthrottled
Result: 90ms latency spike on request
This is a well-documented production issue at companies including Netflix, Twitter, and Google. A container might have CPU headroom available (other CPUs idle) but be throttled because it burst its quota early in the period.
The Linux kernel 5.4+ introduced CONFIG_CFS_BANDWIDTH improvements, and commit 763a9ec06c40 (2019) by Dave Childers (Google) provided partial relief via shorter bandwidth accounting periods. The canonical fix is to set CPU limits higher than needed or disable CPU limits and rely on CPU requests for scheduling weight only.
Reference: "Container isolation gone wrong" — Blanco et al., Netflix Tech Blog 2019.
Lessons Learned
- Global locks are scheduler killers on SMP. Per-CPU run queues are essential for scalability.
- Heuristic-based scheduling (O(1) interactivity detection) is fragile. Model-based approaches (CFS vruntime) are more predictable.
- CFS period/quota bandwidth control creates latency cliffs. Set periods shorter (e.g., 10ms) or disable CPU limits in latency-sensitive services.
- Scheduler design is inseparable from workload characteristics. No scheduler is universally optimal.
Case Study 3: Linux RCU Stall Panics in Production
What Happened
Read-Copy-Update (RCU) is a synchronization mechanism in the Linux kernel that allows lock-free reads of shared data. Writes proceed by: (1) making a copy of the data, (2) updating the copy, (3) replacing the pointer atomically, and (4) waiting for all ongoing reads to complete (a "grace period") before freeing the old copy.
In production environments — particularly virtualized guests, heavily loaded systems, or systems with long-running kernel code paths — the RCU grace period machinery fires a stall detector:
rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
rcu: 0-...: (1 GPs behind) idle=...
rcu: (detected by 3, t=21002 jiffies, g=12345, q=1024)
In severe cases, the kernel panics with RCU stall detected if rcupanic is configured.
Technical Root Cause
RCU grace periods require that each CPU execute a "quiescent state" — a point where no RCU read-side critical section is active. Quiescent states occur at: context switches, idle loops, and user-space execution.
A CPU can block a grace period by: 1. Being in a long kernel code path with no context switch 2. Being in a tight spin loop holding a spinlock (prevents scheduling) 3. Being in an infinite loop bug (kernel thread stuck) 4. Being preempted in a preempt-disabled section (PREEMPT_RT removes some of these) 5. Running at CPU 100% in a VM and being descheduled by the hypervisor for extended periods
Production scenario — VM CPU steal:
VM Guest CPU: running workload
Hypervisor: overcommitted, descheduling guest for 2s (steal time)
Guest perspective: CPU appears to run continuously
RCU grace period timer: fires at 21 jiffies with no quiescent state
Result: RCU stall warning, possible panic
The stall detection threshold is controlled by kernel.rcu_cpu_stall_timeout (default 21 seconds). On a heavily overcommitted hypervisor, a VM can see >21 seconds of "stolen" CPU time without executing any quiescent state.
Production scenario — long kernel path:
/* Example: deep nested spinlock path blocking quiescent state */
spin_lock(&lock_a);
/* ... work ... */
spin_lock(&lock_b);
/* ... work that takes 30+ seconds under load ... */
spin_unlock(&lock_b);
spin_unlock(&lock_a);
/* RCU stall fires because this CPU never context-switched */
Debugging Methodology
- Capture the full RCU stall message — it shows which CPU is stuck and the approximate grace period age
- Examine
sysrq-toutput (all tasks) to find what that CPU is executing - Check
stealfield in/proc/statorvmstatfor hypervisor CPU steal - Use
perf record -a -gfor a few seconds to see what kernel paths are hot - Check
/proc/sys/kernel/rcu_cpu_stall_timeout— increase if benign overload
Production fix options:
- Increase rcu_cpu_stall_timeout on known-overloaded systems
- Reduce hypervisor overcommit ratio
- Investigate and fix any kernel code paths holding locks too long
- Use RCU_NOHZ_FULL on dedicated-CPU real-time workloads
Lessons Learned
- RCU stalls on VMs are usually hypervisor overcommit problems, not kernel bugs. Tune accordingly.
- RCU panics in production are jarring. Consider
rcupanic=0on systems where availability beats correctness auditing. - Spinlock critical sections must be bounded. Unbounded spinlock holding blocks RCU grace periods and causes cascading issues.
- Monitor steal time on VMs. High steal time correlates with RCU stalls, scheduler latency, and missed timeouts.
Case Study 4: CFS Nice Value Inversions and CPU Throttling Latency Spikes
What Happened
After CFS deployment, production teams at Google, Netflix, and cloud providers encountered a class of latency issues caused by interactions between nice values, cgroup hierarchies, and CPU bandwidth throttling. These were not bugs in the traditional sense but emergent behaviors of CFS's weighted fair-queuing model interacting with real workloads.
Technical Root Cause: Nice Value Interaction
CFS assigns weights to tasks based on nice values:
Nice -20: weight = 88761
Nice 0: weight = 1024 (baseline)
Nice +19: weight = 15
Weight ratio nice-20 to nice+19: ~5740:1
In a two-task scenario, a nice -20 task receives 5740/(5740+15) ≈ 99.7% of CPU. This is expected. However, in a production environment with cgroup hierarchies, the weighting applies within each cgroup's share, and parent-child cgroup interactions can produce non-intuitive results.
Real production failure: A Java garbage collection thread was running at nice 0 inside a Kubernetes pod. The application threads were niced to +5 (by a well-intentioned but misguided performance tuning). The GC thread consumed the pod's entire CPU quota during STW (stop-the-world) GC pauses, while application threads (niced +5) received proportionally less time within the remaining quota window. Combined with CPU bandwidth throttling, the application threads saw 50-100ms latency spikes correlated with GC cycles.
CFS throttling latency spike — technical mechanism:
cgroup.cpu_quota = 200ms / 100ms period (2 CPUs worth)
Service runs on 4-core node
Service bursts to 4 CPUs briefly → exhausts 200ms quota in 50ms
Throttled for remaining 50ms of period
All threads in cgroup block simultaneously
Incoming network requests queue
Period expires, threads unblock, queue drained
Latency spike: 50ms+ on P99
This pattern was independently discovered and documented by multiple companies. The Netflix blog post "Container isolation gone wrong" (2019) and the Uber Engineering post "Avoiding CPU Throttling in a Containerized Environment" (2020) both describe this failure mode in production Kubernetes deployments.
Fix
-
CFS period tuning: Reduce
cpu.cfs_period_usfrom 100000 (100ms) to 10000 (10ms). Smaller period = smaller maximum throttle window = lower latency spike. Linux kernel 5.4 improved the accuracy of short-period bandwidth accounting. -
CPU limit removal for latency-sensitive services: Set only CPU requests (which affect CFS weight) without limits. Requests control scheduling priority under contention; limits add throttling.
-
Kernel patch: Commit
512ac999d275in Linux 5.14 (Tejun Heo, 2021) addedcpu.idlecgroup knob. Commitde53fd7aedb1addressed bandwidth timer slack. -
SCHED_OTHER → SCHED_IDLE for background tasks: Using
SCHED_IDLE(not just nice +19) ensures background tasks yield entirely under any system load.
Lessons Learned
- CFS fairness is within a cgroup level, not globally. Nested cgroup hierarchies require understanding the full tree.
- CPU limits in containers create latency cliffs. For P99 latency SLAs, disable CPU limits.
- Nice values still matter in CFS. High-priority threads in the same cgroup as lower-priority threads can starve them.
- 100ms CFS period is too coarse for microsecond-sensitive services. Reduce it.
Case Study 5: Real-Time Priority Inversion in Production — Watchdog Starvation
What Happened
A production embedded control system (industrial automation, not publicly named) running a Linux PREEMPT_RT kernel experienced periodic system unresponsiveness of 200-500ms every few hours. The system used SCHED_FIFO threads for control loops and a hardware watchdog that expected periodic keepalive writes.
Technical Root Cause
The system architecture:
Thread Policy Priority
----------------------------------------------
Control Loop SCHED_FIFO 99 (highest)
Watchdog Keepalive SCHED_FIFO 98
Network I/O Handler SCHED_FIFO 85
Data Logger SCHED_OTHER 0
Hardware watchdog timeout: 5 seconds
The Data Logger was SCHED_OTHER and used mmap() to write log data. During large log flushes, the kernel path called filemap_fault() which took a page fault, which called into the block layer, which — on this system with a slow SATA disk — blocked in a completion wait that was not RT-aware.
The critical path:
Data Logger (SCHED_OTHER) triggers page fault
Page fault takes disk I/O → kernel blocks in io_schedule()
io_schedule() uses SCHED_OTHER even inside kernel context
Meanwhile, high-priority SCHED_FIFO threads run normally
BUT: the page fault is holding mm->mmap_sem (now mmap_lock) in read mode
Control Loop thread (SCHED_FIFO 99) attempts mmap operation
Control Loop blocks on mmap_lock
Watchdog Keepalive (SCHED_FIFO 98) also needs mmap_lock for a different operation
Watchdog Keepalive blocks
All RT threads blocked by a SCHED_OTHER thread doing slow disk I/O
Watchdog keepalive not written for 5+ seconds
Hardware watchdog fires: system reset
The mmap_lock is a read-write semaphore. PREEMPT_RT converts many spinlocks to RT-mutexes (which support priority inheritance) but mmap_lock in the kernel version used on this system was not fully PI-aware. The SCHED_OTHER data logger thread held a read lock on mmap_lock for the duration of its I/O operation, and the RT threads waiting for write access could not inherit priority to the lock holder.
Debugging Methodology
- Enable kernel latency tracer:
echo 1 > /sys/kernel/debug/tracing/tracing_on; uselatencytop - Use
cyclictestto measure RT scheduling latency over hours - Check
/proc/latency_statsfor high-latency kernel operations - Use
ftracewithfunction_graphtracer on the suspect paths - Correlate watchdog resets with disk I/O activity in system logs
The debugging team found 400ms latency spikes in cyclictest correlating exactly with large Data Logger writes. Ftrace confirmed the mmap_lock contention chain.
Fix
- Isolate the data logger from RT threads. Move to a separate process with its own address space to eliminate mmap_lock sharing.
- Use
O_DIRECTwrites in the data logger to bypass page cache and avoid page-fault-induced mmap_lock. - Run the watchdog keepalive from a dedicated RT thread with NO shared mmap regions.
- Upgrade to a kernel with RT-aware mmap_lock — Linux 5.8+ improved mmap_lock RT behavior.
- Use a memory-backed tmpfs for logs with periodic sync to disk from a non-RT context.
Lessons Learned
- mmap_lock is a system-wide latency hazard on PREEMPT_RT. Any operation triggering page faults in a process shared with RT threads is dangerous.
- RT threads should avoid any kernel paths touching disks, network, or file systems unless those paths are verified RT-safe.
- Watchdog keepalive threads require isolation. They cannot share resources with anything that might block indefinitely.
- PREEMPT_RT converts many locks to PI-aware RT-mutexes, but not all. Always verify the specific kernel version's RT-lock coverage.
cyclictestis the standard tool for RT latency auditing. Run it under representative load before production deployment.
ASCII Diagram: Priority Inversion Timeline
Priority
Level
HIGH ─── [H: Needs mutex M] ─────────────────────── [BLOCKED on M] ──────────────────
Thread H runnable │
│ blocked by
MED ───────────────────── [M: Running, no mutex] ────────┼──────── [M: Still running]
Thread M becomes runnable │
│
LOW ─── [L: Holds mutex M] ─── [L: Preempted by M] ──────┼───────── [L: Eventually runs, releases M]
Thread L acquires M │
│
WATCHDOG FIRES ─────┘
(H missed deadline)
With Priority Inheritance:
When H blocks on M held by L:
L.effective_priority = H.priority
M cannot preempt L (M.priority < L.effective_priority)
L runs, releases M quickly
H runs, meets deadline
Watchdog satisfied
Production Examples
- Mars Pathfinder (1997): VxWorks priority inversion, watchdog reset, fixed by uplinked patch
- Linux RT-preempt production systems: Various industrial controllers, automotive ECUs
- Kubernetes CFS throttling: Netflix, Uber, Cloudflare all documented throttle-induced latency spikes
- Linux O(N) scheduler: Lmbench benchmarks on early 2.6 systems showed scheduler overhead > 5% on large process counts
Debugging Notes
# Check CFS throttle statistics
cat /sys/fs/cgroup/cpu/my-service/cpu.stat
# nr_throttled: count of throttled periods
# throttled_time: nanoseconds throttled
# Check current scheduler for a process
cat /proc/<pid>/sched
# Set RT priority
chrt -f 99 <command>
# Check RCU stall timeout
cat /proc/sys/kernel/rcu_cpu_stall_timeout
# cyclictest for RT latency
cyclictest -p 99 -t 4 -n -h 400 -D 60
# ftrace scheduler events
echo 'sched:sched_switch' > /sys/kernel/debug/tracing/set_event
Security Implications
- Priority manipulation as DoS: An unprivileged user with
RLIMIT_RTPRIOset can use SCHED_FIFO to starve other processes. Proper rlimit configuration is essential. - cgroup CPU limits bypass: An attacker with cgroup misconfiguration can consume disproportionate CPU, degrading co-tenant services.
- Watchdog suppression: An attacker who can trigger priority inversion on a safety-critical system can prevent watchdog keepalives, causing resets.
Performance Implications
- CFS overhead is O(log N) per scheduling event — negligible for N < 10000
- SCHED_FIFO at priority 99 will suppress all SCHED_OTHER threads indefinitely
- CPU bandwidth throttling (
cpu.cfs_quota) with 100ms periods creates up to 100ms latency spikes - Nice value range (-20 to +19) creates 5740x CPU weight differential — use carefully
Failure Modes
| Failure | Trigger | Detection | Recovery |
|---|---|---|---|
| Priority inversion | Mutex held by low-pri, needed by high-pri | Watchdog fire, missed deadline | Enable priority inheritance |
| CFS throttle spike | CPU quota exhausted early in period | P99 latency spike, cpu.stat nr_throttled |
Reduce period, raise quota, remove limit |
| RCU stall | CPU stuck in kernel, no quiescent state | Kernel warning/panic | Fix long kernel paths, reduce hypervisor overcommit |
| RT starvation | SCHED_FIFO thread monopolizes CPU | Other threads starved | Use SCHED_DEADLINE or time-limit RT threads |
Modern Usage
- Linux PREEMPT_RT (merged mainline in 6.x series) provides true hard RT guarantees
- SCHED_DEADLINE (Linux 3.14+) implements EDF (Earliest Deadline First) — provably optimal for RT workloads
- Kubernetes
cpu.cfs_period_ustuning is now a standard recommendation in the Kubernetes performance guide - Priority inheritance is now the default in most modern RTOS (FreeRTOS, Zephyr, QNX)
Future Directions
- EEVDF (Earliest Eligible Virtual Deadline First): Merged in Linux 6.6 (2023) as replacement for CFS's pick_next algorithm — better latency properties
- Kernel-wide RT-awareness of locks: The long-term PREEMPT_RT goal of fully converting all kernel lock paths to priority-inheriting RT-mutexes
- Scheduler extensibility via BPF: Linux 6.11+ sched_ext allows writing custom schedulers in BPF (eBPF), enabling per-application scheduling policies without kernel patches
Exercises
-
On a Linux system, reproduce priority inversion: write three threads (high, medium, low priority) where low holds a mutex needed by high and medium spins. Observe with
htopandstrace. Then add priority inheritance (pthreadPTHREAD_PRIO_INHERIT) and verify the inversion disappears. -
In a container with a CPU limit of 0.5 CPUs and a 100ms CFS period, use a tight CPU-bound loop to exhaust the quota, then measure request latency. Reduce the period to 10ms and remeasure P99.
-
Use
cyclictest -p 99 -t 1 -n -D 60on a non-RT and PREEMPT_RT kernel. Compare maximum latency values. Introduce background disk I/O withfioand observe the impact on each. -
Read the Linux CFS source at
kernel/sched/fair.c. Find thepick_next_entity()function and trace howvruntimecomparison drives scheduling decisions. -
Simulate the Mars Pathfinder scenario in Python using
threadingwith aLock. Observe that Python's GIL does not prevent priority inversion (the OS scheduler still applies). Useos.sched_setparam()to set thread priorities and reproduce the starvation.
References
- Reeves, Glenn. "What Really Happened on Mars?" JPL Internal Memo, 1997. Widely reproduced online.
- Molnar, Ingo. "[ANNOUNCE] O(1) scheduler for SMP." Linux-kernel mailing list, January 2002.
- Molnar, Ingo. "[patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]." Linux-kernel mailing list, April 2007.
- Heo, Tejun et al. "CPU Bandwidth Control for CFS." Linux kernel documentation,
Documentation/scheduler/sched-bwc.rst - Ts'o, Theodore. "Linux Scheduler Latency." LWN.net, various articles 2007-2023.
- Blanco, Titus et al. "Container isolation gone wrong." Netflix Tech Blog, 2019.
- Gleixner, Thomas. "PREEMPT_RT: An introduction." Embedded Linux Conference Europe, 2019.
- Linux kernel source:
kernel/sched/fair.c,kernel/rcu/tree_stall.h