01 — Scheduler and Race Condition Failures

Technical Overview

The scheduler is one of the most consequential components in any operating system. It decides which thread runs, for how long, and at what priority. When the scheduler has bugs — or when application-level concurrency assumptions interact badly with scheduling policy — the results range from severe latency degradation to complete system hangs to catastrophic spacecraft resets. This document examines five landmark failures caused by scheduler bugs, priority inversion, and race conditions, each of which produced lasting design changes.

Prerequisites

Understanding of preemptive vs cooperative scheduling
Thread priority levels (real-time, normal, idle)
Mutex semantics and blocking behavior
Watchdog timer concepts
Linux scheduler history: O(N) → O(1) → CFS
Linux RCU (Read-Copy-Update) fundamentals
cgroups CPU bandwidth control

Historical Context

Scheduling has been a research-active area since the 1960s. The tension between fairness and priority-correctness has never been fully resolved. Real-time systems introduced strict priority requirements that traditional Unix schedulers ignored. The growth of multicore systems exposed new classes of scalability bugs that simply didn't exist on uniprocessors. The story of scheduler failures is the story of implicit assumptions meeting adversarial reality.

Case Study 1: Mars Pathfinder Priority Inversion (1997)

What Happened

On July 4, 1997, the Mars Pathfinder spacecraft successfully landed on Mars — a massive engineering triumph. Within days of landing, the spacecraft began experiencing mysterious full system resets, wiping the lander's in-flight data buffers and forcing science operations to halt. Engineers at JPL observed the resets happening, traced them to a watchdog timer firing, but could not easily reproduce the failure on Earth.

The spacecraft ran VxWorks, a deterministic real-time operating system widely used in embedded and aerospace systems. VxWorks supports priority-based preemptive scheduling with mutex support, and includes optional priority inheritance — a mechanism that was, crucially, not enabled.

Technical Root Cause

The Pathfinder software contained three relevant threads:

Thread Priority   Name
----------------------------------------------
HIGH (P=3)        Meteorological Data Thread (ASI/MET)
MEDIUM (P=2)      Communication Bus Scheduler Thread
LOW (P=1)         Information Bus (IMP) Management Thread

The Information Bus (IMP) thread held a mutex protecting shared data structures on the information bus. The Meteorological Data Thread also needed this mutex to publish sensor readings to the bus. The Communication Bus Scheduler thread ran at medium priority and did not need the mutex.

Classic priority inversion scenario:

Time →
LOW thread acquires mutex M
LOW thread preempted by MEDIUM thread (M still held by LOW)
HIGH thread becomes runnable, needs mutex M
HIGH thread blocks on M
MEDIUM thread runs (indefinitely — it does not need M)
LOW thread never gets CPU to release M
HIGH thread starved
Watchdog fires: HIGH thread has not executed within deadline
SYSTEM RESET

The watchdog timer was monitoring the meteorological data (ASI/MET) thread. That thread held the highest priority but could not run because the mutex it needed was held by the low-priority IMP thread, which could not run because the medium-priority communication thread was perpetually preempting it. The medium-priority thread acted as an unintentional barrier between the low-priority mutex holder and the high-priority waiter.

This is textbook priority inversion: a high-priority thread's effective priority drops to that of a low-priority thread holding a resource, because intermediate-priority threads can interpose.

The formal definition:

Priority Inversion occurs when:
  - Thread H (high priority) is blocked on resource R
  - Resource R is held by thread L (low priority)
  - Thread M (medium priority) is runnable and does not need R
  - M preempts L, keeping L from releasing R
  - H is starved by M, despite H > M in priority

VxWorks supports priority inheritance as an optional mutex flag (SEM_INVERSION_SAFE). When enabled, if a low-priority thread holds a mutex that a higher-priority thread is waiting for, the low-priority thread temporarily inherits the higher thread's priority until it releases the mutex. JPL engineers had disabled this feature for performance reasons and because they believed the scenario would not occur.

Debugging Methodology

JPL engineers received telemetry showing the watchdog reset. They reviewed the VxWorks trace logs embedded in the crash dump and identified which task had failed its deadline. By correlating task state at the moment of reset with mutex ownership logs, they reconstructed the dependency chain.

The breakthrough came when an engineer realized the scenario was reproducible on Earth if the lab system was subjected to the same communication bus load that occurred on Mars during peak science data collection. Under lighter load, the race window was too narrow to hit consistently.

VxWorks provided taskInfo() and semaphore tracing APIs. The engineers used these to instrument a test system and confirm the priority inversion chain.

Fix

The fix was a software patch uploaded to Mars over the Deep Space Network — a 20-minute one-way signal delay away. The patch set SEM_INVERSION_SAFE on the relevant mutex. After the patch was uplinked and applied, the system resets stopped entirely.

This was accomplished without a hardware reset of the spacecraft, which would have been far more disruptive. The mission continued successfully.

Lessons Learned

Enable priority inheritance by default in RTOS environments. The performance overhead is negligible compared to the correctness guarantee.
Watchdog timers must be robust to priority inversion. A watchdog that can be fooled by priority inversion provides false safety.
Stress-test under representative load. The priority inversion was not caught in testing because lab loads were lighter than Mars science-collection loads.
Document every synchronization primitive decision. The decision to disable SEM_INVERSION_SAFE was made without documentation, so no one revisited it.

Case Study 2: Linux Scheduler Scalability: O(N) → O(1) → CFS

What Happened

In early Linux kernels (pre-2.6), the scheduler had O(N) complexity for picking the next task — it iterated over all runnable tasks on every scheduling decision. On systems with hundreds of processes, this was tolerable. On large SMP systems with thousands of threads and heavy load, the scheduler became a bottleneck: a significant fraction of CPU time was consumed deciding who ran next, and the scheduler lock became a global contention point.

By the Linux 2.5/2.6 era (early 2000s), large production servers running databases and web workloads were reporting scheduler-induced latency spikes and unfairness.

Technical Root Cause: O(N) Scheduler

The original Linux scheduler maintained a single run queue protected by a global spinlock. When schedule() was called, it walked the entire run queue computing a "goodness" value for each task:

/* Simplified O(N) goodness function */
static int goodness(struct task_struct *p, int this_cpu, ...) {
    int weight = p->priority;
    if (p->mm == current->mm) weight += 1;  // same memory space bonus
    if (p->processor == this_cpu) weight += PROC_CHANGE_PENALTY;
    return weight;
}

/* Walk all tasks to find best — O(N) */
list_for_each(tmp, &runqueue_head) {
    p = list_entry(tmp, struct task_struct, run_list);
    if (goodness(p, ...) > c) { next = p; c = goodness(p, ...); }
}

On a 16-CPU system with 2000 runnable threads, every scheduling event required iterating 2000 entries under a global spinlock — serializing all 16 CPUs. Under the workloads of early 2000s enterprise Linux deployments, this caused measurable scheduler overhead.

The O(1) Scheduler (2.6.0 — 2007)

Ingo Molnar introduced the O(1) scheduler in Linux 2.5.2 (merged 2.6.0, 2003). The design used per-CPU run queues and a two-array bitmap scheme:

Active array:   tasks with remaining timeslice
Expired array:  tasks that exhausted timeslice

Priority bitmap: 140 bits (100 real-time + 40 nice levels)
Finding next task: find_first_set_bit() — O(1) always

Per-CPU queues eliminated the global lock contention. The bitmap made "find highest priority runnable task" a single instruction on architectures with bsfl/bsfq.

However, the O(1) scheduler computed "interactive" vs "batch" heuristics using sleep/run time ratios that were fragile. A task sleeping in small bursts was classified as interactive and given priority boosts. This could be gamed (intentionally or accidentally), causing unfair starvation of CPU-bound tasks. The heuristics were tuned and re-tuned through the 2.6.x series without ever being fully satisfactory.

CFS: Completely Fair Scheduler (2.6.23 — 2007)

Ingo Molnar again replaced the scheduler in 2007 with CFS (Completely Fair Scheduler), which remains the Linux default scheduler (with modifications) today.

CFS models an idealized "perfectly fair" CPU — one that runs all N tasks simultaneously each at 1/N speed. It tracks vruntime (virtual runtime) for each task: how much CPU time a task has received, weighted by its priority (nice value). The task with the smallest vruntime is always scheduled next.

vruntime += real_cpu_time * (NICE_0_LOAD / task_weight)

Red-black tree ordered by vruntime:
  Leftmost node = task that has received least CPU = next to run

Scheduling complexity: O(log N) insert/delete, O(1) find-min

The red-black tree provides O(log N) operations, and in practice the tree is small (leftmost cached), making scheduling decisions extremely fast.

CFS Failure Mode: cgroup CPU Throttling Latency

CFS introduced CPU bandwidth control via cgroups (cpu.cfs_quota_us / cpu.cfs_period_us). This is used extensively in Kubernetes to enforce CPU limits on pods.

The mechanism: each cgroup has a quota of CPU time per period (default 100ms). When a cgroup exhausts its quota, all tasks in it are throttled until the next period begins.

Production failure pattern:

Container CPU limit: 1 CPU
CFS period: 100ms
CFS quota: 100ms per 100ms period

Scenario:
  t=0:   Container starts processing request
  t=0-10ms: Uses 100ms of quota (burst on multi-core)
  t=10ms: Throttled for remaining 90ms of period
  t=100ms: Period resets, container unthrottled
  Result: 90ms latency spike on request

This is a well-documented production issue at companies including Netflix, Twitter, and Google. A container might have CPU headroom available (other CPUs idle) but be throttled because it burst its quota early in the period.

The Linux kernel 5.4+ introduced CONFIG_CFS_BANDWIDTH improvements, and commit 763a9ec06c40 (2019) by Dave Childers (Google) provided partial relief via shorter bandwidth accounting periods. The canonical fix is to set CPU limits higher than needed or disable CPU limits and rely on CPU requests for scheduling weight only.

Reference: "Container isolation gone wrong" — Blanco et al., Netflix Tech Blog 2019.

Lessons Learned

Global locks are scheduler killers on SMP. Per-CPU run queues are essential for scalability.
Heuristic-based scheduling (O(1) interactivity detection) is fragile. Model-based approaches (CFS vruntime) are more predictable.
CFS period/quota bandwidth control creates latency cliffs. Set periods shorter (e.g., 10ms) or disable CPU limits in latency-sensitive services.
Scheduler design is inseparable from workload characteristics. No scheduler is universally optimal.

Case Study 3: Linux RCU Stall Panics in Production

What Happened

Read-Copy-Update (RCU) is a synchronization mechanism in the Linux kernel that allows lock-free reads of shared data. Writes proceed by: (1) making a copy of the data, (2) updating the copy, (3) replacing the pointer atomically, and (4) waiting for all ongoing reads to complete (a "grace period") before freeing the old copy.

In production environments — particularly virtualized guests, heavily loaded systems, or systems with long-running kernel code paths — the RCU grace period machinery fires a stall detector:

rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
rcu:    0-...: (1 GPs behind) idle=...
rcu:    (detected by 3, t=21002 jiffies, g=12345, q=1024)

In severe cases, the kernel panics with RCU stall detected if rcupanic is configured.

Technical Root Cause

RCU grace periods require that each CPU execute a "quiescent state" — a point where no RCU read-side critical section is active. Quiescent states occur at: context switches, idle loops, and user-space execution.

A CPU can block a grace period by: 1. Being in a long kernel code path with no context switch 2. Being in a tight spin loop holding a spinlock (prevents scheduling) 3. Being in an infinite loop bug (kernel thread stuck) 4. Being preempted in a preempt-disabled section (PREEMPT_RT removes some of these) 5. Running at CPU 100% in a VM and being descheduled by the hypervisor for extended periods

Production scenario — VM CPU steal:

VM Guest CPU: running workload
Hypervisor: overcommitted, descheduling guest for 2s (steal time)
Guest perspective: CPU appears to run continuously
RCU grace period timer: fires at 21 jiffies with no quiescent state
Result: RCU stall warning, possible panic

The stall detection threshold is controlled by kernel.rcu_cpu_stall_timeout (default 21 seconds). On a heavily overcommitted hypervisor, a VM can see >21 seconds of "stolen" CPU time without executing any quiescent state.

Production scenario — long kernel path:

/* Example: deep nested spinlock path blocking quiescent state */
spin_lock(&lock_a);
  /* ... work ... */
  spin_lock(&lock_b);
    /* ... work that takes 30+ seconds under load ... */
  spin_unlock(&lock_b);
spin_unlock(&lock_a);
/* RCU stall fires because this CPU never context-switched */

Debugging Methodology

Capture the full RCU stall message — it shows which CPU is stuck and the approximate grace period age
Examine sysrq-t output (all tasks) to find what that CPU is executing
Check steal field in /proc/stat or vmstat for hypervisor CPU steal
Use perf record -a -g for a few seconds to see what kernel paths are hot
Check /proc/sys/kernel/rcu_cpu_stall_timeout — increase if benign overload

Production fix options: - Increase rcu_cpu_stall_timeout on known-overloaded systems - Reduce hypervisor overcommit ratio - Investigate and fix any kernel code paths holding locks too long - Use RCU_NOHZ_FULL on dedicated-CPU real-time workloads

Lessons Learned

RCU stalls on VMs are usually hypervisor overcommit problems, not kernel bugs. Tune accordingly.
RCU panics in production are jarring. Consider rcupanic=0 on systems where availability beats correctness auditing.
Spinlock critical sections must be bounded. Unbounded spinlock holding blocks RCU grace periods and causes cascading issues.
Monitor steal time on VMs. High steal time correlates with RCU stalls, scheduler latency, and missed timeouts.

Case Study 4: CFS Nice Value Inversions and CPU Throttling Latency Spikes

What Happened

After CFS deployment, production teams at Google, Netflix, and cloud providers encountered a class of latency issues caused by interactions between nice values, cgroup hierarchies, and CPU bandwidth throttling. These were not bugs in the traditional sense but emergent behaviors of CFS's weighted fair-queuing model interacting with real workloads.

Technical Root Cause: Nice Value Interaction

CFS assigns weights to tasks based on nice values:

Nice -20: weight = 88761
Nice   0: weight = 1024  (baseline)
Nice +19: weight = 15

Weight ratio nice-20 to nice+19: ~5740:1

In a two-task scenario, a nice -20 task receives 5740/(5740+15) ≈ 99.7% of CPU. This is expected. However, in a production environment with cgroup hierarchies, the weighting applies within each cgroup's share, and parent-child cgroup interactions can produce non-intuitive results.

Real production failure: A Java garbage collection thread was running at nice 0 inside a Kubernetes pod. The application threads were niced to +5 (by a well-intentioned but misguided performance tuning). The GC thread consumed the pod's entire CPU quota during STW (stop-the-world) GC pauses, while application threads (niced +5) received proportionally less time within the remaining quota window. Combined with CPU bandwidth throttling, the application threads saw 50-100ms latency spikes correlated with GC cycles.

CFS throttling latency spike — technical mechanism:

cgroup.cpu_quota = 200ms / 100ms period (2 CPUs worth)
Service runs on 4-core node
Service bursts to 4 CPUs briefly → exhausts 200ms quota in 50ms
Throttled for remaining 50ms of period
All threads in cgroup block simultaneously
Incoming network requests queue
Period expires, threads unblock, queue drained
Latency spike: 50ms+ on P99

This pattern was independently discovered and documented by multiple companies. The Netflix blog post "Container isolation gone wrong" (2019) and the Uber Engineering post "Avoiding CPU Throttling in a Containerized Environment" (2020) both describe this failure mode in production Kubernetes deployments.

Fix

CFS period tuning: Reduce cpu.cfs_period_us from 100000 (100ms) to 10000 (10ms). Smaller period = smaller maximum throttle window = lower latency spike. Linux kernel 5.4 improved the accuracy of short-period bandwidth accounting.
CPU limit removal for latency-sensitive services: Set only CPU requests (which affect CFS weight) without limits. Requests control scheduling priority under contention; limits add throttling.
Kernel patch: Commit 512ac999d275 in Linux 5.14 (Tejun Heo, 2021) added cpu.idle cgroup knob. Commit de53fd7aedb1 addressed bandwidth timer slack.
SCHED_OTHER → SCHED_IDLE for background tasks: Using SCHED_IDLE (not just nice +19) ensures background tasks yield entirely under any system load.

Lessons Learned

CFS fairness is within a cgroup level, not globally. Nested cgroup hierarchies require understanding the full tree.
CPU limits in containers create latency cliffs. For P99 latency SLAs, disable CPU limits.
Nice values still matter in CFS. High-priority threads in the same cgroup as lower-priority threads can starve them.
100ms CFS period is too coarse for microsecond-sensitive services. Reduce it.

Case Study 5: Real-Time Priority Inversion in Production — Watchdog Starvation

What Happened

A production embedded control system (industrial automation, not publicly named) running a Linux PREEMPT_RT kernel experienced periodic system unresponsiveness of 200-500ms every few hours. The system used SCHED_FIFO threads for control loops and a hardware watchdog that expected periodic keepalive writes.

Technical Root Cause

The system architecture:

Thread                    Policy       Priority
----------------------------------------------
Control Loop              SCHED_FIFO   99 (highest)
Watchdog Keepalive        SCHED_FIFO   98
Network I/O Handler       SCHED_FIFO   85
Data Logger               SCHED_OTHER  0

Hardware watchdog timeout: 5 seconds

The Data Logger was SCHED_OTHER and used mmap() to write log data. During large log flushes, the kernel path called filemap_fault() which took a page fault, which called into the block layer, which — on this system with a slow SATA disk — blocked in a completion wait that was not RT-aware.

The critical path:

Data Logger (SCHED_OTHER) triggers page fault
Page fault takes disk I/O → kernel blocks in io_schedule()
io_schedule() uses SCHED_OTHER even inside kernel context
Meanwhile, high-priority SCHED_FIFO threads run normally

BUT: the page fault is holding mm->mmap_sem (now mmap_lock) in read mode
Control Loop thread (SCHED_FIFO 99) attempts mmap operation
Control Loop blocks on mmap_lock
Watchdog Keepalive (SCHED_FIFO 98) also needs mmap_lock for a different operation
Watchdog Keepalive blocks

All RT threads blocked by a SCHED_OTHER thread doing slow disk I/O
Watchdog keepalive not written for 5+ seconds
Hardware watchdog fires: system reset

The mmap_lock is a read-write semaphore. PREEMPT_RT converts many spinlocks to RT-mutexes (which support priority inheritance) but mmap_lock in the kernel version used on this system was not fully PI-aware. The SCHED_OTHER data logger thread held a read lock on mmap_lock for the duration of its I/O operation, and the RT threads waiting for write access could not inherit priority to the lock holder.

Debugging Methodology

Enable kernel latency tracer: echo 1 > /sys/kernel/debug/tracing/tracing_on; use latencytop
Use cyclictest to measure RT scheduling latency over hours
Check /proc/latency_stats for high-latency kernel operations
Use ftrace with function_graph tracer on the suspect paths
Correlate watchdog resets with disk I/O activity in system logs

The debugging team found 400ms latency spikes in cyclictest correlating exactly with large Data Logger writes. Ftrace confirmed the mmap_lock contention chain.

Fix

Isolate the data logger from RT threads. Move to a separate process with its own address space to eliminate mmap_lock sharing.
Use O_DIRECT writes in the data logger to bypass page cache and avoid page-fault-induced mmap_lock.
Run the watchdog keepalive from a dedicated RT thread with NO shared mmap regions.
Upgrade to a kernel with RT-aware mmap_lock — Linux 5.8+ improved mmap_lock RT behavior.
Use a memory-backed tmpfs for logs with periodic sync to disk from a non-RT context.

Lessons Learned

mmap_lock is a system-wide latency hazard on PREEMPT_RT. Any operation triggering page faults in a process shared with RT threads is dangerous.
RT threads should avoid any kernel paths touching disks, network, or file systems unless those paths are verified RT-safe.
Watchdog keepalive threads require isolation. They cannot share resources with anything that might block indefinitely.
PREEMPT_RT converts many locks to PI-aware RT-mutexes, but not all. Always verify the specific kernel version's RT-lock coverage.
cyclictest is the standard tool for RT latency auditing. Run it under representative load before production deployment.

ASCII Diagram: Priority Inversion Timeline

Priority
Level
  HIGH ─── [H: Needs mutex M] ─────────────────────── [BLOCKED on M] ──────────────────
            Thread H runnable                                │
                                                             │ blocked by
  MED  ───────────────────── [M: Running, no mutex] ────────┼──────── [M: Still running]
            Thread M becomes runnable                        │
                                                             │
  LOW  ─── [L: Holds mutex M] ─── [L: Preempted by M] ──────┼───────── [L: Eventually runs, releases M]
            Thread L acquires M                              │
                                                             │
                                         WATCHDOG FIRES ─────┘
                                         (H missed deadline)

With Priority Inheritance:
  When H blocks on M held by L:
    L.effective_priority = H.priority
    M cannot preempt L (M.priority < L.effective_priority)
    L runs, releases M quickly
    H runs, meets deadline
    Watchdog satisfied

Production Examples

Mars Pathfinder (1997): VxWorks priority inversion, watchdog reset, fixed by uplinked patch
Linux RT-preempt production systems: Various industrial controllers, automotive ECUs
Kubernetes CFS throttling: Netflix, Uber, Cloudflare all documented throttle-induced latency spikes
Linux O(N) scheduler: Lmbench benchmarks on early 2.6 systems showed scheduler overhead > 5% on large process counts

Debugging Notes

# Check CFS throttle statistics
cat /sys/fs/cgroup/cpu/my-service/cpu.stat
# nr_throttled: count of throttled periods
# throttled_time: nanoseconds throttled

# Check current scheduler for a process
cat /proc/<pid>/sched

# Set RT priority
chrt -f 99 <command>

# Check RCU stall timeout
cat /proc/sys/kernel/rcu_cpu_stall_timeout

# cyclictest for RT latency
cyclictest -p 99 -t 4 -n -h 400 -D 60

# ftrace scheduler events
echo 'sched:sched_switch' > /sys/kernel/debug/tracing/set_event

Security Implications

Priority manipulation as DoS: An unprivileged user with RLIMIT_RTPRIO set can use SCHED_FIFO to starve other processes. Proper rlimit configuration is essential.
cgroup CPU limits bypass: An attacker with cgroup misconfiguration can consume disproportionate CPU, degrading co-tenant services.
Watchdog suppression: An attacker who can trigger priority inversion on a safety-critical system can prevent watchdog keepalives, causing resets.

Performance Implications

CFS overhead is O(log N) per scheduling event — negligible for N < 10000
SCHED_FIFO at priority 99 will suppress all SCHED_OTHER threads indefinitely
CPU bandwidth throttling (cpu.cfs_quota) with 100ms periods creates up to 100ms latency spikes
Nice value range (-20 to +19) creates 5740x CPU weight differential — use carefully

Failure Modes

Failure	Trigger	Detection	Recovery
Priority inversion	Mutex held by low-pri, needed by high-pri	Watchdog fire, missed deadline	Enable priority inheritance
CFS throttle spike	CPU quota exhausted early in period	P99 latency spike, `cpu.stat nr_throttled`	Reduce period, raise quota, remove limit
RCU stall	CPU stuck in kernel, no quiescent state	Kernel warning/panic	Fix long kernel paths, reduce hypervisor overcommit
RT starvation	SCHED_FIFO thread monopolizes CPU	Other threads starved	Use SCHED_DEADLINE or time-limit RT threads

Modern Usage

Linux PREEMPT_RT (merged mainline in 6.x series) provides true hard RT guarantees
SCHED_DEADLINE (Linux 3.14+) implements EDF (Earliest Deadline First) — provably optimal for RT workloads
Kubernetes cpu.cfs_period_us tuning is now a standard recommendation in the Kubernetes performance guide
Priority inheritance is now the default in most modern RTOS (FreeRTOS, Zephyr, QNX)

Future Directions

EEVDF (Earliest Eligible Virtual Deadline First): Merged in Linux 6.6 (2023) as replacement for CFS's pick_next algorithm — better latency properties
Kernel-wide RT-awareness of locks: The long-term PREEMPT_RT goal of fully converting all kernel lock paths to priority-inheriting RT-mutexes
Scheduler extensibility via BPF: Linux 6.11+ sched_ext allows writing custom schedulers in BPF (eBPF), enabling per-application scheduling policies without kernel patches

Exercises

On a Linux system, reproduce priority inversion: write three threads (high, medium, low priority) where low holds a mutex needed by high and medium spins. Observe with htop and strace. Then add priority inheritance (pthread PTHREAD_PRIO_INHERIT) and verify the inversion disappears.
In a container with a CPU limit of 0.5 CPUs and a 100ms CFS period, use a tight CPU-bound loop to exhaust the quota, then measure request latency. Reduce the period to 10ms and remeasure P99.
Use cyclictest -p 99 -t 1 -n -D 60 on a non-RT and PREEMPT_RT kernel. Compare maximum latency values. Introduce background disk I/O with fio and observe the impact on each.
Read the Linux CFS source at kernel/sched/fair.c. Find the pick_next_entity() function and trace how vruntime comparison drives scheduling decisions.
Simulate the Mars Pathfinder scenario in Python using threading with a Lock. Observe that Python's GIL does not prevent priority inversion (the OS scheduler still applies). Use os.sched_setparam() to set thread priorities and reproduce the starvation.

References

Reeves, Glenn. "What Really Happened on Mars?" JPL Internal Memo, 1997. Widely reproduced online.
Molnar, Ingo. "[ANNOUNCE] O(1) scheduler for SMP." Linux-kernel mailing list, January 2002.
Molnar, Ingo. "[patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]." Linux-kernel mailing list, April 2007.
Heo, Tejun et al. "CPU Bandwidth Control for CFS." Linux kernel documentation, Documentation/scheduler/sched-bwc.rst
Ts'o, Theodore. "Linux Scheduler Latency." LWN.net, various articles 2007-2023.
Blanco, Titus et al. "Container isolation gone wrong." Netflix Tech Blog, 2019.
Gleixner, Thomas. "PREEMPT_RT: An introduction." Embedded Linux Conference Europe, 2019.
Linux kernel source: kernel/sched/fair.c, kernel/rcu/tree_stall.h