Linux Real-Time Scheduling

Technical Overview

Real-time scheduling is not about being fast — it is about being predictable. A real-time system must complete a task by a deadline, not merely quickly on average. Missing a deadline in a hard real-time system can mean physical harm: a robot arm that doesn't stop in time, a nuclear plant control loop that doesn't respond within 100µs, an airbag that fires 50ms too late. Soft real-time systems (audio, video streaming) tolerate occasional misses with degraded quality rather than catastrophic failure.

Linux supports three real-time scheduling policies: SCHED_FIFO, SCHED_RR, and SCHED_DEADLINE. These live in the rt_sched_class and dl_sched_class above CFS in the scheduling hierarchy. Additionally, the PREEMPT_RT patchset (progressively being merged into mainline) transforms the kernel into a fully preemptible, low-latency system capable of sub-100µs worst-case latency.

Understanding Linux real-time scheduling requires grasping the tension: the kernel must simultaneously provide deterministic latency for RT tasks while preventing RT tasks from starving the rest of the system — and while executing on hardware designed for throughput, not determinism.

Prerequisites

01-scheduling-fundamentals.md (preemption models, policy hierarchy)
03-linux-cfs.md (CFS context)
Understanding of interrupt handling and softirqs
Basic POSIX real-time API (sched_setscheduler, pthread_attr_setschedpolicy)

SCHED_FIFO

SCHED_FIFO is the simplest real-time policy: run until you block, yield, or are preempted by a higher-priority task. There is no time quantum. The task has absolute priority over all lower-priority tasks and runs to completion of its CPU burst.

Priority range: 1 (lowest RT) to 99 (highest RT). A SCHED_FIFO task at priority 50 will preempt any CFS task (which have effective RT priority 0) but will yield to any RT task at priority 51+.

Mechanics: - Per-priority FIFO queues in the RT runqueue - pick_next_task_rt() scans from priority 99 downward, returning the head of the first non-empty queue - A SCHED_FIFO task at priority P is only preempted by a task at priority P+1 or higher being enqueued - At the same priority: tasks are queued in FIFO order, run in FIFO order (no round-robin)

RT Runqueue (per CPU):
Priority 99: [Task H]       ← runs first, runs until blocked/completed
Priority 50: [Task M1][M2]  ← M1 runs when H blocks; M2 queued after M1
Priority  1: [Task L]       ← only runs when 50 and 99 queues are empty
CFS (pri 0): [Task C1][C2]  ← runs only when all RT queues empty

SCHED_FIFO and starvation: A CPU-bound SCHED_FIFO task at priority 50 will permanently starve all lower-priority tasks on that CPU. This is by design — the application promised the kernel that this task needs real-time priority and will not monopolize the CPU. If it does, that is an application bug. The system-level mitigation is RT throttling.

API:

struct sched_param sp = { .sched_priority = 50 };
sched_setscheduler(0, SCHED_FIFO, &sp);  /* 0 = current process */

/* Or at thread creation: */
pthread_attr_t attr;
pthread_attr_init(&attr);
pthread_attr_setschedpolicy(&attr, SCHED_FIFO);
pthread_attr_setschedparam(&attr, &(struct sched_param){.sched_priority=50});
pthread_attr_setinheritsched(&attr, PTHREAD_EXPLICIT_SCHED);
pthread_create(&thread, &attr, fn, arg);

Requires CAP_SYS_NICE or RLIMIT_RTPRIO set appropriately. Modern systems use systemd's AmbientCapabilities or LimitRTPRIO to grant RT access to services.

SCHED_RR

SCHED_RR is SCHED_FIFO with a time quantum. If a SCHED_RR task exhausts its quantum without blocking, it is moved to the tail of its priority queue and the next SCHED_RR task at the same priority runs.

Default quantum: 100ms (tuneable via /proc/sys/kernel/sched_rr_timeslice_ms).

Query the quantum programmatically:

struct timespec ts;
sched_rr_get_interval(pid, &ts);
/* ts.tv_sec, ts.tv_nsec: the RR quantum for this process */

When to use SCHED_RR vs SCHED_FIFO: - SCHED_FIFO: single real-time task, or multiple tasks at different priorities where explicit preemption is handled by priority levels - SCHED_RR: multiple tasks at the same priority that should share CPU fairly. Example: 3 audio processing threads at priority 80 that should each get 100ms slices.

Starvation of lower priority remains: SCHED_RR only rounds-robins within its priority level. Lower priority tasks still starve.

RT Throttling

The critical safety valve: SCHED_FIFO and SCHED_RR tasks are throttled to prevent complete CPU starvation of CFS tasks and the kernel's own housekeeping threads.

/proc/sys/kernel/sched_rt_period_us  = 1000000  (1 second)
/proc/sys/kernel/sched_rt_runtime_us = 950000   (0.95 seconds)

Interpretation: RT tasks may run at most 950ms out of every 1000ms.
The remaining 50ms/s is guaranteed to CFS and the kernel.

Setting sched_rt_runtime_us = -1 disables throttling entirely.
(Required for some hard RT systems, but dangerous: a runaway RT task
locks up the system with no recovery path except NMI or hardware reset.)

RT throttling is implemented via per-runqueue bandwidth accounting: - rt_rq->rt_time: CPU time used by RT tasks in current period - rt_rq->rt_runtime: budget for current period - When rt_time > rt_runtime: RT runqueue is throttled, CFS gets CPU - At period boundary: rt_time resets, throttling lifted

Per-cgroup RT bandwidth (RT group scheduling):

# Allow a cgroup's RT tasks to use 500ms per second:
echo 500000 > /sys/fs/cgroup/cpu/mygroup/cpu.rt_runtime_us
echo 1000000 > /sys/fs/cgroup/cpu/mygroup/cpu.rt_period_us

This allows per-container RT task budgets — essential in multi-tenant environments where one container's RT tasks should not starve others.

SCHED_DEADLINE

SCHED_DEADLINE implements Earliest Deadline First (EDF) scheduling with Constant Bandwidth Server (CBS) admission control. It is the most sophisticated and most correct real-time scheduling policy in Linux.

The Sporadic Task Model: Each task is characterized by three parameters: - Runtime (runtime, WCET): worst-case execution time per job (in nanoseconds) - Deadline (deadline): relative deadline from job arrival (nanoseconds) - Period (period): minimum time between consecutive job arrivals (nanoseconds)

Example: a control loop task with period=10ms, runtime=2ms, deadline=8ms means: "Every 10ms, a new job arrives. It needs at most 2ms of CPU. It must complete within 8ms of arrival."

Admission Control: The kernel refuses to accept a SCHED_DEADLINE task if it would make the total utilization exceed 1.0 (100% CPU). Specifically:

For each SCHED_DEADLINE task i: utilization_i = runtime_i / period_i
Total utilization U = Σ(runtime_i / period_i)

If U ≥ 1.0: admission refused (EBUSY)
If U < 1.0: admitted

Example:
  Task 1: runtime=2ms, period=10ms → U1 = 0.2
  Task 2: runtime=3ms, period=15ms → U2 = 0.2
  Task 3: runtime=1ms, period=5ms  → U3 = 0.2
  Total U = 0.6 → admitted

  Task 4 wants runtime=5ms, period=10ms → U4 = 0.5
  Total would be 1.1 → REFUSED

This admission control guarantees all accepted tasks will meet their deadlines — a guarantee no other Linux scheduling policy makes.

Constant Bandwidth Server (CBS): Prevents a task from using more than its declared runtime in a period, even if it runs longer than expected. CBS tracks each task's remaining budget and replenishes it at the start of each period. If a task exhausts its budget, it is throttled until the next replenishment.

Setting SCHED_DEADLINE (requires CAP_SYS_NICE or root):

struct sched_attr attr = {
    .size         = sizeof(attr),
    .sched_policy = SCHED_DEADLINE,
    .sched_flags  = 0,
    .sched_runtime  = 2000000,  /* 2ms */
    .sched_deadline = 8000000,  /* 8ms */
    .sched_period   = 10000000, /* 10ms */
};
sched_setattr(0, &attr, 0);

/* Application loop: */
while (1) {
    do_control_work();
    sched_yield();  /* indicate job completion, sleep until next period */
}

SCHED_DEADLINE and CPU affinity: A SCHED_DEADLINE task cannot be migrated freely; its admission was calculated per-CPU. Migration is restricted (dl_task_check_affinity). Multi-CPU deadline scheduling (global EDF vs partitioned EDF) is an open research area; Linux uses a form of partitioned scheduling for SCHED_DEADLINE.

dl_server for CFS integration: Linux 6.6 introduced dl_server, which wraps the CFS runqueue in a deadline server. This gives CFS tasks a minimum CPU bandwidth guarantee even in the presence of RT tasks, preventing complete starvation without completely disabling RT preemption.

PREEMPT_RT Patchset

The PREEMPT_RT patchset (originally by Ingo Molnar and Thomas Gleixner, ongoing development by the RT Linux community) transforms Linux into a fully preemptible RTOS. Key changes:

Spinlocks → RT Mutexes

In a standard kernel, spinlocks busy-wait and disable preemption. A thread holding a spinlock cannot be preempted, even by a higher-priority RT task. This creates unbounded latency: if a low-priority interrupt handler holds a spinlock, an RT task waiting to acquire it spins (or blocks) until the interrupt handler completes, which may take milliseconds.

PREEMPT_RT replaces most spinlocks with rt_mutex (sleeping mutexes with priority inheritance):

Standard spinlock acquisition:
  preempt_disable()
  spin_until_acquired()
  [critical section — preemption disabled, RT tasks can't run]
  spin_unlock()
  preempt_enable()

PREEMPT_RT rt_mutex:
  lock_rt_mutex()  [if contended: task SLEEPS, another task can run]
  [critical section — preemptible, RT tasks CAN preempt]
  unlock_rt_mutex()
  [wake up any waiters with priority inheritance]

This allows the RT task to preempt even kernel code, provided that code is not in an NMI handler or other truly non-preemptible context.

Threaded IRQ Handlers

In a standard kernel, hardware interrupt handlers run with interrupts disabled, at the highest priority level, non-preemptible. An interrupt handler that takes 500µs is 500µs of latency for any RT task waiting to run.

PREEMPT_RT converts interrupt handlers to kernel threads:

Standard IRQ flow:
  Hardware IRQ fires
  → CPU saves state, enters interrupt context
  → irq_handler() runs [NON-PREEMPTIBLE, interrupts disabled]
  → return from interrupt context

PREEMPT_RT threaded IRQ flow:
  Hardware IRQ fires
  → Minimal "hard IRQ" handler: acknowledges IRQ, wakes irq_thread
  → irq_thread() is a schedulable kernel thread at RT priority
  → Scheduler runs: can preempt irq_thread for higher-priority RT task
  → irq_thread() eventually runs, handles IRQ

This means an RT task at priority 80 will preempt an interrupt handler at priority 50, achieving the "RT task first" semantics the application programmer expects.

Preemptible RCU

Read-Copy-Update (RCU) grace periods in a standard kernel require all CPUs to pass through a quiescent state. PREEMPT_RT makes RCU fully preemptible (PREEMPT_RCU), allowing tasks in RCU read-side critical sections to be preempted.

Resulting Latency Profile

System Type              Typical Latency  Worst Case
─────────────────────────────────────────────────────
CONFIG_PREEMPT_NONE      1-10ms           seconds
CONFIG_PREEMPT           100µs-1ms        10ms+
PREEMPT_RT (x86)         10-50µs          <100µs
PREEMPT_RT (ARM Cortex)  20-100µs         <200µs
Dedicated RTOS (VxWorks) 1-10µs           <50µs

PREEMPT_RT is increasingly merged into mainline Linux. As of Linux 6.12, the vast majority of the patchset is in-tree under various CONFIG_PREEMPT_RT Kconfig options.

Measuring RT Latency: cyclictest

cyclictest (part of the rt-tests package) is the standard tool for measuring scheduling latency on RT Linux systems:

# Basic cyclictest run: measure latency of 1 RT thread at priority 80
cyclictest --priority=80 --interval=1000 --loops=100000

# Multi-core measurement: one thread per CPU
cyclictest --priority=80 --interval=1000 --loops=100000 \
           --smp --mlockall --histogram=400 --histfile=/tmp/latency.hist

# With stress load (simulate real workload)
stress-ng --cpu 0 --io 4 --vm 2 &
cyclictest --priority=80 --interval=1000 --loops=1000000 --mlockall

Output:

T: 0 ( 1234) P:80 I:1000 C:100000 Min:    4 Act:   8 Avg:    9 Max:   47
              thread   prio interval  count  minimum current average maximum

Key: Max is the worst-case latency in microseconds. For a soft RT system (audio), Max under 1ms is acceptable. For hard RT (industrial control), Max under 100µs is typically required.

Latency sources to investigate: - NMI handlers (hardware watchdog, PMU): disable with nohz=off nmi_watchdog=0 - SMIs (System Management Interrupts): hardware-generated, invisible to OS — measured via hwlatdetect - IRQ coalescing: disable with ethtool -C eth0 rx-usecs 0 - Memory page faults: pre-fault with mlockall(MCL_CURRENT|MCL_FUTURE) - CPU frequency scaling: disable with cpupower frequency-set -g performance

Use Cases in Practice

JACK Audio Server: Linux's professional audio framework. Audio callback thread runs at SCHED_FIFO priority 70. Must deliver 128 or 256 audio frames every 2.9ms or 5.8ms (at 44.1kHz) without gaps. Without RT scheduling, buffer underruns occur whenever the audio thread is delayed by CFS. With SCHED_FIFO, audio thread preempts all CFS tasks and runs immediately when needed. JACK popularized the need for user-space RT access on desktop Linux.

Robotics (ROS 2): Robot Operating System 2 uses RT scheduling for motor control loops that must execute at 1kHz (1ms period) with <200µs jitter. Exceeding jitter bounds causes vibration, positioning errors, or mechanical damage. ROS 2's Executor has explicit RT thread support, and the ros2_realtime_benchmarks project tracks scheduling latency regression.

Industrial Control (Siemens PLCs, Beckhoff TwinCAT): Industrial controllers running EtherCAT bus protocols require 250µs-1ms cycle times with jitter below 10µs. Some use PREEMPT_RT Linux as the real-time OS, with a separate CFS system co-running for the HMI and logging. The RT tasks are isolated to specific CPUs with isolcpus and irqaffinity.

Mars Pathfinder (VxWorks): One of the most famous real-time scheduling case studies — see 08-priority-inversion.md. The Mars rovers Spirit and Opportunity used VxWorks with SCHED_FIFO-style fixed-priority preemptive scheduling. A recent Mars helicopter (Ingenuity) ran Linux with PREEMPT_RT.

Financial Trading: High-frequency trading systems require order response latency under 10µs. They typically use SCHED_FIFO, CPU pinning, isolcpus, DPDK (user-space networking), and specialized hardware. Some use custom kernels based on PREEMPT_RT.

Debugging Notes

# Check current RT parameters for a process
chrt -p [pid]
# Output: scheduling policy and priority

# Set SCHED_FIFO priority 50:
chrt -f -p 50 [pid]

# Monitor RT throttling:
watch -n1 "grep -r . /proc/sys/kernel/sched_rt_*"
# If rt_runtime_us/rt_period_us < 1.0: throttling active

# Check if RT tasks are being throttled (per-runqueue stats):
grep "rt_throttled" /proc/schedstat   # counts throttle events per CPU

# View SCHED_DEADLINE task parameters:
grep -r . /proc/[pid]/sched  # includes dl_runtime, dl_deadline, dl_period

# ftrace for RT scheduling:
echo function > /sys/kernel/debug/tracing/current_tracer
echo sched_switch sched_wakeup > /sys/kernel/debug/tracing/set_event
cat /sys/kernel/debug/tracing/trace_pipe | grep "SCHED_FIFO\|SCHED_RR"

# Identify long interrupt handlers (source of RT latency):
cat /proc/interrupts  # count per-CPU; watch for rapid growth of specific IRQs
perf top -e irq:irq_handler_entry  # which handlers consume most time

Security Implications

Privilege escalation via RT scheduling: A process at SCHED_FIFO priority 99 can monopolize a CPU indefinitely. If an attacker gains CAP_SYS_NICE (or RLIMIT_RTPRIO allows it), they can create a denial-of-service by pinning a CPU with an RT task. Mitigations: sched_rt_runtime_us throttling, container-level cpu.rt_runtime_us limits, Seccomp to block sched_setscheduler().

RT scheduling in containers: Kubernetes does not expose SCHED_FIFO to containers by default (requires CAP_SYS_NICE which is not granted). Privileged containers can set RT policies. This is a privileged escape vector if misconfigured.

Priority inversion as a security vulnerability: A carefully crafted priority inversion scenario can create a denial-of-service even without RT privileges. The Mars Pathfinder case showed this as an operational safety issue; in adversarial contexts it can be a security exploit.

SCHED_DEADLINE and resource exhaustion: Admission control prevents over-subscribing the CPU for SCHED_DEADLINE tasks. However, an attacker with the right privileges could reserve 95% of CPU with SCHED_DEADLINE tasks, leaving only 5% for the rest of the system. Cgroup cpu.rt_runtime_us limits provide per-group accounting.

Performance Implications

PREEMPT_RT's rt_mutex vs spinlock conversion adds overhead to every lock acquisition: a sleeping mutex has higher overhead than a spinlock when uncontended (~20-50ns extra). This is the PREEMPT_RT throughput tax for latency determinism.
Threaded IRQ handlers add scheduling overhead for every interrupt. On systems with very high interrupt rates (>100K/s network IRQs), this can add measurable CPU overhead.
mlockall() for RT processes prevents page faults during execution but requires the working set to be physically resident at all times — increases memory pressure on the system.
SCHED_DEADLINE task migration is restricted, which can cause load imbalance on multi-CPU systems. NUMA-aware placement of deadline tasks requires manual affinity configuration.

Failure Modes

Runaway RT task: SCHED_FIFO task enters infinite loop, holds CPU forever. Mitigation: RT throttling. Detection: watchdog daemon using a timer at lower RT priority that resets a counter; if counter not updated, trigger recovery.
Priority inversion: Low-priority task holds lock needed by RT task, medium-priority task preempts low-priority task, RT task starved. Full solution: priority inheritance (rt_mutex). See 08-priority-inversion.md.
Deadline miss: SCHED_DEADLINE task exceeds its declared runtime per period. CBS throttles the task until the next period. If this happens consistently, the task's runtime declaration is wrong — must be corrected (re-measurement, margin increase).
RT task blocked on non-RT mutex: If RT task calls a kernel function that blocks on a non-PREEMPT_RT spinlock, latency is unbounded. PREEMPT_RT fixes this; standard kernels require careful audit of RT code paths.

Future Directions

PREEMPT_RT mainlining: As of Linux 6.12, the majority of PREEMPT_RT is in-tree. The remaining pieces (full RT mutex everywhere, SRCU preemptibility) are being merged incrementally. The goal is to make CONFIG_PREEMPT_RT a first-class kernel config in Linus Torvalds's tree.

SCHED_DEADLINE improvements: Research on global EDF for Linux (all DEADLINE tasks compete globally across all CPUs) would improve CPU utilization but requires solving migdation admission control. Partitioned scheduling is safe but underutilizes CPUs.

RT + CFS unified deadline scheduling: The dl_server introduced in Linux 6.6 is a step toward treating CFS as a bandwidth server with deadline guarantees, unifying RT and fair scheduling in a single framework.

Hardware support: ARM's FEAT_ETE (Embedded Trace Extension) and Intel's PT (Processor Trace) can provide hardware-level timing data, enabling finer-grained RT latency measurement and diagnosis than software tracing.

Exercises

Basic SCHED_FIFO: Write a C program that sets itself to SCHED_FIFO priority 50, then measures the jitter of a 1ms sleep loop using clock_gettime(CLOCK_MONOTONIC). Compare jitter (max - min wakeup delay) with and without RT scheduling.
RT throttling observation: Set sched_rt_runtime_us=500000 (50% budget). Run a SCHED_FIFO CPU hog. Verify that it consumes exactly 50% CPU using top. What happens to the system when the RT task is throttled?
cyclictest baseline: Run cyclictest --priority=80 --interval=1000 --loops=100000 --mlockall on your system. Record Min/Avg/Max latency. Then introduce CPU stress with stress-ng --cpu $(nproc) and repeat. How much does max latency change?
SCHED_DEADLINE task: Write a simple deadline task (period=10ms, runtime=2ms). Use sched_setattr(). Verify with chrt -p [pid] that SCHED_DEADLINE is active. Instrument with CLOCK_THREAD_CPUTIME_ID to verify the task consumes ~20% CPU.
Priority inversion setup: Using three threads at different SCHED_FIFO priorities and a shared mutex (NOT rt_mutex), create a priority inversion scenario. Measure how long the high-priority thread is delayed. Then switch to pthread_mutexattr_setprotocol(PTHREAD_PRIO_INHERIT) and measure again.

References

POSIX.1b Real-Time Extensions specification (<sched.h>, sched_setscheduler)
Linux Kernel Documentation: Documentation/scheduler/sched-deadline.rst
Linux Kernel Documentation: Documentation/scheduler/sched-rt-group.rst
Thomas Gleixner, "PREEMPT_RT: Real Time Linux", Linux Plumbers Conference, various years
Abeni, L. and Buttazzo, G., "Integrating Multimedia Applications in Hard Real-Time Systems", RTSS 1998 — CBS algorithm
Buttazzo, G., "Hard Real-Time Computing Systems", Springer — comprehensive RTOS theory
rt-tests package: https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git
JACK Audio: jackd source, JackPosixThread.cpp for RT thread setup
Fohrenbach, N. et al., "Analysis of Real-Time Linux", Embedded Linux Conference 2019
Linux source: kernel/sched/rt.c, kernel/sched/deadline.c, kernel/locking/rtmutex.c