PREEMPT_RT Linux

Overview

PREEMPT_RT is a patchset that transforms the Linux kernel from a high-throughput general-purpose operating system into a real-time-capable operating system. Rather than replacing Linux with a purpose-built RTOS, PREEMPT_RT surgically modifies the kernel's concurrency and interrupt model to bound worst-case latency, preserving the full Linux userspace while enabling <100µs response times on commodity hardware.

The patchset is significant because it proves that a production-grade, feature-complete operating system can achieve hard real-time properties through principled engineering of its concurrency primitives — without sacrificing the POSIX API, device driver ecosystem, or networking stack that make Linux so valuable.

Prerequisites

Linux kernel internals: spinlocks, RCU, softirqs, interrupt handling
Real-time fundamentals: scheduling theory, jitter, WCET (see 01-real-time-fundamentals.md)
POSIX real-time APIs: sched_setscheduler, SCHED_FIFO, SCHED_RR
Basic understanding of hardware clock sources and timers
Familiarity with kernel build system (Kconfig, make)

Historical Context

Timeline of PREEMPT_RT development:

2004  Ingo Molnar posts first PREEMPT_RT patch to LKML
      Initial focus: ARM and x86, basic spinlock-to-mutex conversion

2005  Thomas Gleixner begins hrtimer (high-resolution timer) work
      Steve Rostedt joins, focuses on latency tracing infrastructure

2006  First "production use" reports from industrial users
      Dual-kernel approach (Xenomai, RTLinux) loses industrial momentum

2007  TimeSys, Mentor Graphics, Wind River shipping PREEMPT_RT products
      cyclictest becomes standard benchmark

2009  "The Big Kernel Lock" (BKL) removal accelerates — enables better RT

2011  Kernel 3.0: PREEMPT_RT partially merged — hrtimers, threaded IRQs
      Wake-up latency consistently <500µs on x86 with patches

2015  Active development: priority inheritance in RCU, lockdep for RT mutexes
      Sub-100µs latency achieved on i7 hardware

2017  Red Hat, SUSE ship PREEMPT_RT kernels for industrial/telco customers
      Linux Foundation Real-Time Linux working group formed

2021  5.15 LTS: Significant PREEMPT_RT infrastructure merged into mainline
      RT mutex, hrtimer, printk rework merged

2024  Linux 6.12: PREEMPT_RT fully merged into mainline Linux kernel
      No separate patch needed for most architectures

The 20-year journey from patchset to mainline reflects the difficulty of making a complex, widely-deployed kernel surgically real-time without breaking anything. Every change had to maintain backward compatibility and pass the scrutiny of the entire Linux kernel maintainer community.

Architecture: What PREEMPT_RT Changes

Standard Linux Kernel (without PREEMPT_RT):
+--------------------------------------------------+
|  Userspace SCHED_FIFO task at priority 99        |
+--------------------------------------------------+
    | (task wants to run)
    v
+--------------------------------------------------+
|  Kernel: doing network softirq processing        |
|  Spinlock held, interrupts enabled, NOT          |
|  preemptible by userspace — even priority 99     |
|  Kernel section may run for 1-10ms               |
+--------------------------------------------------+
    | (eventually finishes, returns to userspace)
    v
    Task runs — but 1-10ms late

PREEMPT_RT Kernel:
+--------------------------------------------------+
|  Userspace SCHED_FIFO task at priority 99        |
+--------------------------------------------------+
    | (task wants to run)
    v
+--------------------------------------------------+
|  Kernel: network softirq in an RT-mutex section  |
|  Running as a threaded softirq at lower priority |
|  Priority 99 task PREEMPTS kernel thread         |
+--------------------------------------------------+
    v
    Task runs — within <100µs

The Five Core Changes

1. Threaded Interrupt Handlers

Before: Hardware interrupt handlers (hardirq) run in atomic context — interrupts disabled, cannot sleep, cannot be preempted.

After: Most hardirq handlers are converted to run in a dedicated kernel thread. The hardware IRQ line triggers a minimal "hardirq top half" that acknowledges the hardware and wakes the IRQ thread. The bulk of interrupt processing happens in the thread, which is a normal schedulable entity.

Before PREEMPT_RT:
  Hardware IRQ fires
    -> CPU enters hardirq context (atomic, non-preemptible)
    -> Full handler runs (may be 10s-100s µs)
    -> Return to interrupted context

After PREEMPT_RT:
  Hardware IRQ fires
    -> CPU enters minimal hardirq (atomic, ~1µs)
       - Acknowledge hardware
       - Wake IRQ thread
    -> Return to whatever was interrupted
    -> Scheduler runs IRQ thread at its configured priority
    -> IRQ thread runs full handler (preemptible by higher-priority RT task)

Configuration: CONFIG_IRQ_FORCED_THREADING=y. Individual drivers can opt out with IRQF_NO_THREAD for truly performance-critical handlers.

2. RT Mutexes Replacing Spinlocks

Before: Spinlocks are the kernel's primary mutual exclusion primitive. They busy-wait (spin) and disable preemption — ensuring bounded, atomic critical sections but making the kernel non-preemptible during their hold.

After: Most spinlocks are replaced by rt_mutex — a sleepable, priority-inheriting mutex. A task trying to acquire an rt_mutex can block (sleep), allowing higher-priority tasks to run.

// Before: spinlock in kernel
spin_lock(&some_lock);
// critical section - kernel non-preemptible
do_work();
spin_unlock(&some_lock);

// After: rt_mutex (functionally equivalent, but sleepable)
rt_mutex_lock(&some_rt_lock);
// critical section - kernel is preemptible by higher-priority tasks
do_work();
rt_mutex_unlock(&some_rt_lock);

Crucially, rt_mutex implements the priority inheritance protocol: if a high-priority task waits for an rt_mutex held by a low-priority task, the low-priority task temporarily inherits the high-priority task's scheduling priority. This prevents priority inversion within the kernel.

Not all spinlocks can be converted: raw spinlocks (raw_spinlock_t) remain for truly atomic sections (NMI handlers, scheduler internals). PREEMPT_RT distinguishes these explicitly.

3. Preemptible RCU

Read-Copy-Update (RCU) is Linux's scalable reader-writer synchronization mechanism. Without PREEMPT_RT, RCU grace periods require all CPUs to pass through a quiescent state — which requires all CPUs to be preemptible, but standard PREEMPT_RT converts spinlock sections.

PREEMPT_RT implements PREEMPT_RCU: RCU readers can be preempted. A reader that is preempted is tracked and the grace period waits for it. This adds overhead but removes a critical source of unbounded latency.

4. High-Resolution Timers (hrtimer)

Before: Linux timer wheel operates at HZ resolution (typically 100-1000Hz). schedule_timeout(1) sleeps for one jiffy — up to 10ms at HZ=100. All wakeups are aligned to the next tick.

After: hrtimer subsystem provides nanosecond-resolution timers backed by hardware clock sources (HPET, TSC deadline timer on x86, arch timer on ARM).

Standard Linux timer:
  Request sleep for 500µs
  |<--- up to 10ms --->|  (aligned to next 10ms tick)
  Actually wakes here

hrtimer:
  Request sleep for 500µs
  |<- ~500µs ± jitter ->|
  Actually wakes here (programmed directly to hardware comparator)

POSIX clock_nanosleep(CLOCK_REALTIME, TIMER_ABSTIME, ...) uses hrtimer on PREEMPT_RT kernels, achieving nanosecond-resolution sleep targeting.

5. Fully Preemptible Kernel

The combination of the above changes enables the PREEMPT_RT preemption model — the kernel is preemptible everywhere except:

raw_spinlock_t sections (used only for truly atomic operations)
NMI handlers
The scheduler's own critical sections

This means a userspace SCHED_FIFO priority 99 thread can preempt almost any kernel execution path, bounding scheduling latency to the time the scheduler itself takes plus any raw spinlock hold time.

Preemption Models in Linux

Linux configures the preemption model at compile time via Kconfig:

CONFIG_PREEMPTION options:

  PREEMPT_NONE (server/desktop default):
    Kernel preemption only at explicit schedule() calls
    Lowest overhead, worst RT latency (ms range)
    "Voluntary preemption" when CONFIG_PREEMPT_VOLUNTARY

  PREEMPT (desktop default):
    Preemption at all non-spinlock-held kernel sections
    Better latency than NONE, some overhead
    Typical worst-case: hundreds of µs

  PREEMPT_RT (real-time):
    Fully preemptible kernel (spinlocks -> rt_mutex)
    Threaded IRQs
    Worst-case latency: ~20-100µs on tuned x86/ARM64 hardware
    Available in mainline since Linux 6.12

Achieving <100µs Latency: System Configuration

Getting sub-100µs latency on real hardware requires more than just a PREEMPT_RT kernel. The full stack must be tuned:

CPU Isolation

# Isolate CPUs 2 and 3 from the general scheduler
# (kernel bootarg)
isolcpus=2,3 rcu_nocbs=2,3 nohz_full=2,3

# Pin RT task to isolated CPU
taskset -c 2 cyclictest --priority=99 --interval=200

NUMA and IRQ Affinity

# Move all non-critical IRQs off RT CPUs
for irq in /proc/irq/*/smp_affinity_list; do
    echo 0-1 > $irq  # Route all IRQs to CPUs 0-1 only
done

# Verify no IRQs on isolated CPUs
cat /proc/interrupts | grep -E "CPU2|CPU3"

CPU Frequency Scaling

P-states and C-states (deep sleep states) cause latency spikes when the CPU must ramp up frequency or wake from deep sleep:

# Disable CPU frequency scaling on RT CPUs
echo performance > /sys/devices/system/cpu/cpu2/cpufreq/scaling_governor

# Disable deep C-states (latency vs power tradeoff)
# cpuidle latency limit (0 = no deep sleep)
echo 0 > /dev/cpu_dma_latency  # Requires open file descriptor kept open

# Or set via /sys/devices/system/cpu/cpu2/cpupower

Memory Locking

// Prevent RT task from faulting on memory access during execution
mlockall(MCL_CURRENT | MCL_FUTURE);

// Pre-fault stack to avoid page fault in critical section
char stack_buf[MAX_STACK_SIZE];
memset(stack_buf, 0, sizeof(stack_buf));  // touch all stack pages

Interrupt Coalescing

# NIC interrupt coalescing causes latency spikes
# Disable for RT network processing
ethtool -C eth0 rx-usecs 0 tx-usecs 0

# Disable NAPI polling delay
echo 0 > /proc/sys/net/core/netdev_budget_usecs

cyclictest: Measuring RT Performance

cyclictest output interpretation:

$ cyclictest -l10000000 -m -Sp99 -i200 -h400 -q

T: 0 ( 7890) P:99 I:200 C:10000000 Min:   3 Act:  6 Avg:   7 Max:  47

Column:   T=thread  PID  Priority  Interval(µs)  Count  Min  Act  Avg  Max

Histogram:
 000003   47382    <- 3µs bucket: 47382 occurrences
 000004  291847
 000005  8731024  <- mode: most samples here
 000006  891234
 ...
 000047       1   <- max: 47µs, only 1 occurrence

Target behavior for industrial RT (IEC 61508):
  - Max < 100µs: acceptable for most soft-RT
  - Max < 50µs:  good for control applications
  - Max < 20µs:  required for sub-millisecond control loops
  - Max > 1ms:   indicates tuning issue or non-RT kernel

Stress Testing

The worst-case latency only appears under realistic system stress:

# Concurrent stress while measuring with cyclictest:
# CPU load
stress-ng --cpu 4 --io 4 &

# Memory pressure
stress-ng --vm 2 --vm-bytes 80% &

# Network load
iperf3 -s &; iperf3 -c localhost -t 60 &

# File I/O
fio --name=stress --rw=randrw --bs=4k --size=1G --ioengine=libaio &

# Run cyclictest with all stress active
cyclictest -l1000000 -p99 -i200 -m

PREEMPT_RT and the printk Problem

One of the last significant sources of latency in PREEMPT_RT was printk. The kernel's print function acquired a console driver lock, which could hold for milliseconds while a slow serial console flushed.

The solution, merged in Linux 5.15, was threaded printk: log writes go to an in-kernel ring buffer, and a dedicated kernel printing thread flushes to consoles. RT tasks calling printk no longer block waiting for console I/O.

This change eliminated a class of latency spikes that was difficult to diagnose because it only appeared when kernel code paths hit debug printk calls under load.

Latency Tracing with ftrace

# Enable latency tracer
echo preemptirqsoff > /sys/kernel/debug/tracing/current_tracer
echo 1 > /sys/kernel/debug/tracing/tracing_on

# Run workload, then capture worst-case trace
cat /sys/kernel/debug/tracing/trace | head -100

# Output shows exact kernel functions executing during worst-case latency:
# preemptirqsoff latency trace v1.1.5 on 5.15.0-rt17
# ------------------------------------------------------------
# latency: 67 us, #4/4, CPU#2 | (M:RT VP:0, KP:0, SP:0 HP:0)
#    -----------------
#    | task: cyclictest-1234 (uid:0 nice:0 policy:1 rt_prio:99)
#    -----------------
#    =>  started at: _raw_spin_lock_irqsave <some_driver_lock>
#    =>  ended at:   _raw_spin_unlock_irqrestore

The rtla timerlat tool (Linux 5.17+) provides an automated, structured analysis:

# rtla timerlat top: shows per-CPU latency breakdown
rtla timerlat top -p 99 -d 60 -T 100

# Output:
# Timer Latency (µs)                     [0-99µs histogram]
#                IRQ    Kernel   User
# CPU  0: avg   3.1     4.2     7.5    max  12    18    31
# CPU  1: avg   2.9     4.0     7.2    max  11    16    28
# CPU  2: avg   3.0     4.1     7.3    max  10    15    26

IRQ latency = hardware timer fires to IRQ handler entry
Kernel latency = IRQ handler entry to wakeup posted
User latency = wakeup posted to task actually executing

Hardware Considerations

PREEMPT_RT latency is ultimately bounded by hardware behavior:

SMIs (System Management Interrupts)

SMIs are the worst latency villain on x86. The CPU silently enters System Management Mode (SMM) at the BIOS/firmware's request — completely invisible to the OS, uninterruptible, can run for 50-200µs.

# Detect SMI activity (requires MSR access):
rdmsr -p 2 0x34  # Read SMI counter on CPU 2
# Run workload for 60s
rdmsr -p 2 0x34  # If count increased, SMIs occurred

# Latency spike coinciding with SMI is identifiable because:
# - ftrace shows no kernel activity during the gap
# - The CPU appears to "disappear" for the SMI duration

SMIs cannot be disabled on most commercial hardware. Industrial RT platforms (Beckhoff Industrial PCs, ADLINK) specifically disable all non-essential SMI sources in firmware.

APIC Timer Behavior

On x86, the deadline timer mode (APIC_LVT_TIMER_TSCDEADLINE) programs the local APIC timer to fire at an exact TSC cycle count. This eliminates the timer hardware's quantization error, enabling sub-microsecond timer resolution. Enable with kernel config CONFIG_X86_TSC and check CPUID for TSC Deadline support.

ARM64 Considerations

ARM Cortex-A systems generally have more predictable interrupt latency than x86: - No SMI equivalent - GIC (Generic Interrupt Controller) has deterministic interrupt routing - ARM generic timer (CNTP_TVAL_EL0) provides reliable high-resolution timing - PREEMPT_RT on ARM64 typically achieves <50µs on modern Cortex-A57/A72/A78 class hardware

Kernel Threads and Priorities

After enabling PREEMPT_RT, kernel threads run at specific priorities. RT application threads must be at higher priority than competing kernel threads:

# View kernel thread priorities (RT threads):
ps -eo pid,rtprio,class,comm | grep -E "FF|RR" | sort -k2 -rn

# Critical kernel threads and typical priorities:
# irq/XX-<name>   priority 50 (IRQ threads, configurable)
# ktimers         priority 50
# ksoftirqd/X     priority 50 (softirq daemon)
# kworker/X       (SCHED_NORMAL, typically priority 0)

# Application RT task should be at priority 80-99 to beat kernel threads
# OR: raise critical kernel threads (irq/XX) above application threads if needed

The key insight: on PREEMPT_RT, there is a continuous priority space from userspace to kernel threads. Designing priority assignments requires understanding which kernel threads compete with your RT tasks.

Production Examples

Beckhoff TwinCAT 3: Industrial automation runtime. Runs on standard x86 hardware with PREEMPT_RT Linux as the real-time core. EtherCAT master with 250µs cycle time, PLC logic, and servo drive control. Ships in CNC machines, packaging lines, and assembly robots worldwide.

KUKA Robot Controllers: KR C4 and successors run PREEMPT_RT Linux. Robot arm trajectory control loop at 4kHz (250µs). The full robot OS is Linux-based — visualization, PLC programming, network communication — with RT threads for servo control.

ABB Robot Controllers: IRC5/OmniCore use PREEMPT_RT for robot motion control. Safety functions (ISO 10218) separated onto a dedicated safety processor; PREEMPT_RT handles non-safety motion control.

Intel FlexRAN: 5G base station software. Runs on Intel Xeon with PREEMPT_RT, DPDK for network I/O, and AVX-512 for DSP processing. L1 (physical layer) processing must complete in <500µs for 5G NR subframe timing.

Xilinx/AMD MPSoC (Zynq UltraScale+): FPGA + quad Cortex-A53 + dual Cortex-R5. PREEMPT_RT on Cortex-A53 for control logic; Cortex-R5 (FreeRTOS, bare-metal) for hard RT. Used in medical imaging, SDR, industrial control.

Debugging Notes

Identifying latency sources: Always use rtla timerlat first. It decomposes latency into IRQ, kernel, and user contributions. If kernel latency dominates: find which raw_spinlock holds longest. If user latency dominates: check mlock, CPU affinity, frequency scaling.
hwlat_detector: Kernel module that detects SMI-induced latency by running a tight loop monitoring TSC gaps. Spikes >10µs that appear without any software activity indicate SMI or hardware interrupt latency.
Spurious wake-ups: clock_nanosleep can return early. Applications must check the clock after waking and sleep the remainder if needed (common RT programming pattern).
False sharing on RT: If an RT task shares a cache line with a non-RT task, cache invalidation from the non-RT task causes memory latency on the RT task's next access. Use __cacheline_aligned and padding to separate hot data.
Checking RT kernel is active: uname -v shows kernel version with PREEMPT_RT in the version string. cat /sys/kernel/realtime returns 1 on PREEMPT_RT kernels.

Security Implications

Privilege escalation via RT priority: SCHED_FIFO at high priority can starve all other processes including security services. Restrict with RLIMIT_RTPRIO per user/group. Use ulimit -r 0 for untrusted users. Configure kernel.sched_rt_runtime_us = 950000 (reserve 5% CPU for non-RT work).
CPU isolation exposure: Isolated CPUs with RT tasks bypass many kernel fairness mechanisms. Malicious RT code on an isolated CPU can perform unrestricted computation. Combine with seccomp and capabilities restriction.
Timing side-channels: High-resolution timers and low-jitter execution enable more precise timing side-channel attacks (Spectre timing variants). On shared infrastructure, PREEMPT_RT may slightly amplify this.
Watchdog starvation: Linux's nmi_watchdog relies on NMI firing even when SCHED_FIFO tasks run. Ensure NMI watchdog is enabled to catch hung RT tasks: kernel.nmi_watchdog=1.

Performance Implications

Throughput vs. latency tradeoff: PREEMPT_RT reduces worst-case latency at some cost to average throughput. Benchmark shows 3-10% reduction in network throughput on loaded x86 servers. Acceptable for RT use cases; not ideal for pure throughput workloads.
Spinlock conversion overhead: rt_mutex acquisition is slower than spinlock acquisition (involves scheduler data structures). Critical sections that were microseconds with spinlocks may be tens of microseconds with rt_mutex. Profile with ftrace to identify hot rt_mutex paths.
IRQ thread overhead: Each IRQ thread is a schedulable kernel thread. High-rate interrupts (10kHz NIC) become 10k wakeups/second per CPU. For very high interrupt rates, IRQF_NO_THREAD keeps the handler in atomic context at the cost of disabling RT for that IRQ.

Failure Modes

Missing mlockall: RT task faults on first access to stack or heap page, triggering the kernel's memory allocator (which may take many microseconds). Symptoms: occasional large latency spikes only on first execution of code paths.
IRQ affinity misconfiguration: A high-rate IRQ is affined to the RT CPU, consuming cycles. Symptom: irq/* thread at high CPU utilization on the RT CPU; cyclictest shows regularly spaced latency spikes at the IRQ rate.
Unthreaded driver IRQ: A legacy driver using IRQF_NO_THREAD runs its full handler in atomic context on the RT CPU, causing latency spikes. Fix: review driver, remove IRQF_NO_THREAD, or pin its IRQ to non-RT CPUs.
SMI latency spikes: Unpredictable spikes of 50-200µs that appear as gaps in ftrace — no kernel activity logged but TSC advances. Indicate firmware SMI. No software fix; requires hardware/firmware change.

Modern Usage

Linux 6.12 PREEMPT_RT mainline: No patching required on supported architectures (x86, ARM64). make menuconfig -> General Setup -> Preemption Model -> Fully Preemptible Kernel (Real-Time).
Red Hat / RHEL for Real Time: Officially supported RHEL kernel with PREEMPT_RT, optimized for telecom (vRAN, DU) and industrial use cases.
OSADL (Open Source Automation Development Lab): Maintains long-term PREEMPT_RT latency test results across dozens of hardware platforms and kernel versions. The OSADL latency monitor provides continuous RT performance data.
ROS 2 real-time: The rclcpp executor supports priority-based callback execution on PREEMPT_RT Linux. Used in autonomous vehicle and drone platforms (Dronecode, Autoware).

Future Directions

RT Linux + eBPF: eBPF programs running in kernel context need to be analyzable for RT impact. Work is ongoing to provide RT-safe eBPF execution semantics.
Per-CPU kernel thread priorities: More granular control over which kernel threads compete for which CPU resources, reducing the need for manual IRQ affinity management.
Memory bandwidth partitioning: Intel MBA (Memory Bandwidth Allocation) and similar DRAM bandwidth partitioning to prevent non-RT memory bandwidth consumption from creating latency spikes in RT tasks.
Formal verification of RT primitives: Ongoing academic and industrial work to formally verify that the PREEMPT_RT rt_mutex and priority inheritance implementation is correct under all scheduler interleavings.

Exercises

Build a PREEMPT_RT kernel from source for your target architecture. Verify PREEMPT_RT appears in uname -v. Run cyclictest with and without load stress. Compare the latency histograms.
Identify and isolate two CPUs on a multi-core machine. Run cyclictest on an isolated CPU while running stress-ng --cpu $(nproc) on non-isolated CPUs. Measure the difference in max latency vs. non-isolated configuration.
Write a POSIX real-time application that: (a) sets SCHED_FIFO priority 80, (b) calls mlockall, (c) pre-faults its stack, (d) sleeps with clock_nanosleep in a 1ms loop, (e) measures and prints histogram of actual wake latency.
Use rtla timerlat to identify whether your worst-case latency is dominated by IRQ, kernel, or user-space contribution. Then diagnose and fix the dominant contributor (e.g., affin IRQs away from RT CPU, disable C-states, set performance cpufreq governor).
Deliberately create a priority inversion scenario in a PREEMPT_RT system using a raw spinlock (simulate a kernel subsystem with a raw_spinlock_t held for 500µs). Measure its impact on a priority-99 cyclictest thread. What is the worst-case latency? Does priority inheritance help here, and why or why not?

References

Ingo Molnar, Thomas Gleixner, Steven Rostedt — PREEMPT_RT LKML posts, 2004-present
Steven Rostedt, "RT Linux in the Real World" (LinuxCon 2012)
Thomas Gleixner, "Realtime Linux — The Long Way" (ELCE 2008)
OSADL Real-Time Linux QA Farm: https://www.osadl.org/OSADL-QA-Farm-Real-time.linux-real-time.0.html
Linux Foundation Real-Time Linux Wiki: https://wiki.linuxfoundation.org/realtime/start
cyclictest source: https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git
rtla documentation: https://www.kernel.org/doc/html/latest/tools/rtla/
Red Hat Performance Tuning Guide for Real Time: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux_for_real_time/
Carsten Emde, "Using and Understanding the Real-Time Cyclictest Benchmark" (OSADL, 2011)
Daniel Bristot de Oliveira, "Demystifying the Real-Time Linux Scheduling Latency" (Real-Time Summit 2020)