PREEMPT_RT Linux
Overview
PREEMPT_RT is a patchset that transforms the Linux kernel from a high-throughput general-purpose operating system into a real-time-capable operating system. Rather than replacing Linux with a purpose-built RTOS, PREEMPT_RT surgically modifies the kernel's concurrency and interrupt model to bound worst-case latency, preserving the full Linux userspace while enabling <100µs response times on commodity hardware.
The patchset is significant because it proves that a production-grade, feature-complete operating system can achieve hard real-time properties through principled engineering of its concurrency primitives — without sacrificing the POSIX API, device driver ecosystem, or networking stack that make Linux so valuable.
Prerequisites
- Linux kernel internals: spinlocks, RCU, softirqs, interrupt handling
- Real-time fundamentals: scheduling theory, jitter, WCET (see 01-real-time-fundamentals.md)
- POSIX real-time APIs:
sched_setscheduler,SCHED_FIFO,SCHED_RR - Basic understanding of hardware clock sources and timers
- Familiarity with kernel build system (Kconfig, make)
Historical Context
Timeline of PREEMPT_RT development:
2004 Ingo Molnar posts first PREEMPT_RT patch to LKML
Initial focus: ARM and x86, basic spinlock-to-mutex conversion
2005 Thomas Gleixner begins hrtimer (high-resolution timer) work
Steve Rostedt joins, focuses on latency tracing infrastructure
2006 First "production use" reports from industrial users
Dual-kernel approach (Xenomai, RTLinux) loses industrial momentum
2007 TimeSys, Mentor Graphics, Wind River shipping PREEMPT_RT products
cyclictest becomes standard benchmark
2009 "The Big Kernel Lock" (BKL) removal accelerates — enables better RT
2011 Kernel 3.0: PREEMPT_RT partially merged — hrtimers, threaded IRQs
Wake-up latency consistently <500µs on x86 with patches
2015 Active development: priority inheritance in RCU, lockdep for RT mutexes
Sub-100µs latency achieved on i7 hardware
2017 Red Hat, SUSE ship PREEMPT_RT kernels for industrial/telco customers
Linux Foundation Real-Time Linux working group formed
2021 5.15 LTS: Significant PREEMPT_RT infrastructure merged into mainline
RT mutex, hrtimer, printk rework merged
2024 Linux 6.12: PREEMPT_RT fully merged into mainline Linux kernel
No separate patch needed for most architectures
The 20-year journey from patchset to mainline reflects the difficulty of making a complex, widely-deployed kernel surgically real-time without breaking anything. Every change had to maintain backward compatibility and pass the scrutiny of the entire Linux kernel maintainer community.
Architecture: What PREEMPT_RT Changes
Standard Linux Kernel (without PREEMPT_RT):
+--------------------------------------------------+
| Userspace SCHED_FIFO task at priority 99 |
+--------------------------------------------------+
| (task wants to run)
v
+--------------------------------------------------+
| Kernel: doing network softirq processing |
| Spinlock held, interrupts enabled, NOT |
| preemptible by userspace — even priority 99 |
| Kernel section may run for 1-10ms |
+--------------------------------------------------+
| (eventually finishes, returns to userspace)
v
Task runs — but 1-10ms late
PREEMPT_RT Kernel:
+--------------------------------------------------+
| Userspace SCHED_FIFO task at priority 99 |
+--------------------------------------------------+
| (task wants to run)
v
+--------------------------------------------------+
| Kernel: network softirq in an RT-mutex section |
| Running as a threaded softirq at lower priority |
| Priority 99 task PREEMPTS kernel thread |
+--------------------------------------------------+
v
Task runs — within <100µs
The Five Core Changes
1. Threaded Interrupt Handlers
Before: Hardware interrupt handlers (hardirq) run in atomic context — interrupts disabled, cannot sleep, cannot be preempted.
After: Most hardirq handlers are converted to run in a dedicated kernel thread. The hardware IRQ line triggers a minimal "hardirq top half" that acknowledges the hardware and wakes the IRQ thread. The bulk of interrupt processing happens in the thread, which is a normal schedulable entity.
Before PREEMPT_RT:
Hardware IRQ fires
-> CPU enters hardirq context (atomic, non-preemptible)
-> Full handler runs (may be 10s-100s µs)
-> Return to interrupted context
After PREEMPT_RT:
Hardware IRQ fires
-> CPU enters minimal hardirq (atomic, ~1µs)
- Acknowledge hardware
- Wake IRQ thread
-> Return to whatever was interrupted
-> Scheduler runs IRQ thread at its configured priority
-> IRQ thread runs full handler (preemptible by higher-priority RT task)
Configuration: CONFIG_IRQ_FORCED_THREADING=y. Individual drivers can opt out with IRQF_NO_THREAD for truly performance-critical handlers.
2. RT Mutexes Replacing Spinlocks
Before: Spinlocks are the kernel's primary mutual exclusion primitive. They busy-wait (spin) and disable preemption — ensuring bounded, atomic critical sections but making the kernel non-preemptible during their hold.
After: Most spinlocks are replaced by rt_mutex — a sleepable, priority-inheriting mutex. A task trying to acquire an rt_mutex can block (sleep), allowing higher-priority tasks to run.
// Before: spinlock in kernel
spin_lock(&some_lock);
// critical section - kernel non-preemptible
do_work();
spin_unlock(&some_lock);
// After: rt_mutex (functionally equivalent, but sleepable)
rt_mutex_lock(&some_rt_lock);
// critical section - kernel is preemptible by higher-priority tasks
do_work();
rt_mutex_unlock(&some_rt_lock);
Crucially, rt_mutex implements the priority inheritance protocol: if a high-priority task waits for an rt_mutex held by a low-priority task, the low-priority task temporarily inherits the high-priority task's scheduling priority. This prevents priority inversion within the kernel.
Not all spinlocks can be converted: raw spinlocks (raw_spinlock_t) remain for truly atomic sections (NMI handlers, scheduler internals). PREEMPT_RT distinguishes these explicitly.
3. Preemptible RCU
Read-Copy-Update (RCU) is Linux's scalable reader-writer synchronization mechanism. Without PREEMPT_RT, RCU grace periods require all CPUs to pass through a quiescent state — which requires all CPUs to be preemptible, but standard PREEMPT_RT converts spinlock sections.
PREEMPT_RT implements PREEMPT_RCU: RCU readers can be preempted. A reader that is preempted is tracked and the grace period waits for it. This adds overhead but removes a critical source of unbounded latency.
4. High-Resolution Timers (hrtimer)
Before: Linux timer wheel operates at HZ resolution (typically 100-1000Hz). schedule_timeout(1) sleeps for one jiffy — up to 10ms at HZ=100. All wakeups are aligned to the next tick.
After: hrtimer subsystem provides nanosecond-resolution timers backed by hardware clock sources (HPET, TSC deadline timer on x86, arch timer on ARM).
Standard Linux timer:
Request sleep for 500µs
|<--- up to 10ms --->| (aligned to next 10ms tick)
Actually wakes here
hrtimer:
Request sleep for 500µs
|<- ~500µs ± jitter ->|
Actually wakes here (programmed directly to hardware comparator)
POSIX clock_nanosleep(CLOCK_REALTIME, TIMER_ABSTIME, ...) uses hrtimer on PREEMPT_RT kernels, achieving nanosecond-resolution sleep targeting.
5. Fully Preemptible Kernel
The combination of the above changes enables the PREEMPT_RT preemption model — the kernel is preemptible everywhere except:
raw_spinlock_tsections (used only for truly atomic operations)- NMI handlers
- The scheduler's own critical sections
This means a userspace SCHED_FIFO priority 99 thread can preempt almost any kernel execution path, bounding scheduling latency to the time the scheduler itself takes plus any raw spinlock hold time.
Preemption Models in Linux
Linux configures the preemption model at compile time via Kconfig:
CONFIG_PREEMPTION options:
PREEMPT_NONE (server/desktop default):
Kernel preemption only at explicit schedule() calls
Lowest overhead, worst RT latency (ms range)
"Voluntary preemption" when CONFIG_PREEMPT_VOLUNTARY
PREEMPT (desktop default):
Preemption at all non-spinlock-held kernel sections
Better latency than NONE, some overhead
Typical worst-case: hundreds of µs
PREEMPT_RT (real-time):
Fully preemptible kernel (spinlocks -> rt_mutex)
Threaded IRQs
Worst-case latency: ~20-100µs on tuned x86/ARM64 hardware
Available in mainline since Linux 6.12
Achieving <100µs Latency: System Configuration
Getting sub-100µs latency on real hardware requires more than just a PREEMPT_RT kernel. The full stack must be tuned:
CPU Isolation
# Isolate CPUs 2 and 3 from the general scheduler
# (kernel bootarg)
isolcpus=2,3 rcu_nocbs=2,3 nohz_full=2,3
# Pin RT task to isolated CPU
taskset -c 2 cyclictest --priority=99 --interval=200
NUMA and IRQ Affinity
# Move all non-critical IRQs off RT CPUs
for irq in /proc/irq/*/smp_affinity_list; do
echo 0-1 > $irq # Route all IRQs to CPUs 0-1 only
done
# Verify no IRQs on isolated CPUs
cat /proc/interrupts | grep -E "CPU2|CPU3"
CPU Frequency Scaling
P-states and C-states (deep sleep states) cause latency spikes when the CPU must ramp up frequency or wake from deep sleep:
# Disable CPU frequency scaling on RT CPUs
echo performance > /sys/devices/system/cpu/cpu2/cpufreq/scaling_governor
# Disable deep C-states (latency vs power tradeoff)
# cpuidle latency limit (0 = no deep sleep)
echo 0 > /dev/cpu_dma_latency # Requires open file descriptor kept open
# Or set via /sys/devices/system/cpu/cpu2/cpupower
Memory Locking
// Prevent RT task from faulting on memory access during execution
mlockall(MCL_CURRENT | MCL_FUTURE);
// Pre-fault stack to avoid page fault in critical section
char stack_buf[MAX_STACK_SIZE];
memset(stack_buf, 0, sizeof(stack_buf)); // touch all stack pages
Interrupt Coalescing
# NIC interrupt coalescing causes latency spikes
# Disable for RT network processing
ethtool -C eth0 rx-usecs 0 tx-usecs 0
# Disable NAPI polling delay
echo 0 > /proc/sys/net/core/netdev_budget_usecs
cyclictest: Measuring RT Performance
cyclictest output interpretation:
$ cyclictest -l10000000 -m -Sp99 -i200 -h400 -q
T: 0 ( 7890) P:99 I:200 C:10000000 Min: 3 Act: 6 Avg: 7 Max: 47
Column: T=thread PID Priority Interval(µs) Count Min Act Avg Max
Histogram:
000003 47382 <- 3µs bucket: 47382 occurrences
000004 291847
000005 8731024 <- mode: most samples here
000006 891234
...
000047 1 <- max: 47µs, only 1 occurrence
Target behavior for industrial RT (IEC 61508):
- Max < 100µs: acceptable for most soft-RT
- Max < 50µs: good for control applications
- Max < 20µs: required for sub-millisecond control loops
- Max > 1ms: indicates tuning issue or non-RT kernel
Stress Testing
The worst-case latency only appears under realistic system stress:
# Concurrent stress while measuring with cyclictest:
# CPU load
stress-ng --cpu 4 --io 4 &
# Memory pressure
stress-ng --vm 2 --vm-bytes 80% &
# Network load
iperf3 -s &; iperf3 -c localhost -t 60 &
# File I/O
fio --name=stress --rw=randrw --bs=4k --size=1G --ioengine=libaio &
# Run cyclictest with all stress active
cyclictest -l1000000 -p99 -i200 -m
PREEMPT_RT and the printk Problem
One of the last significant sources of latency in PREEMPT_RT was printk. The kernel's print function acquired a console driver lock, which could hold for milliseconds while a slow serial console flushed.
The solution, merged in Linux 5.15, was threaded printk: log writes go to an in-kernel ring buffer, and a dedicated kernel printing thread flushes to consoles. RT tasks calling printk no longer block waiting for console I/O.
This change eliminated a class of latency spikes that was difficult to diagnose because it only appeared when kernel code paths hit debug printk calls under load.
Latency Tracing with ftrace
# Enable latency tracer
echo preemptirqsoff > /sys/kernel/debug/tracing/current_tracer
echo 1 > /sys/kernel/debug/tracing/tracing_on
# Run workload, then capture worst-case trace
cat /sys/kernel/debug/tracing/trace | head -100
# Output shows exact kernel functions executing during worst-case latency:
# preemptirqsoff latency trace v1.1.5 on 5.15.0-rt17
# ------------------------------------------------------------
# latency: 67 us, #4/4, CPU#2 | (M:RT VP:0, KP:0, SP:0 HP:0)
# -----------------
# | task: cyclictest-1234 (uid:0 nice:0 policy:1 rt_prio:99)
# -----------------
# => started at: _raw_spin_lock_irqsave <some_driver_lock>
# => ended at: _raw_spin_unlock_irqrestore
The rtla timerlat tool (Linux 5.17+) provides an automated, structured analysis:
# rtla timerlat top: shows per-CPU latency breakdown
rtla timerlat top -p 99 -d 60 -T 100
# Output:
# Timer Latency (µs) [0-99µs histogram]
# IRQ Kernel User
# CPU 0: avg 3.1 4.2 7.5 max 12 18 31
# CPU 1: avg 2.9 4.0 7.2 max 11 16 28
# CPU 2: avg 3.0 4.1 7.3 max 10 15 26
IRQ latency = hardware timer fires to IRQ handler entry
Kernel latency = IRQ handler entry to wakeup posted
User latency = wakeup posted to task actually executing
Hardware Considerations
PREEMPT_RT latency is ultimately bounded by hardware behavior:
SMIs (System Management Interrupts)
SMIs are the worst latency villain on x86. The CPU silently enters System Management Mode (SMM) at the BIOS/firmware's request — completely invisible to the OS, uninterruptible, can run for 50-200µs.
# Detect SMI activity (requires MSR access):
rdmsr -p 2 0x34 # Read SMI counter on CPU 2
# Run workload for 60s
rdmsr -p 2 0x34 # If count increased, SMIs occurred
# Latency spike coinciding with SMI is identifiable because:
# - ftrace shows no kernel activity during the gap
# - The CPU appears to "disappear" for the SMI duration
SMIs cannot be disabled on most commercial hardware. Industrial RT platforms (Beckhoff Industrial PCs, ADLINK) specifically disable all non-essential SMI sources in firmware.
APIC Timer Behavior
On x86, the deadline timer mode (APIC_LVT_TIMER_TSCDEADLINE) programs the local APIC timer to fire at an exact TSC cycle count. This eliminates the timer hardware's quantization error, enabling sub-microsecond timer resolution. Enable with kernel config CONFIG_X86_TSC and check CPUID for TSC Deadline support.
ARM64 Considerations
ARM Cortex-A systems generally have more predictable interrupt latency than x86: - No SMI equivalent - GIC (Generic Interrupt Controller) has deterministic interrupt routing - ARM generic timer (CNTP_TVAL_EL0) provides reliable high-resolution timing - PREEMPT_RT on ARM64 typically achieves <50µs on modern Cortex-A57/A72/A78 class hardware
Kernel Threads and Priorities
After enabling PREEMPT_RT, kernel threads run at specific priorities. RT application threads must be at higher priority than competing kernel threads:
# View kernel thread priorities (RT threads):
ps -eo pid,rtprio,class,comm | grep -E "FF|RR" | sort -k2 -rn
# Critical kernel threads and typical priorities:
# irq/XX-<name> priority 50 (IRQ threads, configurable)
# ktimers priority 50
# ksoftirqd/X priority 50 (softirq daemon)
# kworker/X (SCHED_NORMAL, typically priority 0)
# Application RT task should be at priority 80-99 to beat kernel threads
# OR: raise critical kernel threads (irq/XX) above application threads if needed
The key insight: on PREEMPT_RT, there is a continuous priority space from userspace to kernel threads. Designing priority assignments requires understanding which kernel threads compete with your RT tasks.
Production Examples
Beckhoff TwinCAT 3: Industrial automation runtime. Runs on standard x86 hardware with PREEMPT_RT Linux as the real-time core. EtherCAT master with 250µs cycle time, PLC logic, and servo drive control. Ships in CNC machines, packaging lines, and assembly robots worldwide.
KUKA Robot Controllers: KR C4 and successors run PREEMPT_RT Linux. Robot arm trajectory control loop at 4kHz (250µs). The full robot OS is Linux-based — visualization, PLC programming, network communication — with RT threads for servo control.
ABB Robot Controllers: IRC5/OmniCore use PREEMPT_RT for robot motion control. Safety functions (ISO 10218) separated onto a dedicated safety processor; PREEMPT_RT handles non-safety motion control.
Intel FlexRAN: 5G base station software. Runs on Intel Xeon with PREEMPT_RT, DPDK for network I/O, and AVX-512 for DSP processing. L1 (physical layer) processing must complete in <500µs for 5G NR subframe timing.
Xilinx/AMD MPSoC (Zynq UltraScale+): FPGA + quad Cortex-A53 + dual Cortex-R5. PREEMPT_RT on Cortex-A53 for control logic; Cortex-R5 (FreeRTOS, bare-metal) for hard RT. Used in medical imaging, SDR, industrial control.
Debugging Notes
- Identifying latency sources: Always use
rtla timerlatfirst. It decomposes latency into IRQ, kernel, and user contributions. If kernel latency dominates: find which raw_spinlock holds longest. If user latency dominates: check mlock, CPU affinity, frequency scaling. - hwlat_detector: Kernel module that detects SMI-induced latency by running a tight loop monitoring TSC gaps. Spikes >10µs that appear without any software activity indicate SMI or hardware interrupt latency.
- Spurious wake-ups:
clock_nanosleepcan return early. Applications must check the clock after waking and sleep the remainder if needed (common RT programming pattern). - False sharing on RT: If an RT task shares a cache line with a non-RT task, cache invalidation from the non-RT task causes memory latency on the RT task's next access. Use
__cacheline_alignedand padding to separate hot data. - Checking RT kernel is active:
uname -vshows kernel version withPREEMPT_RTin the version string.cat /sys/kernel/realtimereturns 1 on PREEMPT_RT kernels.
Security Implications
- Privilege escalation via RT priority:
SCHED_FIFOat high priority can starve all other processes including security services. Restrict withRLIMIT_RTPRIOper user/group. Useulimit -r 0for untrusted users. Configurekernel.sched_rt_runtime_us = 950000(reserve 5% CPU for non-RT work). - CPU isolation exposure: Isolated CPUs with RT tasks bypass many kernel fairness mechanisms. Malicious RT code on an isolated CPU can perform unrestricted computation. Combine with seccomp and capabilities restriction.
- Timing side-channels: High-resolution timers and low-jitter execution enable more precise timing side-channel attacks (Spectre timing variants). On shared infrastructure, PREEMPT_RT may slightly amplify this.
- Watchdog starvation: Linux's
nmi_watchdogrelies on NMI firing even when SCHED_FIFO tasks run. Ensure NMI watchdog is enabled to catch hung RT tasks:kernel.nmi_watchdog=1.
Performance Implications
- Throughput vs. latency tradeoff: PREEMPT_RT reduces worst-case latency at some cost to average throughput. Benchmark shows 3-10% reduction in network throughput on loaded x86 servers. Acceptable for RT use cases; not ideal for pure throughput workloads.
- Spinlock conversion overhead: rt_mutex acquisition is slower than spinlock acquisition (involves scheduler data structures). Critical sections that were microseconds with spinlocks may be tens of microseconds with rt_mutex. Profile with ftrace to identify hot rt_mutex paths.
- IRQ thread overhead: Each IRQ thread is a schedulable kernel thread. High-rate interrupts (10kHz NIC) become 10k wakeups/second per CPU. For very high interrupt rates,
IRQF_NO_THREADkeeps the handler in atomic context at the cost of disabling RT for that IRQ.
Failure Modes
- Missing mlockall: RT task faults on first access to stack or heap page, triggering the kernel's memory allocator (which may take many microseconds). Symptoms: occasional large latency spikes only on first execution of code paths.
- IRQ affinity misconfiguration: A high-rate IRQ is affined to the RT CPU, consuming cycles. Symptom:
irq/*thread at high CPU utilization on the RT CPU; cyclictest shows regularly spaced latency spikes at the IRQ rate. - Unthreaded driver IRQ: A legacy driver using
IRQF_NO_THREADruns its full handler in atomic context on the RT CPU, causing latency spikes. Fix: review driver, removeIRQF_NO_THREAD, or pin its IRQ to non-RT CPUs. - SMI latency spikes: Unpredictable spikes of 50-200µs that appear as gaps in ftrace — no kernel activity logged but TSC advances. Indicate firmware SMI. No software fix; requires hardware/firmware change.
Modern Usage
- Linux 6.12 PREEMPT_RT mainline: No patching required on supported architectures (x86, ARM64).
make menuconfig -> General Setup -> Preemption Model -> Fully Preemptible Kernel (Real-Time). - Red Hat / RHEL for Real Time: Officially supported RHEL kernel with PREEMPT_RT, optimized for telecom (vRAN, DU) and industrial use cases.
- OSADL (Open Source Automation Development Lab): Maintains long-term PREEMPT_RT latency test results across dozens of hardware platforms and kernel versions. The OSADL latency monitor provides continuous RT performance data.
- ROS 2 real-time: The
rclcppexecutor supports priority-based callback execution on PREEMPT_RT Linux. Used in autonomous vehicle and drone platforms (Dronecode, Autoware).
Future Directions
- RT Linux + eBPF: eBPF programs running in kernel context need to be analyzable for RT impact. Work is ongoing to provide RT-safe eBPF execution semantics.
- Per-CPU kernel thread priorities: More granular control over which kernel threads compete for which CPU resources, reducing the need for manual IRQ affinity management.
- Memory bandwidth partitioning: Intel MBA (Memory Bandwidth Allocation) and similar DRAM bandwidth partitioning to prevent non-RT memory bandwidth consumption from creating latency spikes in RT tasks.
- Formal verification of RT primitives: Ongoing academic and industrial work to formally verify that the PREEMPT_RT rt_mutex and priority inheritance implementation is correct under all scheduler interleavings.
Exercises
- Build a PREEMPT_RT kernel from source for your target architecture. Verify
PREEMPT_RTappears inuname -v. Run cyclictest with and without load stress. Compare the latency histograms. - Identify and isolate two CPUs on a multi-core machine. Run cyclictest on an isolated CPU while running
stress-ng --cpu $(nproc)on non-isolated CPUs. Measure the difference in max latency vs. non-isolated configuration. - Write a POSIX real-time application that: (a) sets
SCHED_FIFOpriority 80, (b) callsmlockall, (c) pre-faults its stack, (d) sleeps withclock_nanosleepin a 1ms loop, (e) measures and prints histogram of actual wake latency. - Use
rtla timerlatto identify whether your worst-case latency is dominated by IRQ, kernel, or user-space contribution. Then diagnose and fix the dominant contributor (e.g., affin IRQs away from RT CPU, disable C-states, set performance cpufreq governor). - Deliberately create a priority inversion scenario in a PREEMPT_RT system using a raw spinlock (simulate a kernel subsystem with a raw_spinlock_t held for 500µs). Measure its impact on a priority-99 cyclictest thread. What is the worst-case latency? Does priority inheritance help here, and why or why not?
References
- Ingo Molnar, Thomas Gleixner, Steven Rostedt — PREEMPT_RT LKML posts, 2004-present
- Steven Rostedt, "RT Linux in the Real World" (LinuxCon 2012)
- Thomas Gleixner, "Realtime Linux — The Long Way" (ELCE 2008)
- OSADL Real-Time Linux QA Farm: https://www.osadl.org/OSADL-QA-Farm-Real-time.linux-real-time.0.html
- Linux Foundation Real-Time Linux Wiki: https://wiki.linuxfoundation.org/realtime/start
- cyclictest source: https://git.kernel.org/pub/scm/utils/rt-tests/rt-tests.git
- rtla documentation: https://www.kernel.org/doc/html/latest/tools/rtla/
- Red Hat Performance Tuning Guide for Real Time: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux_for_real_time/
- Carsten Emde, "Using and Understanding the Real-Time Cyclictest Benchmark" (OSADL, 2011)
- Daniel Bristot de Oliveira, "Demystifying the Real-Time Linux Scheduling Latency" (Real-Time Summit 2020)