Real-Time Systems Fundamentals

Overview

A real-time system is one where correctness depends not only on the logical result of a computation but also on the time at which results are produced. An answer that is logically correct but delivered too late is, in the context of real-time systems, a failure. This temporal constraint distinguishes real-time systems from all other software.

Real-time does not mean fast. A geological monitoring system that must process seismic data within 10 seconds is real-time. A nuclear reactor control loop that must respond within 10 milliseconds is also real-time. What both share is a defined deadline by which a response must be produced and a defined consequence for missing it.

Prerequisites

Operating systems fundamentals: scheduling, interrupts, context switching
Basic understanding of embedded systems and MCUs
Familiarity with C/C++ systems programming
Basic probability and statistics for jitter analysis
Calculus-level understanding of rates and periods

Taxonomy of Real-Time Systems

Hard Real-Time

In hard real-time systems, missing a deadline is as catastrophic as an incorrect result. The system is defined to have failed.

Hard RT: Deadline Miss = System Failure
------+------+------+------+---->  time
      |      |      |      |
      D1     D2     D3     D4     (deadlines)

  Value
  ^
  |###
  |   #
  |    #
  0----+---->  0
       deadline

Examples: - Automotive airbag: Must fire within 15-30ms of collision detection. A 31ms response is physically useless. - Fly-by-wire flight control: Control surface actuation must occur within the flight control loop period (typically 25-40ms, or faster for inner loop stability). Late response can make the aircraft uncontrollable. - Cardiac pacemaker: Electrical stimulation must occur within the precise cardiac cycle window. Off-timing causes arrhythmia. - Nuclear reactor SCRAM: Emergency rod insertion must complete within seconds of detecting an excursion. This is the one-second deadline that cannot slip. - Industrial robot end-stop: Motor cutoff must occur within microseconds of a limit switch signal. Late response means physical destruction.

Soft Real-Time

Missing a deadline degrades quality but does not constitute system failure.

Soft RT: Deadline Miss = Degraded Quality, System Continues
  Value
  ^
  |###
  |   ###
  |      ##
  |         #
  0----------+------>  time
             deadline (utility decreasing after deadline)

Examples: - Video playback: Missed frame deadline causes visible stutter or dropped frame. Noticeable but not catastrophic. - Audio streaming: A 20ms late audio buffer causes a pop or glitch. Acceptable occasionally. - Interactive UI: Input response after 100ms feels sluggish. After 200ms feels broken. The application still works. - Network packet forwarding: Late packet delivery causes TCP retransmit. Performance degrades but communication continues.

Firm Real-Time

A deadline miss renders the result completely useless, but the system survives and continues operating.

Examples: - Algorithmic trading: A buy/sell order that misses its execution window (perhaps 10µs in high-frequency trading) must be discarded — executing a stale order may lose money. But the trading system continues operating and can place new orders. - Live sensor telemetry: A temperature sample that arrives 2x its period late must be discarded as stale. The monitoring system continues with subsequent samples. - Teleconferencing: A voice packet older than the jitter buffer depth is useless and dropped. The call continues but has a momentary gap.

Real-Time Task Model

The foundational model for real-time task analysis:

Task Model Parameters:
+-------------------------------------------------------+
|                                                       |
|  Task τ_i is characterized by:                       |
|                                                       |
|  C_i  = Worst-Case Execution Time (WCET)             |
|  T_i  = Period (for periodic tasks)                  |
|  D_i  = Relative deadline (often D_i = T_i)          |
|  R_i  = Response time (actual end-to-end latency)    |
|                                                       |
|  For schedulability: R_i <= D_i  for all i           |
|                                                       |
+-------------------------------------------------------+

Timeline visualization:

  τ_1 [period=20ms, C=3ms, D=20ms]
  ----+========+-------+========+-------+========+-->
      |  exec  | idle  |  exec  | idle  |  exec  |
      0        3ms    20ms     23ms    40ms     43ms

  τ_2 [period=50ms, C=10ms, D=50ms]
  ----+============================+--=============+-->
      |          exec              |   exec        |
      0          10ms              50ms            60ms

Periodic, Sporadic, and Aperiodic Tasks

Periodic: Arrive at fixed intervals T. Motor control loop, sensor sampling.
Sporadic: Arrive at random intervals with a minimum inter-arrival time. External interrupt handlers.
Aperiodic: Arrive at arbitrary times with no minimum separation. User input, network packets.

Aperiodic tasks are the hardest to accommodate in a hard real-time system because they offer no temporal guarantees for schedulability analysis.

Worst-Case Execution Time (WCET)

WCET is the longest time a task can possibly execute, over all inputs and all hardware states. It is the critical parameter for schedulability analysis.

Why WCET Is Hard

Modern hardware features that make systems fast in the average case make WCET extremely difficult to bound:

Instruction caches: A cache miss can cost 100+ cycles vs. 1 cycle for a hit
Branch predictors: Misprediction flushes the pipeline
Out-of-order execution: Execution time depends on data-dependent instruction sequences
Memory buses: Shared-bus contention on multi-core systems
DMA transfers: Competing for memory bandwidth

Measurement-Based WCET Analysis

Instrument the code and measure execution time across many inputs and scenarios:

Measurement approach:
1. Identify all paths through the task
2. Create test inputs that exercise each path
3. Execute N times (N ≥ 10,000) with hardware performance counters
4. Take worst observed time plus safety margin (typically 10-20%)
5. Validate with code coverage (all branches exercised?)

Limitation: cannot guarantee worst case was observed.
Suitable for: soft RT, systems where consequences of underestimate are minor.

Static WCET Analysis

Model all execution paths through static program analysis combined with processor timing models:

Static WCET Tools:
- AbsInt aiT: Commercial; used for DO-178C certified avionics
- OTAWA: Academic; open-source
- TACLeBench: Standard benchmark suite for WCET tools

Process:
1. Value analysis: bound all loop counts, pointer values
2. Control flow analysis: enumerate all feasible execution paths
3. Microarchitecture analysis: model cache states, pipeline fill along each path
4. ILP (Integer Linear Programming): maximize execution time across all paths

Static WCET is required for DO-178C Level A avionics software and IEC 61508 SIL4 safety systems. It produces a verified upper bound rather than an observed maximum.

Schedulability Analysis

Given a set of tasks with known WCET and periods, can all tasks meet their deadlines? This is the schedulability problem.

Utilization Bound Test (Liu & Layland, 1973)

For N periodic tasks scheduled under Rate Monotonic Scheduling (RMS — shorter period = higher priority):

Utilization:  U = sum(C_i / T_i) for i = 1..N

Liu & Layland Bound:
  U_bound = N * (2^(1/N) - 1)

  N=1:  100%  (trivially)
  N=2:  82.8%
  N=3:  78.0%
  N=5:  74.3%
  N→∞: 69.3% (ln 2)

If U <= U_bound: task set is provably schedulable under RMS.
If U > U_bound: need response time analysis to determine schedulability.

Response Time Analysis (Audsley et al., 1993)

More precise than utilization bound. For each task i, compute worst-case response time:

R_i^(0) = C_i

R_i^(n+1) = C_i + sum_{j ∈ hp(i)} ceil(R_i^(n) / T_j) * C_j

where hp(i) = tasks with higher priority than i

Repeat until R_i^(n+1) = R_i^(n) (converged) or R_i > D_i (infeasible).

Example:
  τ_1: C=3ms, T=10ms
  τ_2: C=6ms, T=25ms

  R_2^(0) = 6
  R_2^(1) = 6 + ceil(6/10)*3 = 6 + 3 = 9
  R_2^(2) = 6 + ceil(9/10)*3 = 6 + 3 = 9  (converged)
  R_2 = 9ms <= D_2 = 25ms -> τ_2 is schedulable

Scheduling Algorithms

Rate Monotonic Scheduling (RMS)

Fixed-priority, assign priority inversely proportional to period. Short period → high priority. Optimal among fixed-priority algorithms for independent periodic tasks (Liu & Layland theorem).

Earliest Deadline First (EDF)

Dynamic priority, always schedule the task with the nearest absolute deadline. Optimal among all preemptive schedulers for single-processor systems: can schedule any feasible task set. Can utilize CPU up to 100%.

EDF vs RMS comparison:

  Task Set: τ_1(C=3, T=7), τ_2(C=3, T=12)
  U = 3/7 + 3/12 = 0.429 + 0.250 = 0.679 < ln2 (schedulable under both)

  Timeline (RMS: τ_1 higher priority):
  0  3  6  7  9  12 14 16 19 21 ...
  |τ1|τ2|τ1|τ1|τ2|τ1|τ1|τ2|τ1|τ1|

Priority Inheritance and Ceiling Protocols

Critical for preventing priority inversion with shared resources:

Priority Inheritance Protocol (PIP): Low-priority task holding a mutex temporarily inherits the priority of the highest-priority waiter. Simple to implement (FreeRTOS mutex, POSIX PTHREAD_PRIO_INHERIT). Can cause chained inheritance and is not fully analyzable.

Priority Ceiling Protocol (PCP): Each mutex has a priority ceiling equal to the highest priority of any task that locks it. A task can only lock a mutex if its priority is above the ceiling. Bounds blocking to one lower-priority task's critical section. Analyzable for response time.

Priority Inversion (without protocol):
  H [priority 3]: waiting for mutex
  M [priority 2]: preempted L, running
  L [priority 1]: holds mutex, cannot run due to M

  H is stuck behind M: inversion!

With Priority Inheritance:
  H waits for mutex -> L's priority raised to 3
  M [priority 2] cannot preempt L [now priority 3]
  L completes critical section, releases mutex
  H runs

Jitter

Jitter is the variation in task response time or inter-arrival time from its nominal value. Even if a task always meets its deadline, excessive jitter causes downstream problems:

Sensor sampling jitter: ADC samples taken at irregular intervals corrupt frequency-domain analysis (FFT assumes uniform sampling)
Motor control jitter: Variable PWM update timing causes torque ripple
Network packet timing jitter: Causes TCP throughput degradation, audio/video codec buffer issues

Jitter sources (additive chain):

  External Event
      |
      | Hardware interrupt latency
      | (IRQ line to CPU acknowledge)
      v
  IRQ Acknowledged
      |
      | ISR execution + tail-chaining overhead
      v
  ISR Completes
      |
      | Scheduler latency
      | (time from unblock to task actually running)
      v
  Task Begins Execution
      |
      | Task execution time variation
      | (cache effects, branch prediction)
      v
  Task Completes

  Total jitter = sum of all stage variances

For hard real-time systems, each stage's worst case must be bounded and sum of bounds must satisfy the deadline constraint.

Why Standard Linux Fails Real-Time Requirements

Standard Linux was designed for throughput and fairness, not bounded response time. Key failure points:

Unbounded Kernel Lock Hold Times

Linux kernel code paths hold spinlocks while performing operations that can take unbounded time. Before PREEMPT_RT, the network stack, memory allocator, and filesystem code all had long non-preemptible sections.

Non-Preemptible Kernel Sections

Standard Linux kernel had PREEMPT_VOLUNTARY at best — preemption only occurred at explicit cond_resched() calls. A high-priority userspace task could be blocked for milliseconds while the kernel processed a softirq on behalf of a different process.

Interrupt Handling Model

In standard Linux, hardware interrupt handlers (hardirqs) run with interrupts disabled on the CPU. A long network receive handler blocks all other interrupts, including the timer tick. Ethernet interrupt handlers can run for hundreds of microseconds.

Memory Allocation Non-Determinism

kmalloc(GFP_KERNEL) can block waiting for page reclaim. vmalloc can trigger TLB shootdowns across all CPUs. Neither is suitable for use in a real-time task's critical path.

Measured Latency

On standard x86 Linux under load: - Interrupt latency: 10µs - 1ms typical, up to 10ms under worst case - Scheduling latency: 100µs - 10ms depending on kernel lock contention - Tail latency: effectively unbounded in the presence of long softirq processing

For applications requiring <1ms determinism, this is unacceptable.

Latency Measurement: cyclictest

cyclictest is the standard tool for measuring real-time scheduling latency:

# Measure scheduling latency, 10000 samples, at priority 99
cyclictest --priority=99 --loops=10000 --interval=1000

# Output:
# T: 0 (  3) I:   1000 C:   10000 Min:      5 Act:      6 Avg:      6 Max:      47
#               priority      interval   count  min_µs  actual  average  max_µs

# On standard Linux with load (make -j16):
# Max: 1200-5000µs

# On PREEMPT_RT Linux with proper tuning:
# Max: 20-80µs

The max latency (worst-case scheduling latency) is the critical figure for hard RT systems.

Production Examples

F/A-18 Hornet Flight Control Computer: VxWorks RTOS on PowerPC with 40ms outer loop, 6.25ms inner loop. WCET analysis per DO-178C. Triple-redundant with voting.

Toyota Prius Hybrid ECU: AUTOSAR-compliant RTOS (OSEK-derived) on Renesas RH850. Engine control tasks at 1ms, regenerative braking coordination at 10ms. DO they use RMS? AUTOSAR OS uses fixed-priority preemptive scheduling.

ABS Braking System (Bosch): 5ms cycle time for wheel speed sampling and valve control. Hard real-time. Bare-metal or AUTOSAR OS on Infineon AURIX TriCore.

Medical Ventilator (Philips Respironics, Hamilton Medical): IEC 62304 Class C software. Pressure control loop at 10ms, alarm system guaranteed response within 120ms. Validated RTOS with WCET analysis.

Mars Curiosity Rover (JPL): VxWorks on RAD750. Command sequencing with timed execution, telemetry at fixed schedule. 11-minute one-way light time means the vehicle must operate autonomously with onboard hard real-time responses to hazards.

Debugging Notes

ftrace: Linux kernel tracing framework. trace_event=sched_switch,irq_handler_entry,irq_handler_exit shows task scheduling and IRQ timeline. function_graph tracer shows kernel function call timing.
perf sched: perf sched record; perf sched latency shows per-task scheduling latency statistics over a workload period.
Latency histogram: cyclictest -H 200 > hist.dat then gnuplot the histogram. A tail at >100µs on a claimed RT system indicates a problem path.
rtla (Real-Time Linux Analysis): New toolset in Linux 5.17+. rtla timerlat top identifies the source of latency spikes in the kernel call chain.
Priority misconfiguration: The most common RT system failure. Verify with ps -eo pid,pri,rtprio,class,comm that RT threads are at SCHED_FIFO priority 99 and that competing threads (network, display) are at lower priority.

Security Implications

Denial-of-service via CPU starvation: A malicious process at the same or higher SCHED_FIFO priority can starve all lower-priority RT tasks. Mitigate with RLIMIT_RTPRIO caps per user, CONFIG_RT_GROUP_SCHED for cgroup-based RT budgeting.
WCET analysis and adversarial inputs: If WCET is measured rather than statically analyzed, adversarially crafted inputs may trigger execution paths not observed during testing. Safety-critical systems require static WCET analysis.
Trusted execution: RT systems controlling physical processes are high-value targets. Secure boot, measured boot (TPM), and code signing are essential to ensure the RT task set executing is the one that was analyzed and certified.

Performance Implications

Overhead of real-time scheduling: SCHED_FIFO/SCHED_RR has slightly lower overhead than CFS (no CFS red-black tree operations). Negligible difference vs. the latency benefits.
Cache partitioning: On multi-core RT systems, cache pollution from non-RT tasks degrades WCET predictability. Intel CAT (Cache Allocation Technology) and ARM MPAM can partition LLC between RT and non-RT workloads.
NUMA effects on RT: On multi-socket servers, RT tasks should be pinned to cores on the same NUMA node as their memory allocations to avoid remote DRAM access latency (typically 2-3x local DRAM latency).

Failure Modes

WCET underestimation: The most dangerous failure — system passes schedulability analysis but fails at runtime when a rare code path is triggered under worst-case hardware conditions.
Priority inversion uncaught: Without priority inheritance, high-priority tasks miss deadlines due to priority inversion. Silent — no error is logged, the task simply misses its window.
Thundering herd on sporadic tasks: Multiple sporadic tasks arrive simultaneously (e.g., multiple hardware interrupts at once). The combined load exceeds the spare capacity assumed in the task model, causing cascading deadline misses.
Clock drift on distributed RT: Distributed real-time systems (e.g., CAN networks, EtherCAT) require synchronized clocks. Clock drift causes distributed tasks to lose synchronization. IEEE 1588 PTP (Precision Time Protocol) provides sub-microsecond synchronization.

Modern Usage

AUTOSAR (Automotive Open System Architecture): Standard OS interface for automotive ECUs. AUTOSAR Classic (OSEK-derived, bare-metal RT) and AUTOSAR Adaptive (POSIX-based, PREEMPT_RT Linux). Over 80% of new vehicles use AUTOSAR.
ROS 2 on PREEMPT_RT: Robot Operating System 2 provides real-time executor options and runs on PREEMPT_RT Linux for robot control applications.
Industrial EtherCAT: 1ms cycle time, sub-microsecond synchronization across hundreds of servo drives. Linux-based master (EtherCAT master stack on PREEMPT_RT) or dedicated FPGA/ASIC slave controllers.
5G Radio Access: 5G baseband processing requires sub-millisecond scheduling. Intel FlexRAN reference implementation on DPDK + PREEMPT_RT on Xeon with AVX-512.

Future Directions

RISC-V real-time: RISC-V's open ISA enables formally verified hardware for RT systems. RISC-V time CSRs provide standardized high-resolution timers.
Mixed-criticality systems (MCS): Consolidating safety-critical and non-critical software on the same hardware with hardware-enforced isolation (hypervisor, MPU). Reduces cost by eliminating separate ECUs. Active research area.
Formal methods for RT: Formal verification of RT scheduling algorithms and WCET bounds using proof assistants (Coq, Isabelle). Some industrial toolchains (AbsInt) already produce machine-checked WCET proofs.
Time-Sensitive Networking (TSN): IEEE 802.1Qbv and related standards bring deterministic Ethernet to industrial and automotive networks, replacing proprietary fieldbus protocols with standard hardware.

Exercises

Design a task set for a motor controller: 3 tasks — current sensing (C=50µs, T=500µs), speed control (C=200µs, T=1ms), position control (C=500µs, T=10ms). Compute CPU utilization. Apply the Liu & Layland bound test. Then perform response time analysis to verify each task's worst-case response time.
Implement a rate-monotonic scheduler simulator in Python. Accept a task set as input, simulate N hyperperiods, and output whether all deadlines are met along with a Gantt chart.
Measure worst-case interrupt-to-task latency on a Raspberry Pi 4 running standard Raspberry Pi OS and then with PREEMPT_RT kernel. Use cyclictest. Observe the latency histogram difference.
Construct a priority inversion scenario in C using POSIX threads: high-priority thread, medium-priority CPU-bound thread, and low-priority thread holding a mutex. Measure the inversion duration. Then replace with pthread_mutexattr_setprotocol(PTHREAD_PRIO_INHERIT) and measure again.
Research the Boeing 737 MAX MCAS failure in the context of real-time system design. What assumptions in the system model were violated? What changes were made to the control loop timing and redundancy as part of the return-to-service modifications?

References

C.L. Liu and J.W. Layland, "Scheduling Algorithms for Multiprogramming in a Hard Real-Time Environment," JACM 20(1), 1973
N.C. Audsley et al., "Applying New Scheduling Theory to Static Priority Preemptive Scheduling," Software Engineering Journal, 1993
John A. Stankovic, "Misconceptions About Real-Time Computing," IEEE Computer, 1988
Giorgio Buttazzo, Hard Real-Time Computing Systems (3rd ed., Springer, 2011)
Hermann Kopetz, Real-Time Systems: Design Principles for Distributed Embedded Applications (2nd ed., Springer, 2011)
Jane Liu, Real-Time Systems (Prentice Hall, 2000)
AUTOSAR Foundation: https://www.autosar.org/
cyclictest documentation: https://wiki.linuxfoundation.org/realtime/documentation/howto/tools/cyclictest/start