04 — Interrupt Handling

Technical Overview

An interrupt is the hardware mechanism by which a device signals the CPU that it requires attention: a packet has arrived, a DMA transfer has completed, a keypress has occurred. Without interrupts, the CPU would have to poll every device continuously — a catastrophic waste of cycles for slow peripheral I/O. Interrupts allow the CPU to execute useful work until the hardware requires service.

The Linux interrupt handling architecture is a carefully balanced design between two conflicting requirements: interrupt handlers must be fast (to avoid blocking other interrupts and disrupting scheduler latency) and they must complete the work the interrupt represents (which may be substantial). The top-half / bottom-half split resolves this tension by making the hardware-facing handler minimal and deferring the bulk of work to a lower-priority context.

Prerequisites

x86 or ARM CPU architecture basics (privilege levels, registers)
Linux scheduler concepts (process context vs interrupt context)
Kernel concurrency primitives (spinlocks, atomic operations)
PCIe/PCI bus architecture (for MSI/MSI-X sections)

Hardware Interrupt Architecture (x86)

On x86, when a device asserts an interrupt, the following sequence occurs at the hardware level:

Device asserts IRQ line
         │
         ▼
   APIC (Advanced Programmable Interrupt Controller)
   ├── Local APIC: per-CPU, receives interrupt from I/O APIC
   ├── I/O APIC: routes hardware IRQs to CPU Local APICs
   └── Determines interrupt vector number (0-255)
         │
         ▼
   CPU: suspends current execution
   ├── Pushes SS, RSP, RFLAGS, CS, RIP onto stack (hardware frame)
   ├── Clears IF flag (disables maskable interrupts)
   └── Loads RIP from IDT[vector].offset
         │
         ▼
   IDT (Interrupt Descriptor Table): 256 entries
   Each entry: handler address + privilege level + gate type
         │
         ▼
   Linux's common interrupt entry (arch/x86/entry/entry_64.S)
   ├── Save all registers (push rax, rbx, ... full pt_regs)
   ├── call do_IRQ() → __handle_irq_event_percpu()
   │       └── Iterate irq_desc[n].action list
   │               └── Call handler functions
   └── Send EOI (End of Interrupt) to APIC
         │
         ▼
   IRET: restore registers, return to interrupted context

The IDT is a 256-entry table in memory, pointed to by the IDTR register. Each entry is a gate descriptor that specifies the handler's address, privilege level, and whether it's a trap gate (does not clear IF) or an interrupt gate (clears IF). Linux uses interrupt gates for hardware interrupts.

Top-Half / Bottom-Half Model Diagram

Hardware Interrupt
        │
        ▼
┌────────────────────────────────────────────────────────────┐
│ TOP HALF (interrupt context)                               │
│                                                            │
│ • Runs with interrupts disabled (or restricted)            │
│ • Must complete in microseconds                            │
│ • Cannot sleep, cannot schedule, no blocking              │
│ • Acknowledge device (clear interrupt pending bit)         │
│ • Read status registers                                    │
│ • Enqueue work for bottom half                             │
│ • Return IRQ_HANDLED or IRQ_NONE                           │
└────────────────────────┬───────────────────────────────────┘
                         │ schedule_work() / tasklet_schedule()
                         │ / raise_softirq() / irq_thread_wakeup()
                         ▼
┌────────────────────────────────────────────────────────────┐
│ BOTTOM HALF (various contexts)                             │
│                                                            │
│ Softirqs:        kernel code, per-CPU, can nest, parallel  │
│ ├── NET_TX_SOFTIRQ  — transmit packet completions          │
│ ├── NET_RX_SOFTIRQ  — receive packet processing           │
│ ├── BLOCK_SOFTIRQ   — block I/O completion                 │
│ ├── TIMER_SOFTIRQ   — timer expiry callbacks               │
│ ├── HI_SOFTIRQ      — tasklets (high priority)             │
│ └── TASKLET_SOFTIRQ — tasklets (normal priority)           │
│                                                            │
│ Tasklets:        built on softirqs, serialized per-tasklet │
│ Workqueues:      kernel threads, can sleep                 │
│ Threaded IRQs:   kernel thread per IRQ, can sleep (RT)     │
└────────────────────────────────────────────────────────────┘

Registering an Interrupt Handler

#include <linux/interrupt.h>

/* Handler function — must be fast and non-blocking */
static irqreturn_t my_irq_handler(int irq, void *dev_id)
{
    struct my_device *dev = dev_id;

    /* Read device status to confirm our interrupt */
    u32 status = readl(dev->base + STATUS_REG);
    if (!(status & IRQ_PENDING))
        return IRQ_NONE;   /* Not our interrupt — shared IRQ line */

    /* Acknowledge the interrupt (clears IRQ_PENDING bit) */
    writel(IRQ_ACK, dev->base + CONTROL_REG);

    /* Schedule bottom half work */
    tasklet_schedule(&dev->rx_tasklet);

    return IRQ_HANDLED;
}

/* Register during driver probe */
static int my_probe(struct pci_dev *pdev, const struct pci_device_id *id)
{
    struct my_device *dev;
    int irq, ret;

    /* ... allocation, ioremap ... */

    irq = pdev->irq;   /* or pci_irq_vector(pdev, 0) for MSI */

    ret = request_irq(irq,
                      my_irq_handler,
                      IRQF_SHARED,     /* shared IRQ line with other devices */
                      "my_device",     /* name shown in /proc/interrupts */
                      dev);            /* dev_id: passed back to handler, used for shared IRQ identification */
    if (ret) {
        dev_err(&pdev->dev, "failed to request IRQ %d: %d\n", irq, ret);
        return ret;
    }

    /* ... */
    return 0;
}

/* Unregister during driver remove */
static void my_remove(struct pci_dev *pdev)
{
    struct my_device *dev = pci_get_drvdata(pdev);
    free_irq(pdev->irq, dev);
    /* ... */
}

request_irq flags

Flag	Meaning
`IRQF_SHARED`	Allow other handlers on same IRQ number
`IRQF_TRIGGER_RISING`	Edge-triggered (rising edge)
`IRQF_TRIGGER_HIGH`	Level-triggered (high level)
`IRQF_ONESHOT`	Keep IRQ disabled until threaded handler finishes
`IRQF_NO_AUTOEN`	Don't enable IRQ after request; caller enables later

Interrupt Context Rules

Code running in interrupt context operates under severe constraints. The kernel enforces many of these with runtime checks (when CONFIG_DEBUG_ATOMIC_SLEEP is enabled):

Prohibited in interrupt context: - schedule(), msleep(), wait_event_*(), mutex_lock() — any blocking call - Memory allocation with GFP_KERNEL (may sleep to reclaim memory); use GFP_ATOMIC instead - Copying data to/from user space (copy_to_user) — user pages may not be mapped - Anything that can trigger a page fault

Permitted in interrupt context: - spin_lock_irqsave() / spin_unlock_irqrestore() - kzalloc(size, GFP_ATOMIC) — atomic allocation from emergency pools - readl(), writel() — MMIO register access - tasklet_schedule(), schedule_work() — enqueue deferred work - wake_up() — wake sleeping processes (but don't wait for them)

The function in_interrupt() returns true when in hardirq or softirq context. Driver code that could be called from either process or interrupt context must check this and use spin_lock_irqsave (which additionally disables local interrupts) rather than plain spin_lock.

Softirqs

Softirqs (software interrupts) are statically allocated at compile time (there are exactly 10 in current kernels). They run after the top-half handler returns, with interrupts re-enabled. A single softirq can run simultaneously on multiple CPUs, making them the most performant bottom-half mechanism but also the most complex to write correctly.

/* Softirq vector IDs (include/linux/interrupt.h) */
enum {
    HI_SOFTIRQ = 0,          /* highest priority, for tasklets */
    TIMER_SOFTIRQ,            /* timer wheel */
    NET_TX_SOFTIRQ,           /* network transmit */
    NET_RX_SOFTIRQ,           /* network receive */
    BLOCK_SOFTIRQ,            /* block I/O completion */
    IRQ_POLL_SOFTIRQ,         /* iopoll */
    TASKLET_SOFTIRQ,          /* normal tasklets */
    SCHED_SOFTIRQ,            /* scheduler (load balancing) */
    HRTIMER_SOFTIRQ,          /* high-resolution timers */
    RCU_SOFTIRQ,              /* RCU callbacks */
    NR_SOFTIRQS
};

Softirqs are executed by ksoftirqd/N kernel threads (one per CPU) or immediately in the do_softirq() tail of the interrupt return path. If softirqs run for too long without yielding, the system detects this as a softirq overload and defers to ksoftirqd to avoid starving user processes. This is the source of ksoftirqd showing high CPU in network-heavy workloads.

Tasklets

Tasklets are the most common bottom-half mechanism for device drivers. Unlike softirqs, tasklets are created dynamically and are guaranteed to run on at most one CPU at a time (serialized), making them easier to use correctly.

/* Define a tasklet and its handler */
static void my_rx_handler(unsigned long data)
{
    struct my_device *dev = (struct my_device *)data;
    /* process received data — can use spin_lock but not mutex */
    spin_lock(&dev->rx_lock);
    process_rx_queue(dev);
    spin_unlock(&dev->rx_lock);
}

DECLARE_TASKLET(my_rx_tasklet, my_rx_handler, (unsigned long)&my_dev);
/* or dynamically: tasklet_init(&tasklet, handler, data) */

/* From top-half handler: */
tasklet_schedule(&my_rx_tasklet);

Note: Tasklets are being deprecated. Linux 5.9 deprecation discussion concluded that threaded IRQs cover tasklet use cases more cleanly. New drivers should use threaded IRQs or workqueues.

Workqueues

Workqueues execute deferred work in kernel threads (process context), so they can sleep. They are appropriate when bottom-half processing needs to block — e.g., waiting for firmware to respond, acquiring a mutex.

#include <linux/workqueue.h>

struct my_device {
    struct work_struct rx_work;
    struct workqueue_struct *wq;
    // ...
};

static void my_rx_work(struct work_struct *work)
{
    struct my_device *dev = container_of(work, struct my_device, rx_work);
    /* Can sleep here */
    mutex_lock(&dev->mutex);
    process_rx_data(dev);
    mutex_unlock(&dev->mutex);
}

/* In probe: */
dev->wq = alloc_workqueue("my_device_wq", WQ_UNBOUND | WQ_HIGHPRI, 0);
INIT_WORK(&dev->rx_work, my_rx_work);

/* From top-half: */
queue_work(dev->wq, &dev->rx_work);

The kernel provides a shared system workqueue (system_wq, used via schedule_work()) for lightweight work. High-priority or latency-sensitive work should use a dedicated workqueue with WQ_HIGHPRI. For NVMe-like drivers, WQ_CPU_INTENSIVE unbinds the work from the CPU affinity system, allowing it to run anywhere.

Threaded IRQs

Threaded interrupt handling (IRQF_ONESHOT + request_threaded_irq) runs the "slow" handler in a dedicated per-IRQ kernel thread. This is the mechanism used by PREEMPT_RT (real-time Linux) to make interrupt handling preemptible.

/* Two-phase handler: */
static irqreturn_t my_hard_handler(int irq, void *dev_id)
{
    /* Absolute minimum: read status, ack hardware */
    struct my_device *dev = dev_id;
    dev->irq_status = readl(dev->base + STATUS_REG);
    writel(IRQ_ACK, dev->base + STATUS_REG);
    return IRQ_WAKE_THREAD;   /* wake the thread handler */
}

static irqreturn_t my_thread_handler(int irq, void *dev_id)
{
    struct my_device *dev = dev_id;
    /* Runs in a kernel thread: can sleep, use mutex, etc. */
    handle_device_data(dev, dev->irq_status);
    return IRQ_HANDLED;
}

request_threaded_irq(irq,
                     my_hard_handler,     /* primary handler (top half) */
                     my_thread_handler,   /* thread handler (bottom half) */
                     IRQF_ONESHOT,        /* keep IRQ disabled until thread completes */
                     "my_device",
                     dev);

IRQF_ONESHOT is required for threaded IRQs: it keeps the IRQ line disabled until the thread handler completes. Without it, the level-triggered IRQ would re-fire continuously while the thread handler is still running.

Interrupt Affinity

By default, irqbalance daemon distributes hardware interrupts across CPUs to avoid bottlenecking a single CPU. For latency-critical or throughput-critical drivers, manual affinity control is important.

# Show current IRQ distribution
cat /proc/interrupts

# Show and set IRQ affinity (CPU bitmask in hex)
cat /proc/irq/24/smp_affinity          # e.g., "ff" = all CPUs
echo 4 > /proc/irq/24/smp_affinity     # CPU 2 only (bit 2)

# For NIC with multiple RX queues: spread queues across CPUs
# ethtool --set-rxfh-indir eth0 equal 8  (RSS hash indirection)
for i in $(seq 0 7); do
    echo $((1 << i)) > /proc/irq/$((irq_base + i))/smp_affinity
done

NUMA-aware affinity: On multi-socket servers, interrupt affinity should pin each NIC queue's interrupt to a CPU on the same NUMA node as the NIC's PCIe root complex. Cross-NUMA interrupt handling adds ~100ns latency per packet from cache miss and PCIe traversal.

MSI and MSI-X

Legacy PCI interrupts use a shared physical wire (INTA#) between multiple devices. The IRQ number is assigned by the BIOS/firmware and all devices on the same line share one interrupt vector — causing the "spurious interrupt" problem where drivers must check STATUS_REG to confirm the interrupt is theirs.

MSI (Message Signaled Interrupts) replaces the physical wire with a memory write. The device writes a specific 32-bit value to a specific memory address (configured by the OS in the device's MSI capability). This write is intercepted by the CPU's APIC and converted to an interrupt. Benefits: no shared IRQ, no spurious interrupts, lower latency (posted write vs. wire propagation).

MSI-X extends MSI to allow up to 2048 interrupt vectors per device, each with its own address and data value, each independently maskable. NVMe SSDs use MSI-X with one vector per submission queue, allowing per-queue interrupt affinity (one queue pinned per CPU core).

/* Enable MSI-X in a PCIe driver */
int nvecs = pci_alloc_irq_vectors(pdev,
                                  1,          /* min vectors */
                                  nr_queues,  /* max vectors */
                                  PCI_IRQ_MSIX | PCI_IRQ_MSI);
if (nvecs < 0) {
    /* Fall back to legacy IRQ */
    nvecs = pci_alloc_irq_vectors(pdev, 1, 1, PCI_IRQ_LEGACY);
}

/* Get IRQ number for vector i */
int irq = pci_irq_vector(pdev, i);
request_irq(irq, my_handler_for_queue_i, 0, "my_device-q0", queue_i);

Historical Context

Early Linux (pre-2.4) used a single 8259A PIC (Programmable Interrupt Controller) supporting 8 IRQ lines, later chained for 16. The IRQ numbers 0-15 were hardwired by the PC architecture (IRQ0=timer, IRQ1=keyboard, IRQ14/15=IDE). This legacy is still visible in /proc/interrupts on modern systems where IRQ 0 is the timer.

The APIC (Advanced Programmable Interrupt Controller) was introduced with the Pentium and made standard with SMP systems. The I/O APIC receives device interrupts and routes them to CPU Local APICs. This allowed up to 24 IRQ inputs (I/O APIC) mapped to 256 interrupt vectors per CPU.

The generic IRQ layer (kernel/irq/) was refactored by Ingo Molnár and Thomas Gleixner in Linux 2.6.19 (2006) to create the irq_chip abstraction, allowing a single code path to handle 8259A, APIC, GIC (ARM), and other interrupt controllers.

The PREEMPT_RT patchset (Ingo Molnár, Thomas Gleixner, 2004-present, partially merged by 6.x) converts most interrupt handlers to threaded execution, making the kernel fully preemptible for real-time workloads.

Production Examples

10GbE NIC (Intel X540): Uses MSI-X with one vector per transmit/receive queue pair. A 2-socket server with 16 cores per socket configures 32 RX queues, each pinned to one CPU core. RSS (Receive Side Scaling) distributes incoming flows across queues using a hash of src/dst IP+port. At 10Gbps line rate (14.88 Mpps for 64-byte frames), each CPU handles ~465K pps.

NVMe SSD (Samsung 980 Pro): Supports up to 128 MSI-X vectors. On a 16-core system, nvme driver allocates 16 I/O queues + 1 admin queue = 17 vectors. Each CPU submits to its own queue (no lock contention) and receives completion interrupts on its own vector. This architecture achieves 1M IOPS.

Raspberry Pi GPIO interrupt: GPIO pins on the BCM2837 SoC can generate interrupts via the GPIO interrupt controller, which is connected to the ARM GIC. A GPIO interrupt for a button press traverses: BCM2837 GPIO controller → ARM GIC → CPU VIRQ → gpio_irq_handler → user's registered request_irq handler.

Debugging Notes

# Show interrupt counts per CPU
watch -n1 cat /proc/interrupts

# Show softirq statistics
watch -n1 cat /proc/softirqs

# Trace interrupt handler calls with ftrace
echo 'function' > /sys/kernel/debug/tracing/current_tracer
echo 'my_irq_handler' > /sys/kernel/debug/tracing/set_ftrace_filter
echo 1 > /sys/kernel/debug/tracing/tracing_on
cat /sys/kernel/debug/tracing/trace_pipe

# Check for IRQ storms (interrupt count growing too fast)
while true; do grep "eth0" /proc/interrupts; sleep 0.1; done

# Check ksoftirqd CPU usage
top -p $(pgrep ksoftirqd)

# Disable irqbalance and manually control affinity
systemctl stop irqbalance
echo 1 > /proc/irq/24/smp_affinity

IRQ storm detection: If a device's interrupt count in /proc/interrupts is growing at millions per second, the device is not being acknowledged properly (interrupt handler returns IRQ_NONE, or the status register clear is not working). The kernel detects this after 100,000 interrupts without IRQ_HANDLED and disables the IRQ (disable_irq).

Security Implications

Interrupt injection: A malicious device (e.g., via Thunderbolt DMA attack) can inject arbitrary interrupt vectors by writing to APIC registers. Intel VT-d (IOMMU) prevents this by restricting which memory addresses devices can write, including APIC MMIO regions. MSI interrupts without IOMMU protection allow any device to write any interrupt vector to the APIC.

Interrupt coalescing attacks: An attacker who can influence interrupt timing (via a rowhammer-adjacent technique on DDR4) could cause a vulnerable interrupt handler to execute at a precisely controlled time, exploiting a race condition in the handler.

Spectre variant 1 in interrupt handlers: Bounds-checked array accesses in interrupt handlers are vulnerable to speculative execution attacks if the interrupt preempts a Spectre-vulnerable code path.

Performance Implications

Interrupt mitigation (coalescence): High-rate interrupts (e.g., 1Gbps NIC at 1.4Mpps) would overwhelm the CPU if each packet generated one interrupt. NAPI (New API) solves this for network drivers: after the first interrupt, the driver disables the IRQ and polls for more packets in a softirq loop, re-enabling the interrupt only when no more packets are found. This trades latency for throughput.

Cache effects: Interrupt handlers run on whichever CPU received the interrupt. If that CPU doesn't own the relevant data structures in cache, interrupt handling adds L3 cache misses. Interrupt affinity pinning ensures data locality.

Failure Modes

Interrupt not firing: Check solder, PCI configuration, and IRQ routing. Verify IRQF_SHARED is set if the IRQ is shared. Check that the device's interrupt enable bit is set in its control register.
Handler called but returns IRQ_NONE: Device status register doesn't show interrupt pending. Likely a shared IRQ where a different device on the same line is responsible.
Kernel BUG on sleeping in interrupt context: CONFIG_DEBUG_ATOMIC_SLEEP=y will trigger BUG_ON(in_interrupt()) inside schedule(). Stack trace identifies the offending driver code.
Soft lockup / RCU stall: Bottom-half processing is taking too long, starving the scheduler. Check for infinite loops or unexpected blocking in softirq/tasklet handlers.

Modern Usage

With PREEMPT_RT patches fully merged into mainline (the process completed for core infrastructure in 6.x), threaded IRQs are the standard pattern for new drivers. The Linux real-time project (linux-rt-devel) targets sub-100µs worst-case interrupt latency for industrial automation use cases (CNC machines, robotics).

Future Directions

Poll-mode drivers (DPDK/SPDK): For absolute maximum packet processing rates, DPDK bypasses the kernel interrupt model entirely. CPUs spin in a polling loop checking device RX queues without any interrupt involvement. This eliminates interrupt overhead at the cost of dedicating entire CPU cores to polling. At 100Gbps (148Mpps), even the interrupt overhead of NAPI is too high.

Exercises

Write a driver that uses request_threaded_irq with a GPIO interrupt on a Raspberry Pi. Have the thread handler read a temperature sensor via I2C and log the result.
Benchmark the interrupt latency difference between tasklets and threaded IRQs using cyclictest.
Set up a simulated "interrupt storm" by registering a timer that fires at 100kHz and observe the effect on ksoftirqd CPU usage.
Configure RSS on a multi-queue NIC (ethtool -X eth0 equal 8) and verify via /proc/interrupts that interrupts are being distributed across CPUs.
Use perf top to identify which functions are consuming the most CPU time during a high-network-throughput test. Observe the softirq functions in the call graph.

References

include/linux/interrupt.h — request_irq, IRQF flags, tasklet API
kernel/irq/ — generic IRQ layer implementation
Documentation/core-api/irq/ — IRQ documentation
Thomas Gleixner, Ingo Molnár, "Interrupt Handling" — OLS 2006
Linux Device Drivers, 3rd Edition, Chapter 10 — Interrupt Handling
NAPI documentation: Documentation/networking/napi.rst
Intel 64 Architecture Software Developer's Manual, Volume 3A, Chapter 10 (APIC)
PREEMPT_RT documentation: https://wiki.linuxfoundation.org/realtime/