03 — ftrace and tracefs

Technical Overview

ftrace (Function Tracer) is the Linux kernel's built-in tracing framework. Unlike perf (which profiles from outside the kernel) or KGDB (which requires halting the kernel), ftrace instruments the kernel in-place using a combination of static tracepoints and dynamic function hooking. It is accessible through the tracefs filesystem at /sys/kernel/tracing/ (or /sys/kernel/debug/tracing/ on older systems), requiring no additional tooling beyond shell commands and cat. ftrace is the debugging tool of first resort when you have a production kernel issue and limited tools available — every Linux kernel since 2.6.27 includes it.

The framework was originally created by Steven Rostedt (Red Hat) and has grown to include a family of tracers: function tracer, function_graph tracer, irqsoff tracer, preemptoff tracer, wakeup tracer, and more. The key insight is that kernel functions are compiled with a call site at their entry point (the mcount or __fentry__ call, inserted by -pg or -mfentry gcc flags), which ftrace patches dynamically to enable tracing with near-zero overhead when disabled and selective overhead when enabled.

Prerequisites

Root access (tracefs requires root)
Understanding of kernel functions and the Linux call stack
Familiarity with shell scripting
Basic understanding of interrupts and preemption

Core Content

tracefs Layout

TRACEFS STRUCTURE
/sys/kernel/tracing/
├── current_tracer          # active tracer (nop, function, function_graph, etc.)
├── tracing_on              # 0/1: enable/disable tracing
├── trace                   # current trace buffer contents (read to display)
├── trace_pipe              # blocking read: new data as it arrives
├── trace_options           # per-option files (latency-format, funcgraph-proc, etc.)
├── available_tracers       # list of compiled-in tracers
├── available_filter_funcs  # list of all traceable kernel functions
├── set_ftrace_filter       # which functions to trace (empty = all)
├── set_ftrace_notrace      # which functions to NOT trace
├── set_graph_function      # for function_graph: start tracing here
├── buffer_size_kb          # per-CPU ring buffer size
├── tracing_cpumask         # which CPUs to trace
├── events/                 # tracepoint subdirectory
│   ├── syscalls/
│   │   ├── sys_enter_read/
│   │   │   ├── enable        # 0/1 to enable this tracepoint
│   │   │   ├── format        # field descriptions for this tracepoint
│   │   │   └── filter        # apply a filter expression
│   │   └── sys_exit_read/
│   ├── sched/
│   │   ├── sched_switch/
│   │   └── sched_wakeup/
│   ├── block/
│   ├── net/
│   └── ...
├── instances/              # create separate trace buffers per instance
└── options/
    ├── funcgraph-proc      # show process name in function_graph output
    ├── funcgraph-cpu       # show CPU number
    └── func_stack_trace    # show stack trace with function trace

Enabling Function Tracing

# Clear any existing trace data
echo "" > /sys/kernel/tracing/trace

# Enable function tracing (traces ALL kernel function calls — very high volume)
echo function > /sys/kernel/tracing/current_tracer
echo 1 > /sys/kernel/tracing/tracing_on

# Take a short capture
sleep 0.001
echo 0 > /sys/kernel/tracing/tracing_on

# Read the trace
cat /sys/kernel/tracing/trace | head -50
# Output format:
#            TASK-PID   CPU#  irqs  timestamp  FUNCTION
#            bash-1234  [003] ....  12345.678901: do_sys_openat2 <-sys_openat
#            bash-1234  [003] ....  12345.678902: getname <-do_sys_openat2
#            bash-1234  [003] ....  12345.678903: getname_flags <-getname

# Filter to specific functions (MUCH less volume)
echo do_sys_openat2 > /sys/kernel/tracing/set_ftrace_filter
echo tcp_connect >> /sys/kernel/tracing/set_ftrace_filter   # >> to append

# Trace a glob pattern
echo 'tcp_*' > /sys/kernel/tracing/set_ftrace_filter

# Trace functions in a specific module
echo ':mod:mydriver' > /sys/kernel/tracing/set_ftrace_filter

# Exclude specific functions
echo rcu_read_lock > /sys/kernel/tracing/set_ftrace_notrace

Function Graph Tracer

The function_graph tracer shows function entry AND exit, with timing and call depth indentation. This is the most powerful tracer for understanding call chains and finding slow functions.

# Enable function_graph tracer
echo function_graph > /sys/kernel/tracing/current_tracer
echo 1 > /sys/kernel/tracing/options/funcgraph-proc    # show process name
echo 1 > /sys/kernel/tracing/options/funcgraph-cpu     # show CPU
echo 1 > /sys/kernel/tracing/options/funcgraph-overhead # show timing

# Optional: start graph from a specific function
echo do_sys_openat2 > /sys/kernel/tracing/set_graph_function

echo 1 > /sys/kernel/tracing/tracing_on
# ... run the operation you want to trace ...
echo 0 > /sys/kernel/tracing/tracing_on

cat /sys/kernel/tracing/trace | head -80

Sample function_graph output (annotated):

# tracer: function_graph
# CPU DURATION    FUNCTION CALLS
# |  |            |   |   |   |

ANNOTATED function_graph trace:

 0)   bash-1234  |               } /* do_sys_openat2 */    /* closing brace = function returned */
 0)   bash-1234  |               do_sys_openat2() {         /* opening brace = function called */
 0)   bash-1234  |                 getname() {
 0)   bash-1234  |                   kmem_cache_alloc() {
 0)   bash-1234  |                     _cond_resched() {
 0)   bash-1234  |                       rcu_all_qs();       /* semicolon = leaf function (no children traced) */
 0)   bash-1234  |                     } /* 0.198 us */      /* duration in microseconds */
 0)   bash-1234  |                   } /* 1.089 us */
 0)   bash-1234  |                   strncpy_from_user() {
 0)   bash-1234  |                     __check_object_size();
 0)   bash-1234  |                   } /* 0.534 us */
 0)   bash-1234  |                 } /* 2.891 us */           /* getname took 2.891 us total */
 0)   bash-1234  |                 path_openat() {
 0)   bash-1234  |                   alloc_empty_file() {
 0)   bash-1234  | $ 123.456 us   |    kmalloc(); ...         /* $ prefix = exceeded threshold */

Key timing annotations: - No prefix: fast call (< default threshold) - + : > 10 microseconds - ! : > 100 microseconds
- # : > 1 millisecond - @ : > 10 milliseconds - * : > 100 milliseconds - $ : > 1 second

Tracepoints

Tracepoints are static, stable instrumentation hooks compiled into the kernel at specific locations. Unlike kprobes (which hook arbitrary functions), tracepoints have a defined API that doesn't change between kernel versions. They are preferred over kprobes for stability.

# List available tracepoints
cat /sys/kernel/tracing/available_events | grep sched
# Output:
# sched:sched_switch
# sched:sched_wakeup
# sched:sched_wakeup_new
# sched:sched_migrate_task
# ...

# Enable a tracepoint
echo 1 > /sys/kernel/tracing/events/sched/sched_switch/enable

# Check the format of a tracepoint
cat /sys/kernel/tracing/events/sched/sched_switch/format
# Output:
# name: sched_switch
# ID: 315
# format:
#     field:unsigned short common_type;    offset:0;  size:2;
#     field:unsigned char common_flags;    offset:2;  size:1;
#     field:char prev_comm[16];            offset:8;  size:16;
#     field:pid_t prev_pid;                offset:24; size:4;
#     field:int prev_prio;                 offset:28; size:4;
#     field:long prev_state;               offset:32; size:8;
#     field:char next_comm[16];            offset:40; size:16;
#     field:pid_t next_pid;                offset:56; size:4;
#     field:int next_prio;                 offset:60; size:4;

# Apply a filter to the tracepoint (only record events matching the filter)
echo 'prev_pid == 12345' > /sys/kernel/tracing/events/sched/sched_switch/filter

# Enable syscall tracepoints for a specific call
echo 1 > /sys/kernel/tracing/events/syscalls/sys_enter_read/enable
echo 1 > /sys/kernel/tracing/events/syscalls/sys_exit_read/enable

# Enable ALL events in a subsystem
echo 1 > /sys/kernel/tracing/events/block/enable

# Enable ALL tracepoints (very high volume — use carefully)
echo 1 > /sys/kernel/tracing/events/enable

Latency Tracers

irqsoff tracer: finds the maximum time interrupts are disabled on any CPU. Interrupts being disabled prevents preemption and can cause latency for real-time tasks.

echo irqsoff > /sys/kernel/tracing/current_tracer
echo 0 > /sys/kernel/tracing/tracing_max_latency  # reset max latency
echo 1 > /sys/kernel/tracing/tracing_on
# ... run workload for a while ...
echo 0 > /sys/kernel/tracing/tracing_on

# The trace shows the worst-case IRQ-disabled region
cat /sys/kernel/tracing/trace
# Output:
# # irqsoff latency trace v1.1.5 on 5.15.0
# # latency: 342 us, #4/4, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
# #    |   task: irq/9-eno1-12345 (uid:0 nice:0 policy:0 rt_prio:0)
# #    => started at: ixgbe_msix_clean_rings
# #    => ended at:   ixgbe_msix_clean_rings
# #
# #                  _------=> CPU#
# #  FUNCTION        / _-----=> irqs-off
# #  |              | / _----=> need-resched
# #  |              || / _---=> hardirq/softirq
# #  |              ||| / _--=> preempt-depth
# #  |              |||| +   timestamp  function
# #  |              |||||
#        <idle>-0   [001] d... 12345.678: ixgbe_msix_clean_rings() {
#        <idle>-0   [001] d... 12345.678:   napi_schedule() {
#        <idle>-0   [001] d... 12345.979:   } /* 342 us */   ← 342us with IRQs disabled!

preemptoff tracer: similar to irqsoff but finds the maximum time preemption is disabled.

wakeup tracer: measures the time from a task being woken up (placed on the run queue) to when it actually runs on a CPU. This is the scheduling latency.

echo wakeup > /sys/kernel/tracing/current_tracer
echo 0 > /sys/kernel/tracing/tracing_max_latency
echo 1 > /sys/kernel/tracing/tracing_on
# Run the task whose wakeup latency you want to measure...
echo 0 > /sys/kernel/tracing/tracing_on
cat /sys/kernel/tracing/trace
# Shows the time between sched_wakeup and the task actually running on a CPU

trace-cmd: CLI Frontend for ftrace

trace-cmd provides a more ergonomic interface than direct tracefs file manipulation:

# Install
apt install trace-cmd  # or yum install trace-cmd

# Record a function_graph trace for 5 seconds
trace-cmd record -p function_graph -g do_sys_openat2 sleep 5

# Record specific events
trace-cmd record -e sched:sched_switch -e syscalls:sys_enter_read \
  -P 12345 -- sleep 5   # -P: only trace PID 12345

# Display recorded trace
trace-cmd report | head -50

# Stream trace output in real time (like trace_pipe)
trace-cmd stream -e sched:sched_switch

# Stat: show which functions take the most time
trace-cmd stat

# Convert trace to JSON for custom analysis
trace-cmd report -O json

KernelShark: Graphical Visualization

KernelShark is a Qt-based GUI for viewing ftrace data captured by trace-cmd. It provides: - Timeline view of CPU activity across all CPUs - Per-process event filtering - Latency visualization - Click-to-zoom navigation

# Install
apt install kernelshark

# Open a trace-cmd recording
kernelshark trace.dat

trace_printk: Kernel Debug Printing

For kernel development, trace_printk() writes to the ftrace ring buffer instead of the kernel log (dmesg). This is much faster than printk() (no console output, no printk() locks) and enables timestamped output in the trace buffer:

// In kernel module source:
#include <linux/kernel.h>

static int my_function(struct some_struct *s, int val)
{
    trace_printk("my_function: s=%p val=%d s->count=%d\n", 
                 s, val, s->count);
    // ... function body ...
}

# After loading the module, enable trace and view trace_printk output:
echo 1 > /sys/kernel/tracing/tracing_on
# ... trigger my_function calls ...
cat /sys/kernel/tracing/trace | grep "my_function"

trace_printk() should only be used during development. It leaves a kernel message at boot ("trace_printk() in production code detected") as a reminder.

TRACE_EVENT Macro

The TRACE_EVENT macro allows kernel developers to define custom static tracepoints that integrate with the tracefs infrastructure (appear in available_events, have a format file, support filters):

// In kernel header (e.g., include/trace/events/mysubsystem.h):
#define TRACE_SYSTEM mysubsystem
#include <linux/tracepoint.h>

TRACE_EVENT(mysubsystem_request,
    TP_PROTO(struct request *rq, int len),
    TP_ARGS(rq, len),
    TP_STRUCT__entry(
        __field(int, len)
        __field(unsigned int, flags)
    ),
    TP_fast_assign(
        __entry->len = len;
        __entry->flags = rq->flags;
    ),
    TP_printk("len=%d flags=0x%x", __entry->len, __entry->flags)
);

Once defined, this tracepoint appears at /sys/kernel/tracing/events/mysubsystem/mysubsystem_request/ and can be enabled, filtered, and read like any other tracepoint.

Ftrace Ring Buffer Architecture

ftrace uses per-CPU ring buffers to minimize contention. Each CPU writes to its own buffer; reading merges them with timestamp ordering. Buffer size is configurable:

# Show current buffer size
cat /sys/kernel/tracing/buffer_size_kb
# Output: 1408 (per-CPU)

# Increase buffer size (helps prevent overruns during long captures)
echo 65536 > /sys/kernel/tracing/buffer_size_kb  # 64MB per CPU

# Check if the buffer overflowed (events were dropped)
cat /sys/kernel/tracing/buffer_percent
cat /sys/kernel/tracing/per_cpu/cpu0/stats | grep overrun

ftrace Instances: Concurrent Tracing

ftrace instances allow multiple independent trace sessions simultaneously:

# Create a new instance
mkdir /sys/kernel/tracing/instances/my_trace

# Configure independently from main trace
echo function > /sys/kernel/tracing/instances/my_trace/current_tracer
echo 1 > /sys/kernel/tracing/instances/my_trace/tracing_on

# Read instance trace
cat /sys/kernel/tracing/instances/my_trace/trace

# Clean up
rmdir /sys/kernel/tracing/instances/my_trace

Historical Context

ftrace was created by Steven Rostedt at Red Hat, initially released as the "latency tracer" for the -rt (PREEMPT_RT) real-time Linux patchset around 2005-2006. The goal was measuring and minimizing scheduling latency on real-time kernels. It was merged into the mainline kernel in 2.6.27 (2008).

The function tracer uses gcc's -pg flag instrumentation (originally for gprof profiling), replacing the mcount() call at each function entry with a NOP when tracing is disabled and a tracing function when enabled. Later kernels switched to __fentry__ (Steven Rostedt's "dynamic ftrace" — enabling per-function tracing with minimal overhead when disabled) and then to Live Patching infrastructure.

Tracepoints were designed by Mathieu Desnoyers (Ericsson) as stable, versioned instrumentation points that maintainers commit to preserving across kernel versions, unlike kprobes which hook arbitrary functions that may change.

Production Examples

# Production-safe latency investigation: find the longest time a specific
# system call takes, without capturing all calls (high volume)

# Step 1: Enable a syscall tracepoint with a latency filter
echo 1 > /sys/kernel/tracing/events/syscalls/sys_enter_read/enable
echo 1 > /sys/kernel/tracing/events/syscalls/sys_exit_read/enable

# Step 2: Run trace-cmd to capture for 30 seconds, filtering to one process
trace-cmd record -e syscalls:sys_enter_read -e syscalls:sys_exit_read \
  -P $(pgrep -f myapp) -- sleep 30

# Step 3: Analyze
trace-cmd report | awk '/sys_exit_read/{if($NF > 0.001) print}' | sort -k1 -n | tail -20

# Find slow block I/O (> 10ms)
echo 1 > /sys/kernel/tracing/events/block/block_rq_complete/enable
echo 'rwbs ~ "R" && nr_sector > 0' > \
  /sys/kernel/tracing/events/block/block_rq_complete/filter

Debugging Notes

Trace buffer fills up immediately: Use filters (set_ftrace_filter) to reduce volume. For function_graph, use set_graph_function to trace only a specific call subtree.

Function not appearing in available_filter_funcs: The function may be inlined (not traceable). Use tracepoints or kprobes instead. Some functions in hotpaths are also excluded from tracing to prevent infinite recursion in the tracer itself.

irqsoff tracer shows very long latencies from a driver: Network drivers often disable interrupts while updating DMA ring buffers. If the maximum latency is consistently in the same driver function, it's a driver quality issue — the driver is holding interrupts disabled for too long, impacting real-time latency.

trace_pipe blocks: cat /sys/kernel/tracing/trace_pipe is a blocking read. If no events are being generated, it blocks indefinitely. Use trace (not trace_pipe) for reading snapshots, or use timeout cat /sys/kernel/tracing/trace_pipe.

Security Implications

tracefs requires root or CAP_SYS_ADMIN. Any process with write access to tracefs can inject trace_printk-like messages or trace arbitrary kernel functions, potentially leaking kernel memory addresses.
Function tracing can reveal timing information about kernel operations. This may be useful for timing side-channel attacks against cryptographic operations in the kernel. Disable tracefs access in high-security environments.
tracefs_inode is not namespaced — all containers on a host share the same tracefs. In Kubernetes, a privileged pod has full tracefs access and can trace all kernel operations including those from other pods.

Performance Implications

Function tracer (all functions): ~15-20% CPU overhead. Only for short-duration captures.
Function tracer (filtered to a few functions): <1% overhead. Production-safe for brief captures.
Tracepoints (enabled): overhead proportional to event rate. A tracepoint firing 100k/sec adds ~1-2% CPU.
Tracepoints (disabled): ~2-4ns overhead per probe site (a conditional branch check). Negligible.
function_graph tracer: records both entry and exit, roughly 2x the overhead of function tracer.
latency tracers (irqsoff/preemptoff): track per-CPU max latency. Low overhead — only captures when a new maximum is set.

Failure Modes and Real Incidents

irqsoff trace revealing NIC driver latency issue: A real-time process was experiencing 500µs+ scheduling jitter in production. irqsoff trace revealed that the Intel ixgbe network driver was holding interrupts disabled for 350µs during RSS (Receive Side Scaling) hash table updates. Fix: move to a newer driver version with the fix for this race condition.

trace buffer overrun during production trace: An engineer ran echo function > /sys/kernel/tracing/current_tracer on a production host without applying a filter, expecting to capture just a second of data. The function tracer generated ~500MB/second of trace data, filling the ring buffer in <100ms. The 20% CPU overhead caused a production incident. Fix: always set set_ftrace_filter before enabling the function tracer; never enable tracing of all functions without filters on production hosts.

Tracepoint filter syntax error causing EINVAL: echo 'pid == 1234' > /sys/kernel/tracing/events/sched/sched_switch/filter failed silently (EINVAL). The correct field name is prev_pid or next_pid (check format file). Filter errors appear in dmesg and the write() syscall returns EINVAL.

Modern Usage

trace-cmd 3.x: improved timestamp handling, JSON output, better memory efficiency for large traces.
libtraceevent: C library for parsing trace data programmatically, used by perf and trace-cmd.
Synthetic events: ftrace supports creating synthetic events that can be triggered based on multiple tracepoints (e.g., "fire an event when sched_wakeup is followed by sched_switch within 1ms for the same task"). Useful for latency distributions.
Histogram triggers: new ftrace feature allowing in-kernel histogram computation on tracepoint data (similar to eBPF maps but without eBPF):

echo 'hist:key=common_pid:val=hitcount' > \
  /sys/kernel/tracing/events/sched/sched_switch/trigger
cat /sys/kernel/tracing/events/sched/sched_switch/hist

Future Directions

User Events (uprobes in tracefs): user-space programs can register events via a /dev/user_events API, injecting tracepoints from userspace applications that appear in the ftrace ring buffer alongside kernel events.
Boot-time tracing: trace_boot module allows enabling tracepoints before init runs, capturing early boot events for diagnosing slow boot issues.
OTel ftrace integration: projects are exploring using ftrace data as a source for OpenTelemetry spans (kernel spans linked to application traces via PID correlation).

Exercises

function_graph for a system call: Enable the function_graph tracer with set_graph_function = do_sys_openat2. Open a file from a shell. Capture the trace and identify: the total time for the open, the time spent in VFS path lookup, and which function accounted for the most time.
irqsoff analysis: Enable the irqsoff tracer. Run iperf3 -s &; iperf3 -c localhost -t 30 to generate network traffic. After 30 seconds, read the trace. What is the maximum IRQ-disabled duration? Which function caused it? Look up the function in the kernel source to understand why it disables IRQs.
Tracepoint filter pipeline: Enable syscalls:sys_enter_openat and syscalls:sys_exit_openat. Apply a filter to only capture opens by your shell PID. Generate 100 file opens (for i in $(seq 100); do cat /dev/null; done). Compute the average open() latency by correlating enter and exit events by TID.
Synthetic event for latency distribution: Using ftrace histogram triggers, create a histogram of the latency between sched:sched_wakeup and sched:sched_switch for a specific PID. Run a sleep-intensive workload and read the histogram. What is the median and maximum scheduling latency?
trace-cmd + KernelShark: Record a 10-second trace with trace-cmd record capturing: sched:sched_switch, sched:sched_wakeup, block:block_rq_issue, block:block_rq_complete. Open the trace in KernelShark. Identify any CPUs that are idle while I/O is in flight. Measure block I/O latency graphically.

References

Rostedt, Steven. "Debugging the Linux Kernel Using ftrace." LWN.net, 2010. https://lwn.net/Articles/365835/
Linux Kernel Documentation: Documentation/trace/ftrace.rst
Linux Kernel Documentation: Documentation/trace/tracepoints.rst
trace-cmd Documentation: https://www.trace-cmd.org/
Gregg, Brendan. BPF Performance Tools. Addison-Wesley, 2019. Chapter 14 (Ftrace).
Gregg, Brendan. "ftrace: Tracing the Linux Kernel." http://brendangregg.com/blog/2014-09-11/linux-ftrace-tcp-retransmit-tracing.html
Rostedt, Steven. "Using the TRACE_EVENT() macro." https://lwn.net/Articles/379903/
KernelShark Documentation: https://kernelshark.org/Documentation.html