Linux CFS: The Completely Fair Scheduler

Technical Overview

The Completely Fair Scheduler (CFS) replaced the O(1) scheduler in Linux 2.6.23 (October 2007), authored by Ingo Molnar. Its central insight is radical in its simplicity: rather than maintaining complex priority queues and heuristics, model an ideal multi-tasking CPU — one where every task runs simultaneously and each gets an exactly equal fraction of CPU time — and then approximate that ideal on real hardware.

CFS tracks how much CPU time each task has received (its "virtual runtime"), always runs the task that has received the least, and uses a red-black tree to find that task in O(log n). No priority bitmaps, no active/expired arrays, no heuristics about interactivity. Nice values affect how fast virtual time accumulates, not queue position. The result is a scheduler that is simultaneously simple to understand, provably fair, and practically competitive with far more complex predecessors.

Fifteen years later, CFS remains the default scheduler for SCHED_NORMAL and SCHED_BATCH tasks on Linux, running inside every Android phone, AWS EC2 instance, and supercomputer node. It is also the most studied and benchmarked scheduler in existence.

Prerequisites

01-scheduling-fundamentals.md (scheduling goals, preemption, Linux policy hierarchy)
02-classic-scheduling-algorithms.md (MLFQ, RR concepts)
Basic understanding of red-black trees (self-balancing BST, O(log n) operations)
Linux task_struct basics (process descriptor structure)

Design Goal: The Ideal Fair CPU

Consider an ideal processor that can run n tasks simultaneously, giving each exactly 1/n of its capacity. Each task advances at the same rate. No starvation, perfect fairness. This is physically impossible with a single core, but it is the target CFS approximates.

CFS measures each task's "virtual runtime" (vruntime): the amount of CPU time the task would have consumed on an ideal, perfectly fair CPU. CFS always runs the task with the minimum vruntime — the task that is most "behind" in the fair model. After running, the task's vruntime advances, and if it becomes the maximum, another task is selected.

The elegance: there is no separate notion of priority queues, timeslices, or fairness enforcement. It all falls out of maintaining and minimizing vruntime.

Virtual Runtime: The Core Abstraction

Virtual runtime grows as a function of physical runtime and task weight. A task with higher weight (lower nice value) runs more on the ideal CPU — its vruntime grows more slowly relative to physical time.

vruntime_delta = physical_runtime_delta × (NICE_0_WEIGHT / task_weight)

Where NICE_0_WEIGHT = 1024 (the weight of a nice=0 task)

A task with weight 2048 (nice=-5) running for 1ms accumulates only 0.5ms of vruntime. A task with weight 512 (nice=+5) running for 1ms accumulates 2ms of vruntime. So the high-weight task runs more physical time per unit of virtual time — it gets more CPU.

Nice-to-Weight Mapping

Linux uses a precomputed weight table with a geometric ratio of approximately 1.25 between adjacent nice levels:

/* kernel/sched/core.c — sched_prio_to_weight[] */
static const int sched_prio_to_weight[40] = {
 /* -20 */ 88761, 71755, 56483, 46273, 36291,
 /* -15 */ 29154, 23254, 18705, 14949, 11916,
 /* -10 */  9548,  7620,  6100,  4904,  3906,
 /*  -5 */  3121,  2501,  1991,  1586,  1277,
 /*   0 */  1024,   820,   655,   526,   423,
 /*   5 */   335,   272,   215,   172,   137,
 /*  10 */   110,    87,    70,    56,    45,
 /*  15 */    36,    29,    23,    18,    15,
};

Key values: nice 0 = 1024, nice -1 = 1277, nice +1 = 820, nice -20 = 88761.

The ratio between adjacent levels is 1024/820 ≈ 1.25. This means: changing nice by 1 gives approximately 10% more/less CPU than an equal-nice neighbor. Changing nice by 10 gives approximately 10x more/less CPU.

The inverse table sched_prio_to_wmult[] precomputes 2^32 / weight to allow fast fixed-point multiplication instead of division in the hot path.

CFS Runqueue: The Red-Black Tree

CFS maintains all runnable tasks in a red-black tree sorted by vruntime. The leftmost node (minimum vruntime) is the next task to run.

CFS Runqueue (rb-tree, sorted by vruntime):

                    [Task D: vrt=150]
                   /                 \
          [Task B: vrt=120]     [Task F: vrt=200]
          /           \
  [Task A: vrt=100] [Task C: vrt=130]

  ← leftmost (min vruntime) = Task A → runs next

After Task A runs 10ms (weight=1024, so vruntime += 10ms):
  Task A's vruntime = 110ms → reinserted somewhere in the middle
  New leftmost = Task B (vrt=120) → runs next

Why a red-black tree instead of a simple sorted list?

Insert, delete, find-min: O(log n) for red-black tree vs O(n) for sorted list
CFS caches the leftmost node pointer (rb_leftmost), making pick_next_task O(1) for the common case
Insertions happen on wakeup and context switch — relatively infrequent compared to the scheduling decision itself
Red-black trees are well-understood, have bounded rebalancing cost, and are already used elsewhere in Linux

The cache of the leftmost node means the hot path — "what runs next?" — is just dereferencing a pointer.

Key Data Structures

struct sched_entity {              /* embedded in task_struct as 'se' */
    struct load_weight  load;      /* weight (from nice value) */
    struct rb_node      run_node;  /* node in CFS rb-tree */
    struct list_head    group_node;/* in task group list */
    unsigned int        on_rq;     /* is this entity in the runqueue? */

    u64   exec_start;              /* last time entity started running */
    u64   sum_exec_runtime;        /* total physical CPU time consumed */
    u64   vruntime;                /* virtual runtime: THE key value */
    u64   prev_sum_exec_runtime;   /* for preemption check */

    u64   nr_migrations;           /* times migrated between CPUs */
    struct sched_statistics statistics; /* detailed stats */

    struct sched_entity *parent;   /* for group scheduling */
    struct cfs_rq       *cfs_rq;   /* CFS runqueue this entity is on */
    struct cfs_rq       *my_q;     /* if group: the group's CFS runqueue */
};

struct cfs_rq {
    struct load_weight  load;      /* total weight of all tasks */
    unsigned int        nr_running; /* count of runnable tasks */

    u64   min_vruntime;            /* minimum vruntime in tree (monotone) */
    u64   exec_clock;              /* clock for this cfs_rq */

    struct rb_root_cached tasks_timeline; /* rb-tree + leftmost cache */
    struct sched_entity *curr;     /* currently running entity */
    struct sched_entity *next;     /* entity to run next (for wakeup) */
    struct sched_entity *last;     /* last entity that ran */
    struct sched_entity *skip;     /* entity to skip (for buddy system) */
};

Scheduling Code Path

The main scheduling path, triggered by voluntary sleep, preemption, or wakeup:

__schedule(bool preempt)
  │
  ├─ deactivate_task(rq, prev)   if prev is no longer runnable
  │    └─ dequeue_task(rq, prev, flags)
  │         └─ dequeue_entity(cfs_rq, prev_se)
  │              └─ __dequeue_entity(cfs_rq, se)  ← rb_erase
  │
  ├─ pick_next_task(rq, prev, rf)
  │    └─ walks sched_class hierarchy: stop → dl → rt → fair → idle
  │         └─ pick_next_task_fair(rq, prev, rf)
  │              ├─ simple_pick: leftmost rb node if no groups
  │              └─ __pick_next_entity(cfs_rq)
  │                   └─ rb_first_cached(&cfs_rq->tasks_timeline)
  │                        ← O(1): returns cached leftmost node
  │
  └─ context_switch(rq, prev, next)
       ├─ switch_mm()            ← switch page tables (TLB flush if needed)
       └─ switch_to(prev, next)  ← save/restore registers, switch stack

The key function __pick_next_entity() is:

static struct sched_entity *__pick_next_entity(struct cfs_rq *cfs_rq)
{
    struct rb_node *left = rb_first_cached(&cfs_rq->tasks_timeline);
    if (!left)
        return NULL;
    return rb_entry(left, struct sched_entity, run_node);
}

This is the heart of CFS: a single pointer dereference.

Preemption Mechanism

CFS checks for preemption at two points:

Timer-driven preemption (scheduler_tick() → task_tick_fair() → entity_tick() → check_preempt_tick()):

Every HZ timer tick (1ms at HZ=1000), scheduler_tick() is called. For CFS tasks: 1. Update current task's vruntime 2. Check if current task has run longer than its ideal timeslice 3. If yes: set TIF_NEED_RESCHED flag on the task 4. On return from interrupt, preempt_schedule_irq() calls __schedule()

The "ideal timeslice" is computed as:

sched_slice = sched_latency_ns × (task_weight / total_cfs_rq_weight)

With defaults:
  sched_latency_ns = 6,000,000 (6ms)

If only one task:
  sched_slice = 6ms × (1024/1024) = 6ms

If four equal-weight tasks:
  sched_slice = 6ms × (1024/4096) = 1.5ms each
  (but bounded by sched_min_granularity_ns = 0.75ms)

The check_preempt_tick() logic:

static void check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
    unsigned long ideal_runtime = sched_slice(cfs_rq, curr);
    u64 delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;

    if (delta_exec > ideal_runtime) {
        resched_curr(rq_of(cfs_rq));  /* set TIF_NEED_RESCHED */
        return;
    }

    /* Also preempt if the leftmost task's vruntime is significantly
     * less than ours (handles the case of a newly woken task) */
    if (delta_exec < sched_min_granularity_ns)
        return;

    se = __pick_first_entity(cfs_rq);
    delta = curr->vruntime - se->vruntime;
    if (delta > ideal_runtime)
        resched_curr(rq_of(cfs_rq));
}

Wakeup preemption (try_to_wake_up() → check_preempt_wakeup()):

When a sleeping task wakes up, CFS checks whether the woken task should preempt the current task. A woken task that has been sleeping will have a low vruntime (it accumulated no vruntime while sleeping) — it will likely be the leftmost node. CFS uses a "wakeup granularity" (sched_wakeup_granularity_ns, default 1ms) to avoid excessive preemption: the woken task only preempts if its vruntime is more than sched_wakeup_granularity_ns behind the current task's vruntime.

Sleeper Fairness: The Gentle Fair Sleepers Feature

When a task sleeps for a long time (e.g., 10 seconds), its vruntime stays at its last value while min_vruntime advances. When it wakes, its vruntime could be 10 seconds behind min_vruntime, meaning it would preempt everything and run exclusively until it "catches up" — punishing all other tasks.

CFS solves this with "sleeper fairness": on wakeup, the task's vruntime is set to:

vruntime = max(vruntime, min_vruntime - sched_latency_ns/2)

This caps how far behind a sleeper can be, preventing it from hoarding CPU after a long sleep. Controlled by the GENTLE_FAIR_SLEEPERS scheduler feature (enabled by default).

Tunables

CFS exposes several tunables under /proc/sys/kernel/:

sched_latency_ns         = 6000000    (6ms)
  Target scheduling latency: ideally every runnable task runs once per
  this period. Actual timeslice = latency × (task_weight / total_weight).

sched_min_granularity_ns = 750000     (0.75ms)
  Minimum timeslice to prevent excessive context switches with many tasks.
  When nr_running > latency/min_granularity, effective latency increases.

sched_wakeup_granularity_ns = 1000000 (1ms)
  Minimum vruntime advantage for a woken task to preempt the current task.
  Higher values reduce wakeup preemption (better throughput, higher latency).
  Lower values increase wakeup preemption (better response, more switches).

sched_migration_cost_ns  = 500000     (0.5ms)
  Estimated cost of migrating a task between CPUs (cache warming cost).
  Used to decide whether to migrate or let imbalance persist briefly.

sched_nr_migrate         = 32
  Maximum tasks moved in one load balancing pass.

CFS Runqueue Architecture with Groups

When cgroup CPU group scheduling is enabled, CFS uses hierarchical scheduling entities. Each task group has a sched_entity that represents the entire group in its parent's runqueue:

System-level CFS runqueue (rb-tree of group entities):
  ├─ group_A_entity (vruntime=100) → group A's internal CFS runqueue
  │    ├─ task_1 (vruntime=90)
  │    └─ task_2 (vruntime=110)
  └─ group_B_entity (vruntime=120) → group B's internal CFS runqueue
       ├─ task_3 (vruntime=100)
       └─ task_4 (vruntime=140)

Scheduling: pick leftmost at top level (group_A_entity),
            then pick leftmost within group A (task_1).
            task_1 runs; both task_1's vruntime and group_A_entity's
            vruntime advance proportionally.

This two-level selection provides group-level fairness (groups compete for CPU proportional to their weights) and within-group fairness (tasks within a group compete fairly among themselves).

The vruntime Monotone Invariant

An important implementation detail: min_vruntime on a CFS runqueue is monotonically non-decreasing. New tasks joining the runqueue (from fork() or wakeup) have their vruntime set to at least min_vruntime so they don't immediately dominate the tree. When a task migrates between CPUs, its vruntime is adjusted to account for the difference in min_vruntime values between the source and destination runqueues (using offset encoding in the rb-tree).

Historical Context

Linux had three CPU schedulers before CFS:

Linux 2.4 scheduler (Linus Torvalds): Simple priority array. O(n) per scheduling decision — looped through all runnable tasks to find the highest priority. On a server with 1000 processes, noticeably slow. Also had epoch-based recalculation: at end of each epoch, recalculate all priorities, O(n).

O(1) scheduler (Ingo Molnar, Linux 2.6.0, 2003): Two priority arrays per CPU (active, expired), each a bitmask of 140 priority levels. pick_next_task was O(1) via bitmask find-first-set. Expired array swapped to active when active empties — O(1) amortized. Complex interactivity heuristics (sleep_avg, interactive credit) attempted to distinguish interactive from CPU-bound tasks. The heuristics were fragile: gaming workloads and audio applications could confuse them, causing latency spikes. Ultimately untunable: every fix broke something else.

CFS (Ingo Molnar, 2007): Replaced O(1)'s complex heuristics with the vruntime accounting. The key insight from Con Kolivas's "SD" (Staircase Deadline) scheduler (an influential patchset that Molnar studied): you don't need heuristics if you track fairness directly. CFS eliminated ~800 lines of interactivity code and replaced it with ~200 lines of vruntime math.

Production Examples

Google's Borg scheduler interaction with CFS: Borg assigns CPU requests and limits to containers. The CPU request maps to cpu.shares (CFS weight), determining relative scheduling priority when the machine is contended. Google's production experience found that setting tasks to identical nice values but using cgroup weights gives better isolation than process nice values.

Desktop responsiveness with CFS: The motivation for CFS was partly Con Kolivas's work on desktop scheduling. A common complaint with O(1) was audio dropouts: audio threads were misclassified as non-interactive and starved. CFS's vruntime naturally gives recently-woken audio threads priority (their vruntime is low after blocking on audio hardware). GNOME Shell, PulseAudio, and other desktop services benefit from this.

Database workloads: PostgreSQL backends are effectively I/O-bound during query execution (waiting for disk, locks, or network). CFS treats them as I/O-bound and gives fast wakeup response. CPU-intensive queries (large sorts, hash joins) temporarily become CPU-bound; CFS demotes them as their vruntime accumulates. This self-adjusting behavior is well-suited to mixed OLTP/OLAP workloads.

Debugging Notes

# View per-task CFS statistics
cat /proc/[pid]/sched
# Key fields: vruntime, nr_voluntary_switches, nr_involuntary_switches,
#             se.load.weight, policy, prio

# View CFS runqueue state per CPU
cat /sys/kernel/debug/sched/debug  # requires CONFIG_SCHED_DEBUG

# Tune CFS latency (reduce for more responsiveness, careful: more context switches)
sysctl -w kernel.sched_latency_ns=3000000   # 3ms
sysctl -w kernel.sched_min_granularity_ns=300000  # 0.3ms

# Check scheduler features
cat /sys/kernel/debug/sched/features
# Toggle a feature (e.g., disable wakeup preemption):
echo NO_WAKEUP_PREEMPTION > /sys/kernel/debug/sched/features

# Measure scheduling latency with ftrace
echo 1 > /sys/kernel/debug/tracing/events/sched/sched_wakeup/enable
echo 1 > /sys/kernel/debug/tracing/events/sched/sched_switch/enable
cat /sys/kernel/debug/tracing/trace_pipe
# Look for sched_wakeup → sched_switch delay (wakeup latency)

# perf sched for scheduling analysis
perf sched record -- sleep 5
perf sched latency --sort max  # show per-task latency statistics

# Verify CFS is computing correct weights
cat /proc/[pid]/status | grep -E "voluntary|non-voluntary"
# High non-voluntary = being preempted = CPU-bound or contended

Security Implications

vruntime manipulation: A process cannot directly manipulate its vruntime (it is a kernel value). However, a process can influence its scheduling weight via nice() or setpriority() (within its limits). A process that continuously forks and exits can create many short-lived tasks with low vruntime, but the overhead of fork/exec typically outweighs any scheduling advantage.

Side channels via CFS scheduling: CFS's scheduling decisions are deterministic given the runqueue state. An attacker who can observe when their process is scheduled (via timing) can infer the vruntime of co-located processes, leaking information about their execution patterns. This is an information leak but not typically exploitable for data theft.

CAP_SYS_NICE: Required to raise a task's priority (lower nice value). Without it, a process can only lower its own priority (increase nice value). The RLIMIT_NICE resource limit can restrict even this.

Performance Implications

CFS overhead is O(log n) for enqueue/dequeue where n is the number of runnable tasks per CPU. On a typical production system with 10-100 tasks per CPU, this is negligible (log2(100) = 7 comparisons).
The leftmost cache makes pick_next_task O(1) in the common case.
update_curr() (updating vruntime on every scheduling event) adds ~100ns overhead per tick — visible in perf top as update_curr under scheduler load.
CFS scales to thousands of tasks per CPU. The O(1) scheduler's constant-time operations had large constants; CFS's O(log n) with small constants often wins in practice.
Group scheduling adds a level of rb-tree lookup per scheduling level, but this is still O(log n × depth).

Failure Modes

vruntime overflow: vruntime is a u64 nanosecond counter. At 100% CPU utilization, a task's vruntime grows at ~1ns/ns. Overflow at 2^64 ns = ~585 years. Not a practical concern.
min_vruntime staleness: If all tasks sleep simultaneously, min_vruntime stops advancing. When they wake, they all have approximately equal vruntime and compete fairly. This is correct behavior.
Thundering herd: Hundreds of tasks sleeping on the same condition variable all wake simultaneously. CFS inserts all into the rb-tree, context-switches rapidly through them. Each task probably blocks immediately, causing excessive context switches. Application-level mitigation: use pthread_cond_signal instead of pthread_cond_broadcast where possible.
Group scheduling granularity: With deep cgroup hierarchies, a task deep in the hierarchy may effectively get scheduled in very short bursts due to multiple levels of time-slicing. Monitor with container_cpu_throttled_seconds_total.

Modern Usage and Future Directions

CFS has been extended substantially since 2007:

CFS Bandwidth Control (2012): CPU quota enforcement via cgroups. See 06-cfs-group-scheduling.md.
Energy-Aware Scheduling (2019): EAS hooks into CFS to pick energy-efficient CPUs on asymmetric platforms.
UTIL_EST (utilization estimation, 2018): Better tracking of task utilization for frequency scaling decisions.
Latency Nice (merged 2022): SCHED_ATTR_SCHED_UTIL_MIN/MAX hints allowing applications to express latency sensitivity without RT privileges.

EEVDF (Earliest Eligible Virtual Deadline First): Merged in Linux 6.6 (2023) as a replacement for the core CFS scheduling algorithm. EEVDF, proposed by Stoica and Abdel-Wahab (1995), adds per-task virtual deadlines and eligible times. It better handles heterogeneous workloads and reduces the need for wakeup heuristics like NEXT_BUDDY. The rb-tree remains; EEVDF modifies which node is selected (not just leftmost, but "leftmost eligible" based on virtual deadline). CFS as a term lives on (the code is still fair_sched_class) but the algorithm has evolved.

sched_ext: Allows replacing the fair_sched_class implementation with a BPF program. Opens the door to custom algorithms (weighted RR, work-conserving FIFO, NUMA-aware placement) without kernel patching.

Exercises

Verify vruntime math: Start two processes: nice -n 0 cpu_hog & and nice -n 5 cpu_hog &. After 10 seconds, read /proc/[pid]/sched for both. Verify that the nice=0 task has approximately 1024/335 ≈ 3.05x the sum_exec_runtime of the nice=5 task.
rb-tree visualization: With CONFIG_SCHED_DEBUG, read /sys/kernel/debug/sched/debug and find a loaded CPU's CFS runqueue. Identify min_vruntime, the current task's vruntime, and the spread of vruntimes in the tree.
Wakeup preemption impact: Run a latency-sensitive task (e.g., cyclictest) alongside a CPU-bound task. Compare latency with sched_wakeup_granularity_ns set to 0 (maximum preemptiveness) vs 10ms (minimal). Record the difference in P99 wakeup latency.
Group scheduling hierarchy: Create two cgroups with cpu.shares=1024 and cpu.shares=256. Place CPU-bound tasks in each. Verify that the first group gets ~80% and the second ~20% of CPU over a 10-second window using cpuacct.usage.
CFS overhead profiling: Use perf stat -e sched:sched_switch to count context switches under different loads. At what number of runnable tasks does CFS overhead become visible in perf top?

References

Ingo Molnar, "CFS Scheduler Design Document", Linux Kernel Documentation (Documentation/scheduler/sched-design-CFS.rst)
Ingo Molnar, "[ANNOUNCE] CFS scheduler", LKML post, April 13, 2007
Peter Zijlstra, "Group Scheduling", LKML patchset, 2007
Con Kolivas, "Staircase Deadline Scheduler" patchset, 2007 (inspiration for CFS)
Stoica, I. and Abdel-Wahab, H., "Earliest Eligible Virtual Deadline First: A Flexible and Accurate Mechanism for Proportional Share Resource Allocation", 1995 (EEVDF basis)
Lozi et al., "The Linux Scheduler: a Decade of Wasted Cores", EuroSys 2016 — finds load balancing bugs in CFS
Bouron et al., "The Battle of the Schedulers: FreeBSD ULE vs. Linux CFS", USENIX ATC 2018
Linux source: kernel/sched/fair.c, kernel/sched/core.c, include/linux/sched.h
Turner, J., "Inside the Linux CFS Scheduler", IBM developerWorks, 2009