Linux Scheduler Internals

Technical Overview

The Linux scheduler is not a single function but a complex event-driven system embedded throughout the kernel. Understanding its internals means understanding not just pick_next_task() but all the paths that lead to and leave from a scheduling decision: how the kernel detects that rescheduling is needed, how it safely performs a context switch, how it maintains per-CPU runqueue state, and how it moves tasks between CPUs when load imbalances occur.

This file traces the scheduler from entry points through core data structures to load balancing. It is the "how it actually works" complement to the "what it does" coverage in the CFS and RT files. The goal is to enable you to read the kernel source (kernel/sched/core.c, kernel/sched/fair.c) and understand every major function's role.

Prerequisites

01-scheduling-fundamentals.md (policy hierarchy, preemption models)
03-linux-cfs.md (CFS vruntime, sched_entity, rb-tree)
04-linux-realtime-scheduling.md (RT and deadline classes)
Familiarity with Linux kernel data structures (lists, spinlocks, per-CPU variables)
Understanding of interrupt context vs process context

struct rq: The Per-CPU Runqueue

The runqueue (struct rq) is the central data structure for scheduling on each CPU. There is exactly one rq per logical CPU, allocated in a per-CPU array:

DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);

struct rq {
    /* Core fields: */
    raw_spinlock_t      __lock;          /* runqueue lock */
    unsigned int        nr_running;      /* total runnable tasks */
    unsigned long       nr_uninterruptible; /* sleeping tasks */

    /* Current task: */
    struct task_struct *curr;            /* currently running task */
    struct task_struct *idle;            /* this CPU's idle thread */
    struct task_struct *stop;            /* migration/stop task */

    /* Clock: */
    u64                 clock;           /* rq clock (nanoseconds) */
    u64                 clock_task;      /* task-specific clock */

    /* Sub-runqueues for each scheduling class: */
    struct cfs_rq       cfs;             /* CFS tasks */
    struct rt_rq        rt;              /* RT tasks (SCHED_FIFO/RR) */
    struct dl_rq        dl;              /* Deadline tasks */

    /* Load tracking: */
    struct load_weight  load;            /* sum of weights of runnable tasks */
    unsigned long       nr_load_updates; /* times load was updated */
    u64                 nr_switches;     /* total context switches */

    /* CPU topology: */
    int                 cpu;             /* which CPU this rq belongs to */
    int                 online;          /* CPU is online */

    /* Load balancing: */
    struct list_head    cfs_tasks;       /* for load balance */
    unsigned long       misfit_task_load; /* task too heavy for this CPU (EAS) */

    /* Idle balancing: */
    atomic_t            nohz_flags;      /* NO_HZ idle flags */

    /* Pinned tasks (migration stopped): */
    int                 nr_pinned;

    /* ... many more fields for stats, RT throttling, NUMA, etc. */
};

Access pattern: cpu_rq(cpu) returns &per_cpu(runqueues, cpu). The current CPU's rq is this_rq() = &__get_cpu_var(runqueues).

The runqueue lock: rq->__lock (a raw spinlock) protects the runqueue. Any operation that modifies the runqueue (enqueue, dequeue, task migration) must hold this lock. Lock ordering rules: always lock the lower-numbered CPU's rq first to avoid deadlocks during cross-CPU operations.

__schedule(): The Core Function

__schedule() is the fundamental scheduling function. It: 1. Picks the next task to run 2. Performs the context switch 3. Returns into the new task's context

static void __sched notrace __schedule(unsigned int sched_mode)
{
    struct task_struct *prev, *next;
    struct rq_flags rf;
    struct rq *rq;
    int cpu;

    cpu  = smp_processor_id();
    rq   = cpu_rq(cpu);
    prev = rq->curr;

    /* Deactivate if no longer runnable (sleeping, exiting): */
    if (!(sched_mode & SM_MASK_PREEMPT) && prev->state) {
        if (signal_pending_state(prev->state, prev)) {
            prev->state = TASK_RUNNING;  /* signal woke us */
        } else {
            deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
        }
    }

    next = pick_next_task(rq, prev, &rf);  /* THE DECISION */

    clear_tsk_need_resched(prev);
    clear_preempt_need_resched();

    if (likely(prev != next)) {
        ++rq->nr_switches;
        rq->curr = next;
        context_switch(rq, prev, next, &rf);  /* THE SWITCH */
    } else {
        rq_unpin_lock(rq, &rf);
        __balance_callbacks(rq);
        raw_spin_rq_unlock_irq(rq);
    }
}

Notice: context_switch() does NOT return in the normal sense — it switches to a different task's stack. When prev eventually runs again (after another scheduling event), it returns from context_switch() into the next line of __schedule(), then returns up its call stack.

Entry Points to __schedule()

The scheduler is invoked from multiple contexts:

Entry Points:

1. Voluntary sleep/block:
   task calls schedule() or wait_event()
   → schedule()
     → __schedule(SM_NONE)

2. Timer-driven preemption:
   Timer interrupt fires → scheduler_tick() sets TIF_NEED_RESCHED
   Return from interrupt → preempt_schedule_irq()
   → __schedule(SM_PREEMPT)

3. Preemption at preempt_enable():
   lock released → preempt_enable() → preempt_count drops to 0
   if TIF_NEED_RESCHED set → preempt_schedule()
   → __schedule(SM_PREEMPT)

4. Return to user space:
   syscall/interrupt return → check_preempt_pending()
   if TIF_NEED_RESCHED → schedule()

5. Wakeup preemption:
   try_to_wake_up() → ttwu_do_wakeup()
   → check_preempt_curr() sets TIF_NEED_RESCHED on current task
   → preemption occurs at next preempt_enable() or interrupt return

TIF_NEED_RESCHED flag:
  Set by: resched_curr() in CFS tick check, wakeup preemption, RT enqueue
  Cleared by: __schedule() after picking next task
  Checked by: preempt_enable(), interrupt/syscall return path

The TIF_NEED_RESCHED flag (in task_struct's thread_info.flags) is the signal that __schedule() should be called. It is set by various parts of the kernel when they determine a reschedule is needed, and checked at "preemption points" (interrupt returns, preempt_enable).

pick_next_task(): The Scheduling Decision

pick_next_task() implements the sched_class hierarchy:

static inline struct task_struct *
pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
    /* Fast path: if all tasks are CFS tasks (most common case): */
    if (likely(!sched_class_above(&fair_sched_class, prev->sched_class) &&
               rq->nr_running == rq->cfs.h_nr_running)) {
        p = pick_next_task_fair(rq, prev, rf);
        if (likely(p))
            return p;
        /* else: no CFS tasks, fall through to full class walk */
    }

    /* Full class walk from highest priority: */
    for_each_class(class) {
        p = class->pick_next_task(rq, prev);
        if (p)
            return p;
    }
    /* Always returns idle task if nothing else runnable */
    BUG();  /* should never reach here */
}

The fast path assumes all tasks are CFS (most systems). The slow path walks the class chain:

sched_class chain (highest to lowest priority):
  stop_sched_class  → dl_sched_class → rt_sched_class → fair_sched_class → idle_sched_class

stop_sched_class:  Per-CPU stop task (for migration, CPU hotplug). Priority above deadline.
dl_sched_class:    SCHED_DEADLINE tasks.
rt_sched_class:    SCHED_FIFO and SCHED_RR tasks.
fair_sched_class:  SCHED_NORMAL, SCHED_BATCH (CFS).
idle_sched_class:  The per-CPU idle thread. Always runnable, lowest possible priority.

sched_class: The Virtual Dispatch Table

Each scheduling class is defined by a sched_class structure — a vtable of function pointers:

struct sched_class {
    /* Task enqueue/dequeue: */
    void (*enqueue_task)    (struct rq *rq, struct task_struct *p, int flags);
    void (*dequeue_task)    (struct rq *rq, struct task_struct *p, int flags);

    /* Voluntary yield: */
    void (*yield_task)      (struct rq *rq);
    bool (*yield_to_task)   (struct rq *rq, struct task_struct *p);

    /* Preemption check: */
    void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags);

    /* Task selection: */
    struct task_struct *(*pick_next_task)(struct rq *rq, struct task_struct *prev);
    void (*put_prev_task)   (struct rq *rq, struct task_struct *p);
    void (*set_next_task)   (struct rq *rq, struct task_struct *p, bool first);

    /* Load balancing: */
    int  (*balance)         (struct rq *rq, struct task_struct *p, struct rq_flags *rf);
    int  (*select_task_rq)  (struct task_struct *p, int cpu, int flags);
    struct task_struct *(*pick_task)(struct rq *rq);
    void (*migrate_task_rq) (struct task_struct *p, int new_cpu);
    void (*task_woken)      (struct rq *this_rq, struct task_struct *task);
    void (*set_cpus_allowed)(struct task_struct *p, const struct cpumask *newmask, u32 flags);
    void (*rq_online)       (struct rq *rq);
    void (*rq_offline)      (struct rq *rq);

    /* Tick/timer callbacks: */
    void (*task_tick)       (struct rq *rq, struct task_struct *p, int queued);
    void (*task_fork)       (struct task_struct *p);
    void (*task_dead)       (struct task_struct *p);

    /* Accounting and priority: */
    void (*switched_from)   (struct rq *rq, struct task_struct *p);
    void (*switched_to)     (struct rq *rq, struct task_struct *p);
    void (*prio_changed)    (struct rq *rq, struct task_struct *p, int oldprio);

    unsigned int (*get_rr_interval)(struct rq *rq, struct task_struct *p);
    void (*update_curr)     (struct rq *rq);

    /* ... */
};

This is clean object-oriented design in C. Adding a new scheduling class (as sched_ext allows via BPF) requires implementing this interface. The dispatch — class->pick_next_task(rq, prev) — is a standard indirect function call, and with CONFIG_RETPOLINE, adds ~10-20ns overhead vs direct call.

Dispatch chain for fair_sched_class example:

pick_next_task_fair()
  └─ pick_next_entity() for each level of group hierarchy
       └─ __pick_next_entity(cfs_rq)
            └─ rb_first_cached(&cfs_rq->tasks_timeline)  ← O(1)
                 └─ container_of(rb_node, struct sched_entity, run_node)
                      └─ task_of(se)  ← back to task_struct

context_switch(): The Actual CPU Switch

After pick_next_task() selects the next task, context_switch() performs the switch:

static __always_inline struct rq *
context_switch(struct rq *rq, struct task_struct *prev,
               struct task_struct *next, struct rq_flags *rf)
{
    /*
     * Switch mm (memory mappings / page tables):
     * If next is a kernel thread, it borrows prev's mm.
     * If next is a user process with a different mm: switch_mm()
     */
    if (!next->mm) {                                /* next is kernel thread */
        enter_lazy_tlb(prev->active_mm, next);
        next->active_mm = prev->active_mm;
    } else {                                        /* next is user process */
        membarrier_switch_mm(rq, prev->active_mm, next->mm);
        switch_mm_irqs_off(prev->active_mm, next->mm, next); /* CR3 load, TLB */
    }

    /* Prepare to switch stacks/registers: */
    prepare_task_switch(rq, prev, next);

    /* Arch-specific register/stack switch: */
    switch_to(prev, next, prev);    /* ← execution continues in 'next' after this */
    barrier();

    /* When 'prev' runs again, it returns here: */
    return finish_task_switch(prev);
}

switch_to(prev, next, prev) is architecture-specific assembly. On x86-64: 1. Saves callee-saved registers of prev onto prev's kernel stack 2. Saves prev's stack pointer into prev->thread.sp 3. Loads next->thread.sp as the stack pointer 4. Jumps to next->thread.ip (which is the return address from switch_to that was saved when next last called schedule())

After switch_to, the CPU is executing on next's kernel stack. The prev variable now points to the previous task (from next's perspective, prev is the task that just ran before it). This naming is intentionally confusing — "prev" in finish_task_switch(prev) means "the task that was running before me" from the new current task's perspective.

scheduler_tick() and timer-driven preemption

scheduler_tick() is called from update_process_times(), which is called from the timer interrupt handler on every tick:

hardware timer interrupt (1000Hz on modern kernels)
  → local_apic_timer_interrupt()
    → hrtimer_interrupt() [or tick_handle_periodic()]
      → update_process_times()
        → scheduler_tick()
          ├─ update_rq_clock(rq)                     /* advance rq->clock */
          ├─ curr->sched_class->task_tick(rq, curr)  /* class-specific tick */
          │    └─ (for fair_sched_class:)
          │         task_tick_fair()
          │           → entity_tick()
          │               → update_curr()            /* update vruntime */
          │               → check_preempt_tick()     /* set TIF_NEED_RESCHED? */
          ├─ trigger_load_balance(rq)                 /* kick load balancer */
          └─ calc_global_load()                       /* update /proc/loadavg */

The tick's effect on scheduling: 1. Vruntime advances (task has used more CPU) 2. If vruntime advanced beyond the ideal timeslice: resched_curr() sets TIF_NEED_RESCHED 3. On return from interrupt, preempt_schedule_irq() calls __schedule(SM_PREEMPT)

Load Balancer Architecture

Load balancing is the process of moving tasks between CPUs to maintain utilization balance. It runs periodically and also in response to specific events (idle CPU, fork, exec):

Load Balancer Trigger Path:

scheduler_tick()
  └─ trigger_load_balance(rq)
       └─ if (time_after_eq(jiffies, rq->next_balance)):
            raise_softirq(SCHED_SOFTIRQ)      ← deferred to softirq

SCHED_SOFTIRQ handler:
  run_rebalance_domains()
    └─ rebalance_domains(this_rq, idle)
         └─ for each sched_domain (innermost to outermost):
              if (sd_overloaded(sd) && time_to_rebalance(sd)):
                load_balance(this_rq, idle, sd, ...)

load_balance() anatomy:

load_balance(this_rq, this_cpu, sd, idle, continue_balancing):
  1. find_busiest_group(sd, ...)
       ← scan all sched_groups in domain
       ← compute load per group (sum of task weights, normalized)
       ← return group with most load, if imbalance > threshold

  2. find_busiest_queue(sd, group, ...)
       ← within busiest group, find CPU with most runnable tasks
       ← return that CPU's rq

  3. detach_tasks(busiest_rq → local_rq)
       ← migrate tasks from busiest to local
       ← constraints: not cache hot, not pinned, respects affinity
       ← returns number of tasks detached

  4. attach_tasks(local_rq, list_of_detached_tasks)
       ← enqueue migrated tasks on local rq

Returns: number of tasks migrated (0 if balanced or migration failed)

Load metric: The load of a runqueue is not just nr_running. CFS uses weighted load: sum(task_weight × on_rq) per CPU, accounting for task weights (nice values). The PELT (Per-Entity Load Tracking) system uses exponentially decaying averages of CPU utilization to track load smoothly, avoiding oscillation from momentary spikes.

Task Migration: Active vs Passive

Passive migration (pull): The idle or lightly-loaded CPU initiates migration by running load_balance() and pulling tasks from a busier CPU. This is the normal load balancing path.

Active migration (push): Triggered by IPI (inter-processor interrupt). Used when: - A newly created task should immediately run on another CPU (wake_up_new_task) - A woken task has better affinity on a specific CPU - CPU hotplug forces task migration

Active migration sends an IPI to the target CPU. The stop_machine infrastructure and the stop task (stop_sched_class) are used for operations that require temporarily stopping a CPU (e.g., CPU hotplug, some RCU operations).

Active migration via stop task:

  CPU 0: wants to move task T to CPU 1
  → queue stop_machine work on CPU 1
  → send IPI to CPU 1
  CPU 1: receives IPI, STOP task runs (highest priority)
     → T is dequeued from CPU 0's rq (requires locking CPU 0's rq)
     → T is enqueued on CPU 1's rq
     → STOP task completes, CPU 1 schedules T

Idle Balancing: newidle_balance

When a CPU becomes idle (no more runnable tasks), newidle_balance() is called before running the idle task. It aggressively tries to steal work from other CPUs:

pick_next_task(rq, idle_thread):
  → class->balance() for fair_sched_class
     → newidle_balance(rq, rf)
       ← walk sched_domains, look for idle candidates
       ← attempt to pull tasks from busiest CPU
       ← if pulled at least one task: return, schedule it
       ← if nothing: return to idle

Why idle balancing is special:
  - Runs in idle context, so can take more time
  - No currently running task to displace
  - Pulling work here keeps CPUs busy and reduces latency
    (task gets CPU immediately rather than waiting for next balance interval)

Scheduler Features and Knobs

/sys/kernel/debug/sched/features exposes scheduler tuning flags:

GENTLE_FAIR_SLEEPERS     # Cap the vruntime debt of sleeping tasks on wakeup
START_DEBIT              # Newly forked tasks start with a vruntime penalty (prevents fork bomb scheduling gaming)
NEXT_BUDDY               # Prefer to run the waker's buddy (who was woken)
LAST_BUDDY               # Prefer to re-run the last-run task (cache warmth)
CACHE_HOT_BUDDY          # Don't migrate cache-hot tasks
WAKEUP_PREEMPTION        # Allow woken tasks to preempt current task
HRTICK                   # Use high-resolution timer for finer-grained preemption
HRTICK_DL                # High-res timer for deadline tasks
DOUBLE_TICK              # Avoid double tick on same CPU
NONTASK_CAPACITY         # Subtract non-task (IRQ, kernel thread) capacity from CPU capacity
TTWU_QUEUE               # Queue wakeups to be processed on target CPU (reduces IPI traffic)
SIS_PROP                 # Search for idle sibling based on load proportion
UTIL_EST                 # Track task utilization estimate for better frequency scaling decisions
UTIL_EST_FASTUP          # Allow utilization estimate to increase faster than it decays
LATENCY_WARN             # Warn in dmesg about scheduling latency > sched_latency_warn_ns

These can be toggled: echo NO_WAKEUP_PREEMPTION > /sys/kernel/debug/sched/features

Wakeup Path: try_to_wake_up()

When a sleeping task becomes runnable (I/O completes, lock acquired, timer fires), try_to_wake_up() is called:

try_to_wake_up(p, state, wake_flags):
  1. Check if task is already runnable (avoid double wakeup)
  2. smp_mb__before_spinlock() ← memory barrier
  3. Acquire p->pi_lock (task state lock)
  4. select_task_rq(p, ...) ← choose best CPU for the task
       ← sched_class->select_task_rq_fair()
       ← considers: current CPU, prev CPU, idle CPUs, NUMA preference, cache affinity
  5. ttwu_queue(p, target_cpu, wake_flags)
       ← if target CPU == current CPU: enqueue directly
       ← else: queue on target CPU via IPI (TTWU_QUEUE feature)
          or migrate immediately (for RT tasks)
  6. check_preempt_curr(rq, p, ...)
       ← does the woken task preempt the current task?
       ← for CFS: compares vruntime with sched_wakeup_granularity check
       ← for RT: always preempt if newly runnable task has higher priority

Debugging Scheduler Internals

# Full scheduler state dump (per-CPU runqueues, vruntime values, etc.):
cat /sys/kernel/debug/sched/debug   # requires CONFIG_SCHED_DEBUG

# Example output section:
# cpu#0, 2342.432014 MHz
#   .nr_running              :      3
#   .load                    : 3072
#   .nr_switches             : 1234567
#   .nr_load_updates         : 987654
# cfs_rq[0]:
#   .exec_clock              : 123456.789ms
#   .MIN_vruntime            : 0.123456ms  
#   .min_vruntime            : 45678.901ms
#   se->exec_start           : 123456789012
#   se->vruntime             : 45679.123ms
#   se->sum_exec_runtime     : 1234.567ms
#   [task: firefox pid: 1234 .vruntime: 45679.123ms .load: 1024]

# Per-task detailed stats:
cat /proc/[pid]/sched

# Trace context switches with ftrace:
echo function_graph > /sys/kernel/debug/tracing/current_tracer
echo __schedule > /sys/kernel/debug/tracing/set_graph_function
cat /sys/kernel/debug/tracing/trace_pipe | head -100

# Count context switches per second per CPU:
perf stat -e context-switches -a -C 0 -- sleep 5

# Trace scheduling latency with eBPF/bpftrace:
bpftrace -e '
  tracepoint:sched:sched_wakeup { @wakeup[args->pid] = nsecs; }
  tracepoint:sched:sched_switch /args->next_pid != 0/ {
    $delta = nsecs - @wakeup[args->next_pid];
    if ($delta > 0 && $delta < 1000000000) {
      @latency_us = hist($delta / 1000);
    }
    delete(@wakeup[args->next_pid]);
  }
  END { print(@latency_us); }'

# Watch load balancer in action (migrations):
perf stat -e sched:sched_migrate_task -a -- sleep 10

# Measure context switch cost (ns per switch):
perf bench sched messaging -g 10 -l 100

Security Implications

Side channels via scheduler observability: /proc/[pid]/sched exposes vruntime, nr_switches, and exec_time for any process. An attacker can monitor scheduling behavior of target processes, inferring their execution patterns (e.g., when a cryptographic computation is running, affecting timing attacks).

IPI storms: An attacker who can create many threads that continually wake each other across CPUs can generate excessive IPI (inter-processor interrupt) traffic via try_to_wake_up(). IPIs are expensive (~1µs) and can cause measurable latency for legitimate workloads. Rate limiting on wakeup frequency mitigates this.

sched_debug exposure: /sys/kernel/debug/sched/debug is root-only and exposes detailed internal state. In cloud environments, this file should be inaccessible from containers. In practice, it requires CAP_SYS_ADMIN and is not accessible from default container environments.

Retpoline overhead in vtable dispatch: Every class->method() call in the scheduler is an indirect call — a target for Spectre variant 2. With CONFIG_RETPOLINE, each such call has ~10-20ns overhead. The scheduler's hot path (pick_next_task → fair_sched_class → pick_next_task_fair) makes ~5-10 indirect calls. This is ~50-200ns of Retpoline overhead per scheduling decision — observable in perf top on heavily scheduled workloads.

Performance Implications

__schedule() itself takes ~1-5µs end-to-end on modern hardware (register save, rb-tree lookup, TLB flush, register restore)
The CFS rb-tree lookup (__pick_next_entity) is the theoretical bottleneck but in practice is O(1) due to leftmost caching — just a pointer dereference
update_curr() (called on every tick and wakeup) is the most frequently called scheduler function — must be under ~100ns
Load balancing (rebalance_domains) runs asynchronously in softirq context; its overhead is amortized but can cause brief CPU spikes on high-core-count systems
On 256-core systems, the per-CPU runqueue lock is rarely contended (each CPU has its own). Cross-CPU operations (migration, wakeup) require brief locking of remote rqs.
sched_nr_migrate=32 limits tasks moved per load balance pass; higher values improve balance but increase hold time on runqueue locks

Failure Modes

Missing load balance trigger: A bug where trigger_load_balance() is not called (e.g., scheduler_tick() not firing due to NO_HZ tickless idle) can leave runqueue imbalances for seconds. Mitigated by the newidle_balance() path.
Runqueue lock deadlock: Load balancing must lock two runqueues simultaneously. If lock ordering is violated, deadlock results. The kernel enforces ordering: always lock lower CPU number first. Bugs here are historically rare but catastrophic.
select_task_rq() returning a wrong CPU: If the CPU affinity or NUMA preference calculation returns a CPU that doesn't match the task's cpumask, the kernel detects this and falls back to current CPU. The WARN_ON_ONCE(!cpumask_test_cpu(cpu, p->cpus_ptr)) guard catches this.
vruntime skew after migration: When a task migrates from CPU A to CPU B, its vruntime must be translated from CPU A's min_vruntime reference to CPU B's. A bug here causes the task to either monopolize CPU B (too low relative vruntime) or starve (too high). The encoding in migrate_task_rq_fair() handles this.
Idle CPU not getting work: The "wasted cores" paper (Lozi et al., EuroSys 2016) documented cases where Linux's load balancer failed to distribute work due to bugs in find_busiest_group(). These bugs caused CPUs to remain idle while others were overloaded. Several were fixed in Linux 4.x.

Modern Usage and Future Directions

sched_ext (Linux 6.12): Allows the fair_sched_class to be replaced at runtime with a BPF implementation. The BPF program implements the sched_class interface through a set of well-defined hooks. This enables: - Rapid iteration on scheduling algorithms - Workload-specific schedulers (e.g., a scheduler aware of a specific application's thread communication patterns) - Research without kernel patches

Early sched_ext implementations include Userspace Cooperative Scheduler (scx_userspace), FIFO scheduler (scx_fifo), and Lavd (Latency-Aware Virtual Deadline) from Samsung for mobile.

Scheduler profiling improvements: perf sched has become significantly more powerful. BPF-based scheduling probes (via bpftrace or bcc) can capture scheduling decisions at nanosecond granularity, enabling detailed analysis impossible with older tools.

EEVDF (Linux 6.6): The core task selection in fair_sched_class has moved from "pick leftmost vruntime" to "pick earliest eligible virtual deadline first". The rb-tree remains but the selection criterion changed. See 03-linux-cfs.md for details.

Exercises

Read the runqueue: Use cat /sys/kernel/debug/sched/debug and locate a specific CPU's CFS runqueue. Identify the current task, min_vruntime, and the vruntime spread of runnable tasks. How many tasks are in the runqueue?
Trace __schedule invocations: Using perf probe --add __schedule and perf record -e probe:__schedule -a -- sleep 5, count how many times __schedule() is called per second on your system under idle conditions. Compare to under CPU load.
Instrument context_switch: Write a bpftrace script that traces sched:sched_switch events, recording the time between when a task becomes runnable (sched_wakeup) and when it first runs (sched_switch with next_pid matching). Plot the histogram.
Load balance observation: On a multi-CPU system, create a runqueue imbalance by pinning 8 CPU-bound threads to CPU 0 with taskset -c 0. Monitor /sys/kernel/debug/sched/debug to watch other CPUs idle. Then remove the pinning and watch load balance redistribute tasks.
sched_class dispatch overhead: Using perf stat -e r02B1 -- sleep 1 (DSB misses on x86), measure the indirect branch misprediction rate with many context switches. Compare with and without CONFIG_RETPOLINE (if you have access to different kernel builds). Alternatively, use perf stat -e branch-misses and observe the increase during high context-switch workloads.

References

Linux Kernel source: kernel/sched/core.c (main scheduling code)
Linux Kernel source: kernel/sched/fair.c (CFS implementation)
Linux Kernel source: kernel/sched/rt.c (RT scheduler)
Linux Kernel source: include/linux/sched.h (struct task_struct)
Linux Kernel source: kernel/sched/sched.h (struct rq, sched_class)
Love, R., "Linux Kernel Development", 3rd ed., Chapter 4 — Process Scheduling
Corbet, J., "A closer look at the Linux scheduler", LWN.net
Lozi, J-P. et al., "The Linux Scheduler: a Decade of Wasted Cores", EuroSys 2016
Molnar, I., LKML posts on CFS design decisions, 2007-2010
McKenney, P. et al., "Is Parallel Programming Hard, And If So, What Can You Do About It?" — synchronization context for scheduler locks
Peter Zijlstra's LKML posts on load balancer improvements, 2014-2022
Torvalds, L., various LKML scheduler discussions (esp. on scheduler latency tradeoffs)