CFS Group Scheduling and CPU Bandwidth Control

Technical Overview

CFS group scheduling extends the Completely Fair Scheduler's fairness guarantees from individual tasks to groups of tasks — where a "group" can be a container, a user, a service, or any hierarchical collection of processes. It is the kernel mechanism behind Kubernetes CPU requests and limits, systemd CPU quotas, and Docker --cpus flags.

The core innovation: instead of each task having a scheduling entity in the runqueue, each task group has a scheduling entity. Tasks within a group compete for the group's share; groups compete for the system's total CPU. This creates a two-level (or arbitrarily-deep) hierarchical fair scheduler.

The critical operational gotcha — one of the most common sources of unexpected latency in containerized workloads — is CFS bandwidth throttling. When a group exhausts its CPU quota within a scheduling period, every task in that group is frozen for the rest of the period. This can cause 100ms latency spikes in a service that genuinely only needs brief CPU bursts but runs out of quota at the wrong moment. Understanding this mechanism is essential for anyone running containers in production.

Prerequisites

03-linux-cfs.md (vruntime, rb-tree runqueue, sched_entity)
Understanding of Linux cgroups (v1 and v2)
Kubernetes Pod scheduling and resource model (helpful but not required)

CPU Cgroups: The Interface

cgroup v1 CPU Controller

/sys/fs/cgroup/cpu/<group>/
  cpu.shares          = 1024       # relative weight (default 1024 = NICE_0_WEIGHT)
  cpu.cfs_quota_us    = -1         # quota in microseconds per period (-1 = unlimited)
  cpu.cfs_period_us   = 100000     # period in microseconds (default 100ms)
  cpu.stat            = [readonly] # throttling statistics
  cpu.rt_runtime_us   = 0          # RT task budget
  cpu.rt_period_us    = 1000000    # RT period

cpu.shares: Relative weight for CFS scheduling. A group with cpu.shares=2048 gets twice the CPU time of a group with cpu.shares=1024 when the system is contended. When idle, a group can use all available CPU regardless of shares. Analogous to nice values but for groups.

Example: 3 groups on a contended CPU:
  Group A: cpu.shares=1024 → 1024/(1024+512+512) = 50% CPU
  Group B: cpu.shares=512  → 512/2048 = 25% CPU
  Group C: cpu.shares=512  → 512/2048 = 25% CPU

If Group B and C are idle:
  Group A gets 100% CPU (no throttling, just relative weights)

cpu.cfs_quota_us and cpu.cfs_period_us: Hard cap on CPU usage. If quota=50000 and period=100000, the group may use at most 50ms of CPU per 100ms — 50% CPU cap, regardless of system load.

Setting quota to 2× the period allows 200% CPU usage (e.g., 2 full cores worth of parallel work).

cgroup v2 CPU Controller

/sys/fs/cgroup/<group>/
  cpu.weight          = 100        # weight (1-10000, default 100, replaces cpu.shares)
  cpu.weight.nice     = 0          # nice value interface to cpu.weight
  cpu.max             = "max 100000"  # "quota period" or "max period" (unlimited)
  cpu.max.burst       = 0          # burst above quota (bytes of extra quota, v2 only)
  cpu.stat            = [readonly]
  cpu.pressure        = [PSI stats]

cpu.weight vs cpu.shares: v2 weight range is 1-10000 (default 100). The mapping: cpu.shares = cpu.weight × (1024/100) approximately, but they are not linearly equivalent at boundaries.

cpu.max format: "max 100000" means unlimited quota with 100ms period. "50000 100000" means 50ms quota per 100ms period (50% cap). "200000 100000" means 200ms quota per 100ms period (2 CPU cores).

How CFS Group Scheduling Works

Hierarchical Scheduling Entities

Each task group (cgroup) has a task_group structure containing per-CPU sched_entity objects. These group entities are placed in the parent group's runqueue, just like individual task entities.

System CFS runqueue (root level):
rb-tree:
  ├─ group_A entity (vruntime=100, weight=1024)  ─→ Group A's CFS runqueue:
  │                                                   rb-tree:
  │                                                     ├─ task_1 (vrt=90)
  │                                                     └─ task_2 (vrt=110)
  └─ group_B entity (vruntime=120, weight=512)   ─→ Group B's CFS runqueue:
                                                      rb-tree:
                                                        ├─ task_3 (vrt=100)
                                                        └─ task_4 (vrt=140)

Scheduling a task: 1. Root CFS picks the leftmost group entity: group_A_entity (vrt=100) 2. Group A's CFS picks the leftmost task entity: task_1 (vrt=90) 3. task_1 runs; both task_1->vruntime and group_A_entity->vruntime advance 4. On the next scheduling decision, root CFS may pick group_B if its vruntime is now lower

This propagation ensures: tasks within a group are fairly scheduled among themselves, AND groups are fairly scheduled relative to each other. The fairness is hierarchical.

Weight propagation: A group entity's load weight at the parent level is derived from the sum of its member tasks' weights, adjusted to fit within the group's allowed share. The scheduler propagates load up the hierarchy to make accurate group-level scheduling decisions.

Bandwidth Throttling Mechanism

When cpu.cfs_quota_us is set (not -1), the kernel enforces it using a token-bucket mechanism:

For each cgroup with quota set:
  cfs_bandwidth.quota    = cfs_quota_us      (e.g., 50ms)
  cfs_bandwidth.period   = cfs_period_us     (e.g., 100ms)
  cfs_bandwidth.runtime  = current balance   (starts at quota, decreases as CPU used)
  cfs_bandwidth.timer    = high-res timer firing at period boundaries

Per-CPU pools (to reduce lock contention on multi-core):
  cfs_rq.runtime_remaining = local slice of quota
  When local slice exhausted:
    → try to get more from global pool (cfs_bandwidth.runtime)
    → if global pool empty: throttle this CPU's cfs_rq

Throttling:
  cfs_rq.throttled = 1
  Task dequeued from CPU, placed on throttled_cfs_rq list
  No more tasks from this group scheduled on this CPU

Period expiry (high-res timer fires):
  cfs_bandwidth.runtime = quota  (replenish)
  unthrottle_cfs_rq() for all throttled CPUs
  Tasks re-enqueued → can run again

The throttle state is binary per-CPU, per-period. A group is either running or frozen. There is no partial rate-limiting within a period.

The 100ms Period Problem

This is the most important practical consequence of CFS bandwidth control, responsible for widespread latency issues in Kubernetes and containerized environments.

The scenario:

Service configuration: cpu.cfs_quota_us=100ms, cpu.cfs_period_us=100ms
  → Allowed: 100ms CPU per 100ms period (1 full CPU)

Timeline:
  t=0ms:    New period starts. runtime = 100ms.
  t=0-50ms: Service handles 10 concurrent requests. Uses 100ms of CPU
            (10 requests × 10ms each = 100ms, spread across 2 cores = 50ms wall time)
  t=50ms:   quota exhausted. ALL service threads THROTTLED.
  t=50-100ms: Service frozen. Cannot handle new requests. Requests queue.
  t=100ms:  New period. Unthrottled. Threads wake and process queued requests.

Result: 50ms+ latency spike for any request that arrived in the 50-100ms window.
        P99 latency spikes to 100ms even though the service "only needs 100ms/100ms = 1 CPU"

This is not a bug — it is the correct enforcement of the configured limit. The problem is that the 100ms period is too coarse for latency-sensitive services.

The fix: shorter period:

# Kubernetes pod with CPU limit 1.0 = 1 CPU:
# Default: quota=100ms, period=100ms
# Fix: quota=10ms, period=10ms
# Both allow 1 CPU, but throttling events last ≤10ms instead of ≤100ms

# In Kubernetes (cgroup v1):
cat /sys/fs/cgroup/cpu/kubepods/pod<uid>/<container>/cpu.cfs_quota_us
# If 100000 (100ms), change period and quota proportionally:
echo 10000 > cpu.cfs_quota_us   # 10ms quota
echo 10000 > cpu.cfs_period_us  # 10ms period
# Kubernetes does NOT support per-pod period customization as of K8s 1.29
# Workaround: --cpu-cfs-quota-period flag on kubelet (applies globally)

# Or increase the quota (better long-term fix if the limit is too low):
# For a service that legitimately needs more CPU:
resources:
  limits:
    cpu: "2"  # 2 CPUs → 200ms quota per 100ms period → less throttling

Why 100ms was chosen: The Linux default cpu.cfs_period_us=100000 (100ms) was set historically. 100ms was considered fine-grained enough for batch workloads. For interactive or latency-sensitive services, it's too coarse.

cpu.max.burst (Linux 5.14, cgroup v2)

cpu.max.burst allows short-term CPU usage above the configured quota:

cpu.max = "50000 100000"   (50ms quota per 100ms period = 50% CPU)
cpu.max.burst = 50000       (50ms burst credit)

Normal operation:
  Period 1: Use 50ms CPU → full quota consumed, no burst
  Period 2: Use 30ms CPU → 20ms unused → accrues as burst credit
  Period 3: Use 70ms CPU → 50ms from quota + 20ms from burst → allowed

Burst credit accumulates up to cpu.max.burst (50ms in this example)
Allows absorbing short spikes without throttling
Essential for services with bursty workloads (web requests, event-driven)

This is analogous to TCP's burst allowance in traffic shaping. The long-term average is still bounded by the quota, but short-term spikes are absorbed.

Kubernetes CPU Requests and Limits

Kubernetes maps CPU resources to cgroup controls:

Pod spec:
  resources:
    requests:
      cpu: "250m"   # 250 millicores = 0.25 CPUs
    limits:
      cpu: "1000m"  # 1000 millicores = 1.0 CPUs

Mapping to cgroup v1:
  cpu.shares = requests × (1024/1000) = 250 × 1.024 = 256
    (determines relative weight when CPU is contended)

  cpu.cfs_quota_us  = limits × cfs_period_us = 1.0 × 100000 = 100000
  cpu.cfs_period_us = 100000 (100ms, kubelet default)
    (hard cap: group can use at most 100ms per 100ms period)

Key difference:
  requests (cpu.shares): ONLY matters when CPU is contended
  limits   (cfs_quota):  ALWAYS enforced, even when CPU is idle

Running without limits: Pods without CPU limits have no cfs_quota_us (or it's set to -1). They can use all available CPU but compete fairly with other pods via cpu.shares. For latency-sensitive production services, this is often the right choice on dedicated hardware. On shared clusters, it risks noisy neighbor problems.

Requests without limits: A pod with requests=250m but no limits gets cpu.shares=256 (competition weight) but no throttling. It can burst to use any available CPU. This provides latency guarantees through fair scheduling without imposing hard caps.

Container CPU Throttling Monitoring

# Per-container throttled time (cgroup v1):
cat /sys/fs/cgroup/cpu/kubepods/pod<uid>/<container>/cpu.stat
# nr_periods:      total periods elapsed
# nr_throttled:    periods where group was throttled
# throttled_time:  total nanoseconds spent throttled

# Calculate throttle percentage:
nr_throttled / nr_periods × 100

# Prometheus metric from cAdvisor:
container_cpu_throttled_seconds_total{container="my-service"}
container_cpu_throttled_periods_total
container_cpu_cfs_periods_total

# Throttle ratio query (PromQL):
rate(container_cpu_throttled_periods_total[5m]) /
rate(container_cpu_cfs_periods_total[5m])

# Alert: throttle ratio > 5% for latency-sensitive services

Interpreting throttle metrics: - nr_throttled / nr_periods = 0.02 (2%): Low throttling, likely acceptable - nr_throttled / nr_periods = 0.20 (20%): High throttling, latency impact likely - nr_throttled / nr_periods = 0.80 (80%): Severe throttling, limit too low

Key insight: A service can have low average CPU utilization (e.g., 30% of limit) but still experience high throttling if it has bursty traffic patterns. Average utilization is not sufficient for diagnosing throttling.

cgroup v2 CPU Stats and PSI

cgroup v2 adds CPU pressure stall information (PSI, Pressure Stall Information):

cat /sys/fs/cgroup/mygroup/cpu.pressure
# some avg10=2.34 avg60=1.23 avg300=0.89 total=123456789
# full avg10=0.00 avg60=0.00 avg300=0.00 total=0

# "some": fraction of time at least one task was stalled waiting for CPU
# "full": fraction of time ALL tasks were stalled (CPU throttled)
# avg10/60/300: exponentially-weighted average over 10s/60s/5min windows
# total: cumulative microseconds stalled

PSI provides a more nuanced view than throttle time: full directly captures periods when all tasks were throttled (the bad case), while some captures scheduling delays even without throttling.

CFS Hierarchy Example: systemd Integration

systemd uses cgroup CPU limits for system services:

# /etc/systemd/system/myservice.service
[Service]
CPUShares=512          # cgroup v1: cpu.shares (or CPUWeight=50 for v2)
CPUQuota=200%          # cgroup v1: cpu.cfs_quota_us = 200ms per 100ms period
                       # (allows 2 CPU cores worth of work)

# Verify:
systemctl show myservice | grep -E "CPU(Quota|Shares|Weight)"
# Check cgroup:
cat /sys/fs/cgroup/system.slice/myservice.service/cpu.cfs_quota_us

systemd's DefaultCPUAccounting=yes enables CPU accounting for all services, enabling systemd-cgtop to show per-service CPU usage.

Debugging Notes

# Find the cgroup for a process:
cat /proc/[pid]/cgroup
# Output: 0::/system.slice/myservice.service  (v2)
# Output: 8:cpu,cpuacct:/kubepods/pod<uid>/<container>  (v1)

# Watch throttle events in real time:
watch -n1 "cat /sys/fs/cgroup/.../cpu.stat"

# Observe scheduler debug for a cgroup:
# (requires CONFIG_SCHED_DEBUG)
cat /sys/kernel/debug/sched/debug | grep -A 20 "cfs_rq"

# Trace throttle and unthrottle events:
echo 1 > /sys/kernel/debug/tracing/events/sched/sched_cfs_throttle_cgroup/enable
echo 1 > /sys/kernel/debug/tracing/events/sched/sched_cfs_unthrottle_cgroup/enable
cat /sys/kernel/debug/tracing/trace_pipe

# Use bpftrace to trace throttle latency:
bpftrace -e '
  tracepoint:sched:sched_cfs_throttle_cgroup { @start[args->cgroup_id] = nsecs; }
  tracepoint:sched:sched_cfs_unthrottle_cgroup {
    $dur = nsecs - @start[args->cgroup_id];
    printf("throttled for %lld ms\n", $dur/1000000);
    delete(@start[args->cgroup_id]);
  }'

Security Implications

CPU quota as a denial-of-service mitigation: CPU limits (cpu.cfs_quota_us) are the primary defense against a misbehaving container consuming all CPU on a shared node. Without limits, a CPU-intensive workload can starve co-located services. In Kubernetes multi-tenant clusters, CPU limits are often mandatory policy.

Throttling as a side channel: The timing of throttle events can be observable by a co-located container. A malicious container can observe when its neighbors are throttled (by noticing that its own throughput increases suddenly) and infer information about neighbor behavior. This is a weak side channel but part of the broader multi-tenant isolation analysis.

Privilege escalation via cgroup manipulation: Writing to cgroup files requires appropriate permissions (root, or specific cgroup delegation). In Kubernetes, bypassing container CPU limits requires escaping the container's cgroup namespace — typically equivalent to container escape.

cpu.shares imbalance as a starvation attack: A container configured with cpu.shares=1 (minimum) can still run but will get very little CPU. In a multi-tenant environment where one tenant controls their own cpu.shares (within a delegation), they cannot affect other tenants' shares — shares are relative within siblings.

Performance Implications

Group scheduling adds one additional rb-tree lookup per scheduling level. For a 2-level hierarchy (system → container → task), this doubles the scheduling lookup cost — still O(log n) per level.
Bandwidth throttle unthrottle adds latency: the unthrottle happens via a high-resolution timer. If the period boundary fires on a different CPU from where the throttled tasks want to run, there's an IPI involved. This adds ~10-50µs to the unthrottle latency on top of the period boundary.
With many cgroups, the bandwidth timer management overhead becomes measurable. Systems with thousands of containers (dense Kubernetes nodes) can see 1-3% CPU overhead from cgroup CPU accounting.
The cpu.stat throttled_time counter is cumulative; only the increment between two readings tells you about throttling over that window.

Failure Modes

Undetected throttling: A service has 25% throttle rate but P99 latency looks acceptable in normal traffic. Under burst traffic, throttling rate jumps to 80% and P99 spikes. Monitoring only average CPU usage misses this. Always monitor throttle rate alongside CPU utilization.
Period-aligned thundering herd: All containers on a node exhaust their quota simultaneously (e.g., if they all receive traffic in bursts aligned with the period). All are throttled simultaneously; at period reset, all wake and compete. Can cause a burst of CPU contention at period boundaries. Mitigation: randomize period offsets (Linux's scheduler does this with sched_cfs_bandwidth_slice_us).
Burst credit exhaustion: With cpu.max.burst configured, a service that continuously bursts to 150% CPU will exhaust burst credit. The first few seconds are fine; subsequent bursts hit the hard quota limit. Misconfigured services rely on burst credit as a crutch; the underlying issue is that the base quota is too low.
cgroup v1/v2 confusion: Mixed v1/v2 hierarchies (cgroups hybrid mode) can cause confusing behavior where some controllers use v1 semantics and others v2. Kubernetes migrated to cgroup v2 in 1.25+ (feature gate); ensure node OS and kubelet version are aligned.

Modern Usage

Kubernetes CPU management best practices (evolved through operational experience):

Always set CPU requests (affects scheduling and cpu.shares)
Set CPU limits conservatively or not at all for latency-sensitive services
Use --cpu-cfs-quota-period=10ms on the kubelet to reduce throttling granularity
Monitor container_cpu_throttled_seconds_total; alert above 5%
For batch/background jobs: always set CPU limits to prevent runaway jobs
Use CPU Burstable QoS (requests ≠ limits) for services with variable load
Use Guaranteed QoS (requests = limits) for RT-like requirements and predictable scheduling

cgroup v2 migration: cgroup v2 is default on kernel 5.3+, Ubuntu 21.10+, RHEL 9+. The cpu.weight system (1-10000) provides better granularity than v1's cpu.shares (2-262144 but effective range much narrower). cpu.max.burst is v2-only. Most orchestrators (Kubernetes, systemd, Docker) support v2 natively.

Future Directions

Per-CPU quota accounting: Current CFS bandwidth accounting uses a global pool with per-CPU slices, which requires locking. A fully per-CPU accounting scheme (without global synchronization) could reduce overhead on high-core-count systems.

Deadline-based container scheduling: Rather than quota throttling, containers could use SCHED_DEADLINE semantics: declare period and WCET, get admission control and timing guarantees. This would eliminate the "throttle at period end" problem but requires per-container deadline parameter tuning.

Improved burst accounting: The cpu.max.burst feature is a step toward better handling bursty workloads. Future work might include adaptive burst windows (learn the workload's burst pattern and size the credit accordingly) or integration with load predictors.

Exercises

Throttling demonstration: Create a cgroup with cpu.cfs_quota_us=10000 and cpu.cfs_period_us=100000 (10% CPU). Run a CPU-bound task inside it. Using watch -n0.1 "cat cpu.stat", observe the throttle/unthrottle cycle. Measure the period between unthrottle events.
Period sensitivity: Set quota=50ms, period=100ms. Run a web server that handles 100 req/s, each request using 5ms CPU. At 100 req/s, average CPU = 500ms/s = 50% — exactly the quota. Measure P99 latency. Then set quota=5ms, period=10ms (same 50% cap). Compare P99 latency.
Kubernetes CPU throttling: Deploy a pod with cpu: limits: "500m". Run a load test that uses 400m CPU. Monitor container_cpu_throttled_periods_total. What throttle rate do you see? Now increase to cpu: limits: "1" and compare.
cgroup shares verification: Create two cgroups with cpu.shares=1024 and cpu.shares=256. Run CPU-bound tasks in each. Using cpuacct.usage, verify that the first group gets approximately 4x the CPU time over 10 seconds.
cpu.max.burst effect: Configure a cgroup with cpu.max="50000 100000" (50% CPU) and cpu.max.burst=50000. Run a workload that sleeps for 200ms then runs for 70ms. Without burst, the 70ms run would be throttled after 50ms. With burst credit accumulated during the sleep, verify it completes without throttling.

References

Linux Kernel Documentation: Documentation/scheduler/sched-bwc.rst
Peter Zijlstra, "CFS Bandwidth Control", LKML patchset, 2010
Kubernetes documentation: "Managing Compute Resources for Containers"
Grigg, L., "On Kubernetes CPU limits", Zalando Engineering Blog, 2019
Horne, D., "Stop Using CPU Limits", HelloFresh Engineering Blog, 2021
Linux Kernel Documentation: Documentation/admin-guide/cgroup-v2.rst
cgroup v2 cpu controller: kernel/sched/fair.c:sched_cfs_bandwidth_*
Brendan Gregg, "CPU Flame Graphs and Container CPU Throttling", various blog posts
Kubernetes Enhancement Proposal: KEP-1001 CPU Manager
Linux source: kernel/sched/fair.c (bandwidth throttle: search cfs_bandwidth)