Skip to content

Control Groups (cgroups)

Technical Overview

Control Groups (cgroups) are a Linux kernel mechanism for hierarchically organizing processes and applying resource limits, resource accounting, and resource control to those groups. Where namespaces answer "what can a process see?", cgroups answer "what resources can a process use?". Together they form the two-pillar foundation of Linux containers.

A cgroup is simultaneously: a grouping of processes (organized as a hierarchy), a set of resource controllers (each managing a specific resource type), and a set of limits and parameters applied per group. The kernel enforces these limits in the relevant subsystems — the memory controller intercepts page fault and allocation paths, the CPU controller adjusts scheduler weights, the IO controller hooks into the block layer.

There are two generations of cgroups: cgroup v1 (the original design, now considered legacy) and cgroup v2 (unified hierarchy, current standard). Most modern Linux distributions and container runtimes default to cgroup v2.


Prerequisites

  • Process scheduling concepts (CPU time, scheduling weights)
  • Virtual memory management (page faults, reclaim, swap)
  • Block I/O subsystem (block devices, request queues)
  • /proc and /sys filesystem navigation

Historical Context

Control groups were developed at Google by Paul Menage and Rohit Seth, beginning around 2006 under the name "process containers." The name was changed to "control groups" to avoid confusion with Linux containers. The first version was merged into Linux 2.6.24 in January 2008.

Google had been running internal container-based infrastructure since the early 2000s (the Borg system). Cgroups were the resource management primitive that made Borg's workload density possible — without cgroups, you cannot safely run hundreds of jobs on a single machine because any one job could exhaust CPU, memory, or IO and degrade all others.

Cgroup v1's architectural problems (multiple independent hierarchies, inconsistent semantics) were well-known. Tejun Heo led the effort to redesign from scratch as cgroup v2, which was merged in Linux 4.5 (2016) but took until Linux 5.x for full controller coverage and production adoption.


Cgroup v1 Architecture

In cgroup v1, each resource controller has its own independent hierarchy mounted under /sys/fs/cgroup/<controller>/:

/sys/fs/cgroup/
├── memory/              ← memory controller hierarchy root
│   ├── tasks
│   ├── memory.limit_in_bytes
│   └── myapp/
│       ├── tasks
│       └── memory.limit_in_bytes
├── cpu/                 ← cpu controller hierarchy root
│   ├── tasks
│   ├── cpu.shares
│   └── myapp/
│       ├── tasks
│       └── cpu.shares
├── cpuacct/
├── blkio/
├── pids/
└── ...

The critical problem: a process can be in different positions in the memory hierarchy and the CPU hierarchy. The hierarchies are completely independent. This means: - A process must be attached to each relevant hierarchy separately - There is no unified way to express "this group of processes gets 2 CPU shares and 512MB memory" - Delegation model is broken — a parent can give a child cgroup control of CPU but the memory limits are managed in a separate tree with separate delegation

This became untenable as the number of controllers grew.


Cgroup v2 Unified Hierarchy

Cgroup v2 uses a single unified hierarchy at /sys/fs/cgroup/. All controllers are managed under one tree:

/sys/fs/cgroup/
├── cgroup.controllers        ← controllers available at root
├── cgroup.subtree_control    ← controllers delegated to children
├── cgroup.procs              ← PIDs in root cgroup
├── cpu.weight
├── memory.max
│
├── system.slice/
│   ├── cgroup.procs
│   ├── cpu.weight
│   ├── memory.max
│   └── nginx.service/
│       ├── cgroup.procs
│       ├── cpu.weight
│       └── memory.max
│
└── user.slice/
    └── user-1000.slice/
        ├── cgroup.procs
        └── memory.max

Unified Hierarchy Diagram

cgroup v2 tree
/sys/fs/cgroup/   (root cgroup)
│
├── system.slice/
│   ├── docker-<id>.scope/    ← container A
│   │   ├── cpu.weight = 100
│   │   ├── memory.max = 512M
│   │   ├── io.max = ...
│   │   └── pids.max = 100
│   │
│   └── docker-<id>.scope/    ← container B
│       ├── cpu.weight = 200  (gets 2x CPU share relative to A)
│       ├── memory.max = 1G
│       └── pids.max = 50
│
└── user.slice/
    └── (interactive sessions)

Key Rule: No-Internal-Process Constraint

In cgroup v2, a non-root cgroup can either have processes OR children, but not both. This enforces clean hierarchy reasoning — resource limits apply to leaves and are aggregated upward.


Cgroup v2 Key Controllers

CPU Controller

The CPU controller in v2 uses two files:

cpu.weight (1–10000, default 100): A proportional weight used by the CFS (Completely Fair Scheduler). A cgroup with weight 200 gets twice as much CPU time as one with weight 100, when both are runnable. This is the replacement for v1's cpu.shares.

cpu.max (format: quota period): Hard rate limiting. "200000 1000000" means the cgroup can use at most 200ms of CPU per 1000ms period (20% of one CPU). "max 1000000" means unlimited. This is the replacement for v1's cpu.cfs_quota_us and cpu.cfs_period_us.

# Give container at most 1.5 CPUs
echo "150000 100000" > /sys/fs/cgroup/mycontainer/cpu.max

# Set CPU weight to double the default
echo "200" > /sys/fs/cgroup/mycontainer/cpu.weight

cpu.stat: Reports usage statistics including usage_usec, user_usec, system_usec, and throttling information.

Memory Controller

memory.max: Hard memory limit. When a cgroup reaches this limit, the OOM killer is invoked within the cgroup to kill a process. Setting to max means unlimited.

memory.high: Soft memory limit. When a cgroup exceeds this threshold, the kernel aggressively reclaims pages (swapping out, dropping caches) before allocations are served. Processes are not killed but may slow significantly. This is the preferred throttle mechanism for containers — kill only as a last resort.

memory.swap.max: Limits swap usage for the cgroup. Setting to 0 prevents any swap usage (common in containers to make memory pressure explicit).

memory.current: Read-only, current memory usage of the cgroup in bytes.

memory.events: Statistics about memory events:

low 0
high 42
max 3
oom 0
oom_kill 0

The high counter indicates how many times the high threshold was exceeded — useful for detecting memory pressure before OOM kills.

memory.stat: Detailed breakdown including anon, file, kernel, slab, sock memory, swap, and more.

IO Controller

io.max (format: MAJOR:MINOR rbps=N wbps=N riops=N wiops=N): Hard rate limit per block device.

# Limit /dev/sda (8:0) to 100MB/s read, 50MB/s write
echo "8:0 rbps=104857600 wbps=52428800" > /sys/fs/cgroup/mycontainer/io.max

io.weight (format: default N or MAJOR:MINOR N): Proportional IO weight, analogous to cpu.weight.

io.stat: Per-device read/write bytes and IOPS counters.

PID Controller

pids.max: Maximum number of processes+threads that can exist in the cgroup. Essential for preventing fork bombs:

echo "100" > /sys/fs/cgroup/mycontainer/pids.max

pids.current: Current count of processes in the cgroup.

Pressure Stall Information (PSI)

PSI is a v2 feature that quantifies resource pressure as a fraction of time that tasks are stalled waiting for a resource. Each resource has a pressure file in cgroup v2:

# cat /sys/fs/cgroup/mycontainer/memory.pressure
some avg10=0.00 avg60=0.23 avg300=1.47 total=8473291
full avg10=0.00 avg60=0.11 avg300=0.72 total=3921847
  • some: Fraction of time at least one task was stalled on this resource
  • full: Fraction of time ALL tasks were stalled (complete resource starvation)
  • avg10/60/300: Exponential moving averages over 10s, 60s, 300s windows

PSI can be used to trigger proactive reclaim or alerting before OOM events. Facebook's oomd (Out-of-Memory Daemon) uses PSI to kill cgroups under memory pressure before the kernel's OOM killer fires, with more context-aware decisions.


Cgroup v2 Hierarchy Diagram

Resource flow through cgroup v2 hierarchy:

Process makes allocation request
         │
         ▼
┌────────────────────────────────────────────────────────┐
│ Kernel resource subsystem (e.g., mm for memory)        │
│                                                        │
│  Walk cgroup hierarchy from leaf to root               │
│  for each ancestor: check controller limits            │
│  if any limit exceeded: apply controller action        │
│    - memory.max exceeded → OOM kill                    │
│    - memory.high exceeded → synchronous reclaim        │
│    - cpu.max exceeded → throttle (task placed to sleep)│
│    - pids.max exceeded → EAGAIN on fork()              │
└────────────────────────────────────────────────────────┘
         │
         ▼
Request granted (or denied/throttled)

Accounting: usage flows UP the hierarchy
┌─────────────────────┐
│   root cgroup       │  memory.current = sum of all below
│   memory.current    │
│   = 8192 MB         │
│                     │
│  ┌───────────────┐  │
│  │ system.slice  │  │  memory.current = 4096 MB
│  │               │  │
│  │ ┌───────────┐ │  │
│  │ │ container │ │  │  memory.current = 2048 MB
│  │ └───────────┘ │  │
│  └───────────────┘  │
└─────────────────────┘

cgroup v1 vs v2 Comparison

Aspect cgroup v1 cgroup v2
Hierarchy One per controller Single unified
Process placement Can differ per controller Same position for all
Delegation Complex, per-hierarchy Clean, unified
Kernel version 2.6.24 (2008) 4.5+ (2016), mature 5.x
Docker default v1 on older kernels v2 on kernels 5.2+
Kubernetes default v1 historically v2 since k8s 1.25 stable
PSI support No Yes

Production Examples

Kubernetes QoS classes use cgroup v2: - Guaranteed pods (requests == limits): high memory.min (guarantee), memory.max == memory.min - Burstable pods: memory.min = requests, memory.max = limits - BestEffort pods: no memory.min, no memory.max (use whatever is free)

Docker resource constraints:

docker run --cpus="1.5" --memory="512m" --memory-swap="512m" nginx
# --cpus="1.5"    → cpu.max = "150000 100000"
# --memory="512m" → memory.max = 536870912
# --memory-swap equals --memory → memory.swap.max = 0 (no swap)

Systemd unit file cgroup limits:

[Service]
MemoryMax=2G
MemoryHigh=1.5G
CPUWeight=200
IOWeight=100
TasksMax=512

Debugging Notes

  • Check which cgroup version is in use: stat -fc %T /sys/fs/cgroup/ — returns tmpfs for v1, cgroup2fs for v2.
  • Find a process's cgroup: cat /proc/<PID>/cgroup — v2 shows a single line starting with 0::.
  • Memory limit not working: Verify the memory controller is enabled at every level of the hierarchy. In v2, controllers must be listed in cgroup.subtree_control at each level.
  • OOM kills not in dmesg: Under v2, cgroup-level OOM kills are logged to memory.events and may appear in systemd journal rather than kernel messages.
  • Throttled CPU time: Check cpu.stat for throttled_usec — high values indicate the container is hitting cpu.max frequently, degrading latency.
  • PSI monitoring: Set up a threshold notification using inotify or poll() on the PSI files to get callbacks when pressure exceeds a threshold.

Security Implications

  • pids.max is critical: Without a PID limit, a compromised container can fork-bomb the host, exhausting the kernel PID table.
  • memory.swap.max=0: Containers should generally disable swap to prevent memory pressure from a noisy neighbor silently degrading performance via swap activity.
  • cgroup escape via delegation bugs: Historically, delegation of cgroup subtree control had privilege escalation bugs — a container admin could escape resource limits by manipulating their delegated subtree. These have been fixed in v2's stricter delegation model.
  • Detecting cgroup limits from inside container: A process can read its own limits from /sys/fs/cgroup/ (if mounted) — some applications (JVM, Go runtime) now read cgroup limits to auto-configure heap size instead of using total host memory.

Performance Implications

  • CPU throttling latency: When a container hits cpu.max, tasks sleep until the next period, adding latency spikes. Period length (default 100ms) determines maximum latency added. For latency-sensitive services, set a smaller period: "50000 50000" (50% of one CPU, 50ms period).
  • Memory reclaim on memory.high: Hitting the soft limit causes synchronous reclaim in the allocation path, adding latency. Watch memory.events high counter.
  • Copy-on-write accounting: Memory is charged to the cgroup that faults it in. For shared libraries, the first container to use a page gets charged. This can cause unexpected memory accounting discrepancies.
  • OOM kill selection: Under v2, OOM kill within a cgroup selects the process with highest oom_score_adj. Containers should set oom_score_adj appropriately for process priority.

Failure Modes

Failure Symptom Diagnosis
OOM kill loop Container repeatedly exits with code 137 cat memory.events shows oom_kill incrementing; increase memory.max
CPU starvation High latency, low throughput cpu.stat shows high throttled_usec; raise cpu.max or cpu.weight
Fork bomb Host becomes unresponsive pids.max not set; current count in pids.current near system limit
IO starvation Application slow, IO wait high io.stat shows low throughput vs queue depth; check io.max and io.weight
cgroup leak Zombie cgroup directories remain Container runtime bug; clean up with rmdir after all processes exit
Missing controller Limits silently ignored Controller not in cgroup.subtree_control at parent level

Modern Usage

  • systemd as cgroup manager: On modern Linux, systemd manages the cgroup hierarchy. Container runtimes must cooperate with systemd (using sd_notify or systemd unit activation) rather than creating cgroups independently.
  • cgroup-aware applications: JVM (since JDK 10), Go runtime (1.19+), and Node.js now read cgroup memory limits to set default heap sizes instead of reading total physical memory.
  • cgroupv2 and eBPF: eBPF programs can be attached to cgroup hooks for per-container network policy, socket options, and device access control without iptables.
  • Nested virtualization: VMs inside containers use cgroup v2 for nested resource management.

Future Directions

  • Memory protection tiers: More granular memory.min (guaranteed) and memory.low (soft protection) integration with kernel reclaim pressure propagation
  • CPU isolation improvements: cpuset controller integration with v2 hierarchy for NUMA-aware container placement
  • Cross-cgroup memory sharing: Proposals for charging shared anonymous memory to a designated cgroup rather than the first-toucher
  • io.latency controller: Latency-based IO control (already partially implemented) to guarantee IO latency rather than just rate-limiting bandwidth

Exercises

  1. Create a cgroup v2 group manually under /sys/fs/cgroup/test/. Enable the memory controller. Set memory.max to 50MB. Run a process in it using echo $$ > cgroup.procs. Then try to allocate more than 50MB and observe the OOM kill.
  2. Write a script that reads cpu.stat for a container's cgroup, computes the throttle percentage, and alerts if it exceeds 10%.
  3. Set up PSI monitoring: write a program that uses poll() on memory.pressure to receive a notification when some avg10 exceeds 10%, and logs a warning.
  4. Compare memory accounting: run two containers that use the same base image. Inspect memory.current for each. Explain why the sum might exceed physical memory used.
  5. Use systemd-cgls to visualize the cgroup tree on a running Kubernetes node. Identify which cgroup represents a specific running pod.
  6. Experiment with cpu.weight: run two CPU-bound processes in different cgroups with weights 100 and 400. Measure their actual CPU shares using cpu.stat usage_usec over 10 seconds.

References

  • cgroups(7) — Linux man page
  • Tejun Heo, "Control Group v2" — kernel documentation: Documentation/admin-guide/cgroup-v2.rst
  • Paul Menage's original cgroup design documentation
  • Facebook oomd: github.com/facebookincubator/oomd
  • systemd.resource-control(5) — systemd resource control man page
  • Linux kernel source: kernel/cgroup/, mm/memcontrol.c, kernel/sched/fair.c
  • Brendan Gregg, "Linux Performance" — cgroup observability chapters