Control Groups (cgroups)
Technical Overview
Control Groups (cgroups) are a Linux kernel mechanism for hierarchically organizing processes and applying resource limits, resource accounting, and resource control to those groups. Where namespaces answer "what can a process see?", cgroups answer "what resources can a process use?". Together they form the two-pillar foundation of Linux containers.
A cgroup is simultaneously: a grouping of processes (organized as a hierarchy), a set of resource controllers (each managing a specific resource type), and a set of limits and parameters applied per group. The kernel enforces these limits in the relevant subsystems — the memory controller intercepts page fault and allocation paths, the CPU controller adjusts scheduler weights, the IO controller hooks into the block layer.
There are two generations of cgroups: cgroup v1 (the original design, now considered legacy) and cgroup v2 (unified hierarchy, current standard). Most modern Linux distributions and container runtimes default to cgroup v2.
Prerequisites
- Process scheduling concepts (CPU time, scheduling weights)
- Virtual memory management (page faults, reclaim, swap)
- Block I/O subsystem (block devices, request queues)
/procand/sysfilesystem navigation
Historical Context
Control groups were developed at Google by Paul Menage and Rohit Seth, beginning around 2006 under the name "process containers." The name was changed to "control groups" to avoid confusion with Linux containers. The first version was merged into Linux 2.6.24 in January 2008.
Google had been running internal container-based infrastructure since the early 2000s (the Borg system). Cgroups were the resource management primitive that made Borg's workload density possible — without cgroups, you cannot safely run hundreds of jobs on a single machine because any one job could exhaust CPU, memory, or IO and degrade all others.
Cgroup v1's architectural problems (multiple independent hierarchies, inconsistent semantics) were well-known. Tejun Heo led the effort to redesign from scratch as cgroup v2, which was merged in Linux 4.5 (2016) but took until Linux 5.x for full controller coverage and production adoption.
Cgroup v1 Architecture
In cgroup v1, each resource controller has its own independent hierarchy mounted under /sys/fs/cgroup/<controller>/:
/sys/fs/cgroup/
├── memory/ ← memory controller hierarchy root
│ ├── tasks
│ ├── memory.limit_in_bytes
│ └── myapp/
│ ├── tasks
│ └── memory.limit_in_bytes
├── cpu/ ← cpu controller hierarchy root
│ ├── tasks
│ ├── cpu.shares
│ └── myapp/
│ ├── tasks
│ └── cpu.shares
├── cpuacct/
├── blkio/
├── pids/
└── ...
The critical problem: a process can be in different positions in the memory hierarchy and the CPU hierarchy. The hierarchies are completely independent. This means: - A process must be attached to each relevant hierarchy separately - There is no unified way to express "this group of processes gets 2 CPU shares and 512MB memory" - Delegation model is broken — a parent can give a child cgroup control of CPU but the memory limits are managed in a separate tree with separate delegation
This became untenable as the number of controllers grew.
Cgroup v2 Unified Hierarchy
Cgroup v2 uses a single unified hierarchy at /sys/fs/cgroup/. All controllers are managed under one tree:
/sys/fs/cgroup/
├── cgroup.controllers ← controllers available at root
├── cgroup.subtree_control ← controllers delegated to children
├── cgroup.procs ← PIDs in root cgroup
├── cpu.weight
├── memory.max
│
├── system.slice/
│ ├── cgroup.procs
│ ├── cpu.weight
│ ├── memory.max
│ └── nginx.service/
│ ├── cgroup.procs
│ ├── cpu.weight
│ └── memory.max
│
└── user.slice/
└── user-1000.slice/
├── cgroup.procs
└── memory.max
Unified Hierarchy Diagram
cgroup v2 tree
/sys/fs/cgroup/ (root cgroup)
│
├── system.slice/
│ ├── docker-<id>.scope/ ← container A
│ │ ├── cpu.weight = 100
│ │ ├── memory.max = 512M
│ │ ├── io.max = ...
│ │ └── pids.max = 100
│ │
│ └── docker-<id>.scope/ ← container B
│ ├── cpu.weight = 200 (gets 2x CPU share relative to A)
│ ├── memory.max = 1G
│ └── pids.max = 50
│
└── user.slice/
└── (interactive sessions)
Key Rule: No-Internal-Process Constraint
In cgroup v2, a non-root cgroup can either have processes OR children, but not both. This enforces clean hierarchy reasoning — resource limits apply to leaves and are aggregated upward.
Cgroup v2 Key Controllers
CPU Controller
The CPU controller in v2 uses two files:
cpu.weight (1–10000, default 100): A proportional weight used by the CFS (Completely Fair Scheduler). A cgroup with weight 200 gets twice as much CPU time as one with weight 100, when both are runnable. This is the replacement for v1's cpu.shares.
cpu.max (format: quota period): Hard rate limiting. "200000 1000000" means the cgroup can use at most 200ms of CPU per 1000ms period (20% of one CPU). "max 1000000" means unlimited. This is the replacement for v1's cpu.cfs_quota_us and cpu.cfs_period_us.
# Give container at most 1.5 CPUs
echo "150000 100000" > /sys/fs/cgroup/mycontainer/cpu.max
# Set CPU weight to double the default
echo "200" > /sys/fs/cgroup/mycontainer/cpu.weight
cpu.stat: Reports usage statistics including usage_usec, user_usec, system_usec, and throttling information.
Memory Controller
memory.max: Hard memory limit. When a cgroup reaches this limit, the OOM killer is invoked within the cgroup to kill a process. Setting to max means unlimited.
memory.high: Soft memory limit. When a cgroup exceeds this threshold, the kernel aggressively reclaims pages (swapping out, dropping caches) before allocations are served. Processes are not killed but may slow significantly. This is the preferred throttle mechanism for containers — kill only as a last resort.
memory.swap.max: Limits swap usage for the cgroup. Setting to 0 prevents any swap usage (common in containers to make memory pressure explicit).
memory.current: Read-only, current memory usage of the cgroup in bytes.
memory.events: Statistics about memory events:
low 0
high 42
max 3
oom 0
oom_kill 0
The high counter indicates how many times the high threshold was exceeded — useful for detecting memory pressure before OOM kills.
memory.stat: Detailed breakdown including anon, file, kernel, slab, sock memory, swap, and more.
IO Controller
io.max (format: MAJOR:MINOR rbps=N wbps=N riops=N wiops=N): Hard rate limit per block device.
# Limit /dev/sda (8:0) to 100MB/s read, 50MB/s write
echo "8:0 rbps=104857600 wbps=52428800" > /sys/fs/cgroup/mycontainer/io.max
io.weight (format: default N or MAJOR:MINOR N): Proportional IO weight, analogous to cpu.weight.
io.stat: Per-device read/write bytes and IOPS counters.
PID Controller
pids.max: Maximum number of processes+threads that can exist in the cgroup. Essential for preventing fork bombs:
echo "100" > /sys/fs/cgroup/mycontainer/pids.max
pids.current: Current count of processes in the cgroup.
Pressure Stall Information (PSI)
PSI is a v2 feature that quantifies resource pressure as a fraction of time that tasks are stalled waiting for a resource. Each resource has a pressure file in cgroup v2:
# cat /sys/fs/cgroup/mycontainer/memory.pressure
some avg10=0.00 avg60=0.23 avg300=1.47 total=8473291
full avg10=0.00 avg60=0.11 avg300=0.72 total=3921847
some: Fraction of time at least one task was stalled on this resourcefull: Fraction of time ALL tasks were stalled (complete resource starvation)avg10/60/300: Exponential moving averages over 10s, 60s, 300s windows
PSI can be used to trigger proactive reclaim or alerting before OOM events. Facebook's oomd (Out-of-Memory Daemon) uses PSI to kill cgroups under memory pressure before the kernel's OOM killer fires, with more context-aware decisions.
Cgroup v2 Hierarchy Diagram
Resource flow through cgroup v2 hierarchy:
Process makes allocation request
│
▼
┌────────────────────────────────────────────────────────┐
│ Kernel resource subsystem (e.g., mm for memory) │
│ │
│ Walk cgroup hierarchy from leaf to root │
│ for each ancestor: check controller limits │
│ if any limit exceeded: apply controller action │
│ - memory.max exceeded → OOM kill │
│ - memory.high exceeded → synchronous reclaim │
│ - cpu.max exceeded → throttle (task placed to sleep)│
│ - pids.max exceeded → EAGAIN on fork() │
└────────────────────────────────────────────────────────┘
│
▼
Request granted (or denied/throttled)
Accounting: usage flows UP the hierarchy
┌─────────────────────┐
│ root cgroup │ memory.current = sum of all below
│ memory.current │
│ = 8192 MB │
│ │
│ ┌───────────────┐ │
│ │ system.slice │ │ memory.current = 4096 MB
│ │ │ │
│ │ ┌───────────┐ │ │
│ │ │ container │ │ │ memory.current = 2048 MB
│ │ └───────────┘ │ │
│ └───────────────┘ │
└─────────────────────┘
cgroup v1 vs v2 Comparison
| Aspect | cgroup v1 | cgroup v2 |
|---|---|---|
| Hierarchy | One per controller | Single unified |
| Process placement | Can differ per controller | Same position for all |
| Delegation | Complex, per-hierarchy | Clean, unified |
| Kernel version | 2.6.24 (2008) | 4.5+ (2016), mature 5.x |
| Docker default | v1 on older kernels | v2 on kernels 5.2+ |
| Kubernetes default | v1 historically | v2 since k8s 1.25 stable |
| PSI support | No | Yes |
Production Examples
Kubernetes QoS classes use cgroup v2:
- Guaranteed pods (requests == limits): high memory.min (guarantee), memory.max == memory.min
- Burstable pods: memory.min = requests, memory.max = limits
- BestEffort pods: no memory.min, no memory.max (use whatever is free)
Docker resource constraints:
docker run --cpus="1.5" --memory="512m" --memory-swap="512m" nginx
# --cpus="1.5" → cpu.max = "150000 100000"
# --memory="512m" → memory.max = 536870912
# --memory-swap equals --memory → memory.swap.max = 0 (no swap)
Systemd unit file cgroup limits:
[Service]
MemoryMax=2G
MemoryHigh=1.5G
CPUWeight=200
IOWeight=100
TasksMax=512
Debugging Notes
- Check which cgroup version is in use:
stat -fc %T /sys/fs/cgroup/— returnstmpfsfor v1,cgroup2fsfor v2. - Find a process's cgroup:
cat /proc/<PID>/cgroup— v2 shows a single line starting with0::. - Memory limit not working: Verify the memory controller is enabled at every level of the hierarchy. In v2, controllers must be listed in
cgroup.subtree_controlat each level. - OOM kills not in dmesg: Under v2, cgroup-level OOM kills are logged to
memory.eventsand may appear in systemd journal rather than kernel messages. - Throttled CPU time: Check
cpu.statforthrottled_usec— high values indicate the container is hittingcpu.maxfrequently, degrading latency. - PSI monitoring: Set up a threshold notification using
inotifyorpoll()on the PSI files to get callbacks when pressure exceeds a threshold.
Security Implications
- pids.max is critical: Without a PID limit, a compromised container can fork-bomb the host, exhausting the kernel PID table.
- memory.swap.max=0: Containers should generally disable swap to prevent memory pressure from a noisy neighbor silently degrading performance via swap activity.
- cgroup escape via delegation bugs: Historically, delegation of cgroup subtree control had privilege escalation bugs — a container admin could escape resource limits by manipulating their delegated subtree. These have been fixed in v2's stricter delegation model.
- Detecting cgroup limits from inside container: A process can read its own limits from
/sys/fs/cgroup/(if mounted) — some applications (JVM, Go runtime) now read cgroup limits to auto-configure heap size instead of using total host memory.
Performance Implications
- CPU throttling latency: When a container hits
cpu.max, tasks sleep until the next period, adding latency spikes. Period length (default 100ms) determines maximum latency added. For latency-sensitive services, set a smaller period:"50000 50000"(50% of one CPU, 50ms period). - Memory reclaim on
memory.high: Hitting the soft limit causes synchronous reclaim in the allocation path, adding latency. Watchmemory.eventshigh counter. - Copy-on-write accounting: Memory is charged to the cgroup that faults it in. For shared libraries, the first container to use a page gets charged. This can cause unexpected memory accounting discrepancies.
- OOM kill selection: Under v2, OOM kill within a cgroup selects the process with highest
oom_score_adj. Containers should setoom_score_adjappropriately for process priority.
Failure Modes
| Failure | Symptom | Diagnosis |
|---|---|---|
| OOM kill loop | Container repeatedly exits with code 137 | cat memory.events shows oom_kill incrementing; increase memory.max |
| CPU starvation | High latency, low throughput | cpu.stat shows high throttled_usec; raise cpu.max or cpu.weight |
| Fork bomb | Host becomes unresponsive | pids.max not set; current count in pids.current near system limit |
| IO starvation | Application slow, IO wait high | io.stat shows low throughput vs queue depth; check io.max and io.weight |
| cgroup leak | Zombie cgroup directories remain | Container runtime bug; clean up with rmdir after all processes exit |
| Missing controller | Limits silently ignored | Controller not in cgroup.subtree_control at parent level |
Modern Usage
- systemd as cgroup manager: On modern Linux, systemd manages the cgroup hierarchy. Container runtimes must cooperate with systemd (using
sd_notifyor systemd unit activation) rather than creating cgroups independently. - cgroup-aware applications: JVM (since JDK 10), Go runtime (1.19+), and Node.js now read cgroup memory limits to set default heap sizes instead of reading total physical memory.
- cgroupv2 and eBPF: eBPF programs can be attached to cgroup hooks for per-container network policy, socket options, and device access control without iptables.
- Nested virtualization: VMs inside containers use cgroup v2 for nested resource management.
Future Directions
- Memory protection tiers: More granular
memory.min(guaranteed) andmemory.low(soft protection) integration with kernel reclaim pressure propagation - CPU isolation improvements:
cpusetcontroller integration with v2 hierarchy for NUMA-aware container placement - Cross-cgroup memory sharing: Proposals for charging shared anonymous memory to a designated cgroup rather than the first-toucher
- io.latency controller: Latency-based IO control (already partially implemented) to guarantee IO latency rather than just rate-limiting bandwidth
Exercises
- Create a cgroup v2 group manually under
/sys/fs/cgroup/test/. Enable the memory controller. Setmemory.maxto 50MB. Run a process in it usingecho $$ > cgroup.procs. Then try to allocate more than 50MB and observe the OOM kill. - Write a script that reads
cpu.statfor a container's cgroup, computes the throttle percentage, and alerts if it exceeds 10%. - Set up PSI monitoring: write a program that uses
poll()onmemory.pressureto receive a notification whensome avg10exceeds 10%, and logs a warning. - Compare memory accounting: run two containers that use the same base image. Inspect
memory.currentfor each. Explain why the sum might exceed physical memory used. - Use
systemd-cglsto visualize the cgroup tree on a running Kubernetes node. Identify which cgroup represents a specific running pod. - Experiment with
cpu.weight: run two CPU-bound processes in different cgroups with weights 100 and 400. Measure their actual CPU shares usingcpu.stat usage_usecover 10 seconds.
References
cgroups(7)— Linux man page- Tejun Heo, "Control Group v2" — kernel documentation:
Documentation/admin-guide/cgroup-v2.rst - Paul Menage's original cgroup design documentation
- Facebook oomd: github.com/facebookincubator/oomd
systemd.resource-control(5)— systemd resource control man page- Linux kernel source:
kernel/cgroup/,mm/memcontrol.c,kernel/sched/fair.c - Brendan Gregg, "Linux Performance" — cgroup observability chapters