Control Groups (cgroups)

Technical Overview

Control Groups (cgroups) are a Linux kernel mechanism for hierarchically organizing processes and applying resource limits, resource accounting, and resource control to those groups. Where namespaces answer "what can a process see?", cgroups answer "what resources can a process use?". Together they form the two-pillar foundation of Linux containers.

A cgroup is simultaneously: a grouping of processes (organized as a hierarchy), a set of resource controllers (each managing a specific resource type), and a set of limits and parameters applied per group. The kernel enforces these limits in the relevant subsystems — the memory controller intercepts page fault and allocation paths, the CPU controller adjusts scheduler weights, the IO controller hooks into the block layer.

There are two generations of cgroups: cgroup v1 (the original design, now considered legacy) and cgroup v2 (unified hierarchy, current standard). Most modern Linux distributions and container runtimes default to cgroup v2.

Prerequisites

Process scheduling concepts (CPU time, scheduling weights)
Virtual memory management (page faults, reclaim, swap)
Block I/O subsystem (block devices, request queues)
/proc and /sys filesystem navigation

Historical Context

Control groups were developed at Google by Paul Menage and Rohit Seth, beginning around 2006 under the name "process containers." The name was changed to "control groups" to avoid confusion with Linux containers. The first version was merged into Linux 2.6.24 in January 2008.

Google had been running internal container-based infrastructure since the early 2000s (the Borg system). Cgroups were the resource management primitive that made Borg's workload density possible — without cgroups, you cannot safely run hundreds of jobs on a single machine because any one job could exhaust CPU, memory, or IO and degrade all others.

Cgroup v1's architectural problems (multiple independent hierarchies, inconsistent semantics) were well-known. Tejun Heo led the effort to redesign from scratch as cgroup v2, which was merged in Linux 4.5 (2016) but took until Linux 5.x for full controller coverage and production adoption.

Cgroup v1 Architecture

In cgroup v1, each resource controller has its own independent hierarchy mounted under /sys/fs/cgroup/<controller>/:

/sys/fs/cgroup/
├── memory/              ← memory controller hierarchy root
│   ├── tasks
│   ├── memory.limit_in_bytes
│   └── myapp/
│       ├── tasks
│       └── memory.limit_in_bytes
├── cpu/                 ← cpu controller hierarchy root
│   ├── tasks
│   ├── cpu.shares
│   └── myapp/
│       ├── tasks
│       └── cpu.shares
├── cpuacct/
├── blkio/
├── pids/
└── ...

The critical problem: a process can be in different positions in the memory hierarchy and the CPU hierarchy. The hierarchies are completely independent. This means: - A process must be attached to each relevant hierarchy separately - There is no unified way to express "this group of processes gets 2 CPU shares and 512MB memory" - Delegation model is broken — a parent can give a child cgroup control of CPU but the memory limits are managed in a separate tree with separate delegation

This became untenable as the number of controllers grew.

Cgroup v2 Unified Hierarchy

Cgroup v2 uses a single unified hierarchy at /sys/fs/cgroup/. All controllers are managed under one tree:

/sys/fs/cgroup/
├── cgroup.controllers        ← controllers available at root
├── cgroup.subtree_control    ← controllers delegated to children
├── cgroup.procs              ← PIDs in root cgroup
├── cpu.weight
├── memory.max
│
├── system.slice/
│   ├── cgroup.procs
│   ├── cpu.weight
│   ├── memory.max
│   └── nginx.service/
│       ├── cgroup.procs
│       ├── cpu.weight
│       └── memory.max
│
└── user.slice/
    └── user-1000.slice/
        ├── cgroup.procs
        └── memory.max

Unified Hierarchy Diagram

cgroup v2 tree
/sys/fs/cgroup/   (root cgroup)
│
├── system.slice/
│   ├── docker-<id>.scope/    ← container A
│   │   ├── cpu.weight = 100
│   │   ├── memory.max = 512M
│   │   ├── io.max = ...
│   │   └── pids.max = 100
│   │
│   └── docker-<id>.scope/    ← container B
│       ├── cpu.weight = 200  (gets 2x CPU share relative to A)
│       ├── memory.max = 1G
│       └── pids.max = 50
│
└── user.slice/
    └── (interactive sessions)

Key Rule: No-Internal-Process Constraint

In cgroup v2, a non-root cgroup can either have processes OR children, but not both. This enforces clean hierarchy reasoning — resource limits apply to leaves and are aggregated upward.

Cgroup v2 Key Controllers

CPU Controller

The CPU controller in v2 uses two files:

cpu.weight (1–10000, default 100): A proportional weight used by the CFS (Completely Fair Scheduler). A cgroup with weight 200 gets twice as much CPU time as one with weight 100, when both are runnable. This is the replacement for v1's cpu.shares.

cpu.max (format: quota period): Hard rate limiting. "200000 1000000" means the cgroup can use at most 200ms of CPU per 1000ms period (20% of one CPU). "max 1000000" means unlimited. This is the replacement for v1's cpu.cfs_quota_us and cpu.cfs_period_us.

# Give container at most 1.5 CPUs
echo "150000 100000" > /sys/fs/cgroup/mycontainer/cpu.max

# Set CPU weight to double the default
echo "200" > /sys/fs/cgroup/mycontainer/cpu.weight

cpu.stat: Reports usage statistics including usage_usec, user_usec, system_usec, and throttling information.

Memory Controller

memory.max: Hard memory limit. When a cgroup reaches this limit, the OOM killer is invoked within the cgroup to kill a process. Setting to max means unlimited.

memory.high: Soft memory limit. When a cgroup exceeds this threshold, the kernel aggressively reclaims pages (swapping out, dropping caches) before allocations are served. Processes are not killed but may slow significantly. This is the preferred throttle mechanism for containers — kill only as a last resort.

memory.swap.max: Limits swap usage for the cgroup. Setting to 0 prevents any swap usage (common in containers to make memory pressure explicit).

memory.current: Read-only, current memory usage of the cgroup in bytes.

memory.events: Statistics about memory events:

low 0
high 42
max 3
oom 0
oom_kill 0

The high counter indicates how many times the high threshold was exceeded — useful for detecting memory pressure before OOM kills.

memory.stat: Detailed breakdown including anon, file, kernel, slab, sock memory, swap, and more.

IO Controller

io.max (format: MAJOR:MINOR rbps=N wbps=N riops=N wiops=N): Hard rate limit per block device.

# Limit /dev/sda (8:0) to 100MB/s read, 50MB/s write
echo "8:0 rbps=104857600 wbps=52428800" > /sys/fs/cgroup/mycontainer/io.max

io.weight (format: default N or MAJOR:MINOR N): Proportional IO weight, analogous to cpu.weight.

io.stat: Per-device read/write bytes and IOPS counters.

PID Controller

pids.max: Maximum number of processes+threads that can exist in the cgroup. Essential for preventing fork bombs:

echo "100" > /sys/fs/cgroup/mycontainer/pids.max

pids.current: Current count of processes in the cgroup.

Pressure Stall Information (PSI)

PSI is a v2 feature that quantifies resource pressure as a fraction of time that tasks are stalled waiting for a resource. Each resource has a pressure file in cgroup v2:

# cat /sys/fs/cgroup/mycontainer/memory.pressure
some avg10=0.00 avg60=0.23 avg300=1.47 total=8473291
full avg10=0.00 avg60=0.11 avg300=0.72 total=3921847

some: Fraction of time at least one task was stalled on this resource
full: Fraction of time ALL tasks were stalled (complete resource starvation)
avg10/60/300: Exponential moving averages over 10s, 60s, 300s windows

PSI can be used to trigger proactive reclaim or alerting before OOM events. Facebook's oomd (Out-of-Memory Daemon) uses PSI to kill cgroups under memory pressure before the kernel's OOM killer fires, with more context-aware decisions.

Cgroup v2 Hierarchy Diagram

Resource flow through cgroup v2 hierarchy:

Process makes allocation request
         │
         ▼
┌────────────────────────────────────────────────────────┐
│ Kernel resource subsystem (e.g., mm for memory)        │
│                                                        │
│  Walk cgroup hierarchy from leaf to root               │
│  for each ancestor: check controller limits            │
│  if any limit exceeded: apply controller action        │
│    - memory.max exceeded → OOM kill                    │
│    - memory.high exceeded → synchronous reclaim        │
│    - cpu.max exceeded → throttle (task placed to sleep)│
│    - pids.max exceeded → EAGAIN on fork()              │
└────────────────────────────────────────────────────────┘
         │
         ▼
Request granted (or denied/throttled)

Accounting: usage flows UP the hierarchy
┌─────────────────────┐
│   root cgroup       │  memory.current = sum of all below
│   memory.current    │
│   = 8192 MB         │
│                     │
│  ┌───────────────┐  │
│  │ system.slice  │  │  memory.current = 4096 MB
│  │               │  │
│  │ ┌───────────┐ │  │
│  │ │ container │ │  │  memory.current = 2048 MB
│  │ └───────────┘ │  │
│  └───────────────┘  │
└─────────────────────┘

cgroup v1 vs v2 Comparison

Aspect	cgroup v1	cgroup v2
Hierarchy	One per controller	Single unified
Process placement	Can differ per controller	Same position for all
Delegation	Complex, per-hierarchy	Clean, unified
Kernel version	2.6.24 (2008)	4.5+ (2016), mature 5.x
Docker default	v1 on older kernels	v2 on kernels 5.2+
Kubernetes default	v1 historically	v2 since k8s 1.25 stable
PSI support	No	Yes

Production Examples

Kubernetes QoS classes use cgroup v2: - Guaranteed pods (requests == limits): high memory.min (guarantee), memory.max == memory.min - Burstable pods: memory.min = requests, memory.max = limits - BestEffort pods: no memory.min, no memory.max (use whatever is free)

Docker resource constraints:

docker run --cpus="1.5" --memory="512m" --memory-swap="512m" nginx
# --cpus="1.5"    → cpu.max = "150000 100000"
# --memory="512m" → memory.max = 536870912
# --memory-swap equals --memory → memory.swap.max = 0 (no swap)

Systemd unit file cgroup limits:

[Service]
MemoryMax=2G
MemoryHigh=1.5G
CPUWeight=200
IOWeight=100
TasksMax=512

Debugging Notes

Check which cgroup version is in use: stat -fc %T /sys/fs/cgroup/ — returns tmpfs for v1, cgroup2fs for v2.
Find a process's cgroup: cat /proc/<PID>/cgroup — v2 shows a single line starting with 0::.
Memory limit not working: Verify the memory controller is enabled at every level of the hierarchy. In v2, controllers must be listed in cgroup.subtree_control at each level.
OOM kills not in dmesg: Under v2, cgroup-level OOM kills are logged to memory.events and may appear in systemd journal rather than kernel messages.
Throttled CPU time: Check cpu.stat for throttled_usec — high values indicate the container is hitting cpu.max frequently, degrading latency.
PSI monitoring: Set up a threshold notification using inotify or poll() on the PSI files to get callbacks when pressure exceeds a threshold.

Security Implications

pids.max is critical: Without a PID limit, a compromised container can fork-bomb the host, exhausting the kernel PID table.
memory.swap.max=0: Containers should generally disable swap to prevent memory pressure from a noisy neighbor silently degrading performance via swap activity.
cgroup escape via delegation bugs: Historically, delegation of cgroup subtree control had privilege escalation bugs — a container admin could escape resource limits by manipulating their delegated subtree. These have been fixed in v2's stricter delegation model.
Detecting cgroup limits from inside container: A process can read its own limits from /sys/fs/cgroup/ (if mounted) — some applications (JVM, Go runtime) now read cgroup limits to auto-configure heap size instead of using total host memory.

Performance Implications

CPU throttling latency: When a container hits cpu.max, tasks sleep until the next period, adding latency spikes. Period length (default 100ms) determines maximum latency added. For latency-sensitive services, set a smaller period: "50000 50000" (50% of one CPU, 50ms period).
Memory reclaim on memory.high: Hitting the soft limit causes synchronous reclaim in the allocation path, adding latency. Watch memory.events high counter.
Copy-on-write accounting: Memory is charged to the cgroup that faults it in. For shared libraries, the first container to use a page gets charged. This can cause unexpected memory accounting discrepancies.
OOM kill selection: Under v2, OOM kill within a cgroup selects the process with highest oom_score_adj. Containers should set oom_score_adj appropriately for process priority.

Failure Modes

Failure	Symptom	Diagnosis
OOM kill loop	Container repeatedly exits with code 137	`cat memory.events` shows `oom_kill` incrementing; increase `memory.max`
CPU starvation	High latency, low throughput	`cpu.stat` shows high `throttled_usec`; raise `cpu.max` or `cpu.weight`
Fork bomb	Host becomes unresponsive	`pids.max` not set; current count in `pids.current` near system limit
IO starvation	Application slow, IO wait high	`io.stat` shows low throughput vs queue depth; check `io.max` and `io.weight`
cgroup leak	Zombie cgroup directories remain	Container runtime bug; clean up with `rmdir` after all processes exit
Missing controller	Limits silently ignored	Controller not in `cgroup.subtree_control` at parent level

Modern Usage

systemd as cgroup manager: On modern Linux, systemd manages the cgroup hierarchy. Container runtimes must cooperate with systemd (using sd_notify or systemd unit activation) rather than creating cgroups independently.
cgroup-aware applications: JVM (since JDK 10), Go runtime (1.19+), and Node.js now read cgroup memory limits to set default heap sizes instead of reading total physical memory.
cgroupv2 and eBPF: eBPF programs can be attached to cgroup hooks for per-container network policy, socket options, and device access control without iptables.
Nested virtualization: VMs inside containers use cgroup v2 for nested resource management.

Future Directions

Memory protection tiers: More granular memory.min (guaranteed) and memory.low (soft protection) integration with kernel reclaim pressure propagation
CPU isolation improvements: cpuset controller integration with v2 hierarchy for NUMA-aware container placement
Cross-cgroup memory sharing: Proposals for charging shared anonymous memory to a designated cgroup rather than the first-toucher
io.latency controller: Latency-based IO control (already partially implemented) to guarantee IO latency rather than just rate-limiting bandwidth

Exercises

Create a cgroup v2 group manually under /sys/fs/cgroup/test/. Enable the memory controller. Set memory.max to 50MB. Run a process in it using echo $$ > cgroup.procs. Then try to allocate more than 50MB and observe the OOM kill.
Write a script that reads cpu.stat for a container's cgroup, computes the throttle percentage, and alerts if it exceeds 10%.
Set up PSI monitoring: write a program that uses poll() on memory.pressure to receive a notification when some avg10 exceeds 10%, and logs a warning.
Compare memory accounting: run two containers that use the same base image. Inspect memory.current for each. Explain why the sum might exceed physical memory used.
Use systemd-cgls to visualize the cgroup tree on a running Kubernetes node. Identify which cgroup represents a specific running pod.
Experiment with cpu.weight: run two CPU-bound processes in different cgroups with weights 100 and 400. Measure their actual CPU shares using cpu.stat usage_usec over 10 seconds.

References

cgroups(7) — Linux man page
Tejun Heo, "Control Group v2" — kernel documentation: Documentation/admin-guide/cgroup-v2.rst
Paul Menage's original cgroup design documentation
Facebook oomd: github.com/facebookincubator/oomd
systemd.resource-control(5) — systemd resource control man page
Linux kernel source: kernel/cgroup/, mm/memcontrol.c, kernel/sched/fair.c
Brendan Gregg, "Linux Performance" — cgroup observability chapters