OOM Killer

Technical Overview

The OOM (Out of Memory) Killer is the kernel's last resort when memory allocation fails and cannot be satisfied by reclaiming pages. Rather than deadlocking the system (where every process is waiting for memory that no one can free), the kernel selects one or more processes to kill, forcibly freeing their memory to allow the system to continue operating.

The OOM killer is controversial: it can kill the wrong process (killing a database rather than the process that triggered OOM), it can be too slow to respond, and its heuristics may not match the administrator's intent. Understanding it is essential for operating systems at scale — particularly in containerized environments where OOM kills are common and must be predictable.

The OOM killer is triggered by two paths: 1. __alloc_pages() exhausts all reclaim paths and still cannot allocate 2. memcg_oom_handler() — a memory cgroup exceeds its memory limit

Prerequisites

Virtual memory and process memory accounting
Buddy allocator and GFP flags (06-buddy-allocator.md)
Memory pressure and reclaim (12-memory-pressure-and-reclaim.md)
cgroup v2 memory controller basics
Kubernetes pod lifecycle

Core Content

When OOM Killer Triggers

Memory Allocation Path to OOM
================================

alloc_pages(GFP_KERNEL, order)
  │
  ├── [FAST PATH] get_page_from_freelist()
  │     Zone has free pages → SUCCESS
  │
  └── [SLOW PATH] __alloc_pages_slowpath()
        │
        ├── 1. Wake kswapd (async page reclaim daemon)
        │      Retry allocation
        │
        ├── 2. Memory compaction (for high-order allocations)
        │      Retry allocation
        │
        ├── 3. Direct reclaim (synchronous: call try_to_free_pages())
        │      Reclaim LRU pages, write dirty pages to disk
        │      Retry allocation
        │
        ├── 4. More aggressive reclaim + compaction
        │      Retry allocation up to MAX_RECLAIM_RETRIES times
        │
        └── 5. All reclaim failed → out_of_memory()
               │
               ├── Should we panic? (vm.panic_on_oom=1 or 2)
               │     YES: kernel panic
               │
               ├── Is there a killer already running? (oom_kill_disable)
               │     YES: wait for it to free memory
               │
               ├── Select victim: oom_kill_process()
               │
               └── Send SIGKILL to victim
                   Retry allocation (victim should free memory soon)

OOM Score Calculation

Each process has an oom_score (0–1000) that determines its likelihood of being killed. Higher score = more likely to be killed.

The score is computed by oom_badness() in mm/oom_kill.c:

oom_badness(task, totalpages):
  points = get_mm_rss(task->mm)        # RSS: pages in RAM
           + get_mm_counter(task->mm, MM_SWAPENTS)  # + swap usage
           + mm_pgtables_bytes(task->mm) / PAGE_SIZE # + page table pages

  Normalize to [0, 1000]:
  points = points * 1000 / totalpages

  Apply adj (oom_score_adj):
  points += oom_score_adj * totalpages / 1000

  If oom_score_adj == -1000: return LONG_MIN (never kill)
  If points <= 0: return 1 (minimum, but killable)
  Return points

Key properties: - Memory hog gets higher score: A process using 50% of RAM gets ~500 points - Root processes get slight reduction: root_adj = 3 (kernel config) - oom_score_adj is the tuning knob: Range -1000 (never kill) to +1000 (kill first)

# View OOM score for all processes
for p in /proc/[0-9]*/; do
    pid=$(basename $p)
    score=$(cat "$p/oom_score" 2>/dev/null)
    adj=$(cat "$p/oom_score_adj" 2>/dev/null)
    comm=$(cat "$p/comm" 2>/dev/null)
    echo "$score $adj $pid $comm"
done | sort -rn | head -20

/proc/PID/oom_score_adj Tuning

oom_score_adj values and use cases:

  -1000   Never kill this process
          Use for: init (PID 1), monitoring agents, recovery daemons
          Risk: if this process leaks memory, it can cause system-wide OOM

  -500    Significantly protect this process
          Use for: database servers, critical services

    0     Default: score based purely on memory usage

  +500    Prefer to kill this process before average processes
          Use for: batch jobs, test processes

  +1000   Kill this process first (score always highest)
          Use for: processes whose data can be safely discarded

Example tuning:
  # Protect PostgreSQL from OOM kill
  echo -200 > /proc/$(pidof postgres)/oom_score_adj

  # Systemd service file
  [Service]
  OOMScoreAdjust=-500

  # cgroup v1: /proc/PID/oom_score_adj is per-process
  # cgroup v2: memory.oom.group makes all processes in a cgroup die together

OOM Killer Selection Algorithm

oom_kill_process() [mm/oom_kill.c]
  │
  ├── select_bad_process():
  │     For each task in task_list:
  │       score = oom_badness(task, oom_context)
  │       if score > max_score: victim = task
  │     Return victim (highest oom_score)
  │
  ├── oom_kill_task(victim):
  │     Mark victim with TIF_MEMDIE flag
  │     Send SIGKILL to victim and all its threads
  │     Also kill any processes sharing victim's mm (VM_SHARED mappings)
  │
  └── Notify via oom_notify_list (oom_notifier_block)
      (drivers/memory subsystems can register callbacks)

OOM victim selection skips: - Kernel threads (no mm) - Processes with oom_score_adj == -1000 - Processes already being killed (TIF_MEMDIE set) - init (PID 1) — would be catastrophic - Processes currently exiting

OOM Killer Logs

When the OOM killer fires, it emits a detailed log entry:

[1234567.890] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),
              cpuset=/,mems_allowed=0-1,global_oom,task_memcg=/system.slice/myapp.service,
              task=myapp,pid=12345,uid=1000
[1234567.891] Out of memory: Killed process 12345 (myapp) total-vm:8GB anon-rss:4GB file-rss:512MB shmem-rss:0kB
              UID 1000 Total VM 8GB, RSS 4GB, Killed
[1234567.892] oom_reaper: reaped process 12345 (myapp), now anon-rss:0kB, file-rss:512MB, shmem-rss:0kB

Fields to analyze: - total-vm: Virtual address space size (may be much larger than RAM used) - anon-rss: Anonymous RAM (heap, stack, mmap'd anonymous = most interesting) - file-rss: File-backed RAM (code, mmap'd files) - shmem-rss: Shared memory (tmpfs, IPC)

Also printed: per-process table of all processes with their memory usage, OOM score, and command name. This is invaluable for post-mortem analysis.

cgroup-Level OOM (cgroup v2)

In cgroup v2, each cgroup has a memory limit. When a process in the cgroup exceeds the limit:

Cgroup OOM flow:
  Allocation fails for process in cgroup with limit
    │
    └── memcg_oom_handler() [mm/memcontrol.c]
          │
          ├── Try to reclaim from cgroup's LRU lists
          │
          └── If reclaim fails:
                mem_cgroup_out_of_memory()
                  → select_bad_process() within the cgroup
                  → kill highest oom_score process IN THE CGROUP ONLY
                    (does not kill processes outside the cgroup)

cgroup v2 memory.oom.group:
  echo 1 > /sys/fs/cgroup/myapp/memory.oom.group
  → When any process in this cgroup OOMs, ALL processes in the cgroup
    are killed together (atomic group kill)
  → Useful for containers: kill the entire pod, not just one process

Cgroup v2 files for OOM:

# Set memory limit
echo "4G" > /sys/fs/cgroup/myapp/memory.max

# Set swap limit
echo "0" > /sys/fs/cgroup/myapp/memory.swap.max

# Enable group kill on OOM
echo 1 > /sys/fs/cgroup/myapp/memory.oom.group

# Monitor OOM events
cat /sys/fs/cgroup/myapp/memory.events
# low 0
# high 1234
# max 5
# oom 2        ← 2 OOM events
# oom_kill 3   ← 3 processes killed

Memory Overcommit (vm.overcommit_memory)

vm.overcommit_memory modes:

0 = Heuristic overcommit (default):
  Allow overcommit unless the allocation is clearly too large.
  Check: total committed memory < RAM + swap * vm.overcommit_ratio/100
  Typical behavior: 50% of RAM + all swap can be overcommitted.

1 = Always overcommit:
  Never refuse mmap/malloc regardless of available memory.
  OOM killer is the only limit.
  Used by: scientific computing (known max usage), trust application knows its needs.
  Risk: random processes killed if workload exceeds RAM.

2 = No overcommit:
  Refuse allocations when committed memory exceeds:
  CommitLimit = (vm.overcommit_ratio/100 * physical RAM) + swap
  cat /proc/meminfo | grep CommitLimit
  Risk: mmap/malloc returns ENOMEM; applications must handle failure.
  Used by: databases that check mmap return values, security-conscious systems.

vm.overcommit_ratio (default 50):
  Percentage of RAM available for overcommit in mode 2.

The distinction between "allocated" (virtual) and "committed" (physical-backed) memory is the core of overcommit. Linux defaults to optimistic overcommit (mode 0) based on the assumption that programs rarely use their full allocation.

Kubernetes OOMKilled

In Kubernetes, OOM kills manifest as pod status OOMKilled (exit code 137 = 128 + SIGKILL):

Kubernetes memory management:
  Request: 512Mi  → guaranteed to get this much from the node
  Limit:   1Gi    → hard limit; pod is OOMKilled if it exceeds this

The "limit" is implemented as a cgroup v1 memory.limit_in_bytes
(or cgroup v2 memory.max).

When the container's memory.max is hit:
  → cgroup OOM → kubelet receives SIGKILL on container process
  → Container exits with code 137
  → Pod restarts (if restartPolicy=Always)

Common causes:
  1. Memory leak (RSS grows unbounded)
  2. Burst allocation (bulk load, large query, temporary spike)
  3. Limit set too low for actual workload
  4. JVM heap + off-heap (native memory) exceeds limit:
     -Xmx4g set but JVM also uses 1GB off-heap → total 5GB > 4GB limit

Debugging:
  kubectl describe pod <name> | grep -A5 OOMKilled
  kubectl top pod <name>  # live memory usage
  # Historical: metrics server, Prometheus kube-state-metrics

Disabling the OOM Killer (Dangerous)

# System-wide OOM killer disable
echo 1 > /proc/sys/vm/overcommit_memory  # mode 1 = never refuse
# + 
sysctl vm.panic_on_oom=1  # kernel panic instead of OOM kill

# Use case: hard real-time systems where a partially-killed system is
# worse than a full crash (triggers watchdog + restart)

# Per-process: never kill THIS process
echo -1000 > /proc/$(pidof myapp)/oom_score_adj

Never disable OOM killer in production without a watchdog mechanism. A system under OOM with no killer will hang — all processes blocked waiting for memory, none able to make progress.

Historical Context

The OOM killer was introduced in Linux 2.0 (mid-1990s) as a basic necessity. The original algorithm was primitive: find the largest process and kill it. Over subsequent releases, the badness heuristic was refined to consider process age, nice value, memory usage type, and cgroup membership. The oom_score_adj interface (replacing the older oom_adj) was standardized in Linux 2.6.36. Cgroup-level OOM handling was added with the memory cgroup controller (Linux 2.6.25). The oom_reaper thread (Linux 4.5) was added to asynchronously free the OOM victim's memory without blocking the allocation path.

Andries Brouwer's famous essay "Oom killer" (2000) documented the fundamental problem: since mmap always succeeds (due to overcommit), the kernel must have a mechanism to reclaim that memory. The OOM killer is the consequence of the overcommit design choice.

Production Examples

Redis OOM due to fork(): Redis forks for BGSAVE/BGREWRITEAOF. On CoW, the forked child's memory is shared. Under write load, the parent writes dirty many pages, creating CoW copies. If the total committed memory (parent + child CoW copies) exceeds RAM, OOM kills the Redis parent or child. Mitigation: vm.overcommit_memory=1 for Redis servers; use RDB with rdbcompression yes; give adequate swap.

Java application OOM kill with adequate heap: JVM with -Xmx8g on a container with memory.limit=8g → OOMKilled. The JVM heap is 8 GB, but metaspace, JIT code cache, native libraries, and OS overhead add another 1–2 GB. The container limit must be heap + 25% headroom. Best practice: -Xmx4g with memory.limit=6g.

Kubernetes cascade OOM: A node experiences memory pressure. kswapd reclaims page cache but cannot keep up with allocation rate. The OOM killer fires, selecting a pod with the highest oom_score. That pod was a shared cache service used by 50 other pods. Those 50 pods now experience cache misses, generate more allocations, and the OOM killer fires again in a cascade. The node experiences rolling OOM kills of 20 pods before stabilizing. Prevention: set oom_score_adj=-200 for cache services, and use PriorityClass in Kubernetes to protect critical services.

Debugging Notes

# Check if OOM killer has fired recently
dmesg | grep -E "oom|killed process|Out of memory"
journalctl -k | grep -i oom

# Extract OOM events with timestamp
dmesg -T | grep -B5 -A30 "Out of memory"

# Monitor OOM in real time
dmesg -w | grep -i oom

# Check current memory pressure
vmstat 1
# Columns: swpd (swap used), free, buff, cache
# si/so: swap in/out (non-zero = under pressure)

# Overcommit stats
cat /proc/meminfo | grep -E "CommitLimit|Committed_AS"
# Committed_AS > CommitLimit → overcommit active (with mode 0 or 1)

# Per-cgroup OOM stats
find /sys/fs/cgroup -name memory.events | while read f; do
    echo "--- $f ---"; cat "$f"; done

# Find the process that would be killed next (highest oom_score)
sort -k1 -rn <(for p in /proc/[0-9]*/; do
    echo "$(cat "$p/oom_score" 2>/dev/null) $(cat "$p/comm" 2>/dev/null)"
done) | head -5

# Check oom_score_adj for all processes in a cgroup
find /proc -name oom_score_adj | xargs -I{} sh -c 'echo "{} $(cat {})"' | grep -v "^/proc/[0-9]*/task"

Security Implications

OOM killer as DoS vector: An unprivileged process can allocate large amounts of memory (via overcommit), then touch it to commit physical pages, exhausting RAM and triggering OOM kills on other processes. Setting vm.overcommit_memory=2 with appropriate overcommit_ratio limits this. Cgroup memory limits are the proper containment mechanism.

Privilege escalation via OOM kill: If an attacker can deterministically cause the OOM killer to fire and select a specific victim (e.g., a security daemon with oom_score_adj=0 while the attacker's process has oom_score_adj=+1000), they can cause that daemon to be killed, bypassing security enforcement. Mitigation: set oom_score_adj=-1000 for security-critical processes.

Information leak via OOM log: The OOM killer's kernel log output includes process names, PIDs, memory usage, and sometimes mapping details. On a multi-user system, this reveals information about other users' processes.

Container escape via OOM reaper: The oom_reaper thread in the kernel asynchronously walks the OOM victim's VMA list to free memory. A race between OOM reaper and process exit has historically caused kernel use-after-free bugs (e.g., CVE-2019-7308, CVE-2020-29374).

Performance Implications

OOM kill latency: From the time an allocation fails to when the victim is killed and memory is freed: typically 100 ms to several seconds. Depends on victim's clean vs dirty pages, swap activity, and oom_reaper wakeup latency.
False OOM kills: A process with a large VMA (due to mmap overcommit) but small RSS can have a high virtual-vs-RSS ratio. The OOM badness score uses RSS, but the virtual size still counts against commit limit in mode 2.
Memory accounting overhead: Per-cgroup RSS accounting adds ~5–10 ns per page fault due to atomic counter updates in mem_cgroup_charge(). At 100k faults/s, this is ~1 ms/s per core. Acceptable for most workloads.
OOM in GFP_ATOMIC context: If GFP_ATOMIC allocation fails (used in interrupt handlers), the OOM killer is NOT invoked (cannot sleep). The allocation simply fails, and the caller must handle it. Network packet drops during memory pressure are a result of this.

Failure Modes and Real Incidents

Kubernetes OOM cascade (Google SRE, 2017 analog): A production Kubernetes cluster experienced a "thrashing" event where OOM kills + pod restarts consumed memory on restart (container image loading, fresh heap allocation), causing more OOM kills. The system oscillated for 20 minutes before human intervention. Fix: PodDisruptionBudgets + memory requests/limits enforcement + OOMScoreAdj for critical pods.

MySQL OOM kill on buffer pool flush: MySQL InnoDB's buffer pool flush triggered during backup (large SELECT *), resulting in a spike in physical memory usage (dirty pages + temporary buffer). The OOM killer selected mysqld (highest RSS). After OOMKill, the backup was also killed (no database). Recovery took 90 minutes. Fix: oom_score_adj=-500 for mysqld, properly size buffer pool for peak usage.

Redis OOM loop: Redis with vm.overcommit_memory=0 and an active BGSAVE. The fork for BGSAVE doubled the virtual commit. Linux's heuristic overcommit check rejected the fork with ENOMEM. Redis logged "Can't save in background: fork: Cannot allocate memory". The database appeared to work but durability was lost. Fix: vm.overcommit_memory=1 for Redis deployment.

Modern Usage

cgroup v2 memory.oom.group: Kubernetes 1.22+ uses this for pod-level OOM. All containers in a pod are killed together rather than individually, simplifying restart semantics.
Kubernetes resource limits as cgroup limits: resources.limits.memory directly maps to memory.max in cgroup v2. Best practice is to set both requests and limits to the same value for guaranteed QoS class.
oom_reaper (Linux 4.5): Asynchronously reaps OOM victim memory without blocking the OOM path. Reduces OOM kill latency from seconds (waiting for victim to receive SIGKILL, exit, and free pages) to milliseconds.
Memcg proactive reclaim: cgroup v2 memory.reclaim allows user space to request the kernel reclaim memory from a cgroup proactively, before hitting the limit and triggering OOM. Used by systemd-oomd and ChromeOS's memory pressure handling.

Future Directions

systemd-oomd: A user-space OOM daemon that monitors cgroup memory pressure via PSI (Pressure Stall Information) and kills memory-heavy cgroups before the kernel OOM killer fires. More policy-aware than the kernel OOM killer. Part of systemd since version 247.
Android LMKD (Low Memory Killer Daemon): Android's OOM killer equivalent. User-space daemon that monitors memory pressure via PSI and kills background apps by importance order. Moved from in-kernel LMK driver to user-space LMKD in Android 10.
Memory pressure events (PSI): Linux 4.20 introduced PSI (Pressure Stall Information) — /proc/pressure/memory. Reports what fraction of time all tasks are stalled waiting for memory. Used by systemd-oomd and Android LMKD for proactive intervention before the OOM kill point.

Exercises

Write a program that allocates memory in 10 MB increments until OOM is triggered. Observe the OOM killer's log output and identify which process was killed and why.
Set oom_score_adj=-1000 on a process and then trigger OOM (by running a memory hog in parallel). Confirm the protected process survives. Identify which process gets killed instead.
Configure a cgroup v2 with memory.max=100M and memory.oom.group=1. Run 3 processes in the cgroup. Trigger OOM by having one process allocate beyond the limit. Confirm all 3 are killed.
Parse the OOM killer log output from a real system (or a VM you OOM on purpose) and reconstruct the memory pressure state at the time of the kill.
Implement a simple OOM monitor using inotify on /proc/kmsg (or /dev/kmsg) that alerts when OOM is triggered and logs the victim's details.
Compare the OOM kill latency (time from allocation failure to victim's memory being freed) between oom_reaper (Linux 4.5+) and older kernels. Use a VM for the test.

References

mm/oom_kill.c — out_of_memory(), oom_badness(), select_bad_process(), oom_kill_process()
mm/memcontrol.c — mem_cgroup_out_of_memory(), cgroup OOM handling
mm/page_alloc.c — __alloc_pages_slowpath(), reclaim trigger point
kernel/sysctl.c — vm.overcommit_memory, vm.panic_on_oom registration
/proc/sys/vm/ — overcommit_memory, overcommit_ratio, oom_kill_allocating_task
Linux man pages: proc(5) (oom_score, oom_score_adj fields)
Andries Brouwer, "Oom killer" (2000): https://www.win.tue.nl/~aeb/linux/lk/lk-9.html
LWN: "OOM killer rework" — https://lwn.net/Articles/391222/
LWN: "Toward better OOM handling" — https://lwn.net/Articles/668126/
Kubernetes documentation: "Configure Out of Resource Handling"
PSI (Pressure Stall Information): https://www.kernel.org/doc/html/latest/accounting/psi.html