Memory Pressure and Reclaim
Technical Overview
Linux manages a finite pool of physical RAM among competing users: the page cache (file data, directory entries, inode metadata), anonymous pages (heap, stack, private mmap), kernel data structures (slab caches), and hardware (DMA, device mappings). As the system fills up, the kernel must reclaim memory — free pages that can be dropped (clean file pages) or write out and free (dirty pages, swap-backed anonymous pages).
The reclaim system is designed to be transparent: applications should ideally never see allocation failures. The kernel maintains watermarks and proactively reclaims before the system runs out. Two mechanisms handle reclaim:
1. kswapd: A per-node kernel thread that reclaims asynchronously when free pages fall below the low watermark.
2. Direct reclaim: The allocating task itself reclaims when free pages fall below the min watermark.
The largest single consumer of RAM on a typical Linux server is not the running processes — it is the page cache: the OS caches all recently read and written file data in RAM. On a database server with 256 GB RAM and 32 GB of process RSS, the remaining 224 GB is page cache. Understanding when and how this is reclaimed is critical for capacity planning and performance tuning.
Prerequisites
- Virtual memory and page cache concepts
- Buddy allocator and GFP flags (06-buddy-allocator.md)
- OOM killer (10-oom-killer.md)
- NUMA nodes (11-numa-memory.md)
- LRU list basics
Core Content
Page Cache: The Dominant Memory Consumer
The page cache stores copies of file data in RAM. Every read(), mmap(), and write() goes through the page cache:
Page Cache Architecture
========================
struct address_space { /* one per file inode */
struct xarray i_pages; /* radix tree / xarray of cached pages */
unsigned long nrpages; /* number of pages cached */
const struct address_space_operations *a_ops; /* readpage, writepage, ... */
struct inode *host;
spinlock_t private_lock;
struct list_head private_list;
};
Each cached page: struct page with page->mapping = address_space
page->index = offset in file (in pages)
File read path:
read() → file->f_op->read_iter() → generic_file_read_iter()
→ find_get_page(mapping, pgoff): already cached? → return it
→ page_cache_alloc() + mapping->a_ops->readpage() → schedule I/O
→ wait_on_page_locked() → return cached page content
Result: Same file page is shared by all processes accessing it.
Multiple processes reading /lib/x86_64-linux-gnu/libc.so.6 share its pages.
Page cache on a typical production server:
free -h
# total used free shared buff/cache available
# Mem: 376G 48G 12G 2.1G 315G 324G
# Swap: 31G 1.2G 30G
# 315G is buff/cache (mostly page cache)
# 324G is "available" = free + reclaimable cache
LRU Page Lists
The kernel maintains four LRU (Least Recently Used) lists per NUMA node per zone:
LRU Lists (per-zone, simplified)
==================================
Active Anonymous (anon_lru[LRU_ACTIVE_ANON])
Pages with Accessed bit set recently; recently faulted-in anonymous pages
[ hot anonymous page ... warm anon page ... cooldown → to inactive ]
Inactive Anonymous (anon_lru[LRU_INACTIVE_ANON])
Anonymous pages not recently accessed; candidates for swap-out
[ old anonymous page ... very old → candidate for swap-out ]
Active File (file_lru[LRU_ACTIVE_FILE])
File-backed pages accessed recently (Accessed bit = 1)
[ hot file page ... warm file page ... cooldown → to inactive ]
Inactive File (file_lru[LRU_INACTIVE_FILE])
File-backed pages not recently accessed; candidates for eviction
If page is clean: can be freed immediately (re-read from disk later)
If page is dirty: must write to disk first, then free
Page promotion (inactive → active):
When a page in the inactive list is accessed again,
mark_page_accessed() promotes it to the active list.
Page demotion (active → inactive):
During reclaim, pages in the active list are moved to inactive.
The Accessed bit is cleared; if it was not set → demote.
In Linux 6.1+, the Multi-Generational LRU (mglru, CONFIG_LRU_GEN) replaces the two-tier active/inactive with a generational approach, providing more accurate age estimation with lower overhead.
kswapd: Asynchronous Reclaim Daemon
Memory Watermarks (per-zone)
==============================
free pages
High ─────────────────── kswapd wakes at "low", stops at "high"
(kswapd reclaims when free < low)
Low ─────────────────── kswapd wakeup threshold
(alloc waits for async reclaim to catch up)
Min ─────────────────── direct reclaim begins
(allocating task reclaims synchronously)
─────────────────── OOM killer threshold (all reclaim failed)
Watermark values (default):
vm.min_free_kbytes = 67584 KB (on 16GB RAM system)
min = min_free_kbytes / 4 per zone
low = min * 5/4
high = min * 3/2
(exact calculation in __setup_per_zone_wmarks())
Modify with:
sysctl -w vm.min_free_kbytes=262144 # raise for large-page workloads
kswapd operation (mm/vmscan.c: kswapd()):
1. Sleeps until woken by wakeup_kswapd() (called by alloc_pages())
2. Calls balance_pgdat() → kswapd_shrink_node() → shrink_zone()
3. Shrinks LRU lists: reclaim inactive file pages first, then anon pages if vm.swappiness allows
4. Stops when free pages reach high watermark
# kswapd activity
ps aux | grep kswapd # one per NUMA node
grep kswapd /proc/vmstat # kswapd_steal = pages reclaimed by kswapd
watch -n1 "cat /proc/vmstat | grep -E 'kswapd|pgsteal|pgscan'"
Direct Reclaim
When free pages drop below the min watermark, the allocating task itself must reclaim pages synchronously. This is the most impactful event for application latency:
Direct Reclaim Flow
====================
alloc_pages(GFP_KERNEL, order) [kswapd hasn't kept up]
│
├── free < low: wake kswapd, retry
│
├── free < min: direct_reclaim = 1
│ │
│ └── try_to_free_pages() [mm/vmscan.c]
│ │
│ ├── shrink_zones(): scan LRU, reclaim pages
│ ├── If GFP_IO allowed: writeback dirty pages (very slow!)
│ ├── If GFP_FS allowed: drop dentries, shrink slab
│ └── Retry allocation
│
└── Still fails: → slowpath → OOM killer
Direct reclaim latency:
Clean file pages: ~1-10 µs (just clear PTE, add to buddy)
Dirty file pages: ~10-100 ms (writeback to disk first)
Anonymous pages (swap): ~1-100 ms (write to swap device)
vm.swappiness: Reclaim Preference
vm.swappiness (0–200, default 60) controls the balance between reclaiming anonymous pages (via swap) vs reclaiming file-backed pages (dropping page cache):
vm.swappiness effect on reclaim:
0: Strongly prefer reclaiming file cache; almost never swap
→ page cache is evicted aggressively; anonymous pages stay
→ Risk: if working set is large anonymous, reclaim fails
10: Low swap preference (recommended for latency-sensitive servers)
60: Default: balanced
100: Treat anonymous and file pages equally
200: Strongly prefer swapping anonymous pages
Behind the scenes (mm/vmscan.c: get_scan_count()):
The scanner determines how many pages to scan from each LRU list.
swappiness determines the anon:file scan ratio:
scan_anon = total * swappiness / (swappiness + file_weight)
scan_file = total * file_weight / (swappiness + file_weight)
Tuning recommendations:
Databases (PostgreSQL, MySQL): vm.swappiness=10 or 1
General servers: vm.swappiness=10-20
Containers with cgroup limits: Per-cgroup memory.swappiness
Page Writeback
Dirty pages (pages with Accessed AND Dirty bit set, or pages written to via mmap) must be written to disk before they can be freed. The writeback subsystem manages this:
Writeback Control
==================
background_dirty_ratio (default 10% of RAM):
When dirty pages > 10%, background writeback starts (pdflush/writeback threads)
dirty_ratio (default 20% of RAM):
When dirty pages > 20%, new writes are throttled (blocked until writeback catches up)
dirty_background_bytes / dirty_bytes:
Absolute byte limits (takes precedence over ratio if non-zero)
dirty_expire_centisecs (default 3000 = 30 seconds):
Pages older than 30s are written back regardless of dirty ratio
dirty_writeback_centisecs (default 500 = 5 seconds):
Writeback daemon wakes up every 5 seconds to check for expired pages
Writeback worker threads:
Per-device writeback: 1 thread per block device (bdi_writeback)
mm/page-writeback.c: balance_dirty_pages(), wb_writeback()
block/blk-wbt.c: writeback throttling (WBT)
Memory Compaction
When a high-order allocation is needed (for huge pages, DMA), the kernel may compact memory by migrating movable pages to create contiguous free blocks:
Memory Compaction
==================
compact_zone() [mm/compaction.c]:
|
├── Migration scanner (from bottom of zone):
│ Find movable pages (MIGRATE_MOVABLE)
│ Add to migration list
│
├── Free scanner (from top of zone):
│ Find free pages
│
└── Migrate movable pages to free pages at top of zone
→ Creates contiguous free blocks at bottom for huge page allocation
Triggers:
1. Allocation failure for order > 0 (alloc_pages slowpath)
2. khugepaged trying to collapse 4KB pages to 2MB THP
3. Explicit: echo 1 > /proc/sys/vm/compact_memory
4. Periodic: kcompactd daemon
Cost: 100ms – several seconds for full zone compaction
Causes latency spikes visible in application P99 latency
Monitor:
grep -E "compact|migration" /proc/vmstat
# compact_stall = times allocation blocked waiting for compaction
# compact_success = successful compactions
CMA (Contiguous Memory Allocator)
CMA reserves a region of RAM for large contiguous allocations (DMA, GPU, camera ISP):
CMA Operation:
At boot: reserve CMA_SIZE (e.g., 256 MB) as MIGRATE_CMA pages
Normal operation: pages in CMA region are used as MOVABLE (user pages)
On demand: cma_alloc() migrates MOVABLE pages out, returns contiguous range
Usage in drivers:
struct page *pages = dma_alloc_contiguous(dev, size, gfp);
→ calls cma_alloc(dev->cma_area, ...) if device has CMA
Configuration:
Kernel cmdline: cma=256M
Per-device: dma_declare_contiguous(), dma_contiguous_reserve()
Balloon Driver (VM Memory Management)
In virtualized environments, the balloon driver allows the hypervisor to reclaim guest RAM:
Memory Balloon Operation (KVM/VMware/Hyper-V):
Hypervisor wants to reclaim 1GB from a guest VM:
│
├── Signals balloon driver in guest (virtio-balloon)
├── Guest's balloon driver: alloc_pages() in a tight loop (1GB worth)
│ These pages are "inflated" (pinned by balloon driver)
│ Guest OS sees less free memory
├── Hypervisor maps those pinned pages to other VMs or frees them
│
When hypervisor releases memory back:
├── Signals balloon driver to deflate
├── Balloon driver: free_pages() → pages returned to guest buddy allocator
└── Guest sees more free memory
virtio-balloon driver: drivers/virtio/virtio_balloon.c
zswap and zram
zswap: Compressed Swap Cache (kernel 3.11+)
Intercepts swap-out requests
Compresses pages with LZO/LZ4/ZSTD
Stores compressed pages in a kernel memory pool (zbud/z3fold/zsmalloc)
If hit on swap-in: decompress from pool (much faster than disk swap)
If pool full: evict oldest compressed page to real swap device
Enable:
echo 1 > /sys/module/zswap/parameters/enabled
echo lz4 > /sys/module/zswap/parameters/compressor
echo 20 > /sys/module/zswap/parameters/max_pool_percent # 20% of RAM
zram: RAM-backed compressed swap device
Creates a block device backed by compressed RAM
Used as swap on Android, ChromeOS, low-RAM embedded systems
Compression: LZO4/LZ4/ZSTD
Typical compression ratio: 2-4x for typical memory content
mkswap /dev/zram0
swapon /dev/zram0 -p 100 # high priority (use before disk swap)
Memory cgroups and Reclaim
# cgroup v2 memory control
/sys/fs/cgroup/<name>/memory.max # hard limit (OOM on exceed)
/sys/fs/cgroup/<name>/memory.high # soft limit (throttle + reclaim)
/sys/fs/cgroup/<name>/memory.low # protection floor (reclaim avoids this cgroup)
/sys/fs/cgroup/<name>/memory.min # absolute protection (never reclaim below this)
# Reclaim from a specific cgroup
echo 100M > /sys/fs/cgroup/myapp/memory.reclaim
# cgroup reclaim priority:
# memory.min > memory.low > no protection (default)
# Reclaim first from cgroups without memory.low or memory.min protection
# PSI (Pressure Stall Information) per cgroup
cat /sys/fs/cgroup/myapp/memory.pressure
# some avg10=0.00 avg60=0.00 avg300=0.05 total=123456
# full avg10=0.00 avg60=0.00 avg300=0.00 total=0
# "full" = all tasks stalled waiting for memory = severe pressure
/proc/meminfo Interpretation
cat /proc/meminfo
Critical fields explained:
MemTotal: 387939392 kB # Total usable RAM
MemFree: 12388608 kB # Completely free pages (not used for anything)
MemAvailable: 324198400 kB # Estimate: can be freed without swapping
# = MemFree + reclaimable_cache + reclaimable_slab
# THIS is the number to watch, not MemFree
Buffers: 2097152 kB # Temporary buffers for raw block I/O (metadata)
Cached: 312573952 kB # Page cache (file data) — largest consumer
SwapCached: 1048576 kB # Pages in swap that are also in RAM (swapped back in)
Active: 98304000 kB # Active LRU: recently used pages
Inactive: 215040000 kB # Inactive LRU: candidates for reclaim
Active(anon): 32768000 kB # Active anonymous (heap, stack)
Inactive(anon): 1048576 kB # Inactive anonymous → swap candidates
Active(file): 65536000 kB # Active file pages → page cache
Inactive(file): 214040000 kB # Inactive file pages → eviction candidates
Unevictable: 2097152 kB # Locked pages (mlock, tmpfs with MAP_LOCKED)
Mlocked: 1048576 kB # Pages locked via mlock()
SwapTotal: 33554432 kB # Total swap space
SwapFree: 32505856 kB # Unused swap
Dirty: 131072 kB # Pages written but not yet flushed to disk
Writeback: 0 kB # Pages currently being written to disk
AnonPages: 32505856 kB # Anonymous pages (non-file-backed)
Mapped: 8388608 kB # Pages actively mapped into processes
Shmem: 2097152 kB # Shared memory (tmpfs, /dev/shm)
KReclaimable: 16777216 kB # Reclaimable kernel memory (slab+misc)
Slab: 20971520 kB # Total slab memory
SReclaimable: 16777216 kB # Reclaimable slab (page cache, dentries, inodes)
SUnreclaim: 4194304 kB # Unreclaimable slab (kmalloc for kernel structs)
KernelStack: 131072 kB # Kernel stacks (8KB per thread typically)
PageTables: 262144 kB # Page table pages
AnonHugePages: 6291456 kB # THP anonymous pages
HugePages_Total: 1024 # HugeTLB pool
HugePages_Free: 512 # Free HugeTLB pages
Key relationship: MemAvailable ≈ MemFree + Inactive(file) + SReclaimable - reserve
Historical Context
Linux's page cache replaced the BSD buffer cache in kernel 2.2. The unified page cache (merging buffer cache and page cache) arrived in Linux 2.4. The kswapd daemon replaced the original bdflush in Linux 2.4. The LRU lists (two-hand clock algorithm) have been in Linux since at least 2.2. The vm.swappiness tunable was added in Linux 2.6. PSI (Pressure Stall Information) was added by Johannes Weiner (Facebook) in Linux 4.20 (2018), providing quantitative memory pressure measurement for the first time. The Multi-Generational LRU (mglru) by Yu Zhao (Google) was merged in Linux 6.1 (2022), providing significantly better page reclaim decisions with lower overhead.
Production Examples
Linux page cache as a database buffer pool: A PostgreSQL server with 8 GB of shared_buffers on a 256 GB machine uses 8 GB for its managed buffer pool. The remaining 248 GB is available to the OS page cache, which automatically caches hot table and index pages. PostgreSQL's effective_cache_size=200GB tells the query planner to assume the OS cache is warm.
zswap at Google: Google runs zswap on all their servers. Typical compressed page ratios: 2.5x. This effectively adds 40% more usable RAM at the cost of CPU cycles for compression. For CPU-idle periods (waiting on network), compression is "free." Result: 10–15% reduction in total server count for the same workload.
ChromeOS zram: ChromeOS uses zram as the primary swap device. With 4–8 GB of RAM, ChromeOS maintains dozens of browser tabs by swapping cold tabs to zram. Typical compression ratio: 3x. A 4 GB zram device stores 12 GB of memory content. This enables the system to maintain far more concurrent processes than physical RAM would suggest.
Debugging Notes
# Memory pressure snapshot
cat /proc/meminfo | grep -E "MemTotal|MemFree|MemAvailable|Cached|Active|Inactive|Dirty|Mapped|Slab"
# Reclaim rate (pages per second)
watch -n1 "cat /proc/vmstat | grep -E 'pgsteal|pgscan|pgactivate|pgdeactivate'"
# pgsteal_kswapd = pages freed by kswapd (want this > 0 under pressure)
# pgscan_kswapd = pages scanned by kswapd (scanned >> stolen = bad efficiency)
# Swap activity
vmstat 1 | awk '{print "swap-in:", $7, "swap-out:", $8}'
# si/so non-zero = swapping is happening
# Check writeback pressure
watch -n1 "grep -E '^Dirty|^Writeback' /proc/meminfo"
# Dirty should stay below dirty_background_bytes/dirty_ratio
# Writeback growing = slow disks, writeback falling behind
# Direct reclaim stall events (causes application latency spikes)
grep allocstall /proc/vmstat
# allocstall_normal growing = direct reclaim active (bad for latency)
# Compaction stalls
grep compact_stall /proc/vmstat
# Drop caches (for testing, dangerous on production)
echo 1 > /proc/sys/vm/drop_caches # drop page cache
echo 2 > /proc/sys/vm/drop_caches # drop slab (dentries, inodes)
echo 3 > /proc/sys/vm/drop_caches # drop both
# PSI memory pressure
cat /proc/pressure/memory
# some avg10=0.50 avg60=0.10 avg300=0.05 total=1234567
# full avg10=0.00 avg60=0.00 avg300=0.00 total=0
# "some" = at least one task stalled, "full" = all tasks stalled
# avg10 > 5-10% = moderate pressure; > 50% = severe pressure
# systemd-oomd (user-space OOM using PSI)
systemctl status systemd-oomd
journalctl -u systemd-oomd | tail -20
Security Implications
Page cache poisoning: An attacker who can write to a file that is mmap'd by a victim can poison the victim's page cache. This is prevented by MAP_PRIVATE (CoW) and filesystem permissions. Shared writable mmap requires explicit consent.
Timing side channel via reclaim: An attacker can infer whether a victim's file pages are cached (fast access) or evicted (slow access) by timing their own access to the same file. This reveals usage patterns without reading the content.
Swap hibernation data: On systems with disk swap, hibernation writes all RAM to the swap partition. If the swap partition is unencrypted, an attacker with physical access can read entire memory contents from disk. Mitigation: encrypted swap (dm-crypt).
Dirty page information leak: If a process writes sensitive data to a file and exits, the dirty pages remain in the page cache until written to disk. Another process can read the same file and see the stale data. This is intentional behavior (coherence), but it means sensitive file writes need explicit sync()/fsync().
Performance Implications
- Page cache hit rate: The most important metric for I/O-bound workloads. If the hot working set fits in page cache, throughput is memory-speed; if it doesn't, it's disk-speed.
vmtouchandperf record -e cache-missesreveal cache effectiveness. - Reclaim cost: Clean page eviction: ~1 µs. Dirty page writeback: ~1–100 ms per page (disk-speed). Swap-out of anonymous page: ~1–10 ms per page. Applications that hold dirty large files (video editing, databases) and then allocate memory cause expensive direct reclaim.
- vm.min_free_kbytes too low: If
min_free_kbytesis too low, thelowandhighwatermarks are barely above zero. kswapd wakes up too late, allowing direct reclaim to fire frequently. Recommendation:min_free_kbytes = sqrt(4 * physical_RAM_in_KB)(kernel default formula). - Slab reclaim overhead: Dropping millions of dentries/inodes from slab on
echo 2 > /proc/sys/vm/drop_cachescan take seconds on a busy system and causes inode cache invalidation.
Failure Modes and Real Incidents
Dirty write throttle under burst load: A log aggregation service batched log writes in 10 MB chunks. Each flush marked 2,500 pages dirty. On a system with dirty_ratio=10% of 8 GB RAM = 800 MB, the service occasionally burst to 1.5 GB dirty, triggering write throttle. Application throughput dropped 90% as threads were blocked in balance_dirty_pages(). Fix: reduce batch size, tune dirty_bytes=500M.
kswapd CPU spike during backup: A nightly backup process read a 500 GB dataset, filling the page cache. At the end of the backup window, normal workload pages were evicted (cold). The workload re-warmed the cache, generating a burst of major page faults. kswapd ran at 100% of one CPU for 10 minutes. Production query latency spiked 5x. Fix: ionice -c 3 on backup + fadvise(FADV_DONTNEED) after reading each file to avoid polluting page cache.
Slab leak on filesystem metadata storm: A bug in ext4 (Linux 3.x, since fixed) caused dentry objects to not be reclaimed when vfs_cache_pressure was low. On a server with a large directory tree (100M files), the dcache consumed 40 GB. Normal processes could not allocate memory. SUnreclaim in /proc/meminfo was the clue — it was growing unboundedly. Fix: sysctl -w vm.vfs_cache_pressure=200 + kernel patch.
Modern Usage
- PSI-aware load balancer: Facebook's
oomduses PSIfullmemory pressure metric to detect when a service group is about to OOM and preemptively terminates the lowest-priority container. - Multi-generational LRU (mglru): Linux 6.1+. Replaces the 2-list LRU with a generational scan that better identifies truly cold pages. At Google, mglru reduced OOM kill rates by 18% and improved page cache hit rates.
- Proactive compaction: Linux 5.9+ runs
kcompactdproactively in the background to maintain a pool of high-order free pages. Reduces THP allocation failures without synchronous compaction latency spikes.
Future Directions
- Memory tiering reclaim: With CXL memory and PMEM as NUMA nodes, reclaim will "demote" cold pages to the slower tier rather than swapping to disk. Implemented via
memory_tieranddemotionwork in Linux 5.18+. - Predictive prefault: Using access patterns learned from PSI and perf data to prefault pages before they're needed, hiding reclaim latency.
- Container-aware writeback: cgroup v2 writeback (
CONFIG_CGROUP_WRITEBACK) already assigns dirty pages to their cgroup for accounting. Future work extends this to provide per-cgroup writeback bandwidth control.
Exercises
- Fill the page cache by reading a large file (
cat /dev/urandom > /dev/null &thendd if=/large/file of=/dev/null). Monitor/proc/meminfoMemAvailable. Observe kswapd waking up via/proc/vmstat. - Tune
vm.swappiness=1on a system with a swap device. Trigger memory pressure by running a large process. Compare the proportion of file vs anonymous reclaim via/proc/vmstat pgstealcounters. - Set
vm.dirty_bytes=10M(very aggressive) and run a write-heavy workload. Observe write throttle by measuring write bandwidth vs time. Plot the throttle-induced pauses. - Enable zswap with
zstdcompression. Monitor compression ratio via/sys/kernel/debug/zswap/. Run a workload that exhausts RAM and compare swap-in latency vs disk swap. - Write a tool that parses
/proc/meminfoevery second and alerts whenMemAvailable < 10%ofMemTotal, with a breakdown of what is consuming the memory. - Simulate the "backup cache pollution" incident: warm the page cache with a working set, run a large sequential read (simulating backup), then measure the time to re-warm the working set. Compare with and without
posix_fadvise(FADV_DONTNEED)on the backup read.
References
mm/vmscan.c—kswapd(),shrink_zone(),try_to_free_pages(),get_scan_count()mm/page-writeback.c— writeback control,balance_dirty_pages()mm/compaction.c—compact_zone(),kcompactd()mm/zswap.c— zswap implementationmm/swap.c— LRU list manipulation,mark_page_accessed()include/linux/mmzone.h— watermark definitions (WMARK_MIN,WMARK_LOW,WMARK_HIGH)mm/page_alloc.c— watermark check (zone_watermark_ok())/proc/meminfodocumentation:Documentation/filesystems/proc.rst/proc/vmstatdocumentation:Documentation/admin-guide/sysctl/vm.rst- Johannes Weiner, "PSI: Pressure Stall Information", LWN 2018
- Yu Zhao, "Multi-Generational LRU Framework", LWN 2022: https://lwn.net/Articles/894859/
- LWN: "Toward a more reliable OOM killer" — https://lwn.net/Articles/668126/
- Mel Gorman, "Understanding the Linux Virtual Memory Manager", Chapter 10 (Page Reclaim)