Skip to content

Copy-on-Write (CoW)

Technical Overview

Copy-on-Write (CoW) is a resource management strategy where multiple processes share the same physical memory pages until one of them modifies a page. At that point, the modifying process receives a private copy; other processes continue to use the original. The copy is deferred until it is actually needed — "lazy copying."

CoW is used throughout the Linux kernel and filesystem ecosystem: - fork(): Parent and child share all pages until either writes - mmap(MAP_PRIVATE): File pages are shared until written - Kernel samepage merging (KSM): Identical anonymous pages across processes are merged, with CoW on write - Filesystems (Btrfs, ZFS): Data blocks are never overwritten; writes create new blocks - Container layers (overlay filesystem): Lower layers are read-only; writes go to an upper layer

The result: fork() is O(1) in time and O(1) in memory at the moment of the call, regardless of the parent's address space size. This is essential for the Unix process model where fork() is called millions of times per day.

Prerequisites

  • Virtual memory and VMA structure (01-virtual-memory.md)
  • Paging and PTE flag bits (02-paging.md, 03-page-tables.md)
  • Page fault handling flow (02-paging.md)
  • mmap system call (09-mmap.md)

Core Content

CoW in fork(): Mechanism

fork() CoW Implementation
===========================

Before fork():
  Parent address space:
    VMA [0x400000-0x500000] VM_READ|VM_WRITE|VM_PRIVATE
    PTE → Frame X (R/W=1, dirty)

fork() call:
  1. Kernel calls dup_mmap() to copy parent's VMA list
  2. For each private writable VMA, copy_page_range() is called
  3. For each present PTE in writable private VMAs:
       a. Clear R/W bit in BOTH parent and child PTE (write-protect)
       b. Increment page refcount (get_page())
       c. Child PTE points to SAME physical frame

After fork():
  Parent PTE → Frame X  (R/W=0, write-protected)
  Child  PTE → Frame X  (R/W=0, write-protected)
  Frame X refcount = 2

When child writes to the page:
  1. CPU: write to virtual address → PTE has R/W=0 → #PF (protection fault)
  2. Kernel: exc_page_fault() → handle_mm_fault() → do_wp_page()
  3. do_wp_page() checks page refcount:
     - refcount == 1: page is only owned by us → make PTE writable in-place
     - refcount > 1:  page is shared → COPY THE PAGE
  4. COPY path:
     a. Alloc new frame Y (alloc_page())
     b. Copy Frame X → Frame Y (copy_user_highpage())
     c. Install PTE → Frame Y (R/W=1)
     d. Decrement Frame X refcount (put_page())
  5. Fault returns; child resumes write to Frame Y (its private copy)
  6. Parent's PTE → Frame X remains write-protected
     Until parent writes: same process (parent gets Frame Z)

After both write:
  Parent PTE → Frame Z (R/W=1, parent's private copy)
  Child  PTE → Frame Y (R/W=1, child's private copy)
  Frame X refcount = 0 → returned to buddy allocator

The code implementing CoW on write faults is do_wp_page() in mm/memory.c. It handles several sub-cases: - wp_page_copy(): standard copy path - wp_page_reuse(): page can be reused in-place (refcount=1, no other mappers) - wp_pfn_shared(): pfn-mapped pages (non-page-struct frames)

vm_area_struct Flags for CoW

/* VMA flags controlling CoW behavior */
VM_SHARED    (0x00000008)  /* shared mapping — no CoW, writes visible to others */
VM_MAYWRITE  (0x00000020)  /* can be made writable */
VM_WRITE     (0x00000002)  /* currently writable (by user) */

/* CoW is triggered when:
   VM_WRITE is set but VM_SHARED is NOT set:
   → Private writable mapping → pages are CoW-protected on fork */

/* In do_wp_page():
   if (vma->vm_flags & VM_SHARED)
       → shared mapping; make PTE writable (no copy)
   else
       → private mapping; copy if refcount > 1 */

CoW and mmap(MAP_PRIVATE)

mmap(file, MAP_PRIVATE) creates a private mapping. The pages are shared with the page cache until written:

MAP_PRIVATE file mapping + write:
  Initial: PTE → page cache Frame (R/W=0)
  Write:   #PF → do_wp_page() → copy to anonymous frame → PTE → new Frame (R/W=1)
  Now:     Private dirty page (anonymous). No longer connected to file.
  On munmap/msync: changes are DISCARDED (private). File is not updated.

MAP_SHARED file mapping + write:
  Initial: PTE → page cache Frame (R/W=1 if mapping writable)
  Write:   No fault. Page directly dirtied in page cache.
  On msync(MS_SYNC): page written back to file.

Dirty COW: CVE-2016-5195

Dirty COW is one of the most famous Linux kernel privilege escalation vulnerabilities, present in the kernel since 2007 and discovered/exploited in 2016. It exploits a race condition in the CoW write fault handler.

The vulnerability:

Dirty COW Race Condition
=========================

Thread 1 (attacker):                    Thread 2 (attacker):
  while (1):                              while (1):
    write(fd_of_readonly_file, ...)         madvise(mmap_addr, MADV_DONTNEED, ...)

Goal: write to a read-only memory-mapped file (e.g., /etc/passwd)

Normal flow:
  write() → get_user_pages(FOLL_WRITE)
    → GUP sees read-only PTE
    → Triggers CoW: copies page, installs writable PTE
    → get_user_pages returns writable page
    → write() writes to the copied (anonymous) page
    → File not modified (correct behavior)

Race window:
  write() → get_user_pages(FOLL_WRITE)
    Step 1: GUP triggers CoW fault, gets writable page
  [THREAD 2 calls madvise(MADV_DONTNEED) — discards the CoW copy]
    Step 2: GUP retries with FOLL_WRITE
    Step 3: FOLL_WRITE fails (page is now read-only again after DONTNEED)
    Step 4: GUP falls back to FOLL_WRITE without FOLL_COW flag
    Step 5: GUP returns the ORIGINAL read-only page (the file's page)
    Step 6: write() writes directly to the file's page cache page
    Step 7: File is modified (PRIVILEGE ESCALATION)

Practical exploit: map /etc/passwd read-only, race to write "root::0:0:root:/root:/bin/sh"
to the root entry. Works because write() doesn't enforce file permissions on
the already-gotten page.

Fix: Added a new GUP flag FOLL_COW that distinguishes "this is a CoW page we've already triggered" from a fresh lookup. The retry path with FOLL_WRITE after a DONTNEED can no longer silently downgrade to a direct reference to the file page. Commit 4ceb5db9757a ("mm: remove gup_flags FOLL_WRITE games from __get_user_pages()").

CVE-2016-5195 had a CVSS score of 7.8. A working exploit (dirtycow.c) was published publicly. The exploit was used in the wild within days of disclosure.

CoW for Filesystems: Btrfs and ZFS

Btrfs CoW semantics:

Write to file block:
  1. Allocate new disk block B'
  2. Copy existing block B → B'
  3. Apply the modification to B'
  4. Update B-tree reference to point to B'
  5. Old block B becomes unreferenced → freed

Benefits:
  - Crash safety: file is never partially overwritten
  - Snapshots: O(1) — just refcount the B-tree root
  - Deduplication: multiple extents share the same block data

Drawbacks:
  - Write amplification (every write copies the block)
  - Fragmentation (data never overwrites in-place)
  - Mount time for fsck after crash (but Btrfs journaling reduces this)

CoW for Containers: Overlay Filesystem

OverlayFS (container layers)
==============================

Container image:
  Layer 3 (overlay upper, read-write): [modified config.json]
  Layer 2 (overlay lower, read-only):  [base config.json, /bin/bash, /lib/]
  Layer 1 (overlay lower, read-only):  [/usr/, /etc/]

Read /bin/bash:
  OverlayFS checks upper layer → not found
  Falls through to lower layers → found in layer 2
  Returns lower layer page (no copy)

Write to /etc/hosts:
  Copy-up: copy /etc/hosts from lower layer to upper layer
  Then write to the copy in the upper layer
  Future reads get the upper layer version (shadows lower layer)

This is CoW at the filesystem level, not page-table level.
Each container gets its own upper layer; lower layers are shared read-only
across all containers using the same image.

KSM: Kernel Samepage Merging

KSM (mm/ksm.c) is the kernel daemon that finds anonymous pages with identical content across different processes and merges them into a single CoW-protected page:

KSM Operation
==============

1. Application calls madvise(addr, len, MADV_MERGEABLE)
   (or CONFIG_KSM_DEFAULT_ON)

2. ksmd daemon periodically scans MADV_MERGEABLE VMAs:
   a. Hash each page (content-based fingerprint)
   b. Build red-black tree of unique page hashes
   c. When two pages have matching hashes: compare byte-by-byte
   d. If identical: replace both PTEs with a single read-only PTE
      pointing to a "KSM page" (a special struct page)

3. When either process writes to the merged page:
   Normal CoW: #PF → allocate new page → copy → install writable PTE
   The KSM page remains shared for other processes

Benefits: significant RAM savings for VMs running identical OS images
  100 VMs with 2GB each, sharing 1.5GB base OS = saves 150GB RAM

Drawbacks:
  - ksmd CPU overhead (~0.3% per core)
  - CoW latency on first write to a merged page
  - Timing side channel (see security section)

Historical Context

CoW in Unix context was popularized with BSD's implementation of vfork() (4.2BSD, 1983) and then fully for fork() in SunOS 4 (1988). The concept of CoW for filesystem semantics appears in the Locus distributed filesystem (1984) and was later popularized by ZFS (2004) and Btrfs (2007). The Linux kernel has had CoW fork since at least version 1.0. The Dirty COW vulnerability (CVE-2016-5195), discovered by Phil Oester, was in the kernel since 2007 — a 9-year latent bug.

Production Examples

Process pool servers: Apache's prefork MPM and Unicorn (Ruby web server) use fork() to create worker processes from a pre-loaded parent. Workers share all the parent's code and data pages via CoW until they start serving requests. A Rails application with 200 MB of loaded code, forked into 20 workers, uses ~200 MB of actual RAM (shared) instead of 4,000 MB (if fully copied). Only the pages each worker actually writes to are duplicated.

Container startup time: Kubernetes container startup with OverlayFS copies only modified files ("copy-up"), not the entire image. A 1 GB container image starts in ~200 ms because only a few config files are copied.

KSM at AWS/GCP: Cloud hypervisors run KSM on VM memory. VMs running the same guest OS have many identical pages (kernel text, shared library text, zero pages). KSM can reduce physical RAM usage by 20–40% in homogeneous VM fleets.

Debugging Notes

# Monitor CoW page faults (write faults to CoW pages)
grep -E "cow_faults|nr_cow_ptes" /proc/vmstat  # if available

# Track write faults (includes CoW)
perf stat -e 'page-faults' ./myprogram

# KSM statistics
cat /sys/kernel/mm/ksm/pages_shared     # pages being shared
cat /sys/kernel/mm/ksm/pages_sharing    # how many pages would be needed without KSM
cat /sys/kernel/mm/ksm/pages_unshared   # candidate pages, not yet merged
cat /sys/kernel/mm/ksm/pages_volatile   # changing too fast to merge
cat /sys/kernel/mm/ksm/full_scans       # times daemon has scanned all mergeable regions

# Memory saved by KSM:
# (pages_sharing - pages_shared) * 4KB = bytes saved

# Enable KSM for a process (requires MADV_MERGEABLE)
# In code: madvise(addr, len, MADV_MERGEABLE)
# Globally: echo 1 > /sys/kernel/mm/ksm/run

# Check if CoW is occurring (fork + write pattern)
# strace output will show mmap with MAP_PRIVATE on fork
strace -e mmap,mprotect -p $(pidof unicorn 2>/dev/null) 2>&1 | head -20

# Dirty COW check (vulnerable kernels: < 4.8.3, < 4.7.9, < 4.4.26)
uname -r  # check kernel version

Security Implications

Dirty COW (CVE-2016-5195): Full privilege escalation exploit. Any unprivileged local user can write to any read-only file (including /etc/passwd, SUID binaries, /proc/sysrq-trigger). Patched in Linux 4.8.3, 4.7.9, 4.4.26 (stable branches). A working exploit was publicly available within 24 hours of disclosure.

KSM timing side channel: When two processes share a KSM page, the first process to write it pays the CoW cost (page copy + TLB shootdown); subsequent processes see a write-fault miss. A process can measure its own write latency to infer whether a target process has been allocated a specific page (and thus written specific data). Used in cross-VM information extraction at cloud providers. Mitigation: disable KSM for security-sensitive workloads (MADV_UNMERGEABLE).

Speculative CoW: Spectre-like attacks could speculatively bypass the CoW protection check, reading data from a page before the CoW copy completes. Mitigated by the same retpoline/IBPB mitigations as Spectre.

OverlayFS security: Container breakout vulnerabilities have exploited copy-up in OverlayFS (e.g., CVE-2021-3493 — OverlayFS privilege escalation via setxattr on copied files). The copy-up operation must carefully preserve and check file capabilities.

Performance Implications

  • fork() time: O(number of VMA pages' page tables) — dominated by write-protecting PTEs, not copying pages. For a process with 1 GB of anonymous memory and 4KB pages, fork() writes ~250,000 PTEs = ~1–5 ms.
  • First write after fork: Each write to a shared CoW page costs one page fault + page copy (~1 µs). A process that modifies its entire 1 GB working set after fork pays for 256,000 CoW copies = ~256 ms.
  • Read-only after fork: If the child immediately calls exec() (the typical fork()+exec() pattern), no CoW copies occur. The page tables are replaced atomically by exec. This is why fork()+exec() is fast even for large processes.
  • CoW and huge pages: A write to a CoW-protected THP (2 MB) triggers a copy of the entire 2 MB page, not just 4 KB. Linux 5.11 added THP CoW splitting: at CoW fault time, split the 2 MB THP into 512 × 4 KB pages, then CoW only the specific 4 KB page touched.
  • Btrfs CoW overhead: Random writes on Btrfs are slower than on ext4 because each write triggers a block copy + B-tree update. For database workloads, nodatacow mount option or chattr +C on database files disables CoW semantics for specific files.

Failure Modes and Real Incidents

CoW storm after fork: A production incident at GitHub (2012): Resque background job system forked workers from a pre-loaded Rails master. After fork, each worker loaded job-specific data (say, 100 MB), causing 100 MB of CoW copies per worker. With 50 workers, 5 GB of CoW copies occurred in a burst, exhausting the page allocator and triggering OOM. Fix: use fork() before loading large data, or use Copy_on_Write=false semantics with vfork()+exec().

Dirty COW in container escapes: Multiple CVEs around 2016–2021 used Dirty COW or related CoW race conditions to escape Docker containers. Containers run as separate network namespaces but (historically) with the same kernel CoW paths. A container with write access to /proc could race the CoW handler to write to the host's read-only files.

Btrfs data corruption on power failure: Early Btrfs versions had a bug where CoW blocks were not properly journaled, causing corruption on unexpected power loss. Btrfs's CoW model is theoretically crash-safe, but the implementation had race conditions between CoW allocation and journal commit. Fixed in Linux 3.x series.

Modern Usage

  • io_uring and CoW: io_uring's zero-copy receive path uses MSG_ZEROCOPY, which pins pages to prevent CoW during network transfer. If a CoW fault occurs on a pinned page, the kernel must handle it specially (fall back to copy, or fail).
  • USERFAULTFD for CoW interception: userfaultfd can intercept CoW faults, allowing user-space checkpoint/restore (CRIU) to implement application-level CoW.
  • Windows WSL2 and CoW: WSL2 (Windows Subsystem for Linux 2) runs a Linux kernel in a Hyper-V VM. fork() uses the same CoW mechanism as native Linux. The Hyper-V balloon driver may interfere with CoW pages if it tries to reclaim shared pages.
  • eBPF and CoW: bpf_probe_write_user() is an eBPF helper that writes to user-space memory from a kernel probe. It must navigate CoW correctly — it must use get_user_pages_unlocked() with proper CoW triggering to avoid writing to shared pages.

Future Directions

  • Hardware CoW: Future CPU extensions (e.g., RISC-V pointer masking + hardware CoW bits) could implement CoW tracking in hardware, eliminating the page-fault overhead.
  • Lazy CoW for THP: Instead of splitting a 2 MB THP on CoW write fault, track which 4 KB sub-pages have been written (using a bitmap in struct page). Only copy 4 KB. The 2 MB page remains a huge page but is partially independent. Work in progress.
  • CoW-aware GC: Garbage collectors that move objects (copying GC) interact badly with CoW — moving an object dirties its source page, preventing CoW sharing. Future GC designs may use write barriers that are CoW-aware, avoiding unnecessary dirty page creation.

Exercises

  1. Write a program that fork()s and immediately measures its RSS via /proc/self/status. Confirm that the child starts with nearly zero private dirty pages (CoW sharing). Then have the child write to all pages and measure RSS again.
  2. Reproduce a simplified Dirty COW scenario on an old kernel (or in a VM): map a read-only file, race two threads (one writing via GUP, one calling MADV_DONTNEED). Observe whether the file changes.
  3. Enable KSM (echo 1 > /sys/kernel/mm/ksm/run). Launch 10 identical processes that each allocate and fill 100 MB of identical data. Measure the memory savings via /sys/kernel/mm/ksm/pages_shared.
  4. Create a Btrfs filesystem, write a 1 GB file, take a snapshot, then modify the file. Use btrfs-debug-tree to observe the CoW blocks created for the modified extents.
  5. Measure the fork time for processes of different sizes (10 MB, 100 MB, 1 GB) using CLOCK_MONOTONIC wrapping the fork() call. Plot fork time vs address space size. Confirm O(page_tables) relationship.
  6. Implement user-space CoW using mprotect(PROT_NONE) and a SIGSEGV handler: on first write, the signal handler copies the page to a private buffer and remaps the virtual address. Measure the overhead vs kernel CoW.

References

  • mm/memory.cdo_wp_page(), wp_page_copy(), copy_page_range()
  • kernel/fork.cdup_mmap(), copy_mm()
  • mm/ksm.c — Kernel Samepage Merging
  • fs/overlayfs/copy_up.c — OverlayFS copy-up implementation
  • mm/gup.cget_user_pages(), FOLL_WRITE, FOLL_COW
  • CVE-2016-5195 patch: git log --oneline v4.8.3..v4.8.4 -- mm/ (Linux stable)
  • Phil Oester, "Dirty COW" disclosure: https://dirtycow.ninja/
  • Andrea Arcangeli, "GUP fast: get_user_pages without mmap_sem" — LWN
  • LWN: "The copy-on-write cow" — https://lwn.net/Articles/849638/
  • Btrfs design documentation: https://btrfs.wiki.kernel.org/index.php/Btrfs_design
  • Linux man pages: fork(2), madvise(2) (MADV_MERGEABLE, MADV_DONTNEED)