Overlay Filesystems

Technical Overview

OverlayFS is a Linux union filesystem that presents a merged view of multiple directory layers to processes. It is the primary mechanism by which container images achieve both storage efficiency (layers shared across many containers) and writability (each container gets its own writable layer on top of a shared read-only image). Understanding OverlayFS is essential for debugging container storage issues, understanding image build performance, and reasoning about copy-on-write semantics.

The core concept: OverlayFS presents a merged directory that combines a read-only lower layer (or multiple lower layers) with a read-write upper layer. Reads that find a file in the upper layer return the upper version; reads that don't find a file in upper fall through to lower layers. All writes go to the upper layer. Deletions are represented by special "whiteout" files in the upper layer.

Prerequisites

Linux VFS (Virtual Filesystem Switch) concepts
Inode fundamentals (inode numbers, hard links, file identity)
Understanding of container image layers
Basic filesystem operations (stat, rename, link)

Historical Context

Union filesystems have a long history in Linux. The original motivation was bootable LiveCD systems — mount a read-only squashfs from CD, overlay a tmpfs for writes, user gets a fully writable system without modifying the CD.

Several union filesystem implementations preceded OverlayFS: - aufs (2006): "Another Union FileSystem," originally a patch set never merged into mainline. Used by Docker as its initial storage driver due to maturity but required out-of-tree kernel patches. - UnionFS: Earlier user-space implementation, limited adoption - overlayfs (Miklos Szeredi): Submitted multiple times to the kernel. Merged into Linux 3.18 in December 2014 after years of review cycles. This was a major event for the container ecosystem because it meant union filesystem support in the mainline kernel without patches. - Multiple lower layers (Linux 4.0, 2015): Initial OverlayFS supported only one lower layer. Linux 4.0 added support for multiple lower layers (up to a depth limit), enabling full container image layer stacking.

Docker migrated from aufs to overlayfs2 as the default storage driver once kernel support was widespread (circa 2017).

OverlayFS Core Operations

Mount Syntax

mount -t overlay overlay \
  -o lowerdir=/lower2:/lower1,upperdir=/upper,workdir=/work \
  /merged

lowerdir: Colon-separated list of read-only lower directories (rightmost is bottom, leftmost is top of lower stack)
upperdir: Read-write upper directory (must be on same filesystem as workdir)
workdir: Scratch directory used internally by OverlayFS during operations like rename and copy-up (must be empty, same filesystem as upper)
/merged: The merged view mountpoint

Read Operation

Process reads /merged/etc/nginx.conf

OverlayFS lookup:
1. Check upper (/upper/etc/nginx.conf)   → not found
2. Check lower1 (/lower1/etc/nginx.conf) → not found
3. Check lower2 (/lower2/etc/nginx.conf) → FOUND
   Return content from lower2

The lookup is sequential from upper down through the lower stack. Time complexity is O(N) in the number of layers for a file that exists only in the deepest layer. This is why very deep image layer stacks (the 128-layer limit exists partly for this reason) have measurable read overhead for cold cache accesses.

Write Operation

Process writes to /merged/var/log/nginx/access.log

If file exists in upper:
  → Write directly to upper file (fast path)

If file exists only in lower:
  → COPY-UP: copy entire file from lower to upper first
  → Then write to the upper copy
  (future reads see the upper version)

If file does not exist anywhere:
  → Create new file directly in upper

Delete Operation (Whiteout Files)

When a file that exists in a lower layer is deleted through the merged view, OverlayFS cannot remove the lower-layer file (it is read-only). Instead, it creates a whiteout file in the upper layer:

Lower layer has: /lower1/etc/hosts
Process deletes: /merged/etc/hosts

OverlayFS creates: /upper/etc/hosts  (with device number 0,0 — character device with major/minor both 0)
This is a whiteout file.

Subsequent lookup of /merged/etc/hosts:
1. Check upper: find whiteout file → stop, return ENOENT
   (lower layers not consulted)

Opaque directories: When a directory in a lower layer is deleted and recreated (rename + create), OverlayFS uses an opaque directory in upper — a directory with a trusted.overlay.opaque=y xattr — which prevents lookup from falling through to the same-named directory in lower layers.

Copy-Up

Copy-up is the most expensive OverlayFS operation. It is triggered the first time a lower-layer file is modified through the merged view:

File: /lower1/var/lib/mysql/ibdata1  (2GB InnoDB tablespace)
First write to this file via merged view:

1. Stat the lower file
2. Create parent directories in upper if they don't exist
3. Create new file in upper with same attributes (mode, owner, times, xattrs)
4. COPY entire file content from lower to upper (2GB read + 2GB write!)
5. Atomically rename the temp file to the final upper path
6. Now writes go to the upper copy

Copy-up is atomic (rename-based) but it is not incremental — the entire file is copied. For large files, this causes significant latency on the first write. This is why containers that use large database files perform poorly when the DB data files are in the container's overlay layer — they incur a multi-gigabyte copy-up on first modification.

Mitigation: Mount database data directories as bind volumes (bypassing OverlayFS entirely). This is standard practice for stateful containers.

OverlayFS in Docker: overlay2 Storage Driver

Docker's overlay2 storage driver organizes layers under /var/lib/docker/overlay2/:

/var/lib/docker/overlay2/
├── <layer-sha256-1>/
│   ├── diff/          ← layer contents (the actual filesystem changes)
│   ├── link           ← short ID symlink for path length management
│   └── work/          ← OverlayFS work directory
│
├── <layer-sha256-2>/
│   ├── diff/
│   ├── link
│   ├── lower          ← text file listing parent layer short IDs
│   └── work/
│
└── <container-id>/
    ├── diff/          ← container's writable upper layer
    ├── link
    ├── lower          ← all image layers listed (colon-separated short IDs)
    ├── merged/        ← the live mountpoint (when container is running)
    └── work/

OverlayFS Directory Structure Diagram

Docker Image: nginx (3 layers)

Layer 3 (top image layer: nginx config)
/var/lib/docker/overlay2/<sha3>/diff/
├── etc/nginx/nginx.conf
└── etc/nginx/conf.d/default.conf

Layer 2 (nginx binaries layer)
/var/lib/docker/overlay2/<sha2>/diff/
└── usr/sbin/nginx

Layer 1 (base OS layer)
/var/lib/docker/overlay2/<sha1>/diff/
├── bin/
├── etc/
│   └── hosts
├── lib/
└── usr/

Container writable layer (upper)
/var/lib/docker/overlay2/<container-id>/diff/
└── (empty at start; writes accumulate here)

OverlayFS mount (when container runs):
lowerdir=<sha3>/diff:<sha2>/diff:<sha1>/diff
upperdir=<container-id>/diff
workdir=<container-id>/work
merged=<container-id>/merged   ← container's rootfs

Container sees at merged/:
├── bin/             ← from layer 1
├── etc/
│   ├── hosts        ← from layer 1
│   └── nginx/       ← from layer 3
│       ├── nginx.conf
│       └── conf.d/
├── lib/             ← from layer 1
└── usr/
    ├── lib/         ← from layer 1
    └── sbin/
        └── nginx    ← from layer 2

The power of the layered model is that image layers are shared across all containers that use the same image:

100 containers running nginx:latest

/var/lib/docker/overlay2/<sha1>  ← base layer: 1 copy on disk, referenced by all 100
/var/lib/docker/overlay2/<sha2>  ← nginx layer: 1 copy on disk, referenced by all 100
/var/lib/docker/overlay2/<sha3>  ← config layer: 1 copy on disk, referenced by all 100

Per-container storage (100 separate writable layers):
/var/lib/docker/overlay2/<container1-id>/diff  ← empty (or small diffs)
/var/lib/docker/overlay2/<container2-id>/diff  ← empty (or small diffs)
...
/var/lib/docker/overlay2/<container100-id>/diff

Disk usage: 3 shared layers + 100 tiny writable layers
           (not: 100 × full image size)

OverlayFS Limitations and Quirks

Inode Renumbering

Files in different layers get different inode numbers when viewed through the merge, even for the same logical file. More critically: when a file is copied up from lower to upper, it gets a new inode number. Applications that cache inode numbers (some build tools, some database recovery mechanisms) can be confused by this.

redirect_dir mount option (Linux 4.10+): Improves handling of directory renames by recording the original directory path as an xattr, so cross-layer renames work correctly without full directory copy-up.

Hard Link Semantics

Hard links to files in lower layers do not work as expected through the merged view when copy-up occurs. When one hard link is modified (triggering copy-up), only that path gets copied up; the other hard link in the lower layer becomes unlinked from the upper copy. This breaks hard-link-based atomic file replacement patterns used by some package managers (rpm, dpkg with hard-link deduplication).

`d_type` Requirement

OverlayFS requires the underlying filesystem to support d_type (directory entry type in readdir()). XFS formatted without ftype support, and some network filesystems, do not support d_type. Docker detects this and falls back to the slower vfs storage driver.

NFS and Remote Filesystems

OverlayFS upper and work directories cannot be on NFS or other network filesystems — they must be local filesystems. Only lower layers can be on remote filesystems.

Kernel Depth Limit

The maximum number of lower layers in a single OverlayFS mount is 128 (enforced by a kernel constant OVERLAY_MAX_STACK_DEPTH). Docker image builds should stay well under this. A docker history with 128+ layers hits this limit.

Layer Model Performance Analysis

Performance characteristics by operation:

Read (cached):    O(1) — page cache, same as regular fs
Read (cold):      O(N) layers traversed in worst case (file in deepest layer)
Write (new file): O(1) — direct write to upper
Write (modified): O(1) if file already in upper
                  O(file_size) COPY-UP if file only in lower
Delete:           O(1) — whiteout file creation
Rename (dir):     O(subtree_size) without redirect_dir
                  O(1) with redirect_dir xattr optimization

Image build layer analysis:
- Each Dockerfile RUN instruction creates a new layer
- Many small layers: each layer adds lookup overhead
- Fewer, larger layers: better read performance, worse build caching

Container Image Deduplication

Content-addressable storage means layers are identified by their SHA256 digest. If two different images both have a layer with the same content (same digest), the layer is stored once on disk and referenced by both image manifests.

In practice, base image layers (OS packages, common libraries) are heavily shared across many images. The space saving on a production node running dozens of different services from similar base images can be substantial — often 60-80% space reduction compared to storing each image independently.

OverlayFS vs Other Storage Drivers

Driver	Kernel support	CoW mechanism	Performance	Notes
overlay2	mainline 3.18+	File-level copy-up	Good	Default; recommended
aufs	Out-of-tree	File-level copy-up	Comparable	Deprecated, requires patched kernel
devicemapper	mainline	Block-level CoW	Worse for metadata	Complex setup, largely replaced
btrfs	mainline	Subvolume snapshots	Excellent for metadata	Requires btrfs-formatted storage
zfs	Out-of-tree	ZFS snapshots	Excellent	Requires ZFS kernel module
vfs	None	Full copies (no CoW!)	Poor	Fallback only; copies entire image for each container

Production Examples

Inspect layer structure of a running container:

# Find container's OverlayFS mount
docker inspect <container> | jq '.[0].GraphDriver'
# Shows LowerDir, UpperDir, MergedDir, WorkDir paths

# Examine the actual mount
cat /proc/mounts | grep overlay | grep <container-id-prefix>

# See upper layer changes
ls /var/lib/docker/overlay2/<container-id>/diff/

Find where disk space is consumed:

# Total overlay2 usage
du -sh /var/lib/docker/overlay2/

# Find largest layers
du -sh /var/lib/docker/overlay2/*/diff | sort -rh | head -20

# Dangling layers (not referenced by any image or container)
docker system df -v
docker image prune    # remove dangling images and their layers

Understand why a docker build is slow:

# Build with --no-cache to measure fresh build time
# Count layers in final image
docker history myimage --no-trunc | wc -l

# Check if you're hitting copy-up on large files
strace -e trace=open,openat,read,write -p <container-pid> 2>&1 | grep "overlay2"

Debugging Notes

"too many levels of symbolic links": OverlayFS with redirect_dir and many layers can generate deep symlink chains that hit the kernel limit. Reduce layer count or disable redirect_dir.
"no space left on device" but df shows space: Could be inode exhaustion. OverlayFS on ext4 can exhaust inodes when many small files are created across many layers. Check df -i.
Unexpected file permissions: After copy-up, the upper copy has the same permissions as the lower original. If the lower file was owned by a different UID mapping, the upper copy may appear owned by nobody.
Database refusing to start: Check if the DB data directory is on OverlayFS (it should be a bind volume). Databases (MySQL, PostgreSQL) often fail or perform terribly on OverlayFS due to copy-up of large data files.
stale NFS handle errors: Using NFS as the Docker storage root. Not supported for overlay2 upper layers. Use local storage.

Security Implications

Layer content inspection: The lower layers of a running container are readable from the host by root. Sensitive secrets baked into image layers are accessible from the host. This is why secrets should not be in image layers.
Upper layer writable by host root: The container's writable upper layer (diff/) is accessible from the host. A host root process can modify it, injecting code into the running container. Containers do not protect their filesystem from host root access.
Layer sharing and side channels: Shared lower layers mean CPU cache and page cache timing side channels could theoretically leak information between containers sharing the same image layers — a theoretical concern for high-security environments.
Whiteout file leakage: If an image layer contains a whiteout for a lower layer file, the original file content is still visible on the lower layer in the content store. Do not rely on "deleting" sensitive files in a Dockerfile layer to hide them.

Performance Implications

Copy-up latency: The first write to any lower-layer file adds the full file copy overhead. For large files, this can be seconds. Design containers to mount large mutable data as bind volumes.
Deep layer stacks: Each cold read traverses up to N layers. Merge common layers in Dockerfile (RUN apt-get install ... && apt-get clean && ... in one step rather than separate RUN instructions).
Page cache sharing: Lower layers (read-only) are backed by the page cache normally. Multiple containers using the same lower layer share the same page cache entries — good for memory efficiency.
Write-heavy workloads: OverlayFS is appropriate for mostly-read workloads. Write-heavy container workloads (CI/CD build containers writing many files) should use ephemeral volumes on local SSD rather than the container's overlay layer.

Failure Modes

Failure	Symptom	Cause
Copy-up on large file	Container hangs on first write	Large file being copied from lower to upper; use bind volume
Inode exhaustion	`ENOSPC` despite free blocks	Too many small files; check `df -i`, resize or tune ext4 inode ratio
Layer depth exceeded	Container fails to start	`OVERLAY_MAX_STACK_DEPTH` exceeded (128 layers); flatten image
d_type missing	Docker falls back to vfs driver	Underlying fs doesn't support d_type; reformat with ftype=1
Upper dir not on local fs	Mount fails	NFS/CIFS upper dir; use local storage for Docker root
Rename across layers broken	Application errors on file rename	Missing redirect_dir; upgrade kernel or use bind volume

Modern Usage

BuildKit layer optimization: Docker BuildKit (and tools like ko, kaniko) analyze Dockerfile instructions to maximize layer reuse and minimize copy-up during builds
OCI layer streaming: Projects like estargz and nydus implement lazy-loading of container image layers — only pull the file blocks that are actually accessed, rather than pulling entire layers before starting
OverlayFS for build caches: CI systems mount read-only base layers as overlayfs lower dirs and writable upper dirs per build step, then cache and reuse upper dirs as new lower layers for subsequent steps
User namespace + OverlayFS: Linux 5.11+ supports OverlayFS mounting inside user namespaces, enabling rootless overlayfs (previously required root or fuse-overlayfs fallback)

Future Directions

Composefs: A new Linux filesystem (merged 6.4) designed for container images — immutable, content-addressed, with per-file fsverity for integrity verification. Intended as a better alternative to squashfs for container image lower layers.
OverlayFS volatile upper: Kernel optimization allowing the upper layer to bypass journal writes for containers that don't need crash-consistent upper layers (disposable containers) — significant write performance improvement.
Chunk-level deduplication: Moving beyond file-level sharing to block/chunk-level deduplication in container image stores, reducing storage further for images with similar but not identical layers.

Exercises

Mount an OverlayFS manually: create lower, upper, work, and merged directories. Mount the overlay. Create files in lower, then modify one through merged. Observe the copy-up in the upper directory. Check the whiteout file after deleting a lower-layer file.
Measure copy-up cost: create a 500MB file in the lower layer. Time the first write to this file through the merged view. Compare to a direct write to upper.
Use docker history nginx:latest to see all layers. Find the corresponding directories under /var/lib/docker/overlay2/. Verify the lower file in each layer directory points to the correct parent layers.
Build a Dockerfile that intentionally creates 5 layers. Run a container from it and observe the OverlayFS mount with cat /proc/mounts. Count the lowerdir entries.
Create a container that writes many small files. Monitor inode usage with df -i. At what point does inode exhaustion become a concern?
Compare disk space usage: pull the same image for 10 containers vs. having 10 independent copies of the filesystem. Measure the actual disk savings from layer sharing.

References

OverlayFS kernel documentation: Documentation/filesystems/overlayfs.rst
Miklos Szeredi's OverlayFS design documents (kernel mailing list archives)
Docker overlay2 storage driver documentation: docs.docker.com/storage/storagedriver/overlayfs-driver/
Linux kernel source: fs/overlayfs/
mount(8) man page, overlayfs section
Composefs design document: lkml archives, 2022
estargz (Stargz Snapshotter): github.com/containerd/stargz-snapshotter
Container storage interface (CSI) specification for volumes bypassing OverlayFS