06 — Process Isolation and Linux Namespaces
Technical Overview
Linux namespaces are the kernel mechanism that makes containers possible. Each namespace type wraps one category of global kernel resource — the process tree, the network stack, the filesystem mount table, the hostname, IPC objects, user/group IDs, cgroup views, or wall clocks — and provides a private instance of that resource to a set of processes. Containers are fundamentally just processes running in a coordinated set of namespaces combined with cgroup resource limits and a chrooted filesystem.
There are currently 8 namespace types in the mainline kernel. Understanding each — what it isolates, how it is created, and where it interacts with security boundaries — is essential for container runtime authors, platform engineers, and anyone doing advanced Linux systems work.
Prerequisites
01-process-concept.md:task_struct,/proc/PID/ns/02-fork-and-exec.md:clone(),CLONE_NEW*flags- Basic networking (network interfaces, routing tables, iptables)
- Mount/filesystem concepts (VFS, mount points)
Core Content
Namespace Overview
Linux Kernel Global Resources
──────────────────────────────────────────────────────────────────────
Resource Namespace clone flag unshare flag
───────────────── ───────────── ────────────── ──────────────────
Process IDs PID ns CLONE_NEWPID unshare --pid
Network stack Net ns CLONE_NEWNET unshare --net
Mount table Mount ns CLONE_NEWNS unshare --mount
Hostname UTS ns CLONE_NEWUTS unshare --uts
SysV IPC / POSIX IPC ns CLONE_NEWIPC unshare --ipc
message queues
UID/GID mapping User ns CLONE_NEWUSER unshare --user
Cgroup view Cgroup ns CLONE_NEWCGROUP unshare --cgroup
System clocks Time ns CLONE_NEWTIME unshare --time
──────────────────────────────────────────────────────────────────────
A process belongs to exactly one namespace of each type. New namespaces are created via:
- clone(CLONE_NEW*): the new process starts in a new namespace
- unshare(CLONE_NEW*): the calling process moves itself to a new namespace
- setns(fd, nstype): the calling process joins an existing namespace identified by fd
Namespace identity is tracked via inode numbers in /proc/PID/ns/:
ls -la /proc/self/ns/
# lrwxrwxrwx 1 user user 0 cgroup -> 'cgroup:[4026531835]'
# lrwxrwxrwx 1 user user 0 ipc -> 'ipc:[4026531839]'
# lrwxrwxrwx 1 user user 0 mnt -> 'mnt:[4026531841]'
# lrwxrwxrwx 1 user user 0 net -> 'net:[4026531992]'
# lrwxrwxrwx 1 user user 0 pid -> 'pid:[4026531836]'
# lrwxrwxrwx 1 user user 0 time -> 'time:[4026531834]'
# lrwxrwxrwx 1 user user 0 user -> 'user:[4026531837]'
# lrwxrwxrwx 1 user user 0 uts -> 'uts:[4026531838]'
Two processes with identical inode numbers for a given namespace type are in the same
namespace. Keeping a namespace alive without any processes: hold an fd open to
/proc/PID/ns/TYPE (or bind-mount it somewhere).
PID Namespace
Isolates the process ID number space. Processes inside the namespace see a private PID tree starting at PID 1. The host kernel still assigns unique global PIDs.
Host (initial namespace) Container PID namespace
───────────────────────── ──────────────────────────
PID 1: systemd PID 1: container-init (= host PID 47823)
PID 2: kthreadd PID 2: nginx worker (= host PID 47824)
... PID 3: nginx worker (= host PID 47825)
PID 47822: containerd
PID 47823: container-init ──────► appears as PID 1 inside namespace
PID 47824: nginx worker ──────► appears as PID 2 inside namespace
PID 47825: nginx worker ──────► appears as PID 3 inside namespace
Key properties:
- PID 1 semantics inside the namespace: if PID 1 in the namespace exits, the kernel
sends SIGKILL to all other processes in that namespace. This is why containers need
a proper init (e.g., tini) rather than running the application directly as PID 1.
- /proc visibility: inside a PID namespace, /proc shows only the processes in
that namespace (if a new mount namespace is used with a private /proc mount).
- Nested namespaces: PID namespaces are hierarchical. A process in an outer namespace
can see processes in inner namespaces (with their outer PIDs). Inner namespaces cannot
see outer processes.
- getpid() inside the container returns the namespace-local PID (1, 2, 3...).
The host-global PID is visible in /proc/PID/status as NSpid:.
# Create a new PID namespace (requires root or user ns with mapping):
unshare --pid --fork --mount-proc /bin/sh
# Inside: 'echo $$' → 1
Network Namespace
Each network namespace has a complete, independent network stack:
- Network interfaces (including lo)
- Routing tables
- Firewall rules (iptables / nftables)
- Network sockets
- /proc/net/ files
Host network namespace Container network namespace
───────────────────────── ──────────────────────────
eth0: 192.168.1.10/24 eth0: 172.17.0.2/16
lo: 127.0.0.1/8 lo: 127.0.0.1/8
docker0: 172.17.0.1/16 (veth0 ↔ host's veth1)
iptables rules: host policies iptables rules: container policies
veth pairs: the standard way to connect a network namespace to the host or another namespace. A veth pair is two virtual NICs linked together; a packet entering one end emerges from the other.
# Create a new network namespace:
ip netns add myns
# Create a veth pair:
ip link add veth0 type veth peer name veth1
# Move one end into the namespace:
ip link set veth1 netns myns
# Configure addresses:
ip addr add 192.168.100.1/24 dev veth0
ip link set veth0 up
ip netns exec myns ip addr add 192.168.100.2/24 dev veth1
ip netns exec myns ip link set veth1 up
ip netns exec myns ip link set lo up
# Test:
ip netns exec myns ping 192.168.100.1
Network namespaces are also the mechanism behind:
- Kubernetes pod networking: each pod has its own network namespace; all containers
in the pod share it (joined via setns).
- Network policy enforcement: iptables/eBPF rules applied per-namespace.
- Service mesh sidecars: the sidecar container shares the pod's network namespace
via --net=container:ID in Docker, or by joining the pod's netns in Kubernetes.
Mount Namespace
Isolates the filesystem mount table. Processes in different mount namespaces can have entirely different views of the filesystem.
unshare --mount /bin/sh
# Now in a private mount namespace
mount --bind /tmp/myfs /mnt/container
# This mount is invisible to the host
Propagation types (controlled via mount --make-*):
- shared: mounts propagate between namespaces bidirectionally
- slave: mounts propagate from master to slave, not reverse
- private: no propagation
- unbindable: cannot be bind-mounted
Container runtimes (runc) use unshare(CLONE_NEWNS) followed by:
1. pivot_root() to switch the root filesystem to the container image
2. Mounting /proc, /sys, /dev in the new namespace
3. Unmounting the old root
overlayfs (used by Docker, containerd) layers a writable upperdir over a read-only
lowerdir (the image), presenting a unified writable filesystem without modifying the
original image.
UTS Namespace
Isolates the hostname and NIS domainname. Each container can have its own hostname without affecting the host:
unshare --uts /bin/sh
hostname container-01
# host hostname is unchanged
Used by every container runtime so that hostname inside the container returns the
container name, not the host's hostname.
IPC Namespace
Isolates System V IPC objects (message queues, semaphore sets, shared memory segments) and POSIX message queues. Processes in different IPC namespaces cannot share SysV IPC or POSIX MQ objects.
Without IPC namespace isolation, container processes could accidentally (or maliciously) interact with host SysV shared memory segments. Most container runtimes create a new IPC namespace by default.
User Namespace
The most powerful and most complex namespace type. User namespaces map UID/GID ranges between the namespace and the host:
Container user namespace Host
───────────────────────── ──────────────────────
UID 0 (root) ─────────► UID 1000 (regular user)
UID 1 ─────────► UID 1001
...
UID 65535 ─────────► UID 66535
This mapping is written to /proc/PID/uid_map and /proc/PID/gid_map:
# /proc/PID/uid_map format: [ns_start] [host_start] [count]
0 1000 65536
Rootless containers: because user namespaces allow a non-root user to create a namespace where they appear as UID 0, it is possible to run full container workloads without any host-level root privileges. Podman, rootless Docker, and rootless containerd all use this.
Capability semantics: inside a user namespace, a process can hold the full set of
capabilities (CAP_NET_ADMIN, CAP_SYS_ADMIN, etc.) — but these capabilities are
only honored for operations within that namespace (or namespaces nested inside it).
# Create a user namespace mapping current user to UID 0 inside:
unshare --user --map-root-user /bin/sh
# Inside: id → uid=0(root) gid=0(root)
# But on the host: we are still our original UID
Security constraint: user namespaces are powerful and have historically been a source
of privilege escalation CVEs. Many hardened deployments set
kernel.unprivileged_userns_clone=0 to require CAP_SYS_ADMIN to create user namespaces.
Cgroup Namespace
Isolates the view of /proc/self/cgroup. Without a cgroup namespace, a process inside
a container can see its full cgroup path on the host (e.g.,
/system.slice/docker-abc123.scope/), leaking information about the container runtime
and orchestrator.
With a cgroup namespace, the process sees only a relative path from the namespace root:
Without cgroup ns: /system.slice/docker-abc123.scope/
With cgroup ns: /
Created automatically by container runtimes to prevent cgroup path leakage.
Time Namespace
Introduced in Linux 5.6 (2020). Allows a process to have a different view of
CLOCK_MONOTONIC and CLOCK_BOOTTIME. Does not affect wall clock (CLOCK_REALTIME).
Primary use case: CRIU (Checkpoint/Restore in Userspace). When a process is checkpointed and restored on a different host, its monotonic clock readings would otherwise be wrong (the new host has been up for a different duration). A time namespace allows adjusting the monotonic clock offset so the restored process sees a consistent time view.
# Adjust monotonic clock by -100 seconds for processes in a new time namespace:
unshare --time --monotonic=-100 /bin/sh
Namespace Operations: unshare, setns, /proc/PID/ns/
Creating namespaces:
──────────────────────────────────────────────────────────────────────
clone(CLONE_NEWPID | CLONE_NEWNET | ...) — on fork, child in new ns
unshare(CLONE_NEWNS | CLONE_NEWUTS | ...) — calling process moves
unshare(1) shell command — wraps unshare() syscall
Joining namespaces:
──────────────────────────────────────────────────────────────────────
setns(fd, nstype) — join namespace referenced by fd
nsenter(1) — shell command: join namespaces of a running process
nsenter --target PID --mount --pid --net /bin/sh
# Now in the same mount, PID, and net namespaces as PID
Persisting namespaces (keep alive without processes):
──────────────────────────────────────────────────────────────────────
mount --bind /proc/PID/ns/net /var/run/netns/my_net_ns
# or just: ip netns add NAME (creates a bind mount in /run/netns/)
setns(open("/var/run/netns/my_net_ns"), CLONE_NEWNET)
Full Container Namespace Isolation Diagram
Host kernel
┌─────────────────────────────────────────────────────────────────┐
│ PID ns: host │
│ Net ns: host (eth0, lo, docker0, veth*) │
│ Mnt ns: host (/) │
│ UTS ns: host ("prod-worker-01") │
│ IPC ns: host │
│ User ns: host (UID 0 = real root) │
│ │
│ containerd ────────────────────────────────────────────────┐ │
│ │ │
│ Container A: Container B: │ │
│ ┌─────────────────────────┐ ┌─────────────────────────┐│ │
│ │ PID ns: A (PID 1=nginx) │ │ PID ns: B (PID 1=redis) ││ │
│ │ Net ns: A (172.17.0.2) │ │ Net ns: B (172.17.0.3) ││ │
│ │ Mnt ns: A (overlay/) │ │ Mnt ns: B (overlay/) ││ │
│ │ UTS ns: A ("nginx-pod") │ │ UTS ns: B ("redis-pod") ││ │
│ │ IPC ns: A │ │ IPC ns: B ││ │
│ │ User ns: A (0→1000) │ │ User ns: B (0→1001) ││ │
│ │ Cgroup ns: A │ │ Cgroup ns: B ││ │
│ └─────────────────────────┘ └─────────────────────────┘│ │
│ │ │
│ Shared: host kernel, cgroups hierarchy, devices └──┘│
└─────────────────────────────────────────────────────────────────┘
Historical Context
Mount namespaces (called "filesystem namespaces") were the first namespace type, merged in Linux 2.4.19 (2002) by Al Viro. They were originally proposed to allow bind mounts and private mount spaces for individual processes.
The broader namespace framework was designed by Eric Biederman and merged incrementally: - 2.6.19 (2006): UTS and IPC namespaces - 2.6.24 (2008): PID namespaces - 2.6.29 (2009): Network namespaces - 3.8 (2013): User namespaces (full, unprivileged creation) - 4.6 (2016): Cgroup namespaces - 5.6 (2020): Time namespaces
Docker (2013) popularized container technology built primarily on namespaces + cgroups + overlayfs. The OCI (Open Container Initiative) runtime specification (2015) standardized how container runtimes should use these primitives.
Production Examples
Inspect a container's namespaces:
# Find container PID on host:
docker inspect <container> --format '{{.State.Pid}}'
# List its namespaces:
ls -la /proc/<pid>/ns/
# Join its network namespace:
nsenter --target <pid> --net ip addr
# Join all namespaces:
nsenter --target <pid> --all /bin/sh
Network namespace for service isolation (production pattern):
# Create isolated network namespace for a service:
ip netns add svc-payments
ip link add veth-pay type veth peer name veth-pay-host
ip link set veth-pay netns svc-payments
ip -n svc-payments addr add 10.10.0.2/24 dev veth-pay
ip -n svc-payments link set veth-pay up
ip -n svc-payments link set lo up
# Run service in namespace:
ip netns exec svc-payments ./payments-service
Rootless container with user namespace:
# Podman running without root:
podman run --rm -it alpine /bin/sh
# Inside: id → uid=0(root) gid=0(root)
# On host: running as uid=1000
Debugging Notes
nsenterfor live debugging:nsenter --target PID --all /bin/shgives a shell inside all of the target process's namespaces. Essential for debugging container network or filesystem issues without restarting the container.- Namespace leaks: if a namespace is kept alive by a bind-mount or an open fd after
all processes have left it, it continues to consume kernel memory. Check with
lsnscommand (util-linux>= 2.27). - PID namespace PID 1 death: if the container's PID 1 exits, SIGKILL is sent to all other processes in the namespace. This appears as "container exited 0" even if worker processes were running. Always run a proper init.
- Network namespace connectivity:
ip netns exec NS ping 8.8.8.8failing whileping 172.17.0.1works indicates missing default route or NAT rules in the host namespace. Checkip -n NS routeand host iptables MASQUERADE rules. - User namespace UID mapping failure:
unshare --usermay fail withEPERM: Operation not permittedifkernel.unprivileged_userns_clone=0. Check withsysctl kernel.unprivileged_userns_clone.
Security Implications
- Namespace != sandbox: namespaces isolate resources but do not restrict syscalls.
A container process can still call
ptrace,perf_event_open, and other sensitive syscalls. Combine namespaces with seccomp (syscall filtering) and AppArmor/SELinux for defense in depth. - User namespace CVEs: enabling unprivileged user namespace creation has repeatedly
led to privilege escalation CVEs (CVE-2016-8655, CVE-2022-0185, CVE-2023-32233, etc.)
because it exposes kernel code paths that were never designed to be reached by
unprivileged users. Distros and enterprises often disable with
kernel.unprivileged_userns_clone=0oruser.max_user_namespaces=0. - PID namespace escape: a container with sufficient capabilities inside its user
namespace can potentially manipulate the host's process tree via
/procreferences that span namespace boundaries. Restrict withseccomp+no-new-privileges. - Network namespace and host exposure: a container with
CAP_NET_ADMINinside its network namespace can manipulate that namespace's networking, but not the host's. However, a misconfigured shared network namespace (--net=hostin Docker) gives the container full access to the host network stack. - Mount namespace and filesystem leakage: bind mounts can propagate between
namespaces if the propagation type is
shared. Container runtimes explicitly set mount propagation toslaveorprivatefor the container root to prevent host filesystem modifications from the container.
Performance Implications
- Namespace creation cost: each
clone(CLONE_NEW*)flag adds overhead incopy_process()for allocating and initializing the new namespace object. Creating all 7 namespaces simultaneously costs ~50–200 µs on modern hardware. - Network namespace overhead: each network namespace has its own routing tables, iptables rule evaluation, and socket tables. On hosts with 10,000+ containers, the aggregate memory for all net namespace objects can be significant (each empty net namespace is ~50 KB of kernel memory).
setnsand TLB: switching between network namespaces withsetns(CLONE_NEWNET)is fast (~1 µs). Switching mount namespaces is more expensive if it involves differentmm_structstates (rare forsetnsalone).- iptables performance at scale: each container typically gets iptables rules for
NAT and filtering. At 10,000+ containers, iptables becomes a bottleneck (linear scan
of rules). Kubernetes migrated to
kube-proxywithipvsmode or eBPF-based networking (Cilium) to solve this.
Failure Modes
| Failure | Symptom | Root cause |
|---|---|---|
| PID 1 exit kills container | Container exits unexpectedly | Application ran as PID 1 without zombie reaping |
| Network namespace routing black hole | Container has IP but can't reach host | Missing veth routing or NAT rules |
| Mount propagation leak | Host sees container mounts | Mount namespace not set to private/slave |
| User namespace UID mismatch | Permission denied on files | uid_map not configured correctly |
| Namespace leak | Growing /proc/sys/user/max_*_namespaces usage |
fd or bind-mount holding namespace alive |
| Time namespace clock drift | Process sees wrong monotonic time | Time namespace offset misconfigured for CRIU restore |
Modern Usage
Kubernetes: each pod gets a unique network namespace (shared by all containers in the
pod). The CNI plugin (Flannel, Calico, Cilium) configures veth pairs and routes. PID and
IPC namespaces are optionally shared within the pod via shareProcessNamespace: true.
Rootless containers (Podman, rootless Docker): use user namespaces to map the calling
user to UID 0 inside the container, enabling container operations without host root.
newuidmap/newgidmap (setuid helpers) configure the full uid range mappings.
Firecracker microVMs: while Firecracker uses KVM (hardware virtualization) for stronger isolation, its VMM process itself runs in a network namespace and mount namespace to restrict its access to host resources.
Future Directions
- Landlock LSM + namespaces: Landlock (merged 5.13) provides programmatic filesystem access control that complements mount namespaces. Integrating Landlock with namespace transitions would allow per-namespace access policies.
- Network namespace fast path: BPF-based networking (XDP, TC-BPF) inside network namespaces, avoiding iptables entirely. This is the direction of Cilium and similar CNI plugins.
- Anonymous namespaces: proposal to create namespaces that are not associated with
any
/procfd, useful for short-lived isolated contexts without the risk of namespace leaks via fd retention. - Namespace-aware
seccomp: current seccomp policies are per-process. Proposals exist for namespace-scoped seccomp policies that apply to all processes entering a namespace, simplifying policy management.
Exercises
-
Manual container: using only
unshare,mount,ip, andchroot, create a minimal container from a rootfs tarball. It should have its own PID, network, mount, and UTS namespaces, a workinglointerface, and isolated hostname. Document every command. Verify withps auxandip addrinside vs. outside. -
Network namespace routing: create two network namespaces connected by a veth pair. Write a TCP server in one namespace and a client in the other. Use
strace -e networkto verify no packets escape to the host network stack. -
Rootless container exploration: install Podman. Run
podman run -it alpine sh. From the host, find the container's PID. Read/proc/PID/statusand note theNSpid:field. Read/proc/PID/uid_mapand explain the UID mapping. Find where the container filesystem is mounted on the host. -
Namespace leak detection: write a C program that creates a new network namespace with
unshare(CLONE_NEWNET)but then opens an fd to/proc/self/ns/netand exits without closing the fd. Observe withlsnsthat the namespace persists. Close all fds and verify cleanup. -
PID 1 SIGKILL behavior: create a PID namespace with
unshare --pid --fork. Inside, start a backgroundsleep 100. Kill PID 1 (the shell) from the host. Observe that thesleepprocess also dies immediately. Now installtinias PID 1 and repeat —sleepshould survive until tini receives SIGTERM.
References
kernel/nsproxy.c—nsproxystructure, namespace switchingkernel/pid_namespace.c,net/core/net_namespace.c,fs/namespace.cinclude/linux/user_namespace.h— uid/gid mapping structurestools/testing/selftests/namespaces/— kernel namespace self-tests- Kerrisk, The Linux Programming Interface — Chapter 28 (monitoring children), Appendix: Linux 3.8+ namespace features
- Michael Kerrisk's LWN namespace series (2013–2016): 7-part series on namespaces
- OCI Runtime Specification: https://github.com/opencontainers/runtime-spec
man 7 namespaces,man 2 unshare,man 2 setns,man 1 nsenter,man 1 lsns- "Namespaces in operation" — LWN.net series by Michael Kerrisk
- Rootless containers: https://rootlesscontaine.rs/
- Container Security, Liz Rice (O'Reilly, 2020) — practical namespace security