Skip to content

06 — Process Isolation and Linux Namespaces

Technical Overview

Linux namespaces are the kernel mechanism that makes containers possible. Each namespace type wraps one category of global kernel resource — the process tree, the network stack, the filesystem mount table, the hostname, IPC objects, user/group IDs, cgroup views, or wall clocks — and provides a private instance of that resource to a set of processes. Containers are fundamentally just processes running in a coordinated set of namespaces combined with cgroup resource limits and a chrooted filesystem.

There are currently 8 namespace types in the mainline kernel. Understanding each — what it isolates, how it is created, and where it interacts with security boundaries — is essential for container runtime authors, platform engineers, and anyone doing advanced Linux systems work.


Prerequisites

  • 01-process-concept.md: task_struct, /proc/PID/ns/
  • 02-fork-and-exec.md: clone(), CLONE_NEW* flags
  • Basic networking (network interfaces, routing tables, iptables)
  • Mount/filesystem concepts (VFS, mount points)

Core Content

Namespace Overview

Linux Kernel Global Resources
──────────────────────────────────────────────────────────────────────
Resource           Namespace      clone flag        unshare flag
─────────────────  ─────────────  ──────────────    ──────────────────
Process IDs        PID ns         CLONE_NEWPID      unshare --pid
Network stack      Net ns         CLONE_NEWNET      unshare --net
Mount table        Mount ns       CLONE_NEWNS        unshare --mount
Hostname           UTS ns         CLONE_NEWUTS       unshare --uts
SysV IPC / POSIX   IPC ns         CLONE_NEWIPC       unshare --ipc
  message queues
UID/GID mapping    User ns        CLONE_NEWUSER      unshare --user
Cgroup view        Cgroup ns      CLONE_NEWCGROUP    unshare --cgroup
System clocks      Time ns        CLONE_NEWTIME      unshare --time
──────────────────────────────────────────────────────────────────────

A process belongs to exactly one namespace of each type. New namespaces are created via: - clone(CLONE_NEW*): the new process starts in a new namespace - unshare(CLONE_NEW*): the calling process moves itself to a new namespace - setns(fd, nstype): the calling process joins an existing namespace identified by fd

Namespace identity is tracked via inode numbers in /proc/PID/ns/:

ls -la /proc/self/ns/
# lrwxrwxrwx 1 user user 0 cgroup -> 'cgroup:[4026531835]'
# lrwxrwxrwx 1 user user 0 ipc    -> 'ipc:[4026531839]'
# lrwxrwxrwx 1 user user 0 mnt    -> 'mnt:[4026531841]'
# lrwxrwxrwx 1 user user 0 net    -> 'net:[4026531992]'
# lrwxrwxrwx 1 user user 0 pid    -> 'pid:[4026531836]'
# lrwxrwxrwx 1 user user 0 time   -> 'time:[4026531834]'
# lrwxrwxrwx 1 user user 0 user   -> 'user:[4026531837]'
# lrwxrwxrwx 1 user user 0 uts    -> 'uts:[4026531838]'

Two processes with identical inode numbers for a given namespace type are in the same namespace. Keeping a namespace alive without any processes: hold an fd open to /proc/PID/ns/TYPE (or bind-mount it somewhere).


PID Namespace

Isolates the process ID number space. Processes inside the namespace see a private PID tree starting at PID 1. The host kernel still assigns unique global PIDs.

Host (initial namespace)          Container PID namespace
─────────────────────────         ──────────────────────────
PID 1: systemd                    PID 1: container-init (= host PID 47823)
PID 2: kthreadd                   PID 2: nginx worker   (= host PID 47824)
...                                PID 3: nginx worker   (= host PID 47825)
PID 47822: containerd
PID 47823: container-init  ──────►  appears as PID 1 inside namespace
PID 47824: nginx worker    ──────►  appears as PID 2 inside namespace
PID 47825: nginx worker    ──────►  appears as PID 3 inside namespace

Key properties: - PID 1 semantics inside the namespace: if PID 1 in the namespace exits, the kernel sends SIGKILL to all other processes in that namespace. This is why containers need a proper init (e.g., tini) rather than running the application directly as PID 1. - /proc visibility: inside a PID namespace, /proc shows only the processes in that namespace (if a new mount namespace is used with a private /proc mount). - Nested namespaces: PID namespaces are hierarchical. A process in an outer namespace can see processes in inner namespaces (with their outer PIDs). Inner namespaces cannot see outer processes. - getpid() inside the container returns the namespace-local PID (1, 2, 3...). The host-global PID is visible in /proc/PID/status as NSpid:.

# Create a new PID namespace (requires root or user ns with mapping):
unshare --pid --fork --mount-proc /bin/sh
# Inside: 'echo $$' → 1

Network Namespace

Each network namespace has a complete, independent network stack: - Network interfaces (including lo) - Routing tables - Firewall rules (iptables / nftables) - Network sockets - /proc/net/ files

Host network namespace             Container network namespace
─────────────────────────         ──────────────────────────
eth0: 192.168.1.10/24             eth0: 172.17.0.2/16
lo:   127.0.0.1/8                 lo:   127.0.0.1/8
docker0: 172.17.0.1/16            (veth0 ↔ host's veth1)

iptables rules: host policies     iptables rules: container policies

veth pairs: the standard way to connect a network namespace to the host or another namespace. A veth pair is two virtual NICs linked together; a packet entering one end emerges from the other.

# Create a new network namespace:
ip netns add myns

# Create a veth pair:
ip link add veth0 type veth peer name veth1

# Move one end into the namespace:
ip link set veth1 netns myns

# Configure addresses:
ip addr add 192.168.100.1/24 dev veth0
ip link set veth0 up
ip netns exec myns ip addr add 192.168.100.2/24 dev veth1
ip netns exec myns ip link set veth1 up
ip netns exec myns ip link set lo up

# Test:
ip netns exec myns ping 192.168.100.1

Network namespaces are also the mechanism behind: - Kubernetes pod networking: each pod has its own network namespace; all containers in the pod share it (joined via setns). - Network policy enforcement: iptables/eBPF rules applied per-namespace. - Service mesh sidecars: the sidecar container shares the pod's network namespace via --net=container:ID in Docker, or by joining the pod's netns in Kubernetes.


Mount Namespace

Isolates the filesystem mount table. Processes in different mount namespaces can have entirely different views of the filesystem.

unshare --mount /bin/sh
# Now in a private mount namespace
mount --bind /tmp/myfs /mnt/container
# This mount is invisible to the host

Propagation types (controlled via mount --make-*): - shared: mounts propagate between namespaces bidirectionally - slave: mounts propagate from master to slave, not reverse - private: no propagation - unbindable: cannot be bind-mounted

Container runtimes (runc) use unshare(CLONE_NEWNS) followed by: 1. pivot_root() to switch the root filesystem to the container image 2. Mounting /proc, /sys, /dev in the new namespace 3. Unmounting the old root

overlayfs (used by Docker, containerd) layers a writable upperdir over a read-only lowerdir (the image), presenting a unified writable filesystem without modifying the original image.


UTS Namespace

Isolates the hostname and NIS domainname. Each container can have its own hostname without affecting the host:

unshare --uts /bin/sh
hostname container-01
# host hostname is unchanged

Used by every container runtime so that hostname inside the container returns the container name, not the host's hostname.


IPC Namespace

Isolates System V IPC objects (message queues, semaphore sets, shared memory segments) and POSIX message queues. Processes in different IPC namespaces cannot share SysV IPC or POSIX MQ objects.

Without IPC namespace isolation, container processes could accidentally (or maliciously) interact with host SysV shared memory segments. Most container runtimes create a new IPC namespace by default.


User Namespace

The most powerful and most complex namespace type. User namespaces map UID/GID ranges between the namespace and the host:

Container user namespace          Host
─────────────────────────         ──────────────────────
UID 0 (root)          ─────────►  UID 1000 (regular user)
UID 1                 ─────────►  UID 1001
...
UID 65535             ─────────►  UID 66535

This mapping is written to /proc/PID/uid_map and /proc/PID/gid_map:

# /proc/PID/uid_map format: [ns_start] [host_start] [count]
0  1000  65536

Rootless containers: because user namespaces allow a non-root user to create a namespace where they appear as UID 0, it is possible to run full container workloads without any host-level root privileges. Podman, rootless Docker, and rootless containerd all use this.

Capability semantics: inside a user namespace, a process can hold the full set of capabilities (CAP_NET_ADMIN, CAP_SYS_ADMIN, etc.) — but these capabilities are only honored for operations within that namespace (or namespaces nested inside it).

# Create a user namespace mapping current user to UID 0 inside:
unshare --user --map-root-user /bin/sh
# Inside: id → uid=0(root) gid=0(root)
# But on the host: we are still our original UID

Security constraint: user namespaces are powerful and have historically been a source of privilege escalation CVEs. Many hardened deployments set kernel.unprivileged_userns_clone=0 to require CAP_SYS_ADMIN to create user namespaces.


Cgroup Namespace

Isolates the view of /proc/self/cgroup. Without a cgroup namespace, a process inside a container can see its full cgroup path on the host (e.g., /system.slice/docker-abc123.scope/), leaking information about the container runtime and orchestrator.

With a cgroup namespace, the process sees only a relative path from the namespace root:

Without cgroup ns: /system.slice/docker-abc123.scope/
With cgroup ns:    /

Created automatically by container runtimes to prevent cgroup path leakage.


Time Namespace

Introduced in Linux 5.6 (2020). Allows a process to have a different view of CLOCK_MONOTONIC and CLOCK_BOOTTIME. Does not affect wall clock (CLOCK_REALTIME).

Primary use case: CRIU (Checkpoint/Restore in Userspace). When a process is checkpointed and restored on a different host, its monotonic clock readings would otherwise be wrong (the new host has been up for a different duration). A time namespace allows adjusting the monotonic clock offset so the restored process sees a consistent time view.

# Adjust monotonic clock by -100 seconds for processes in a new time namespace:
unshare --time --monotonic=-100 /bin/sh

Namespace Operations: unshare, setns, /proc/PID/ns/

Creating namespaces:
──────────────────────────────────────────────────────────────────────
clone(CLONE_NEWPID | CLONE_NEWNET | ...)    — on fork, child in new ns
unshare(CLONE_NEWNS | CLONE_NEWUTS | ...)   — calling process moves
unshare(1) shell command                    — wraps unshare() syscall

Joining namespaces:
──────────────────────────────────────────────────────────────────────
setns(fd, nstype)    — join namespace referenced by fd
nsenter(1)           — shell command: join namespaces of a running process
  nsenter --target PID --mount --pid --net /bin/sh
  # Now in the same mount, PID, and net namespaces as PID

Persisting namespaces (keep alive without processes):
──────────────────────────────────────────────────────────────────────
mount --bind /proc/PID/ns/net /var/run/netns/my_net_ns
# or just: ip netns add NAME (creates a bind mount in /run/netns/)
setns(open("/var/run/netns/my_net_ns"), CLONE_NEWNET)

Full Container Namespace Isolation Diagram

Host kernel
┌─────────────────────────────────────────────────────────────────┐
│  PID ns: host                                                    │
│  Net ns: host (eth0, lo, docker0, veth*)                        │
│  Mnt ns: host (/)                                               │
│  UTS ns: host ("prod-worker-01")                                │
│  IPC ns: host                                                   │
│  User ns: host (UID 0 = real root)                              │
│                                                                  │
│   containerd ────────────────────────────────────────────────┐  │
│                                                               │  │
│   Container A:                    Container B:               │  │
│   ┌─────────────────────────┐     ┌─────────────────────────┐│  │
│   │ PID ns: A (PID 1=nginx) │     │ PID ns: B (PID 1=redis) ││  │
│   │ Net ns: A (172.17.0.2)  │     │ Net ns: B (172.17.0.3)  ││  │
│   │ Mnt ns: A (overlay/)    │     │ Mnt ns: B (overlay/)    ││  │
│   │ UTS ns: A ("nginx-pod") │     │ UTS ns: B ("redis-pod") ││  │
│   │ IPC ns: A               │     │ IPC ns: B               ││  │
│   │ User ns: A (0→1000)     │     │ User ns: B (0→1001)     ││  │
│   │ Cgroup ns: A            │     │ Cgroup ns: B            ││  │
│   └─────────────────────────┘     └─────────────────────────┘│  │
│                                                               │  │
│   Shared: host kernel, cgroups hierarchy, devices            └──┘│
└─────────────────────────────────────────────────────────────────┘

Historical Context

Mount namespaces (called "filesystem namespaces") were the first namespace type, merged in Linux 2.4.19 (2002) by Al Viro. They were originally proposed to allow bind mounts and private mount spaces for individual processes.

The broader namespace framework was designed by Eric Biederman and merged incrementally: - 2.6.19 (2006): UTS and IPC namespaces - 2.6.24 (2008): PID namespaces - 2.6.29 (2009): Network namespaces - 3.8 (2013): User namespaces (full, unprivileged creation) - 4.6 (2016): Cgroup namespaces - 5.6 (2020): Time namespaces

Docker (2013) popularized container technology built primarily on namespaces + cgroups + overlayfs. The OCI (Open Container Initiative) runtime specification (2015) standardized how container runtimes should use these primitives.


Production Examples

Inspect a container's namespaces:

# Find container PID on host:
docker inspect <container> --format '{{.State.Pid}}'
# List its namespaces:
ls -la /proc/<pid>/ns/
# Join its network namespace:
nsenter --target <pid> --net ip addr
# Join all namespaces:
nsenter --target <pid> --all /bin/sh

Network namespace for service isolation (production pattern):

# Create isolated network namespace for a service:
ip netns add svc-payments
ip link add veth-pay type veth peer name veth-pay-host
ip link set veth-pay netns svc-payments
ip -n svc-payments addr add 10.10.0.2/24 dev veth-pay
ip -n svc-payments link set veth-pay up
ip -n svc-payments link set lo up
# Run service in namespace:
ip netns exec svc-payments ./payments-service

Rootless container with user namespace:

# Podman running without root:
podman run --rm -it alpine /bin/sh
# Inside: id → uid=0(root) gid=0(root)
# On host: running as uid=1000

Debugging Notes

  • nsenter for live debugging: nsenter --target PID --all /bin/sh gives a shell inside all of the target process's namespaces. Essential for debugging container network or filesystem issues without restarting the container.
  • Namespace leaks: if a namespace is kept alive by a bind-mount or an open fd after all processes have left it, it continues to consume kernel memory. Check with lsns command (util-linux >= 2.27).
  • PID namespace PID 1 death: if the container's PID 1 exits, SIGKILL is sent to all other processes in the namespace. This appears as "container exited 0" even if worker processes were running. Always run a proper init.
  • Network namespace connectivity: ip netns exec NS ping 8.8.8.8 failing while ping 172.17.0.1 works indicates missing default route or NAT rules in the host namespace. Check ip -n NS route and host iptables MASQUERADE rules.
  • User namespace UID mapping failure: unshare --user may fail with EPERM: Operation not permitted if kernel.unprivileged_userns_clone=0. Check with sysctl kernel.unprivileged_userns_clone.

Security Implications

  • Namespace != sandbox: namespaces isolate resources but do not restrict syscalls. A container process can still call ptrace, perf_event_open, and other sensitive syscalls. Combine namespaces with seccomp (syscall filtering) and AppArmor/SELinux for defense in depth.
  • User namespace CVEs: enabling unprivileged user namespace creation has repeatedly led to privilege escalation CVEs (CVE-2016-8655, CVE-2022-0185, CVE-2023-32233, etc.) because it exposes kernel code paths that were never designed to be reached by unprivileged users. Distros and enterprises often disable with kernel.unprivileged_userns_clone=0 or user.max_user_namespaces=0.
  • PID namespace escape: a container with sufficient capabilities inside its user namespace can potentially manipulate the host's process tree via /proc references that span namespace boundaries. Restrict with seccomp + no-new-privileges.
  • Network namespace and host exposure: a container with CAP_NET_ADMIN inside its network namespace can manipulate that namespace's networking, but not the host's. However, a misconfigured shared network namespace (--net=host in Docker) gives the container full access to the host network stack.
  • Mount namespace and filesystem leakage: bind mounts can propagate between namespaces if the propagation type is shared. Container runtimes explicitly set mount propagation to slave or private for the container root to prevent host filesystem modifications from the container.

Performance Implications

  • Namespace creation cost: each clone(CLONE_NEW*) flag adds overhead in copy_process() for allocating and initializing the new namespace object. Creating all 7 namespaces simultaneously costs ~50–200 µs on modern hardware.
  • Network namespace overhead: each network namespace has its own routing tables, iptables rule evaluation, and socket tables. On hosts with 10,000+ containers, the aggregate memory for all net namespace objects can be significant (each empty net namespace is ~50 KB of kernel memory).
  • setns and TLB: switching between network namespaces with setns(CLONE_NEWNET) is fast (~1 µs). Switching mount namespaces is more expensive if it involves different mm_struct states (rare for setns alone).
  • iptables performance at scale: each container typically gets iptables rules for NAT and filtering. At 10,000+ containers, iptables becomes a bottleneck (linear scan of rules). Kubernetes migrated to kube-proxy with ipvs mode or eBPF-based networking (Cilium) to solve this.

Failure Modes

Failure Symptom Root cause
PID 1 exit kills container Container exits unexpectedly Application ran as PID 1 without zombie reaping
Network namespace routing black hole Container has IP but can't reach host Missing veth routing or NAT rules
Mount propagation leak Host sees container mounts Mount namespace not set to private/slave
User namespace UID mismatch Permission denied on files uid_map not configured correctly
Namespace leak Growing /proc/sys/user/max_*_namespaces usage fd or bind-mount holding namespace alive
Time namespace clock drift Process sees wrong monotonic time Time namespace offset misconfigured for CRIU restore

Modern Usage

Kubernetes: each pod gets a unique network namespace (shared by all containers in the pod). The CNI plugin (Flannel, Calico, Cilium) configures veth pairs and routes. PID and IPC namespaces are optionally shared within the pod via shareProcessNamespace: true.

Rootless containers (Podman, rootless Docker): use user namespaces to map the calling user to UID 0 inside the container, enabling container operations without host root. newuidmap/newgidmap (setuid helpers) configure the full uid range mappings.

Firecracker microVMs: while Firecracker uses KVM (hardware virtualization) for stronger isolation, its VMM process itself runs in a network namespace and mount namespace to restrict its access to host resources.


Future Directions

  • Landlock LSM + namespaces: Landlock (merged 5.13) provides programmatic filesystem access control that complements mount namespaces. Integrating Landlock with namespace transitions would allow per-namespace access policies.
  • Network namespace fast path: BPF-based networking (XDP, TC-BPF) inside network namespaces, avoiding iptables entirely. This is the direction of Cilium and similar CNI plugins.
  • Anonymous namespaces: proposal to create namespaces that are not associated with any /proc fd, useful for short-lived isolated contexts without the risk of namespace leaks via fd retention.
  • Namespace-aware seccomp: current seccomp policies are per-process. Proposals exist for namespace-scoped seccomp policies that apply to all processes entering a namespace, simplifying policy management.

Exercises

  1. Manual container: using only unshare, mount, ip, and chroot, create a minimal container from a rootfs tarball. It should have its own PID, network, mount, and UTS namespaces, a working lo interface, and isolated hostname. Document every command. Verify with ps aux and ip addr inside vs. outside.

  2. Network namespace routing: create two network namespaces connected by a veth pair. Write a TCP server in one namespace and a client in the other. Use strace -e network to verify no packets escape to the host network stack.

  3. Rootless container exploration: install Podman. Run podman run -it alpine sh. From the host, find the container's PID. Read /proc/PID/status and note the NSpid: field. Read /proc/PID/uid_map and explain the UID mapping. Find where the container filesystem is mounted on the host.

  4. Namespace leak detection: write a C program that creates a new network namespace with unshare(CLONE_NEWNET) but then opens an fd to /proc/self/ns/net and exits without closing the fd. Observe with lsns that the namespace persists. Close all fds and verify cleanup.

  5. PID 1 SIGKILL behavior: create a PID namespace with unshare --pid --fork. Inside, start a background sleep 100. Kill PID 1 (the shell) from the host. Observe that the sleep process also dies immediately. Now install tini as PID 1 and repeat — sleep should survive until tini receives SIGTERM.


References

  • kernel/nsproxy.cnsproxy structure, namespace switching
  • kernel/pid_namespace.c, net/core/net_namespace.c, fs/namespace.c
  • include/linux/user_namespace.h — uid/gid mapping structures
  • tools/testing/selftests/namespaces/ — kernel namespace self-tests
  • Kerrisk, The Linux Programming Interface — Chapter 28 (monitoring children), Appendix: Linux 3.8+ namespace features
  • Michael Kerrisk's LWN namespace series (2013–2016): 7-part series on namespaces
  • OCI Runtime Specification: https://github.com/opencontainers/runtime-spec
  • man 7 namespaces, man 2 unshare, man 2 setns, man 1 nsenter, man 1 lsns
  • "Namespaces in operation" — LWN.net series by Michael Kerrisk
  • Rootless containers: https://rootlesscontaine.rs/
  • Container Security, Liz Rice (O'Reilly, 2020) — practical namespace security