Kubelet Internals

Overview

The kubelet is the primary node agent in Kubernetes. It runs on every node in the cluster — both worker nodes and (in some configurations) control plane nodes — and is responsible for ensuring that containers described in PodSpecs are running and healthy. While the API server is the brain of the cluster, the kubelet is the hands: it translates the declarative intent stored in etcd into actual running processes on the underlying Linux machine.

Unlike other Kubernetes control-plane components, the kubelet does not run as a Pod. It is a system daemon (typically managed by systemd) because it must bootstrap the environment in which Pods run. It communicates upward to the API server and downward to the container runtime, volume plugins, and the Linux kernel itself via cgroups and namespaces.

Understanding kubelet internals is essential for debugging node-level issues: why a Pod is stuck in ContainerCreating, why a node shows NotReady, why a container keeps getting OOM-killed, or why liveness probes are not behaving as expected.

Prerequisites

Understanding of the Kubernetes Pod model and PodSpec
Basic familiarity with Linux cgroups (v1 or v2) and namespaces
Knowledge of container runtimes (containerd, CRI-O)
Familiarity with gRPC and protocol buffers
Understanding of Kubernetes node objects and conditions

Historical Context

The kubelet predates the formal Kubernetes 1.0 release. In the early Google Borg system (which inspired Kubernetes), the equivalent agent was called the Borglet. When Kubernetes was open-sourced in 2014, the kubelet was one of the first components written and has evolved significantly:

Pre-1.5 (2016): Kubelet used Docker directly via the Docker remote API. No CRI abstraction existed.
1.5 (2016): The Container Runtime Interface (CRI) was introduced as an alpha feature, allowing pluggable runtimes.
1.20 (2020): Docker support deprecated in favor of CRI-compliant runtimes. Dockershim (the CRI adapter for Docker) began its removal path.
1.24 (2022): Dockershim removed. containerd and CRI-O are the standard runtimes.
1.25 (2022): cgroup v2 support matured; kubelet began fully supporting the unified cgroup hierarchy.
1.27 (2023): In-place Pod vertical scaling introduced as alpha, allowing CPU/memory changes without Pod restart.

The Pod Lifecycle Event Generator (PLEG) was introduced to replace per-container polling with a more efficient event-driven model, significantly reducing API server load on large nodes.

Kubelet Responsibilities

The kubelet has a wide surface area of responsibilities:

1. Pod Realization The kubelet's primary job is to take a PodSpec (whether from the API server, a static Pod manifest, or an HTTP endpoint) and make it real on the node. This means creating network namespaces, pulling images, creating containers, mounting volumes, and starting the container processes.

2. Node Status Reporting The kubelet continuously reports the node's status back to the API server. This includes hardware capacity (CPU, memory, storage), allocatable resources (capacity minus system and kubelet reservations), and condition signals (Ready, MemoryPressure, DiskPressure, PIDPressure, NetworkUnavailable).

3. Health Management The kubelet runs liveness, readiness, and startup probes for every container that defines them. Based on probe outcomes, it kills and restarts containers (liveness), gates traffic routing (readiness), and delays liveness checks during startup (startup probe).

4. Eviction When the node runs low on resources — memory, disk, or file descriptors — the kubelet proactively evicts Pods to protect node stability, following a priority order based on QoS class and resource consumption.

5. Volume Management The kubelet coordinates with CSI drivers to attach, stage, and publish volumes to the correct paths before container start, and to unpublish and unstage them after container stop.

The Kubelet Sync Loop and PLEG

The heart of the kubelet is its sync loop. Here is how the key components interact:

  API Server / Static Pods
         |
         v
  +--------------+
  |  Pod Manager  |  <-- desired state (PodSpecs)
  +--------------+
         |
         v
  +------------------+      polls CRI every 1s
  |  PLEG             |  --------------------------->  Container Runtime
  |  (Pod Lifecycle   |  <-- container state events --  (containerd/CRI-O)
  |   Event Generator)|
  +------------------+
         |  events: ContainerStarted, ContainerDied,
         |          ContainerRemoved, ContainerChanged
         v
  +------------------+
  |  Sync Loop        |  <-- syncPod() per Pod
  |  (pod workers)    |
  +------------------+
         |
         +----> CRI gRPC calls (RunPodSandbox, CreateContainer, ...)
         +----> Volume Manager (mount/unmount CSI volumes)
         +----> cgroup Manager (create/update cgroup hierarchy)
         +----> Probe Manager (run liveness/readiness probes)
         +----> Status Manager (update Pod status in API server)
         |
         v
  Node: actual running containers + cgroups + volumes

PLEG in Detail

The Pod Lifecycle Event Generator (PLEG) is a polling loop that runs every second. It calls ListPodSandbox and ListContainers on the CRI to get the current state of all sandboxes and containers on the node. It then compares this list to its internal cache and generates lifecycle events for any differences:

ContainerStarted — container was not running, now is
ContainerDied — container was running, now has exited
ContainerRemoved — container no longer exists
ContainerChanged — container state changed (e.g., hash changed)

These events are put on a channel that the generic runtime manager consumes. The manager then triggers syncPod() for the affected Pod.

A critical failure mode: if PLEG stops generating events (due to a hung CRI call), the node transitions to NotReady with the status PLEG is not healthy. This is one of the most common node failure modes and is monitored via the kubelet_pleg_relist_duration_seconds metric.

CRI: Container Runtime Interface

The kubelet does not speak to containerd or CRI-O directly. It uses a gRPC interface called CRI, defined in k8s.io/cri-api. This abstraction enables runtime pluggability.

  kubelet
    |
    |  gRPC (CRI)
    v
  +----------------------------+
  |  RuntimeService            |
  |  - RunPodSandbox()         |  Create network namespace + pause container
  |  - StopPodSandbox()        |
  |  - RemovePodSandbox()      |
  |  - ListPodSandbox()        |
  |  - CreateContainer()       |  Create container within sandbox
  |  - StartContainer()        |
  |  - StopContainer()         |
  |  - RemoveContainer()       |
  |  - ListContainers()        |
  |  - ContainerStatus()       |
  |  - ExecSync()              |  kubectl exec
  |  - Attach()                |  kubectl attach
  |  - PortForward()           |
  +----------------------------+
  |  ImageService              |
  |  - PullImage()             |
  |  - ListImages()            |
  |  - RemoveImage()           |
  |  - ImageStatus()           |
  +----------------------------+
         |
         v
    containerd / CRI-O
         |
         v
    runc / kata / gVisor

Sandbox concept: A "sandbox" in CRI terminology is the shared environment for a Pod — the network namespace, IPC namespace, and the pause container (a minimal container that holds namespaces open). All application containers in the Pod share this sandbox. The pause container is sometimes called the "infra container."

Sequence for Pod startup: 1. RunPodSandbox — creates network namespace, calls CNI plugin, starts pause container 2. PullImage — pulls each container image if not cached 3. CreateContainer — creates container spec inside sandbox 4. StartContainer — executes the container process

cgroup Management

The kubelet creates a cgroup hierarchy for resource enforcement. Under cgroup v1:

  /sys/fs/cgroup/cpu/
  └── kubepods/
      ├── besteffort/
      │   └── pod<UID>/
      │       ├── <container-ID>/    <-- app container
      │       └── <container-ID>/    <-- pause container
      ├── burstable/
      │   └── pod<UID>/
      │       └── <container-ID>/
      └── guaranteed/
          └── pod<UID>/
              └── <container-ID>/

The same hierarchy is mirrored across all cgroup subsystems: cpu, memory, blkio, pids.

QoS tiers and cgroup placement: - Guaranteed: requests == limits for all containers. Placed in /kubepods/guaranteed/. Gets the most predictable performance. - Burstable: at least one container has a request or limit but they are not equal. Placed in /kubepods/burstable/. - BestEffort: no requests or limits. Placed in /kubepods/besteffort/. First evicted.

The kubelet sets cpu.shares proportional to CPU requests and memory.limit_in_bytes equal to memory limits. For CPU limits, it sets cpu.cfs_quota_us and cpu.cfs_period_us.

Under cgroup v2, the hierarchy is unified under /sys/fs/cgroup/ with a single directory tree, using cpu.max, memory.max, and memory.min instead of the v1 per-subsystem split. Kubernetes 1.25+ supports cgroup v2 in production.

Volume Management

The kubelet's volume manager runs a reconciliation loop to attach/detach and mount/unmount volumes:

  Volume Manager Reconcile Loop

  Desired state (from PodSpecs):
    Pod A needs PVC "data-vol"

  Actual state (from node):
    PVC "data-vol" not mounted

  Actions:
    1. Wait for PV to be attached to node
       (done by external-provisioner + attach-detach controller)
    2. CSI NodeStageVolume(volumeID, stagingPath)
       -- formats filesystem if needed
       -- mounts at staging path (shared across pods using same PV)
    3. CSI NodePublishVolume(volumeID, stagingPath, targetPath)
       -- bind-mounts staging path to pod-specific path:
          /var/lib/kubelet/pods/<podUID>/volumes/<volPlugin>/<volName>
    4. Container starts with volume bind-mounted into namespace

On Pod deletion, the reverse happens: NodeUnpublishVolume then NodeUnstageVolume. If a Pod is force-deleted or the node crashes, volumes may remain staged and require manual NodeUnstageVolume calls — a common source of "volume stuck in terminating" issues.

Node Status Reporting

The kubelet updates the Node object in the API server on a configurable interval (--node-status-update-frequency, default 10s). It reports:

Capacity and Allocatable:

Capacity:
  cpu:     4
  memory:  16Gi
  pods:    110

Allocatable:
  cpu:     3800m        # minus kubelet + system reserved
  memory:  14Gi         # minus kubelet + system reserved
  pods:    110

Node Conditions:

Conditions:
  Type              Status   Reason
  MemoryPressure    False    KubeletHasSufficientMemory
  DiskPressure      False    KubeletHasNoDiskPressure
  PIDPressure       False    KubeletHasSufficientPID
  Ready             True     KubeletReady

If the API server does not receive a heartbeat within --node-monitor-grace-period (default 40s), the node controller marks the node NotReady and begins the pod eviction countdown.

Pod Probes

Probes are health checks executed by the kubelet (not the container runtime or the API server):

Liveness Probe Determines if the container is alive. On failure: container is killed and restarted (subject to restartPolicy). Used for processes that can deadlock but not crash.

Readiness Probe Determines if the container is ready to serve traffic. On failure: the Pod's IP is removed from the Endpoints object for all matching Services. The container is NOT restarted. Used for warm-up time, dependency checks.

Startup Probe Runs first, gates the liveness probe. While startup probe is failing, the liveness probe does not run. Once startup probe succeeds once, it stops running. Designed for slow-starting containers that would otherwise be killed by liveness before they finish initializing.

Probe mechanisms: - exec: run command inside container; exit 0 = success - httpGet: HTTP GET; 2xx/3xx = success - tcpSocket: TCP connection; established = success - grpc: gRPC health check protocol (1.24+)

Eviction

When the node is under resource pressure, the kubelet evicts Pods before the kernel OOM killer fires (which would be uncontrolled). Eviction is triggered by:

Soft eviction thresholds (e.g., memory.available < 1Gi for 2 minutes) — give Pods a grace period. Hard eviction thresholds (e.g., memory.available < 500Mi) — immediate eviction, no grace period.

Eviction order: 1. BestEffort Pods (no requests/limits) — most wasteful per node perspective 2. Burstable Pods that exceed their requests 3. Guaranteed Pods — last resort only if system/kubelet processes need resources

Within each tier, the kubelet sorts by how much the pod exceeds its request (highest overage first), then by Pod priority.

  Memory Pressure Eviction Example:

  Available memory: 450Mi (below hard threshold 500Mi)

  Eviction candidates sorted:
  +------------------+-------+---------+-----------+---------+
  | Pod              | QoS   | Request | Usage     | Over    |
  +------------------+-------+---------+-----------+---------+
  | nginx-dev        | BestE |    0Mi  |   512Mi   |  512Mi  | <-- evict first
  | worker-burst     | Burst |  256Mi  |   900Mi   |  644Mi  | <-- evict second
  | app-prod         | Guar  |  512Mi  |   512Mi   |    0Mi  | <-- last resort
  +------------------+-------+---------+-----------+---------+

Debugging Notes

Node stuck NotReady:

# Check kubelet service status
systemctl status kubelet
journalctl -u kubelet -n 100 --no-pager

# Check PLEG health
kubectl describe node <node> | grep -A5 "Conditions"

# Key metric: PLEG relist duration
kubectl top nodes   # not useful here
# Check via prometheus: kubelet_pleg_relist_duration_seconds_bucket

Pod stuck in ContainerCreating:

kubectl describe pod <pod> -n <ns>
# Look at Events section — common causes:
# - "Failed to pull image" — network/auth issue
# - "MountVolume.SetUp failed" — CSI/NFS issue
# - "failed to create containerd task" — runtime issue
# - "NetworkPlugin cni failed" — CNI plugin issue

Container being OOM killed:

kubectl describe pod <pod> | grep -A3 "OOMKilled"
# Check dmesg on the node:
dmesg | grep -i "oom\|kill"
# Check kubelet_container_oom_events_total metric

Liveness probe killing healthy containers:

# Check if probe is correctly configured
kubectl get pod <pod> -o yaml | grep -A10 livenessProbe
# Check initialDelaySeconds — too short for slow-starting apps
# Check timeoutSeconds — network latency causing false failures

Security Implications

The kubelet API (port 10250) should never be exposed publicly. It can exec commands in any container on the node. Use network policies and firewall rules.
Kubelet uses client certificates for API server authentication. Rotate these certificates regularly (--rotate-certificates).
Static Pod manifests on disk (typically /etc/kubernetes/manifests/) are trusted absolutely — anyone who can write to this directory can run privileged containers.
The kubelet can be configured with --protect-kernel-defaults to prevent Pods from modifying kernel parameters.
Node authorization mode restricts kubelet to reading only Secrets/ConfigMaps that are bound to Pods on its own node — preventing a compromised kubelet from reading all cluster secrets.

Performance Implications

Large nodes (high Pod count) can cause PLEG relist to take longer, eventually exceeding 1s and causing cascading NotReady events. Tune --max-pods (default 110) based on runtime performance.
CPU manager (--cpu-manager-policy=static) allocates exclusive CPU cores to Guaranteed QoS pods, eliminating CPU throttling noise for latency-sensitive workloads.
Memory manager (--memory-manager-policy=Static) provides NUMA-aware memory allocation for performance-sensitive pods.
--serialize-image-pulls=false allows parallel image pulls, reducing Pod startup time when multiple containers need different images.

Failure Modes

Failure	Symptom	Root Cause
PLEG unhealthy	Node NotReady	CRI calls hanging (>3min)
Disk pressure	Pods evicted	Logs/images filling disk
containerd crash	All pods stuck	Runtime process died
CNI misconfiguration	ContainerCreating forever	Pod network not set up
Volume stuck	Pod terminating forever	CSI NodeUnpublish hanging
CPU throttling	High p99 latency	CFS quota exhausted before period

Modern Usage

In Kubernetes 1.29+, the kubelet supports: - In-place Pod vertical scaling (alpha): Update CPU/memory limits without Pod restart - Sidecar containers (stable in 1.29): Native sidecar lifecycle distinct from regular containers; sidecars start before app containers and stop after - Swap memory support (beta 1.28): Controlled swap usage for BestEffort and Burstable pods - cgroup v2 as default on many distributions (Ubuntu 22.04+, RHEL 9)

Future Directions

Node Resource Topology: Fine-grained NUMA, CPU, and device topology-aware allocation to reduce memory access latency for HPC workloads.
DRA (Dynamic Resource Allocation): Replace the device plugin framework with a more flexible API for GPUs, FPGAs, and custom hardware.
Kubelet credential provider: Standardized plugin interface for pulling images from private registries without baking credentials into nodes.
Graceful node shutdown improvements: Better coordination between kubelet and systemd for clean Pod termination during node shutdown.

Exercises

SSH into a Kubernetes node and examine the cgroup hierarchy under /sys/fs/cgroup/memory/kubepods/. Find the memory.limit_in_bytes file for a running container and verify it matches the container's resource limit.
Intentionally cause PLEG to become unhealthy by stopping the container runtime (systemctl stop containerd). Observe how long before the node transitions to NotReady. Restore the runtime and observe recovery.
Create a Pod with a liveness probe that fails after 30 seconds (e.g., deletes a file that the probe checks). Observe the restart cycle in kubectl get pod -w and review the event log.
Deploy a BestEffort Pod and a Guaranteed Pod on the same node. Use a stress tool to exhaust node memory and observe eviction order.
Configure a container with a CPU limit of 100m and run a CPU-intensive workload. Use cat /sys/fs/cgroup/cpu/kubepods/burstable/pod<UID>/<containerID>/cpu.stat to observe throttled_time increasing.

References

Kubernetes kubelet source code: pkg/kubelet/ in kubernetes/kubernetes
CRI API definition: k8s.io/cri-api/pkg/apis/runtime/v1/
"PLEG Is Not Healthy" — Kubernetes blog post on debugging PLEG
Container Runtime Interface design document: kubernetes/design-proposals-archive
cgroup v2 Linux kernel documentation: Documentation/admin-guide/cgroup-v2.rst
Brendan Gregg, "Linux Performance Analysis in 60 seconds" (node-level debugging applicable to kubelet issues)
KEP-2238: Kubelet Eviction Policy
KEP-1287: In-place Update of Pod Resources