Section 22: Kubernetes Internals — Overview
Section Purpose and Scope
This section dissects Kubernetes from the inside out. Rather than covering kubectl commands and YAML syntax, it examines the architectural decisions underlying the control plane, the reconciliation loops that drive the system toward declared state, the network and storage plugin interfaces, and the security mechanisms governing multi-tenant clusters. The focus is on why Kubernetes works the way it does: its design choices, failure modes, and the invariants the system maintains under partial failure.
Prerequisites
- Section 15: Networking (TCP/IP, DNS, load balancing)
- Section 17: Distributed Systems (consensus, leader election, watch mechanisms)
- Section 18: Database Internals (B-trees, WAL — for etcd)
- Section 20: Containers (OCI spec, CRI, container runtimes)
- Section 21: Cloud Infrastructure (cloud provider integration context)
Learning Objectives
- Trace the complete lifecycle of a Pod from
kubectl applyto a running container. - Explain etcd's role and why it is the sole source of truth in a Kubernetes cluster.
- Describe the API server's request handling pipeline including admission controllers.
- Articulate the scheduler's filtering and scoring algorithm.
- Explain how controllers implement reconciliation loops and handle leader election.
- Describe how kube-proxy implements Service abstractions using iptables or eBPF.
- Understand the CNI and CSI plugin interfaces at the level of what calls are made and when.
- Analyze RBAC policy evaluation and the relationship between ServiceAccounts and pod identity.
Architecture Overview
┌─────────────────────────────────────────────────────────────────┐
│ CONTROL PLANE │
│ │
│ ┌─────────────┐ ┌──────────────────────────────────────┐ │
│ │ etcd │◄───│ kube-apiserver │ │
│ │ (Raft │ │ - Authentication (certs, OIDC) │ │
│ │ consensus) │ │ - Authorization (RBAC, ABAC, webhook)│ │
│ │ 3 or 5 │ │ - Admission Controllers │ │
│ │ members │ │ - Serialization / validation │ │
│ │ │ │ - Watch/List serving │ │
│ └─────────────┘ └────────────────┬─────────────────────┘ │
│ │ watches │
│ ┌─────────────────────────────────── ▼──────────────────────┐ │
│ │ kube-controller-manager (leader-elected) │ │
│ │ ┌───────────────┐ ┌───────────────┐ ┌────────────────┐ │ │
│ │ │ ReplicaSet │ │ Deployment │ │ Node │ │ │
│ │ │ Controller │ │ Controller │ │ Controller │ │ │
│ │ └───────────────┘ └───────────────┘ └────────────────┘ │ │
│ │ ┌───────────────┐ ┌───────────────┐ ┌────────────────┐ │ │
│ │ │ Endpoint │ │ ServiceAccount│ │ Job │ │ │
│ │ │ Controller │ │ Controller │ │ Controller │ │ │
│ │ └───────────────┘ └───────────────┘ └────────────────┘ │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ kube-scheduler (leader-elected) │ │
│ │ Filter plugins → Score plugins → Bind │ │
│ │ (NodeAffinity, Taint/Toleration, ResourceFit, ...) │ │
│ └───────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│ API calls
┌─────────────────────────── ▼──────────────────────────────────────┐
│ DATA PLANE (per node) │
│ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ kubelet │ │
│ │ - PodSpec reconciliation loop │ │
│ │ - CRI calls → containerd/CRI-O │ │
│ │ - CSI volume mounting │ │
│ │ - CNI plugin invocation at pod network setup │ │
│ │ - Liveness/readiness probe execution │ │
│ │ - Node status reporting │ │
│ └────────────────────┬────────────────────────────────────────┘ │
│ │ │
│ ┌────────────────────▼────────────────────────────────────────┐ │
│ │ Container Runtime (containerd/CRI-O) │ │
│ │ → runc / kata / gVisor (OCI runtime) │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ kube-proxy (or Cilium/eBPF replacement) │ │
│ │ Watches Services/Endpoints → programs iptables / ipvs │ │
│ │ or eBPF maps for ClusterIP, NodePort, LoadBalancer │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
│ Pod Network (CNI plugin): │
│ ┌──────────┐ veth ┌──────────┐ ┌──────────────────────────┐ │
│ │ Pod A ├────────┤ cbr0 / │ │ Overlay / BGP │ │
│ │ eth0 │ │ bridge ├──┤ (VXLAN, WireGuard, │ │
│ └──────────┘ └──────────┘ │ BGP with Calico) │ │
│ └──────────────────────────┘ │
└────────────────────────────────────────────────────────────────────┘
Key Concepts
- etcd: Distributed key-value store using the Raft consensus algorithm. All cluster state serialized as protobuf stored under
/registry/prefix. Watch mechanism feeds API server informers. A 3-node etcd cluster tolerates one failure; 5-node tolerates two. - API Server: Stateless front-end to etcd. Every object mutation goes through authentication, authorization, and admission before being written. Serves Watch streams to controllers and the scheduler.
- Informers: Client-side watch + local cache abstraction. Controllers use informers to avoid thundering-herd watch storms on the API server. Informers use List/Watch semantics with resource version tracking.
- Reconciliation Loop: Every controller runs: observe current state → compute delta → apply action → re-observe. This level-triggered (not edge-triggered) approach makes the system self-healing.
- Scheduler: Two-phase: filtering (eliminates unfit nodes) then scoring (ranks fit nodes). Extensible via scheduler framework plugins. Binds a pod to a node by writing a Binding object to the API server.
- kubelet: Node agent. Watches pod specs assigned to its node. Drives the container runtime via CRI gRPC calls. Reports node capacity, allocatable resources, and pod status.
- kube-proxy: Implements Service virtual IP (ClusterIP) by programming iptables NAT rules or IPVS virtual servers. eBPF-based CNI plugins (Cilium) replace kube-proxy entirely using XDP and tc programs.
- CNI (Container Network Interface): Plugin specification invoked by kubelet at pod creation/deletion. Plugin receives container netns path and returns IP assignment. Calico, Flannel, Cilium, Weave are implementations.
- CSI (Container Storage Interface): Plugin specification for dynamic volume provisioning. Three gRPC services: Identity, Controller (CreateVolume, AttachVolume), Node (NodeStageVolume, NodePublishVolume).
- CRI (Container Runtime Interface): gRPC API kubelet uses to manage pods and containers. RuntimeService (RunPodSandbox, CreateContainer, StartContainer) and ImageService.
- Admission Controllers: Webhooks and built-in plugins that intercept API server write requests. Mutating admission runs first, then validating admission. Used for policy enforcement (OPA Gatekeeper, Kyverno), sidecar injection, resource defaulting.
- RBAC: Role-Based Access Control. Subjects (User, Group, ServiceAccount) bound to Roles (namespace-scoped) or ClusterRoles via RoleBinding/ClusterRoleBinding. Evaluated by the API server authorizer.
- Operator Pattern: Custom controller + Custom Resource Definition. Encodes operational knowledge as code. Operator reconciles custom resource state against application state. Used for databases, ML workloads, etc.
- HPA / VPA / KEDA: Horizontal Pod Autoscaler scales replicas on CPU/custom metrics. Vertical Pod Autoscaler adjusts resource requests. KEDA enables event-driven autoscaling (queue depth, etc.).
Major Historical Milestones
| Year | Event |
|---|---|
| 2014 | Kubernetes open-sourced by Google (based on internal Borg/Omega) |
| 2015 | Kubernetes v1.0; CNCF founded; CNI specification created |
| 2016 | StatefulSets, DaemonSets, RBAC alpha; Helm 1.0 |
| 2017 | CRI finalized; containerd donated to CNCF; RBAC GA |
| 2018 | CSI GA; CoreDNS replaces kube-dns; Custom Resources GA |
| 2019 | Kubernetes passes 1M cluster milestone; Server-Side Apply alpha |
| 2019 | Cilium eBPF-based networking matures; kube-proxy replacement viable |
| 2020 | Dockershim deprecation announced; direct CRI adoption |
| 2021 | Structured logging; Server-Side Apply GA; ephemeral containers beta |
| 2022 | Dockershim removed (v1.24); Gateway API alpha; OpenTelemetry integration |
| 2023 | Sidecar containers KEP; KubeVirt maturity; AI/GPU workload scheduling KEPs |
| 2024 | Dynamic Resource Allocation GA; topology-aware scheduling improvements |
Modern Relevance
Kubernetes has become the de facto operating system of the cloud. Virtually every new cloud-native workload runs on or alongside Kubernetes. Understanding its internals is essential for: diagnosing scheduling failures and resource starvation, designing secure multi-tenant configurations, building operators for complex stateful applications, tuning network policy performance, and reasoning about the failure modes of the control plane itself.
The eBPF revolution in Kubernetes networking (Cilium, Hubble) is replacing iptables-based kube-proxy, enabling per-packet observability and microsecond-latency policy enforcement. The Gateway API is replacing Ingress with a richer, role-oriented networking model. AI/ML workload scheduling (GPUs, fractional resources, topology awareness) is driving significant new scheduler extensions.
File Map
22-kubernetes-internals/
├── 00-overview.md ← this file
├── 01-architecture.md ← control plane vs data plane, component responsibilities
├── 02-etcd-internals.md ← Raft, watch mechanism, compaction, defrag
├── 03-api-server.md ← request pipeline, admission, watch serving
├── 04-scheduler.md ← filter/score framework, affinity, bin packing
├── 05-controller-manager.md ← reconciliation loops, leader election, informers
├── 06-kubelet.md ← pod lifecycle, CRI calls, probes, eviction
├── 07-kube-proxy.md ← iptables/IPVS service implementation
├── 08-cni-plugins.md ← CNI spec, Calico, Cilium, Flannel internals
├── 09-csi-plugins.md ← CSI spec, volume lifecycle, topology
├── 10-cri.md ← CRI gRPC API, shim design
├── 11-pod-lifecycle.md ← from pending to running, init containers, sidecars
├── 12-service-mesh.md ← Envoy, Istio, Linkerd data plane integration
├── 13-admission-controllers.md ← webhook chain, OPA/Kyverno, sidecar injection
├── 14-rbac-internals.md ← RBAC evaluation, ServiceAccount projection, IRSA
├── 15-resource-management.md ← requests/limits, QoS classes, eviction
└── 16-autoscaling-operators.md ← HPA/VPA/KEDA, operator pattern, CRD design
Cross-References
- Section 17 (Distributed Systems): Raft consensus in etcd; leader election patterns
- Section 18 (Database Internals): etcd's bbolt storage engine; watch as event log
- Section 20 (Containers): CRI, OCI runtime spec, CNI plumbing
- Section 21 (Cloud Infrastructure): EKS/GKE/AKS control plane managed service design
- Section 23 (Observability): Prometheus scraping in Kubernetes; OpenTelemetry operator
- Section 25 (Performance Engineering): CPU/memory limits, network latency in pod communication
- Section 26 (Security): RBAC, admission policies, network policies, pod security standards
- Section 28 (Reliability Engineering): Pod disruption budgets, topology spread, multi-region