Kubernetes Architecture: Control Plane, Data Plane, and the Declarative Model
Overview
Kubernetes is the dominant container orchestration platform, running the majority of containerized production workloads worldwide. Originating from Google's decade of experience with their internal Borg and Omega systems, Kubernetes was open-sourced in 2014 and donated to the Cloud Native Computing Foundation in 2016. It is simultaneously a container scheduler, a distributed systems framework, and a platform for building higher-level abstractions.
Understanding Kubernetes at an architectural level — not just how to write YAML, but how the control plane actually works, what components are responsible for what, and why the system is designed as it is — is essential for operating it reliably at scale, debugging unexpected behavior, and extending it with custom controllers.
Prerequisites
- Familiarity with containers and Docker concepts
- Understanding of distributed systems fundamentals (consensus, CAP theorem)
- Basic Linux networking (routing, iptables)
- Familiarity with REST APIs and declarative configuration
- Understanding of key-value stores and event-driven architectures
Historical Context
Google has been running container workloads in production since 2003 with Borg, their internal cluster management system. Borg runs hundreds of thousands of jobs across multiple clusters, managing resource allocation, scheduling, and service discovery at a scale no other organization approaches. The experience operating Borg at Google scale over a decade informed virtually every design decision in Kubernetes.
Key lessons from Borg that shaped Kubernetes: - Declarative configuration over imperative commands: tell the system what you want, not how to achieve it - Controllers and reconciliation loops: self-healing systems that continuously drive actual state toward desired state - Ubiquitous labels: flexible metadata rather than rigid hierarchies - Level-triggered notifications: react to current state, not just state changes (idempotent reconciliation)
Kubernetes launched as open source in June 2014. The Cloud Native Computing Foundation graduated it in 2018. As of 2024, Kubernetes is the de facto standard for container orchestration, with managed offerings from every major cloud provider (EKS, GKE, AKS).
System Architecture
╔══════════════════════════════════════════════════════════════════╗
║ CONTROL PLANE ║
║ ║
║ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ ║
║ │ kube- │ │ kube- │ │ kube-controller- │ ║
║ │ apiserver │ │ scheduler │ │ manager │ ║
║ │ │ │ │ │ │ ║
║ │ REST API │ │ Watches for │ │ Replication ctrl │ ║
║ │ Auth/authz │ │ unscheduled │ │ Endpoints ctrl │ ║
║ │ Validation │ │ pods; picks │ │ Node ctrl │ ║
║ │ State in etcd │ │ node; writes │ │ Job ctrl │ ║
║ │ Admission │ │ nodeName │ │ Namespace ctrl │ ║
║ └──────┬───────┘ └──────────────┘ └──────────────────────┘ ║
║ │ ║
║ ┌──────┴───────┐ ┌──────────────────────────────────────────┐ ║
║ │ etcd │ │ cloud-controller-manager │ ║
║ │ │ │ │ ║
║ │ Distributed │ │ Node lifecycle (cloud VM management) │ ║
║ │ KV store │ │ LoadBalancer provisioning │ ║
║ │ All cluster │ │ Route management │ ║
║ │ state │ └──────────────────────────────────────────┘ ║
║ └──────────────┘ ║
╚═══════════════════════════╤══════════════════════════════════════╝
│ kube-apiserver only endpoint
┌───────────┼───────────┐
│ │ │
╔═══════════════╪═══╗ ╔════╪══════╗ ╔═╪══════════════╗
║ NODE 1 │ ║ ║ │ ║ ║ │ NODE 3 ║
║ ▼ ║ ║ ▼ ║ ║ ▼ ║
║ ┌──────────────┐ ║ ║ ┌────────┐║ ║┌────────────┐ ║
║ │ kubelet │ ║ ║ │kubelet ││ ║│ kubelet │ ║
║ │ │ ║ ║ │ ││ ║│ │ ║
║ │ Reads Pod │ ║ ║ └────────┘║ ║└────────────┘ ║
║ │ spec; drives │ ║ ║ ║ ║ ║
║ │ CRI to start │ ║ ║ kube-proxy║ ║ kube-proxy ║
║ │ containers │ ║ ║ ║ ║ ║
║ │ │ ║ ║ Pods: ║ ║ Pods: ║
║ ├──────────────┤ ║ ║ [pod-a] ║ ║ [pod-d] ║
║ │ kube-proxy │ ║ ║ [pod-b] ║ ║ [pod-e] ║
║ │ │ ║ ╚═══════════╝ ╚════════════════╝
║ │ Service │ ║
║ │ networking │ ║ DATA PLANE
║ ├──────────────┤ ║
║ │ Pods: │ ║
║ │ [pod-x] │ ║
║ │ [pod-y] │ ║
║ └──────────────┘ ║
╚═══════════════════╝
Control Plane Components
kube-apiserver
The API server is the single entry point for all Kubernetes operations. Every component — kubelet, scheduler, controller-manager, and end-user clients — communicates exclusively through the API server. No component communicates with any other component directly (with one exception: kubelet can communicate directly with the container runtime).
Key responsibilities: - REST API: serves the Kubernetes API over HTTPS on port 6443. All resources (Pods, Deployments, Services, etc.) are accessed via this API. - Authentication: validates the identity of callers (client certificates, bearer tokens, OIDC tokens, service account tokens). - Authorization: checks whether the authenticated identity has permission to perform the requested action on the requested resource (RBAC, ABAC, Node authorizer, Webhook authorizer). - Admission control: validates and mutates requests before persisting. Admission webhooks allow custom validation and mutation logic (policy engines like OPA/Gatekeeper, security scanners, secret injection). - Persistence: translates accepted API requests to etcd read/write operations. The API server does not maintain in-memory state that cannot be reconstructed from etcd. - Watch: maintains long-lived watch connections to clients, delivering change events as objects in etcd are modified.
The API server is horizontally scalable — multiple replicas can run simultaneously. All state is in etcd. Load balancers in front of multiple API server replicas are standard in production deployments.
etcd
etcd is the single source of truth for all cluster state. Every object in the Kubernetes cluster — every Pod, every Service, every ConfigMap, every Secret — is stored in etcd as a serialized key-value entry.
etcd uses the Raft consensus algorithm to ensure consistency across a cluster of typically 3 or 5 nodes. All writes must reach a quorum (majority) before being committed. This means etcd can tolerate the failure of (n-1)/2 nodes while continuing to accept writes (1 failure with 3-node, 2 failures with 5-node).
etcd is the most operationally critical component of a Kubernetes cluster. If etcd loses quorum (e.g., 2 of 3 nodes fail), the cluster continues to serve traffic from existing workloads (the data plane runs independently) but cannot accept any new API requests — no new pods, no scaling, no config changes.
kube-scheduler
The scheduler watches for Pods with spec.nodeName unset (unscheduled pods). For each unscheduled pod, it:
- Runs filter plugins to eliminate nodes that cannot run the pod (insufficient resources, taint mismatch, affinity violations)
- Runs score plugins to rank remaining nodes by priority
- Selects the highest-scoring node
- Writes
pod.spec.nodeNameto the API server
The scheduler does not start containers. It only writes a nodeName to the pod spec. The kubelet on the targeted node detects this change and starts the containers. This separation of concerns is intentional — the scheduler is a pure decision-making component.
The scheduler can be replaced with a custom scheduler implementation. Multiple schedulers can run simultaneously; each Pod specifies which scheduler should schedule it via spec.schedulerName.
kube-controller-manager
The controller manager runs all built-in Kubernetes controllers as goroutines in a single binary (for efficiency). Each controller implements a reconciliation loop:
loop forever:
desired_state = read from API server (etcd)
actual_state = observe reality (API server, external systems)
if desired_state != actual_state:
take action to move actual → desired
sleep(resync_period)
Key built-in controllers:
- ReplicaSet controller: ensures the correct number of Pod replicas are running. If a Pod dies, creates a replacement. If excess Pods exist, deletes them.
- Deployment controller: manages rolling updates by creating/scaling ReplicaSets.
- Node controller: watches Node objects, marks nodes as NotReady when they stop heartbeating, evicts pods from NotReady nodes after a configurable timeout.
- Endpoints/EndpointSlice controller: maintains the mapping from Service selectors to Pod IP:port lists. When a pod becomes Ready, adds it to the endpoint slice; when it terminates, removes it.
- Job/CronJob controller: creates Pods for batch jobs, tracks completion, re-creates failed jobs up to configured retry limits.
- Namespace controller: when a namespace is deleted, cleans up all resources within it.
- ServiceAccount controller: creates default service accounts in each namespace.
cloud-controller-manager
Cloud-specific controller logic, separated from the core kube-controller-manager so cloud-specific code does not need to be in the core Kubernetes codebase.
- Node lifecycle: when a new Node joins the cluster, the cloud controller annotates it with cloud-provider-specific metadata (instance type, zone, region). When a node is deleted from the cloud provider (VM terminated), the cloud controller removes the corresponding Node object.
- Load balancer provisioning: when a Service of type LoadBalancer is created, the cloud controller calls the cloud provider API to provision a load balancer and obtain an external IP.
- Route management: in some CNI configurations, the cloud controller adds routes in the VPC routing table to enable pod-to-pod networking across nodes.
Data Plane Components
kubelet
The kubelet is an agent running on every worker node. It is the component that bridges the Kubernetes API with the local container runtime.
Kubelet watches the API server for Pods assigned to its node (those with spec.nodeName equal to its node name). When a new Pod appears:
- Pulls the container images (if not cached locally)
- Creates the network namespace via the CNI plugin
- Mounts volumes (via the CSI plugin for persistent volumes)
- Calls the CRI (Container Runtime Interface) to start each container
- Monitors container health (liveness probes, readiness probes, startup probes)
- Reports pod status back to the API server
The kubelet also manages node-level concerns: evicting pods when disk pressure or memory pressure is detected, running static pods (pods defined in files on the local filesystem, used for running control plane components on the control plane nodes).
kube-proxy
kube-proxy implements Service networking on each node. It watches the API server for Service and Endpoint changes and translates them into local network rules (iptables, IPVS, or eBPF depending on configuration).
kube-proxy is being phased out in modern deployments where CNI plugins like Cilium implement service networking entirely in eBPF.
Container Runtime Interface (CRI)
The CRI is an API that decouples Kubernetes from specific container runtimes. Kubernetes does not call Docker directly — it calls a CRI-compliant runtime. Common runtimes:
- containerd (default in most distributions): extracted from Docker, maintained by CNCF
- CRI-O: lightweight OCI-compliant runtime built specifically for Kubernetes
- gVisor (runsc): Google's kernel-level sandboxed runtime (optional, for high-security isolation)
Container Network Interface (CNI)
The CNI is a plugin interface for pod networking. When a pod is started, kubelet calls the configured CNI plugin to: - Create a network namespace for the pod - Connect the pod's namespace to the node's network - Assign an IP address to the pod
Common CNI plugins: Calico (eBPF or iptables policies), Cilium (eBPF), Flannel (simple overlay, suitable for smaller clusters), AWS VPC CNI (assigns real VPC IPs to pods, no overlay), Weave.
Container Storage Interface (CSI)
The CSI standardizes how storage providers (AWS EBS, GCP PD, NFS, Ceph, etc.) integrate with Kubernetes. CSI drivers handle provisioning, attaching, mounting, and lifecycle management of persistent volumes. The kubelet calls CSI to mount volumes into pod filesystems at container start time.
The Declarative Model and Level-Triggered Reconciliation
The Kubernetes design philosophy centers on declarative desired state. Users express what they want ("I want 3 replicas of this pod running") rather than how to achieve it ("create a pod on this node"). The system continuously reconciles actual state toward desired state.
This is described as level-triggered (reacting to the current state) rather than edge-triggered (reacting to state changes). The distinction is critical for reliability:
Edge-triggered (fragile):
"A pod died" → create a replacement
Problem: if the event is missed (crash, network partition),
the replacement is never created. One missed event = permanent divergence.
Level-triggered (robust):
"Desired replicas: 3, actual replicas: 2" → create a replacement
The controller checks state periodically. Even if it missed the event
when the pod died, the next reconciliation loop corrects the divergence.
Missed events cause delay, not permanent failure.
This principle makes Kubernetes self-healing. Controllers don't need to catch every event — they just need to run their reconciliation loop often enough. If a controller crashes and restarts, it re-reads the current state from etcd and reconciles from there. No events are "lost" in a way that causes permanent divergence.
Operator Pattern
The Kubernetes architecture enables the Operator pattern: packaging domain-specific operational knowledge as custom controllers. An Operator: 1. Defines one or more Custom Resource Definitions (CRDs) — new API types 2. Implements a controller that reconciles the custom resources to manage an application
Example: a PostgreSQL Operator defines a PostgresCluster CRD. When you create a PostgresCluster object, the controller creates the appropriate Pods, ConfigMaps, Services, and PersistentVolumeClaims, handles failover, manages backups, and performs rolling upgrades — encoding all the operational knowledge a DBA would normally apply manually.
Debugging Notes
- API server connectivity: all control plane issues manifest as API server unavailability. Test with
kubectl get nodes --request-timeout=5s. If this hangs, the API server is unreachable (check load balancer, certificates, network rules). - Stuck pods in Pending: use
kubectl describe pod <name>— the Events section shows why the scheduler cannot place the pod (resource limits, taint mismatch, PVC not found). - Controller not reconciling: check controller-manager logs for the specific controller. Each controller logs its reconciliation actions. Use
kubectl logs -n kube-system kube-controller-manager-<node>. - etcd health:
etcdctl endpoint health --endpoints=...andetcdctl endpoint statusshow which etcd nodes are up and which is the leader. High latency (raft_proposal_duration_seconds_bucket) is the first sign of etcd problems. - Node NotReady:
kubectl describe node <name>shows conditions. Check forDiskPressure,MemoryPressure,NetworkUnavailable. Check kubelet logs on the node:journalctl -u kubelet.
Security Implications
- The API server is the most security-critical component. Expose it only to authorized networks; never expose port 6443 to the public internet without authentication.
- RBAC (Role-Based Access Control) is enabled by default since Kubernetes 1.8. Audit RBAC bindings regularly; the most common privilege escalation vector is overprivileged service accounts.
- etcd contains all Secrets in base64-encoded form (not encrypted by default). Enable etcd encryption at rest. etcd should never be directly accessible from outside the control plane network.
- Admission webhooks execute code for every API request. A compromised or malicious webhook can modify any object. Audit all deployed webhooks and use
failurePolicy: Ignoreonly where appropriate. - Node-to-API-server communication is authenticated via client certificates. Rotate node certificates regularly (automatic with kubelet certificate rotation, enabled by default since 1.8).
Performance Implications
- API server performance scales with etcd performance. Keep etcd write latency under 10ms p99. Use local NVMe SSDs for etcd — never network-attached storage.
- Large clusters (5000+ nodes) require careful API server tuning: watch cache sizing, etcd connection pool limits, API priority and fairness (APF) to prevent one client from starving others.
- Controller manager uses informers (cached watches) not direct etcd reads — this is essential for scalability. Never write controllers that call
listdirectly against the API server in a tight loop. - The scheduler processes approximately 100 pods per second at 5000 nodes with default settings. Scheduling throughput can be increased by enabling gang scheduling plugins and tuning the scheduling cycle time.
Failure Modes
- etcd quorum loss: if etcd loses quorum, the API server becomes read-only. Existing workloads continue running (data plane is independent), but no changes can be made. Restore quorum from backup or add etcd members.
- Scheduler failure: if the scheduler stops running, new pods are never scheduled. They sit in Pending state indefinitely. Existing running pods are unaffected.
- Controller manager failure: if the controller manager stops, reconciliation stops. If a node fails, pods are never evicted. If a Deployment is updated, no rolling update occurs. But all currently running pods continue running.
- Split brain (network partition): if the network partitions and etcd loses quorum on one side, that partition's API server becomes read-only. The partition with quorum continues normal operation.
Modern Usage
Kubernetes has become the substrate for higher-level platforms (OpenShift, Rancher, Anthos, Tanzu). The Operator pattern has produced a rich ecosystem of operators for stateful applications. GitOps tools (ArgoCD, Flux) use Kubernetes controllers to continuously reconcile cluster state from Git repositories.
As of Kubernetes 1.29, the core scheduling framework, CSI, and CNI APIs are stable. Gateway API (the successor to Ingress) graduated to v1 stable in 1.28. The focus has shifted from stabilizing core APIs to improving performance, security (Pod Security Admission replacing PSP), and observability.
Future Directions
- Cluster API: manage Kubernetes cluster infrastructure itself via Kubernetes-style declarative objects (provision and manage entire clusters declaratively).
- WasmEdge integration: WebAssembly workloads alongside container workloads, using WASM's sub-millisecond startup for ephemeral tasks.
- Topology-aware routing improvements: smarter EndpointSlice topology hints to keep traffic within AZ boundaries automatically.
- Structured configuration language: cue, KCL, and Timoni exploring safer typed alternatives to raw YAML for Kubernetes manifests.
Exercises
- Start a local Kubernetes cluster (kind or minikube). Use
kubectl get componentstatusesand review the state of each control plane component. Inspect etcd usingetcdctlto see the raw keys for a Pod you create. - Delete the kube-scheduler pod in a local cluster. Create a new Deployment and observe that Pods remain in Pending state. Restart the scheduler and observe pods being scheduled.
- Write a custom controller using the
controller-runtimelibrary (Go) orkopf(Python) that watches ConfigMap objects and logs a message whenever a ConfigMap in thedefaultnamespace is created or modified. - Trace a
kubectl applycommand from the command line through authentication, authorization, admission, persistence in etcd, and notification to the scheduler and controller-manager. Usekubectlverbose mode (-v=9) to see the actual API calls. - Benchmark the API server's watch latency using
kubectl get pods --watchand measuring the time from when you create a pod to when the watch event arrives. Compare with the time for the pod to be scheduled and running.
References
- Kubernetes Documentation: https://kubernetes.io/docs/concepts/
- Google's Borg, Omega, and Kubernetes (ACM Queue, 2016) — Burns, Grant, Oppenheimer, Brewer, Wilkes
- "Kubernetes: Up and Running" — Burns, Beda, Hightower (O'Reilly)
- Kubernetes Architecture: https://kubernetes.io/docs/concepts/architecture/
- Controller design principles: https://github.com/kubernetes/community/blob/master/contributors/devel/sig-api-machinery/controllers.md
- "Production Kubernetes" — Josh Rosso et al. (O'Reilly, 2021)
- CNCF Kubernetes Operator White Paper (2021)