Skip to content

Kubernetes Resource Management

Overview

Resource management in Kubernetes is the mechanism by which compute resources (CPU and memory, primarily) are requested, allocated, enforced, and scaled. It sits at the intersection of the Linux kernel's cgroup subsystem, the Kubernetes scheduler's bin-packing algorithms, and the controller ecosystem (HPA, VPA, KEDA) that adjusts workload size over time.

Getting resource management right is one of the hardest operational challenges in Kubernetes. Under-provisioning causes CPU throttling and OOM kills. Over-provisioning wastes money and reduces cluster density. The defaults (no requests, no limits) are almost always wrong for production systems. This document explains the full stack from kernel enforcement to autoscaling.

Prerequisites

  • Linux cgroup v1 and v2 fundamentals (CFS scheduler, memory controller)
  • Kubernetes scheduler concepts (node scoring, bin-packing)
  • Understanding of QoS classes (Guaranteed, Burstable, BestEffort)
  • Familiarity with Kubernetes Pods, Deployments, and Namespaces
  • Basic Prometheus/metrics familiarity for autoscaling

Historical Context

Kubernetes inherited resource enforcement from Linux cgroups, which originated at Google in 2006 (then called "process containers") as a way to isolate Borg tasks. The CFS (Completely Fair Scheduler) CPU bandwidth controller — the mechanism behind CPU limits — was merged into the Linux kernel in 2010.

The Horizontal Pod Autoscaler was introduced in Kubernetes 1.0 (2014) based on CPU metrics. Custom metrics support came in 1.6 (2017) via the metrics API. Vertical Pod Autoscaler was released by Google as an open-source project in 2018.

KEDA (Kubernetes Event-Driven Autoscaling) was created by Microsoft and Red Hat in 2019 and donated to CNCF in 2020, filling the gap between HPA's metric limitations and real-world queue/event-driven workloads.

The CPU throttling problem (CFS bandwidth) became widely publicized around 2019 when engineers at Zalando, Shopify, and others published analyses showing that CPU limits caused severe latency spikes even when containers were nowhere near their average limit.

Resource Model

  Resource units:

  CPU:
    1 CPU = 1000 millicores (m)
    500m = 0.5 CPU core
    100m = 0.1 CPU (minimum meaningful allocation)
    Fractional cores are shared via CFS scheduling.

  Memory:
    Specified in bytes:
    128Mi = 128 mebibytes (binary, 1Mi = 1,048,576 bytes)
    1Gi = 1 gibibyte
    1G  = 1 gigabyte (decimal, avoid — use Mi/Gi)

  Other resources:
    hugepages-2Mi: 2Mi hugepage count
    nvidia.com/gpu: GPU count (integer only, from device plugin)
    ephemeral-storage: local disk space

Requests vs Limits

This is the most important distinction in Kubernetes resource management:

  requests: scheduling hint + soft guarantee
  limits:   hard enforcement ceiling

  +-------------------+----------------------------------+----------------------------------+
  | Property          | requests                         | limits                           |
  +-------------------+----------------------------------+----------------------------------+
  | Used by scheduler | YES — bin-packs based on         | NO — scheduler ignores limits    |
  |                   | sum(requests) on node            |                                  |
  +-------------------+----------------------------------+----------------------------------+
  | Kernel enforcement| CPU: sets cpu.shares proportion  | CPU: sets CFS quota (hard cap)   |
  |                   | Memory: sets memory.min (soft)   | Memory: sets memory.max (OOM)    |
  +-------------------+----------------------------------+----------------------------------+
  | Can exceed?       | YES — container can use more     | NO (memory), throttled (CPU)     |
  |                   | if node has spare capacity       |                                  |
  +-------------------+----------------------------------+----------------------------------+

  Example:

  Node has 4 CPU cores and 8Gi memory.

  Pod A: requests 1 CPU / 2Gi,  limits 2 CPU / 4Gi
  Pod B: requests 1 CPU / 2Gi,  limits 2 CPU / 4Gi
  Pod C: requests 1 CPU / 2Gi,  limits 2 CPU / 4Gi

  Scheduler sees: 3 CPU requested of 4 available → all three fit.
  Scheduler does NOT see: 6 CPU max limits of 4 available → would be overcommitted.

  At runtime: if all three pods spike to 2 CPU simultaneously,
  kernel throttles all three to their CFS quota.

CPU Throttling: The Silent Latency Killer

CPU throttling is poorly understood because it does not show up in CPU usage metrics — a container can be at 20% average CPU utilization and still be heavily throttled.

  CFS Bandwidth Control (cgroup v1):

  /sys/fs/cgroup/cpu/<pod>/<container>/
    cpu.cfs_period_us = 100000   (100ms period — default)
    cpu.cfs_quota_us  = 10000    (10ms = 100m CPU = 0.1 core limit)

  How it works:
  - Every 100ms period, container is given 10ms of CPU time budget.
  - If container uses all 10ms, it is THROTTLED for the remaining 90ms.
  - No work happens while throttled, even if other CPUs are idle.
  - Budget resets at next period.

  The problem:
  - Budget is global across all CPUs in the period.
  - A container with 500m limit (50ms budget) running on 4 CPUs
    can burn 50ms in 12.5ms of wall-clock time if using all 4 cores.
  - Result: throttled for ~87ms → 87ms pause every 100ms period.
  - Observed as high p99/p999 latency even at low average CPU.

  Mitigation options:
  1. Remove CPU limits entirely (watch for noisy neighbors)
  2. Increase cpu.cfs_period_us to 1s (reduces granularity impact)
     -- Linux 5.14+ supports this per-cgroup
  3. Use CPU Manager with static policy for Guaranteed pods
     (pins container to exclusive CPUs — no CFS sharing)
  4. Set limits = requests (Guaranteed QoS) to avoid over-subscription

Checking throttling:

# On the node, for a specific container:
cat /sys/fs/cgroup/cpu/kubepods/burstable/pod<UID>/<containerID>/cpu.stat
# Look for: throttled_time (nanoseconds throttled) and nr_throttled

# Via Prometheus:
container_cpu_cfs_throttled_periods_total
container_cpu_cfs_periods_total
# Throttle ratio = throttled_periods / total_periods > 25% is concerning

Memory OOM Behavior

Memory limits are enforced via the kernel's memory.max (cgroup v2) or memory.limit_in_bytes (cgroup v1). When a container exceeds its limit:

  Memory Limit Exceeded → OOM Kill Sequence:

  1. Kernel's OOM killer selects a process inside the cgroup to kill.
  2. Usually the process with highest OOM score adj.
  3. Container process is killed with SIGKILL.
  4. Container runtime detects process death.
  5. kubelet PLEG detects ContainerDied event.
  6. kubelet restarts container (if restartPolicy: Always/OnFailure).
  7. Pod status shows: OOMKilled = true.

  kubectl get pod <pod> -o jsonpath=
    '{.status.containerStatuses[0].lastState.terminated.reason}'
  # Output: OOMKilled

  Limits vs Requests for memory:

  memory request (memory.min in cgroupv2):
    Kernel GUARANTEES this much memory will not be reclaimed under pressure.
    Container can use more if available.

  memory limit (memory.max):
    Hard ceiling. Process killed if exceeded.

Resource Enforcement Diagram

  Pod Scheduling and Runtime Enforcement

  +-----------+     requests       +-----------+
  | Scheduler |  <-----------      | Pod Spec  |
  |           |                    | requests: |
  | Node A:   |                    |   cpu: 500m
  |  sum(req) |                    |   mem: 256Mi
  |  3.5 CPU  |  fits? YES         | limits:   |
  |  7Gi mem  |  ---------> schedule to Node A | cpu: 1000m|
  +-----------+                    |   mem: 512Mi
                                   +-----------+

  Runtime (on Node A):

  cgroup /kubepods/burstable/podUID/containerID

  cpu subsystem:
    cpu.shares = 512           (proportional to 500m request)
    cpu.cfs_quota_us = 100000  (100ms = 1 CPU limit per period)
    cpu.cfs_period_us = 100000 (100ms period)

  memory subsystem:
    memory.min = 268435456     (256Mi request — soft guarantee)
    memory.max = 536870912     (512Mi limit — hard cap, OOM if exceeded)

  Scheduler sees: 3.5 + 0.5 = 4.0 CPU used (at capacity)
  Runtime allows: container can burst to 1.0 CPU (CFS quota)
                  container OOMed if exceeds 512Mi

LimitRange

LimitRange applies default and maximum resource values at the namespace level:

apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: development
spec:
  limits:
  - type: Container
    default:           # applied if container has no limits
      cpu: 500m
      memory: 512Mi
    defaultRequest:    # applied if container has no requests
      cpu: 100m
      memory: 128Mi
    max:               # cannot exceed these
      cpu: 2
      memory: 4Gi
    min:               # must specify at least these
      cpu: 50m
      memory: 64Mi
  - type: Pod
    max:
      cpu: 4
      memory: 8Gi
  - type: PersistentVolumeClaim
    max:
      storage: 100Gi

LimitRange is enforced by the LimitRanger admission controller at Pod creation time. It does not retroactively affect running pods.

ResourceQuota

ResourceQuota enforces aggregate limits across all resources in a namespace:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: production-quota
  namespace: production
spec:
  hard:
    requests.cpu: "20"
    requests.memory: 40Gi
    limits.cpu: "40"
    limits.memory: 80Gi
    pods: "100"
    services: "20"
    persistentvolumeclaims: "50"
    requests.storage: 500Gi
    count/deployments.apps: "20"
    count/jobs.batch: "50"

When a namespace has a ResourceQuota, ALL Pods must specify requests and limits (or a LimitRange must provide defaults). Pods without requests are rejected.

kubectl describe resourcequota -n production
# Shows: Used vs Hard for each resource

Vertical Pod Autoscaler (VPA)

VPA analyzes historical CPU and memory usage and recommends (or automatically applies) right-sized requests and limits:

  VPA Components:

  Recommender: watches pod metrics, computes recommendations
    → stores in VPA object: status.recommendation

  Admission Plugin: injects recommended requests into new pods
    → mutates pod spec at admission time

  Updater: evicts pods with out-of-date recommendations
    → forces recreation with new resource spec

  VPA modes:
  - "Off": recommendations only (read-only, safe to try first)
  - "Initial": apply recommendation at pod creation only
  - "Auto": evict and recreate pods as recommendations change
            (causes pod restarts — disruptive)
  - "Recreate": same as Auto but more aggressive

  Warning: VPA is INCOMPATIBLE with HPA scaling on CPU/memory.
  Only use VPA for workloads where HPA scales on custom metrics.

Horizontal Pod Autoscaler (HPA)

HPA scales the number of Pod replicas based on metrics:

  HPA Control Loop (runs every 15s):

  1. Query metrics API:
     GET /apis/metrics.k8s.io/v1beta1/namespaces/<ns>/pods
     (from metrics-server, Prometheus adapter, or custom metrics)

  2. Calculate desired replicas:
     desiredReplicas = ceil(currentReplicas × (currentMetric / targetMetric))

     Example:
       currentReplicas = 3
       currentCPU = 80% of request
       targetCPU = 50% of request
       desiredReplicas = ceil(3 × (80/50)) = ceil(4.8) = 5

  3. Apply scale-up/scale-down constraints:
     - Scale up: immediate (unless --horizontal-pod-autoscaler-upscale-stabilization)
     - Scale down: stabilization window 300s by default (prevent thrashing)

  4. Update Deployment/StatefulSet replicas field.

  HPA v2 supports multiple metrics and behavior tuning:

  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 10       # scale down max 10% of pods per period
        periodSeconds: 60
    scaleUp:
      policies:
      - type: Pods
        value: 4        # add at most 4 pods per period
        periodSeconds: 60

KEDA: Event-Driven Autoscaling

KEDA extends HPA to scale on external event sources:

  KEDA Architecture:

  ScaledObject (CRD) → defines: deployment, scalers, min/max replicas

  Scalers (built-in, 50+):
    - Kafka: consumer group lag
    - AWS SQS: queue depth
    - RabbitMQ: queue length
    - Prometheus: any PromQL query
    - PostgreSQL: row count in query
    - Azure Service Bus: message count
    - Cron: schedule-based scaling

  Scale to zero: KEDA can scale deployments to 0 replicas
    when there are no events (cost savings for batch workloads).
    HPA minimum is 1 replica; KEDA overrides this.

  Example:

  apiVersion: keda.sh/v1alpha1
  kind: ScaledObject
  metadata:
    name: kafka-consumer-scaler
  spec:
    scaleTargetRef:
      name: order-processor
    minReplicaCount: 0     # scale to zero when no messages
    maxReplicaCount: 50
    triggers:
    - type: kafka
      metadata:
        topic: orders
        bootstrapServers: kafka:9092
        consumerGroup: order-processor-group
        lagThreshold: "100"    # 1 replica per 100 messages of lag

Cluster Autoscaler

While HPA/KEDA scale pods, the Cluster Autoscaler (CA) scales nodes:

  CA Decision Logic:

  Scale UP:
    - Any pod in Pending state due to "Insufficient CPU/memory"
    - CA simulates scheduling on existing nodes → fails
    - CA selects node group to expand, calls cloud provider API
    - New node joins cluster, pending pods scheduled
    - Delay: ~3-5 minutes (cloud instance boot + kubelet ready)

  Scale DOWN:
    - Node utilization < 50% for 10 minutes (default)
    - All pods on node can be moved to other nodes
    - No PodDisruptionBudget violations
    - No local storage, no annotations opting out
    - CA drains node → pods rescheduled → node deleted

  Common CA issues:
  - Pods with PodAntiAffinity that require spread can prevent scale-down
  - PodDisruptionBudgets with minAvailable: 100% prevent any drain
  - DaemonSets excluded from scale-down calculations (they run everywhere)
  - Custom node labels/taints must be preserved in node group templates

Debugging Notes

# Find pods with no resource requests (bad practice)
kubectl get pods -A -o json | jq '.items[] | 
  select(.spec.containers[].resources.requests == null) |
  {name: .metadata.name, ns: .metadata.namespace}'

# Check CPU throttling for a pod
kubectl exec -it <pod> -- cat /sys/fs/cgroup/cpu/cpu.stat
# Look for throttled_time in ns

# Check OOM kills on a node
kubectl get events -A --field-selector reason=OOMKilling

# Examine VPA recommendations (without Auto mode)
kubectl describe vpa <vpa-name> -n <ns>
# Shows: Lower Bound / Target / Upper Bound per container

# Check HPA current state
kubectl describe hpa <hpa-name> -n <ns>
# Shows: current metrics, desired vs current replicas, recent events

# Check ResourceQuota usage
kubectl describe resourcequota -n <ns>

# Check KEDA ScaledObject state
kubectl describe scaledobject <name> -n <ns>
kubectl get hpa -n <ns>  # KEDA creates an HPA under the hood

Security Implications

  • Setting no resource limits on containers is a security risk: a container can consume all node resources, causing eviction of other workloads (denial of service within the cluster).
  • Memory limits prevent container memory leaks from taking down the node.
  • ephemeral-storage limits prevent log spam from filling node disk.
  • ResourceQuota at the namespace level provides hard multi-tenancy isolation.
  • CPU limits, while causing throttling, provide isolation guarantees that prevent one team's CPU-heavy workload from starving others.

Performance Implications

  • The CPU throttling problem means that strict CPU limits can severely hurt p99/p999 latency for latency-sensitive services. Many production teams (Netflix, Shopify, Zalando) have moved to running without CPU limits, relying on HPA for horizontal scaling and ResourceQuota for blast radius control.
  • Memory requests set memory.min in cgroup v2, which instructs the kernel not to reclaim that memory under pressure — critical for preventing swap thrashing.
  • Huge pages (2Mi or 1Gi) can dramatically improve TLB hit rates for memory-intensive applications (databases, ML inference). Request via hugepages-2Mi: "4Gi" in resources.

Failure Modes

Symptom Cause Fix
Pod Pending "Insufficient CPU" No node with enough allocatable CPU Add nodes or reduce requests
Pod OOMKilled repeatedly Memory limit too low Increase limit or fix memory leak
High p99 latency at low CPU % CPU throttling from CFS quota Remove CPU limits or increase period
HPA not scaling up Metrics not reaching API server Check metrics-server/adapter
KEDA at 0 replicas, events queued Scale-from-zero delay Set minReplicaCount: 1 for latency-sensitive
VPA evicting pods frequently VPA in Auto mode, unstable recommendations Switch to VPA Off mode initially

Modern Usage

  • Karpenter (AWS, CNCF): Next-generation cluster autoscaler that provisions nodes based on actual pod requirements rather than node group templates. Faster scale-up and better bin-packing.
  • cgroup v2: Removes many CFS quirks; cpu.weight replaces cpu.shares with more predictable behavior. Memory pressure handling improved with memory.high (soft limit, triggers reclaim before OOM).
  • In-place Pod vertical scaling (1.27 alpha): Change CPU/memory requests/limits without pod restart — finally decoupling VPA from disruption.

Future Directions

  • Proportional autoscaling: HPA that accounts for startup time of new replicas to avoid over-provisioning during spike.
  • Resource Fit Admission: Gate pod admission based on resource availability forecasts, not just current state.
  • Power-aware scheduling: Schedule pods to minimize energy consumption, integrating with Intel Running Average Power Limit (RAPL) data.
  • eBPF-based resource accounting: More accurate CPU accounting that attributes network and syscall time correctly to workloads.

Exercises

  1. Create a pod with a 100m CPU limit. Run a tight CPU loop inside it (while true; do :; done). After 30 seconds, read cpu.stat from the cgroup directory on the node. Calculate the throttle percentage and correlate with any latency you observe.

  2. Configure a LimitRange with default requests/limits in a test namespace. Deploy a pod without specifying any resources. Verify via kubectl get pod -o yaml that the defaults were injected by the LimitRanger.

  3. Deploy the VPA in "Off" mode on a namespace. Run a workload for 10 minutes with varying load. Read the VPA recommendation and compare it to what you manually specified.

  4. Set up KEDA with the Prometheus scaler. Write a ScaledObject that scales a deployment based on a custom Prometheus metric (e.g., HTTP request rate). Simulate load and observe the scaling behavior.

  5. Create a ResourceQuota in a namespace capping total CPU requests at 2 cores. Deploy pods until the quota is exhausted. Observe the rejection message. Then increase the quota and verify the pending pod gets scheduled.

References

  • Linux CFS Bandwidth Control: kernel.org/doc/html/latest/scheduler/sched-bwc.html
  • "Stop Using CPU Limits" — Zalando Engineering Blog, 2019 (widely cited analysis of throttling)
  • Kubernetes Resource Management docs: kubernetes.io/docs/concepts/configuration/manage-resources-containers/
  • VPA design document: github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler
  • KEDA documentation: keda.sh/docs/
  • Cluster Autoscaler FAQ: github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md
  • cgroup v2 memory controller: kernel.org/doc/html/latest/admin-guide/cgroup-v2.html
  • Karpenter project: karpenter.sh
  • "Kubernetes Resource Management Deep Dive" — KubeCon EU 2023 talk by Ian Lewis