Kubernetes Scheduler Internals: Filters, Scores, and Preemption

Overview

The Kubernetes scheduler is responsible for one decision: given an unscheduled pod, which node should run it? This sounds simple, but at scale it becomes a sophisticated constraint-satisfaction and optimization problem. The scheduler must enforce hard constraints (the pod won't fit unless this node has sufficient CPU, memory, and the right labels), rank remaining options by soft preferences (prefer nodes with the most free resources, prefer the same zone as other pod replicas), and handle priority-based preemption when no node can satisfy the constraints of a high-priority pod without evicting lower-priority ones.

The scheduler operates at approximately 100 pods per second on a 5000-node cluster, which requires careful algorithmic design. The scheduling framework — a plugin architecture introduced in Kubernetes 1.15 and stabilized in 1.19 — defines well-separated extension points that allow both the core scheduling logic and custom extensions to fit into a coherent pipeline.

Prerequisites

Kubernetes architecture fundamentals (API server, etcd, kubelet)
Understanding of Linux resource isolation (cgroups: CPU and memory limits/requests)
Familiarity with Kubernetes resource model (requests, limits, QoS classes)
Basic graph theory (nodes, constraints, optimization)
Understanding of Pod affinity/anti-affinity semantics

Historical Context

Kubernetes's original scheduler (kube-scheduler v1, 2014) was a relatively simple algorithm: filter nodes by resources and affinity, then pick the least-loaded one. As Kubernetes adoption grew, users needed more sophisticated scheduling: co-location constraints for latency-sensitive microservices, topology spread for resilience, GPU/accelerator allocation, gang scheduling for ML training jobs.

The original scheduler extension mechanism (Extenders — webhook calls in the middle of the scheduling loop) was too slow and operationally complex. The Scheduling Framework (Kubernetes Enhancement Proposal KEP-624) redesigned the scheduler as a set of plugin interfaces, allowing arbitrary scheduling logic to be compiled into the scheduler binary without the latency and operational overhead of external webhooks.

Scheduler Workflow Overview

kube-apiserver/etcd
       │
       │ WATCH: Pods with spec.nodeName == ""
       ▼
┌─────────────────────────────────────────────────────┐
│              Scheduling Queue                        │
│  ActiveQ: PriorityQueue (highest-priority pod first) │
│  BackoffQ: Pods that failed scheduling (backoff)     │
│  UnschedulableQ: Pods that can't be scheduled yet    │
└──────────────────────┬──────────────────────────────┘
                       │ Pop highest-priority pod
                       ▼
┌─────────────────────────────────────────────────────┐
│            Scheduling Cycle                          │
│                                                      │
│  1. PreFilter    (precompute data, validate pod)     │
│  2. Filter       (eliminate unfit nodes)             │
│     → ResourceFit, NodeAffinity, Taints, etc.        │
│  3. PostFilter   (preemption if Filter found 0 nodes)│
│  4. PreScore     (precompute data for scoring)       │
│  5. Score        (rank remaining nodes 0-100)        │
│     → LeastAllocated, NodeAffinity weight, etc.      │
│  6. NormalizeScore (normalize each plugin's scores)  │
│  7. Reserve      (mark resources as reserved)        │
│  8. Permit       (allow/wait/deny binding)           │
│                                                      │
└──────────────────────┬──────────────────────────────┘
                       │ Async: goroutine per pod
                       ▼
┌─────────────────────────────────────────────────────┐
│            Binding Cycle                             │
│  9. PreBind    (setup before binding: volume binding)│
│  10. Bind      (write pod.spec.nodeName to API server│
│  11. PostBind  (cleanup, metrics, events)            │
└─────────────────────────────────────────────────────┘

Scheduling and binding cycles are separated because binding (writing to the API server) is slow (network I/O). The scheduler runs the scheduling cycle synchronously for one pod at a time (to avoid race conditions on the shared node state snapshot), but multiple binding cycles run concurrently.

Scheduling Framework Plugin Points

PreFilter

Purpose: validate the pod and precompute data that will be used by Filter plugins. If PreFilter returns an error, the pod is rejected immediately (skips all Filter/Score logic).

Examples: - NodeResourcesFit PreFilter: calculates total resource requirements of the pod (including initContainers) - PodTopologySpread PreFilter: builds a map of how existing pods are distributed across topology domains - InterPodAffinity PreFilter: precomputes which nodes are eligible based on existing pod anti-affinity rules - VolumeBinding PreFilter: checks if all PersistentVolumeClaims are bound or can be bound

Filter (Elimination Phase)

Filter plugins eliminate nodes that cannot run the pod. A node must pass ALL Filter plugins to remain a candidate.

NodeResourcesFit:

For each node:
  available_cpu = node.allocatable.cpu - sum(pod.requests.cpu for running pods)
  available_mem = node.allocatable.memory - sum(pod.requests.memory for running pods)

  If pod.requests.cpu > available_cpu → filter out
  If pod.requests.memory > available_mem → filter out

Note: "allocatable" is less than "capacity":
  allocatable = capacity - kube-reserved - system-reserved - eviction-threshold

NodeSelector / NodeAffinity:

NodeSelector (simple):
  pod.spec.nodeSelector: {disk: ssd}
  → Filter: only nodes with label disk=ssd pass

NodeAffinity (complex):
  requiredDuringSchedulingIgnoredDuringExecution:
    nodeSelectorTerms:
    - matchExpressions:
      - key: topology.kubernetes.io/zone
        operator: In
        values: [us-east-1a, us-east-1b]
  → Filter: only nodes in us-east-1a or us-east-1b pass
  (IgnoredDuringExecution: pods already running are NOT evicted if node label changes)

TaintToleration:

Taint on node: node.kubernetes.io/disk-pressure:NoSchedule
Pod without toleration → filtered out

Pod with toleration:
  tolerations:
  - key: node.kubernetes.io/disk-pressure
    operator: Exists
    effect: NoSchedule
→ passes filter

Taints with effect NoExecute cause existing pods to be evicted from the node (not just prevent new scheduling). Used by the Node controller to evict pods from NotReady nodes.

PodAffinityAntiAffinity:

Pod A running on node1 with label app=frontend

New Pod B has podAntiAffinity:
  requiredDuringScheduling:
    labelSelector: {app: frontend}
    topologyKey: kubernetes.io/hostname
→ Filter: node1 is eliminated (Pod A running there matches the selector)
→ Pod B cannot be co-located with any frontend pod on the same node

Topology key topology.kubernetes.io/zone spreads pods across zones; kubernetes.io/hostname spreads across nodes.

VolumeBinding: - Filters out nodes that don't have the required StorageClass available in their zone - For volumes with local affinity (local StorageClass), only the node where the volume exists passes

NodeUnschedulable: - Filters out nodes with spec.unschedulable = true (cordoned nodes) - Allows pods with a toleration for node.kubernetes.io/unschedulable:NoSchedule (DaemonSet pods, critical system pods)

PostFilter (Preemption)

If Filter returns zero candidates, PostFilter is called. The default implementation is the preemption plugin: it searches for victim pods to evict so that the pending pod can be scheduled.

Preemption algorithm:
  For each node that was filtered out:
    Find the set of pods that, if evicted, would allow the pending pod to fit
    This set must:
      1. Include only pods with lower priority than the pending pod
      2. After eviction, the node must pass all Filter plugins for the pending pod
      3. Minimize the number/priority of evicted pods

  Select the node with the "best" preemption (fewest/lowest-priority victims)
  Annotate the pending pod with nominatedNodeName
  Evict the victim pods (delete + respect PodDisruptionBudget)
  Pending pod waits for victims to terminate, then is scheduled on the vacated node

Preemption respects PodDisruptionBudgets — it will not evict a pod if doing so would violate the PDB's minimum available constraint.

Score (Ranking Phase)

Score plugins assign a score (0-100) to each remaining node. The scheduler aggregates scores from all plugins (with configurable weights) and selects the highest-scoring node.

LeastAllocated (default, highest weight):

score = (cpu_unused/cpu_total + memory_unused/memory_total) / 2 * 100

Prefers nodes with the most remaining capacity.
Creates "headroom" for future pods and burst.

MostAllocated (alternative, for bin-packing):

score = (cpu_used/cpu_total + memory_used/memory_total) / 2 * 100

Prefers nodes that are already most utilized.
Packs pods tightly onto fewer nodes → can power down underutilized nodes.
Used in cost-optimization scenarios.

BalancedAllocation:

Penalizes nodes where CPU and memory utilization are severely imbalanced.
A node with 90% CPU utilization but 10% memory utilization gets a low score.
Prevents "CPU-rich but memory-starved" or vice versa, which wastes resources.

NodeAffinity score (for preferred affinity):

preferredDuringSchedulingIgnoredDuringExecution:
  - weight: 80
    preference:
      matchExpressions:
      - key: disk-type
        operator: In
        values: [ssd]
  - weight: 20
    preference: ...

Nodes matching the SSD preference get +80 to their score.
"Preferred" unlike "required" — pods still schedule on non-SSD nodes if needed.

ImageLocality:

score = sum(image_sizes_present_on_node) / total_image_sizes * 100

Prefers nodes that already have the container image pulled.
Reduces cold start time (no image pull) and reduces registry bandwidth.
Weight is low — this is a tiebreaker, not a primary placement driver.

PodTopologySpread (score variant):

topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule  ← Filter behavior
    labelSelector: ...
  - maxSkew: 2
    topologyKey: kubernetes.io/hostname
    whenUnsatisfiable: ScheduleAnyway  ← Score behavior

whenUnsatisfiable: ScheduleAnyway adds a score penalty for nodes
that would increase skew, without hard-filtering them.

Reserve

Reserve plugins mark resources as reserved before the binding cycle begins. This prevents a race where the scheduler assigns the same resources to two pods concurrently (since scheduling cycles run one at a time but binding is async).

The VolumeBinding plugin uses Reserve to provisionally bind PersistentVolumeClaims to the selected node. If binding fails, the Unreserve extension point rolls back.

Permit

Permit plugins can hold a pod at the Permit phase before allowing it to proceed to binding. The pod is marked as "waiting" and held in memory. Other permits can unblock it.

Primary use case: gang scheduling. A machine learning training job may require 8 pods (one per GPU) to start simultaneously. The gang scheduler holds all 8 pods at Permit until all 8 have been scheduled to a node. If after a timeout not all 8 are schedulable, the entire group is rejected.

Bind

The Bind plugin writes pod.spec.nodeName to the API server. This is the moment the scheduling decision becomes visible to the kubelet on the target node. The kubelet watches for pods with nodeName equal to its own node name and starts them.

Scheduler Performance

The scheduler maintains a snapshot of node and pod state rather than querying the API server for each scheduling decision. The snapshot is updated at the beginning of each scheduling cycle. This means the scheduler is working with a slightly stale view of the cluster, but avoids overwhelming the API server with list requests.

Scheduling throughput at 5000 nodes:

  Naive approach (query API for each filter):
    5000 nodes × 1ms API call × 100 pods/sec = 500,000 ms = impossible

  Snapshot approach:
    Snapshot 5000 nodes into memory: ~100ms
    Filter 5000 nodes in memory: ~10ms
    Score remaining 500 nodes: ~5ms
    Total scheduling cycle: ~15ms → 66 pods/sec

  With parallel scoring (goroutine per node):
    Score 500 nodes × 100μs per node / 8 parallel goroutines: ~6ms
    Total: ~20ms → 50 pods/sec (conservative estimate)

Real benchmarks from Kubernetes 1.29 show approximately 120-150 pods/second at 5000 nodes with default configuration.

Optimization knobs: - percentageOfNodesToScore: by default, the scheduler stops evaluating nodes once it finds 5% of all nodes as viable candidates. At 5000 nodes, scoring 250 nodes rather than 5000 while still finding a good placement. Configurable for tighter placement requirements. - parallelism: number of goroutines for scoring, defaulting to 16.

Preemption in Detail

Priority classes:
  system-cluster-critical (2000000000) ← kube-apiserver, etcd
  system-node-critical    (2000000000) ← kubelet, kube-proxy
  high (1000)                          ← production services
  medium (500)                         ← staging services
  low (100)                            ← batch jobs
  0 (default)                          ← unprioritized

Preemption scenario:
  Node A: 4 CPU available
  Pod X (priority: high, 4 CPU request) → pending, no node fits

  Node A running pods:
    Pod Y (priority: low, 2 CPU) + Pod Z (priority: low, 2 CPU) = 4 CPU used

  Preemption: scheduler identifies Pod Y + Pod Z as eviction candidates
  → marks Pod X with nominatedNodeName: node-a
  → deletes Pod Y and Pod Z (graceful termination, 30s default)
  → after termination, Node A has 4 CPU free
  → Pod X is scheduled on Node A

PodDisruptionBudget constraint:
  If Pod Y is covered by a PDB with minAvailable: 1 and only 1 replica exists,
  Pod Y cannot be preempted (would violate PDB).
  Scheduler must find alternate victims.

Descheduler

The Kubernetes descheduler is a separate component (not part of kube-scheduler) that periodically evicts pods that, while currently running, are no longer optimally placed. This addresses "scheduling drift": over time, as nodes come and go, the cluster's pod distribution becomes suboptimal.

Descheduler strategies: - RemoveDuplicates: evict pods so no two replicas of the same deployment run on the same node - LowNodeUtilization: evict pods from overloaded nodes (migration to underutilized nodes) - RemovePodsViolatingNodeAffinity: evict pods whose node affinity is no longer satisfied (node labels changed after scheduling) - RemovePodsViolatingTopologySpreadConstraint: evict pods to restore topologySpreadConstraint compliance

Descheduled pods are re-created by their ReplicaSet controllers and re-scheduled, hopefully to better nodes.

Debugging Notes

Pod stuck in Pending: kubectl describe pod <name> shows scheduler events. Common reasons: Insufficient cpu/memory (increase node size or add nodes), node(s) didn't match Pod's node affinity/selector (check nodeSelector and node labels), pod has unbound immediate PersistentVolumeClaims (PVC not bound — check StorageClass and available PVs).
Scheduler not scheduling pods: check if the scheduler is running (kubectl get pods -n kube-system -l component=kube-scheduler). Check scheduler logs for errors.
Unexpected placement: use kubectl describe pod to see which node was chosen. Use kubectl get events --field-selector reason=Scheduled to see recent scheduling decisions with reasons.
Preemption debugging: look for kubectl describe pod output mentioning "NominatedNodeName". Look for eviction events (kubectl get events | grep Evicted). Check PodDisruptionBudgets if preemption seems stuck.
Scheduling latency tracking: the scheduler_scheduling_duration_seconds Prometheus metric tracks end-to-end scheduling latency. scheduler_pod_scheduling_duration_seconds shows per-phase breakdowns. Alert if p99 exceeds 1 second.

Security Implications

Custom scheduler plugins execute inside the kube-scheduler process. A malicious or buggy plugin can cause the scheduler to crash or make incorrect scheduling decisions affecting cluster availability.
The scheduler reads pod specs from etcd (via API server watch). If an attacker can modify pod specs (requires elevated RBAC), they can inject pod affinity rules that control placement — potentially co-locating sensitive pods with attacker-controlled pods.
PodPriorityClass is a RBAC-controlled resource. Do not allow untrusted users to create PriorityClass objects or assign high-priority classes to their pods — they could trigger preemption of critical system pods.
Topology spread constraints that require zone distribution can be used defensively: ensure that no single AZ failure takes down all replicas of a critical service.

Performance Implications

Setting resource requests accurately is critical for scheduler correctness. Under-specifying requests causes pods to be placed on nodes without sufficient actual capacity, leading to OOM kills and CPU throttling.
Over-specifying requests causes fragmentation: nodes appear "full" to the scheduler but are actually underutilized. The cluster must scale out to accommodate pods that could have fit on existing nodes.
Large clusters with many small pods should consider fractional CPU requests (e.g., 100m = 0.1 CPU). The scheduler handles millicores natively.
Gang scheduling (Permit plugin) introduces head-of-line blocking: if a large gang cannot be scheduled, it can hold up the scheduling queue. Configure reasonable timeouts.

Failure Modes

Scheduler single-point-of-failure: only one scheduler instance is active at a time (leader election using a Lease object in etcd ensures this). If the leader crashes, a standby instance is elected within the lease duration (15 seconds default). During this window, no new pods are scheduled.
Snapshot staleness: the scheduler operates on a snapshot that may be up to one scheduling cycle old. In rapid node failure scenarios, the scheduler may try to bind pods to a recently-failed node. The kubelet will reject such pods (node NotReady), and the pods return to the Pending queue.
Priority inversion: a low-priority workload that holds a PodDisruptionBudget preventing preemption can block high-priority pods indefinitely. Audit PDB configurations for interactions with priority classes.
Scheduler thrashing: if no placement is possible for a pod, the scheduler moves it to the UnschedulableQ with exponential backoff. If cluster conditions change (node added, other pod deleted), the scheduler re-evaluates. If the backoff period is long, newly schedulable pods may wait. The scheduler_unschedulable_pods metric tracks this.

Modern Usage

The Volcano project extends Kubernetes scheduling for batch and ML workloads, adding gang scheduling, queue management, and heterogeneous resource scheduling (GPUs, NPUs). It implements the Kubernetes Batch Scheduling API being standardized in the Kubernetes Job Working Group.

Karpenter (AWS, 2021) inverts the traditional autoscaler model: instead of scaling nodes based on current utilization, Karpenter watches for unschedulable pods and provisions nodes with exactly the right shape to schedule them. It integrates deeply with the scheduler via the Pending pod list.

Future Directions

Dynamic Resource Allocation (DRA): replaces device plugin API with a richer API for heterogeneous hardware (GPUs, FPGAs, network interfaces) that integrates cleanly with the scheduling framework's Filter/Score phases.
Structured Parameters for DRA: enables the scheduler to reason about resource allocation during scheduling rather than deferring to device plugins at bind time.
Scheduler simulator: Kubernetes SIG-Scheduling is developing a scheduler simulator for testing scheduling policies against real cluster traces without requiring a live cluster.

Exercises

Create three nodes with different resource capacities and labels (using kubectl label node). Deploy a pod with nodeAffinity requiring a specific label. Verify placement. Then taint the matching node and observe that the pod moves to Pending.
Create a PriorityClass with priority value 1000 and a default PriorityClass with value 0. Fill a node with low-priority pods. Deploy a high-priority pod that requires more resources than are free. Observe preemption events using kubectl get events.
Deploy a StatefulSet with podAntiAffinity using topologyKey kubernetes.io/hostname ensuring all replicas are on different nodes. Scale to more replicas than nodes and observe the behavior.
Implement a custom scheduler plugin (Go) that uses the Score extension point to prefer nodes whose hostname starts with a specific prefix. Compile it into a custom scheduler binary and deploy it alongside the default scheduler.
Use kubectl describe pod --all-namespaces | grep -A5 "Events:" to audit recent scheduling events in a real cluster. Identify any pods with repeated scheduling failures and diagnose the root cause.

References

Kubernetes Scheduling Framework: https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/
KEP-624: Scheduling Framework (https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/624-scheduling-framework)
Kubernetes Scheduler Plugins repository: https://github.com/kubernetes-sigs/scheduler-plugins
"A Deep Dive into Kubernetes Scheduling" — Bobby Salamat, Jonathan Basseri (KubeCon NA 2019)
Volcano Scheduler: https://github.com/volcano-sh/volcano
Karpenter: https://karpenter.sh/docs/concepts/
"Borg, Omega, and Kubernetes" — Burns et al. (ACM Queue 2016) — scheduler design philosophy
Kubernetes Descheduler: https://github.com/kubernetes-sigs/descheduler