Karpenter

TL;DR

Karpenter is an open-source Kubernetes node autoscaler launched by AWS in 2021 and donated to the CNCF as a Sandbox project in 2023.
Replaces Cluster Autoscaler with a faster, more flexible model: instead of scaling fixed node groups, Karpenter picks the best instance type for pending pods on the fly.
Originally AWS-only; now has community providers for Azure (AKS Karpenter), Google Cloud, Alibaba Cloud, and bare-metal via the cloud-provider-kwok project for testing.
Provisioning latency typically 30-60 seconds vs Cluster Autoscaler's 2-4 minutes; consolidation continuously rightsizes the cluster as pods come and go.

Why Karpenter Replaced Cluster Autoscaler#

Cluster Autoscaler (CA), the original Kubernetes node autoscaler, scales pre-defined node groups — Auto Scaling Groups on AWS, MIGs on GCP, VMSS on Azure. To support multiple instance types, you create a node group per type and CA picks which to scale up. This works but produces two practical problems: scale-up is slow because CA must call the cloud API to extend an ASG and wait for the new instance, and the cluster cannot adapt to workload shapes it was not pre-configured for.

Karpenter takes a different approach. There are no node groups. Pending pods are matched against a NodePool (formerly Provisioner) that lists acceptable instance categories, architectures, zones, and capacity types. Karpenter then calls the cloud provider's CreateInstance API directly with the cheapest instance that fits — sometimes a c6i.large, sometimes a p5.48xlarge — and the pod schedules on it within roughly a minute.

Core Concepts#

NodePool — declares constraints (instance families, architectures, zones, on-demand vs spot) and limits (max cpu/memory/gpu in this pool).
EC2NodeClass / AKSNodeClass — cloud-specific node configuration (AMI, security groups, subnets, IAM role, user-data).
NodeClaim — Karpenter's internal record of an in-flight node provisioning.
Disruption — coordinated draining and replacement for consolidation, expiry, drift, and emptiness.
Consolidation — continuously rearrange pods to fewer/cheaper nodes when possible.

NodePool Example for GPU Workloads#

yaml

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-h100
spec:
  template:
    spec:
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: [amd64]
        - key: karpenter.k8s.aws/instance-family
          operator: In
          values: [p5, p5e]   # H100 / H200
        - key: karpenter.sh/capacity-type
          operator: In
          values: [on-demand]
      taints:
        - key: nvidia.com/gpu
          effect: NoSchedule
  limits:
    nvidia.com/gpu: 64
  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 30s

Consolidation and Disruption#

Karpenter does not just scale up — it continuously asks whether the cluster could be cheaper. If a node is empty for `consolidateAfter` seconds, Karpenter drains and deletes it. If pods could be repacked onto fewer nodes, Karpenter cordons the source nodes, evicts pods (respecting PDBs), and lets them reschedule.

For GPU workloads this is double-edged. Disruption is essential to control cost — an idle H100 burns money — but mid-training pod evictions are catastrophic. Use `do-not-disrupt` annotations on long-running training pods, set conservative PodDisruptionBudgets, and configure NodePools with `consolidationPolicy: WhenEmpty` rather than `WhenUnderutilized` for GPU pools.

Karpenter consolidation can evict training pods. Always set `karpenter.sh/do-not-disrupt: "true"` on long-running GPU jobs or use Volcano/Kueue gang scheduling that Karpenter respects.

Spot, Capacity Reservations, and GPUs#

Karpenter has first-class spot support: it picks the cheapest spot pool, handles interruption notices via a queue (`spotInterruptionQueue`), and falls back to on-demand if spot is unavailable. For GPU capacity that is hard to acquire — H100s and H200s on AWS in 2025-26 — Karpenter integrates with EC2 Capacity Reservations and Capacity Blocks for ML, so reserved capacity is preferred when available.

Karpenter on Non-AWS Clouds#

Since CNCF donation, Karpenter has a pluggable cloud-provider interface. The Azure AKS team maintains `aks-karpenter-provider` (GA on AKS since 2024). Google Cloud and Alibaba Cloud providers are in active development. For on-prem and bare-metal, Karpenter has been used with Cluster API providers (Karpenter on Equinix Metal, Karpenter on vSphere) but production maturity varies by provider.

When Karpenter Is Right (and Wrong)#

Karpenter is the right choice for elastic, cost-sensitive clusters with heterogeneous workloads — typical of inference fleets, CI/CD, and data-science platforms. It is the wrong choice for clusters where node identity matters (Cassandra, Kafka, Elasticsearch with local SSD), for hard-pinned bare-metal fleets, and for environments where the cloud provider's API rate limits make per-pod provisioning impractical.

References

Karpenter Documentation · Karpenter Project
karpenter on GitHub · GitHub
CNCF Karpenter Project Page · CNCF

Why Karpenter Replaced Cluster Autoscaler#

Core Concepts#

NodePool — declares constraints (instance families, architectures, zones, on-demand vs spot) and limits (max cpu/memory/gpu in this pool).

EC2NodeClass / AKSNodeClass — cloud-specific node configuration (AMI, security groups, subnets, IAM role, user-data).

NodeClaim — Karpenter's internal record of an in-flight node provisioning.

Disruption — coordinated draining and replacement for consolidation, expiry, drift, and emptiness.

Consolidation — continuously rearrange pods to fewer/cheaper nodes when possible.

NodePool Example for GPU Workloads#

yaml

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-h100
spec:
  template:
    spec:
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: [amd64]
        - key: karpenter.k8s.aws/instance-family
          operator: In
          values: [p5, p5e]   # H100 / H200
        - key: karpenter.sh/capacity-type
          operator: In
          values: [on-demand]
      taints:
        - key: nvidia.com/gpu
          effect: NoSchedule
  limits:
    nvidia.com/gpu: 64
  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 30s

Consolidation and Disruption#

Karpenter consolidation can evict training pods. Always set `karpenter.sh/do-not-disrupt: "true"` on long-running GPU jobs or use Volcano/Kueue gang scheduling that Karpenter respects.

Spot, Capacity Reservations, and GPUs#

Karpenter on Non-AWS Clouds#

When Karpenter Is Right (and Wrong)#

Karpenter

Why Karpenter Replaced Cluster Autoscaler#

Core Concepts#

NodePool Example for GPU Workloads#

Consolidation and Disruption#

Spot, Capacity Reservations, and GPUs#

Karpenter on Non-AWS Clouds#

When Karpenter Is Right (and Wrong)#

References

Browse all entries

Deploy on Yobitel

Karpenter

Why Karpenter Replaced Cluster Autoscaler#

Core Concepts#

NodePool Example for GPU Workloads#

Consolidation and Disruption#

Spot, Capacity Reservations, and GPUs#

Karpenter on Non-AWS Clouds#

When Karpenter Is Right (and Wrong)#

References

Browse all entries

Deploy on Yobitel