NVIDIA GPU Operator

TL;DR

Helm-installed operator from NVIDIA (Apache 2.0, first GA in 2020) that bundles the driver container, NVIDIA Container Toolkit, k8s-device-plugin, DCGM Exporter, GPU Feature Discovery, MIG Manager, Node Feature Discovery and Sandbox Workloads into one reconciled stack.
Eliminates the historically painful step of hand-installing kernel modules and runtime hooks on every GPU node — a single Helm release makes a fresh Kubernetes node GPU-ready in 3-7 minutes.
Supports bare-metal kubeadm, EKS / GKE / AKS (driver mode opt-in), OpenShift, Rancher / RKE2, air-gapped clusters, signed driver containers for Secure Boot, vGPU on virtualised hypervisors, MIG partitioning on A100 / H100 / H200 / B200 and confidential-computing modes on Hopper and Blackwell.
Surfaces `nvidia.com/gpu`, `nvidia.com/mig-*g.*gb`, `nvidia.com/gpu.shared` (MPS / time-slicing) and vGPU resource names to the scheduler, plus >150 DCGM metrics on `:9400/metrics` for Prometheus.
Hard prerequisite for KServe, KubeRay, Kubeflow Training Operator, Volcano, Kueue, NVIDIA Dynamo, the Run:ai stack and every supported GPU path on the Yobitel sovereign tenancies (Yobibyte runs on top of the operator, not in place of it).

Overview

The NVIDIA GPU Operator is the canonical mechanism for turning a vanilla Kubernetes node into a GPU-schedulable node. Before the operator landed in 2020, exposing a single GPU to a pod required four hand-managed components on every host: a kernel driver matched to the exact Linux kernel, the NVIDIA Container Toolkit so the OCI runtime could inject device nodes and user-space libraries, the Kubernetes device plugin DaemonSet to advertise nvidia.com/gpu, and DCGM for telemetry. Each piece had its own upgrade cadence, its own packaging conventions and its own way of breaking after an unattended OS update. Fleets of more than a handful of nodes inevitably built bespoke configuration-management to keep the pieces in lockstep.

The operator replaces that mess with one Helm release. It runs every component as a containerised DaemonSet, watches node labels emitted by Node Feature Discovery (NFD) so it only touches GPU hosts, and orchestrates restart order when drivers change. A previously CPU-only node becomes a fully-schedulable GPU node within minutes of being labelled nvidia.com/gpu.present=true. The operator is the install path NVIDIA recommends, support, and validate against in the NGC compatibility matrix; running anything else in production is now firmly outside the supported envelope.

By mid-2026 the operator is at v25.x, tracking CUDA 12.6 / 13.0 drivers (R565 / R570 series), Kubernetes 1.27-1.33, and supports the full Hopper / Hopper-X / Blackwell range plus the Ampere and Ada generations still in service. It is not a CNCF project — NVIDIA owns and ships it — but the source is open under Apache 2.0 and the bug tracker is public. Yobibyte runs the GPU Operator under the hood across every Yobitel NeoCloud region (UK London-1, EU Frankfurt-1, US-East), so customers consuming Yobibyte never install, version-pin or operate it themselves; this entry documents the production surface for teams that do own that responsibility on their own clusters. This entry helps you stand up the GPU Operator on your Kubernetes cluster — or recognise what Yobibyte does on your behalf as a managed service.

Quick start

The fastest sane path on a bare-metal or cloud node group is the upstream Helm chart with MIG disabled and DCGM Exporter on. The four-command sequence below installs the operator, watches the daemonsets come up, exercises a CUDA workload and confirms the device plugin advertises GPUs to the scheduler. Run this against a fresh node group; do not attempt to install on a host that already has the proprietary driver bound to kernel modules — uninstall the host-side driver first.

# 1. Add the NVIDIA Helm repo
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

# 2. Install the operator into its own namespace
helm install --wait gpu-operator \
    nvidia/gpu-operator \
    --version "v25.3.0" \
    --namespace gpu-operator --create-namespace \
    --set driver.enabled=true \
    --set toolkit.enabled=true \
    --set devicePlugin.enabled=true \
    --set dcgmExporter.enabled=true \
    --set gfd.enabled=true \
    --set mig.strategy=none

# 3. Watch the per-node validation pods turn Ready (~3-7 minutes)
kubectl -n gpu-operator get pods -w

# 4. Verify a node advertises GPUs and run a CUDA workload
kubectl describe node <gpu-node> | grep -E "nvidia.com/gpu|nvidia.com/mig"

cat <<'YAML' | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata: { name: cuda-smoke-test }
spec:
  restartPolicy: Never
  containers:
    - name: cuda
      image: nvcr.io/nvidia/cuda:12.6.0-base-ubuntu24.04
      command: ["nvidia-smi"]
      resources: { limits: { nvidia.com/gpu: 1 } }
YAML

kubectl logs cuda-smoke-test

Tip: On EKS, GKE and AKS the cluster's node image may already ship a vendor-managed driver. Set driver.enabled=false to keep the host driver and let the operator manage only the toolkit, device plugin, DCGM Exporter and GFD. This is the recommended path on every managed Kubernetes service.

How it works

Internally the operator is a single Go controller built with operator-sdk that reconciles a cluster-scoped ClusterPolicy CR (or, since v23.6, sets installed straight from the Helm values). On reconcile it computes the desired DaemonSet, ConfigMap and RBAC objects for every sub-component, applies them only on nodes carrying the expected NFD labels (feature.node.kubernetes.io/pci-10de.present=true, nvidia.com/gpu.present=true), and walks restarts in a deterministic order: NFD first, then driver, then toolkit, then plugin / DCGM / GFD / MIG Manager. Validator pods between each step block the rollout if a stage fails — for example, the toolkit validator runs a CUDA hello-world inside a sample pod, and if it cannot allocate a device the device plugin DaemonSet is not started.

The driver container is the most distinctive piece. NVIDIA publishes a per-OS / per-kernel image (Ubuntu 22.04 / 24.04, RHEL 8 / 9, SUSE SLES 15, Rocky 9, Flatcar) which builds or loads the kernel module on first run, mounts the user-space libraries under /run/nvidia/driver, and exposes them to other DaemonSets via host path. The Container Toolkit (containerd / CRI-O / Docker shim) is then configured to call into that hostPath at OCI hook time. This means a node never needs a hand-installed .run file or dkms build — every kernel module compilation happens inside a privileged container with the right toolchain pinned.

Device exposure to pods is done by the Kubernetes device plugin. On each node it discovers physical GPUs, MIG instances or vGPU slices, advertises them under their respective resource names, and on pod admission writes the right NVIDIA_VISIBLE_DEVICES value and bind-mounts the user-space libraries into the container filesystem. GPU Feature Discovery layers richer labels (architecture, memory size, compute capability, MIG capable, NVLink topology) so workloads can target the precise hardware they need via nodeSelector or nodeAffinity.

Reconciliation loop — operator pod watches ClusterPolicy, NFD-labelled nodes and component DaemonSet status; no etcd state outside Kubernetes.
Driver mode auto — drops the driver container if nvidia-smi already works on the host; the recommended default since v24.6.
Pre-compiled driver — driver.usePrecompiled=true skips in-container DKMS build, halves first-boot time on stable kernels.
Sandbox Workloads — opt-in support for KubeVirt VMs needing PCI passthrough; uses the vfio-pci driver path instead of the standard kernel module.
Confidential Computing — sandboxWorkloads.defaultWorkload=vm-passthrough + Hopper / Blackwell CC mode encrypts PCIe traffic between CPU and GPU.
Helm subchart for NFD — installs Node Feature Discovery if the cluster does not already run it; can be disabled with nfd.enabled=false.

Note: The operator does not own the kernel. On nodes where another agent (cloud-init, Ansible, an immutable OS image) installs a different driver out of band, the operator's validator will fail. Either run the operator in driver.enabled=false mode and let it manage only the upper layers, or strip the alternative install path from your provisioning pipeline.

Reference: Helm values

The Helm chart exposes ~200 values; the table below covers the ones that matter on every install. Defaults are taken from chart v25.3.0. Every value can also be set via a ClusterPolicy CR if you prefer to drive the operator declaratively from Argo CD or Flux.

Helm key	Type	Default	Purpose
driver.enabled	bool	true	Run the driver DaemonSet. Set false on managed K8s with vendor driver.
driver.version	string	(chart-pinned)	Pin a specific driver, e.g. `570.124.06`. Must match CUDA matrix.
driver.usePrecompiled	bool	false	Use NVIDIA's prebuilt driver images; skips in-container DKMS.
driver.repository / driver.image	string	nvcr.io/nvidia/driver	Image source — switch for air-gapped mirrors.
driver.startupProbe.initialDelaySeconds	int	60	Raise on slow storage / IB-only mgmt networks.
driver.rdma.enabled	bool	false	Install nvidia-peermem for GPUDirect RDMA over InfiniBand.
toolkit.enabled	bool	true	Install / configure containerd or CRI-O nvidia runtime.
toolkit.version	string	(chart-pinned)	Container Toolkit version pin, e.g. `1.16.2-ubuntu20.04`.
devicePlugin.enabled	bool	true	Run k8s-device-plugin DaemonSet.
devicePlugin.config.name	string	(none)	Reference a ConfigMap with MPS / time-slicing config.
mig.strategy	string	none	none
migManager.enabled	bool	true	Reconcile MIG profiles from `nvidia.com/mig.config` node label.
dcgmExporter.enabled	bool	true	Run DCGM Exporter on :9400, surfacing >150 GPU metrics.
dcgmExporter.config.name	string	(default csv)	Override the metrics CSV for custom Prometheus emit.
dcgm.enabled	bool	true	Run nv-hostengine inside the cluster instead of in driver container.
gfd.enabled	bool	true	GPU Feature Discovery — emits per-node hardware labels.
nfd.enabled	bool	true	Install Node Feature Discovery subchart.
operator.runtimeClass	string	nvidia	RuntimeClass exposed to workloads for explicit selection.
operator.defaultRuntime	string	containerd	containerd
validator.image / validator.repository	string	nvcr.io/nvidia/cloud-native/gpu-operator-validator	Validator image source.
sandboxWorkloads.enabled	bool	false	KubeVirt passthrough mode; install vfio-pci-manager.
sandboxWorkloads.defaultWorkload	string	container	container
vfioManager.enabled	bool	false	Manage vfio-pci bindings for passthrough GPUs.
vgpuManager.enabled	bool	false	Install NVIDIA vGPU host driver + licence client.
vgpuDeviceManager.enabled	bool	false	Reconcile vGPU per-node profile from `nvidia.com/vgpu.config`.
mps.enabled (via devicePlugin sharing)	bool	false	Enable Multi-Process Service for fractional GPUs.
timeSlicing.replicas	int	1	Time-slice a GPU into N logical resources; software isolation only.
psp.enabled	bool	false	Generate PodSecurityPolicies — deprecated; use Pod Security Standards.
cdi.enabled	bool	true	Container Device Interface (CDI) generation — the future of OCI device wiring.
nodeSelector / affinity / tolerations	map	{}	Constrain operator + DaemonSets to specific node pools.
validator.driver.env.WITH_WORKLOAD	bool	true	Run CUDA hello-world as the driver validator (recommended).
dcgmExporter.serviceMonitor.enabled	bool	false	Create Prometheus Operator ServiceMonitor automatically.
operator.upgradeCRD	bool	true	Allow helm upgrade to update ClusterPolicy CRD.

Warning: driver.enabled=true plus a vendor-managed driver already on the host (EKS Bottlerocket, GKE Container-Optimized OS) will deadlock — the driver DaemonSet cannot unload the in-use module. Pick one or the other before the first install; switching after the fact requires draining and reimaging the node.

Workload patterns

Three deployment patterns cover the bulk of production installs. The first is bare-metal or self-managed Kubernetes (kubeadm, RKE2, Talos, OpenShift) where the operator owns the entire driver + toolkit + plugin stack. The second is cloud-managed Kubernetes (EKS, GKE, AKS) where the vendor ships a node image with the driver already baked in and the operator manages only the upper layers. The third is multi-tenant MIG, where the operator's MIG Manager reconciles hardware partitioning from a per-node label so each tenant sees only the slice it has been allocated.

Pattern A — bare-metal cluster bootstrap. The operator brings up everything. Set driver.enabled=true, pick a driver.version matched to your Linux kernel, and use driver.usePrecompiled=true if your kernel is on a stable LTS line. This is the canonical path on Yobitel-operated sovereign clusters and on most on-premises NVIDIA-Certified Systems.

Pattern B — cloud-managed K8s with vendor driver. EKS, GKE and AKS each ship optimised node images. Set driver.enabled=false; the operator then runs only toolkit, plugin, DCGM Exporter and GFD. Crucially, on EKS you must still match the vendor driver to the CUDA runtime your workload expects — a 535-series driver will refuse CUDA 13 workloads.

Pattern C — MIG-partitioned multi-tenant. Operator installs in mig.strategy=single (every GPU on the node carries the same uniform profile) or mixed (heterogeneous profiles per GPU). The cluster operator labels each node with nvidia.com/mig.config=all-1g.10gb or similar and the MIG Manager applies the partition. Tenants request slices through the precise resource name (nvidia.com/mig-1g.10gb).

# A — bare-metal kubeadm cluster, operator owns the driver
helm install --wait gpu-operator nvidia/gpu-operator \
    -n gpu-operator --create-namespace \
    --set driver.enabled=true \
    --set driver.version="570.124.06" \
    --set driver.usePrecompiled=true \
    --set mig.strategy=single \
    --set dcgmExporter.enabled=true

# B — EKS / GKE / AKS with vendor-managed driver
helm install --wait gpu-operator nvidia/gpu-operator \
    -n gpu-operator --create-namespace \
    --set driver.enabled=false \
    --set toolkit.enabled=true \
    --set devicePlugin.enabled=true \
    --set dcgmExporter.enabled=true \
    --set mig.strategy=single

# C — MIG-partitioned multi-tenant cluster (apply on each H100 node)
kubectl label node h100-node-01 nvidia.com/mig.config=all-1g.10gb --overwrite
kubectl label node h100-node-02 nvidia.com/mig.config=all-2g.20gb --overwrite
kubectl label node h100-node-03 nvidia.com/mig.config=mixed --overwrite

# Verify the reconciled state
kubectl get node h100-node-01 -o jsonpath='{.status.allocatable}' | jq

Tip: For Pattern C, group MIG nodes into separate node pools (e.g. one pool per profile) and use taints + tolerations to keep tenant workloads on the right slice. Mixing MIG profiles within a single pool produces hours of pending-pod investigation later.

Sizing and capacity planning

The operator itself is cheap. The operator pod runs once per cluster (~200 mCPU, 256 MiB at idle). Per node the DaemonSets cost roughly 0.5-1.0 vCPU and 2-3 GiB of memory, dominated by the driver container holding the kernel module and the user-space libraries in shared memory. DCGM Exporter scrapes are CPU-light but emit ~30-60 KiB per node per scrape interval — sized for a 15s interval that is ~120 KiB/s per 1,000 nodes into Prometheus, well inside any reasonable retention budget. Where sizing matters is the host filesystem (driver container caches kernel artefacts) and /dev/shm budget for downstream training workloads using NCCL.

Plan /dev/shm ≥ 8 GiB on every GPU node — NCCL multi-rank training and tensor-parallel inference (vLLM, TensorRT-LLM) fail without it.
Driver image is ~3 GiB; pre-pull to a local registry mirror in air-gapped or restricted-bandwidth environments.
driver.usePrecompiled=true cuts first-boot time from 4-6 minutes to ~60-90 seconds, at the cost of needing NVIDIA precompiled images for your kernel.
DCGM Exporter scrape cost scales linearly with GPU count per node; on 8x H100 nodes expect 80-120 KiB per scrape.

Component	CPU / node	Memory / node	Disk	Notes
nvidia-driver-daemonset	200-400 mCPU	2.0-2.5 GiB	1.5-2.5 GiB /run/nvidia	Higher with usePrecompiled=false during DKMS build.
nvidia-container-toolkit-daemonset	50-100 mCPU	128-256 MiB	<100 MiB	Configures containerd / CRI-O once then idles.
nvidia-device-plugin-daemonset	50-100 mCPU	128-256 MiB	<50 MiB	Light gRPC server speaking kubelet device plugin API.
nvidia-dcgm-exporter	100-300 mCPU	256-512 MiB	<50 MiB	Scrapes every 15s; metrics surface on :9400.
gpu-feature-discovery	50 mCPU	128 MiB	<50 MiB	Emits labels once per node startup + on driver change.
nvidia-mig-manager (if MIG)	50 mCPU	128 MiB	<50 MiB	Reacts to `nvidia.com/mig.config` label changes.
nfd (cluster-wide)	100 mCPU master + 50 mCPU/node	256 MiB + 128 MiB/node	<50 MiB	Skip if NFD already installed cluster-wide.
gpu-operator (controller)	100-300 mCPU	256-512 MiB	n/a	Single pod per cluster; spikes during reconcile.

Observability

DCGM Exporter is the operator's eyes on the GPU. It exposes a Prometheus endpoint on :9400/metrics covering utilisation, memory, power, ECC errors, thermals, NVLink throughput, NVSwitch counters, MIG per-instance utilisation and a long tail of vendor metrics. The default config emits the high-signal subset (~30 metrics); a dcgmExporter.config.name ConfigMap can switch on the full ~150-metric CSV. The operator itself exports controller health on :8080/healthz and reconciliation metrics on :8080/metrics (Prometheus format). The validator pods log a structured one-line outcome that is the easiest first signal when a node fails to come Ready.

The metrics worth alerting on are GPU utilisation, GPU memory usage, ECC counter rate, thermal slowdown bits, and the operator's reconcile-error counter. The rules below are the minimum production set; refine per-tenant once you know your normal floor.

DCGM_FI_DEV_GPU_UTIL — coarse SM occupancy proxy; low + decode-heavy = Python overhead, high + idle queue = SLO under threat.
DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_FREE — frame buffer (VRAM) usage; alert at 95% sustained.
DCGM_FI_DEV_POWER_USAGE — wattage; sustained near TDP cap means the node is throttling.
DCGM_FI_DEV_GPU_TEMP / DCGM_FI_DEV_MEMORY_TEMP — temperatures; alert on thermal slowdown trip count.
DCGM_FI_DEV_ECC_DBE_VOL_TOTAL — uncorrected ECC errors; a non-zero rate retires the GPU.
DCGM_FI_PROF_NVLINK_TX_BYTES / RX_BYTES — NVLink throughput; critical for multi-GPU collectives.
DCGM_FI_DEV_MIG_MODE — per-instance MIG utilisation; the chargeback signal for multi-tenant clusters.
gpu_operator_reconcile_total / _errors_total — operator-side health; non-zero error rate means a sub-component is failing.

# Prometheus alerts for a GPU Operator deployment
groups:
  - name: gpu-operator-sla
    interval: 30s
    rules:
      - alert: GPUMemoryNearFull
        expr: avg by (node, gpu) (DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE)) > 0.95
        for: 10m
        labels: { severity: warning }
        annotations:
          summary: "GPU {{ $labels.gpu }} on {{ $labels.node }} above 95% VRAM"

      - alert: GPUEccUncorrectableRising
        expr: increase(DCGM_FI_DEV_ECC_DBE_VOL_TOTAL[1h]) > 0
        for: 5m
        labels: { severity: critical }
        annotations:
          summary: "Uncorrectable ECC error on {{ $labels.gpu }} — retire and RMA"

      - alert: GPUThermalSlowdown
        expr: DCGM_FI_DEV_THERMAL_VIOLATION > 0
        for: 5m
        labels: { severity: warning }
        annotations:
          summary: "Thermal slowdown on {{ $labels.gpu }} — check airflow / inlet temp"

      - alert: GPUOperatorReconcileFailing
        expr: rate(gpu_operator_reconcile_errors_total[10m]) > 0
        for: 15m
        labels: { severity: critical }
        annotations:
          summary: "GPU Operator reconcile errors on cluster — investigate operator pod"

      - alert: NvidiaDevicePluginDown
        expr: kube_daemonset_status_number_unavailable{daemonset="nvidia-device-plugin-daemonset"} > 0
        for: 10m
        labels: { severity: critical }
        annotations:
          summary: "Device plugin DaemonSet has unavailable replicas — GPUs may not schedule"

Tip: Ship a NVIDIA-published Grafana dashboard (IDs 12239 and 19725 on grafana.com) on day one. They give a complete GPU fleet view out of the box and are the de facto baseline every SRE expects to find when triaging an incident.

Cost and FinOps

The operator software is free (Apache 2.0). The cost surface is operational: pre-pull bandwidth for driver images, control-plane overhead for DCGM scrapes, and the cluster-management labour the operator either saves or adds. In practice the operator is a clear net positive — replacing a six-step Ansible playbook plus a five-page runbook with one Helm release pays back within the first kernel upgrade incident the team avoids.

Image pull bandwidth — driver images ~3 GiB, toolkit ~600 MiB, DCGM Exporter ~200 MiB. On 100 nodes that is ~380 GiB per major version bump. Mirror to a private registry to keep $0.05/GB egress charges contained.
Prometheus retention — DCGM Exporter at 15s interval costs ~50 MB/day per 8x H100 node retained 30 days. For a 100-node H100 fleet that is roughly 150 GB of TSDB, $3-6/month on object-storage backends like Mimir or Thanos.
Per-node CPU/RAM tax — ~0.5-1.0 vCPU + 2-3 GiB. On a 96-vCPU node this is <1% overhead; on a smaller utility node it can be 5-8%. Plan node pools accordingly.
Operational savings — every avoided kernel-mismatch incident saves 2-8 engineering hours; in a 100-node fleet that compounds quickly. The break-even point against hand-rolled provisioning is typically within the first 90 days.

Security and compliance

Every component the operator deploys runs privileged or near-privileged — kernel module loading, /dev mounts and /sys access are mandatory for the driver and toolkit DaemonSets. This is not negotiable: GPUs require kernel privileges to load. The compensating controls are: pin every image to a SHA digest (not :latest), constrain the operator namespace with Pod Security Standards Restricted on everything except the named operator-owned DaemonSets, and use admission policies (Kyverno / OPA Gatekeeper) to block pod requests that bypass the device plugin and try to mount /dev/nvidia* directly.

For Secure Boot environments, NVIDIA publishes signed driver images. For UK central-government OFFICIAL workloads, the operator can be configured with a customer-controlled Machine Owner Key (MOK) for kernel module signing. Confidential computing on Hopper / Blackwell encrypts PCIe DMA traffic and is supported via sandboxWorkloads.enabled=true plus the appropriate firmware mode — a hard requirement for some sovereign workloads.

Regulatory implications are mostly indirect: the operator is infrastructure, not a data plane. For NCSC Cloud Security Principles, the relevant principles are 2 (Asset protection and resilience — encrypted PCIe under CC mode), 5 (Operational security — operator's reconciliation provides drift detection and patch baseline), and 9 (Secure user management — operator scopes itself to a single namespace with explicit RBAC). For GDPR Article 32, the operator processes no personal data. For SOC 2 / ISO 27001, the operator's GitOps-friendly install path is the evidence trail (every version bump is a Git commit).

Warning: Do not run the operator's driver DaemonSet on the same nodes as workloads that mount hostPath: /lib/modules for their own driver build. The two will race during kernel updates and one will leave the node in a half-installed state. Segregate by node pool.

Migration and alternatives

Most production migrations to the operator come from one of three origins: hand-rolled Ansible / Puppet / Chef installing driver + toolkit + plugin separately, the deprecated standalone NVIDIA Kubernetes device plugin manifest, or a cloud-vendor's bundled GPU AMI. The migration effort is shallow but the rollout sequence matters — get it wrong on a live cluster and you will lose all GPU scheduling for the duration of the cutover.

The canonical playbook is: roll the operator into the cluster with driver.enabled=false, drain a single canary node, uninstall the host-side driver from that node, relabel it nvidia.com/gpu.deploy.driver=true, watch the operator's driver DaemonSet come up and the validator pass, then iterate across the fleet. Reverse the sequence to roll back. The table below summarises the path from each common starting point.

From	Effort	Risk	Notes
Hand-rolled Ansible + standalone device plugin	Low	Medium	Reversible per node. Strip Ansible's driver tasks first.
Standalone k8s-device-plugin manifest	Trivial	Low	Operator's plugin DaemonSet replaces it; remove old manifest.
NVIDIA AI Enterprise installer	Low	Low	Same operator under the hood; just a chart-source switch.
EKS GPU AMI (bundled driver)	Low	Low	Set `driver.enabled=false`; operator manages upper layers.
GKE / AKS with vendor driver	Low	Low	Same as EKS — operator runs in driver-disabled mode.
Bottlerocket / Container-Optimized OS image	Low	Low	Driver baked into OS; operator owns toolkit / plugin / DCGM.
KubeVirt + manual vfio-pci	Medium	Medium	Enable `sandboxWorkloads.enabled=true`; reuse PCI bindings.
Run:ai pre-acquisition installer	Trivial	Low	Run:ai now ships on top of the operator post-NVIDIA acquisition.
vs Yobibyte managed alternative	n/a	n/a	If you would rather not own the operator at all, Yobibyte consumes GPUs through Yobitel-operated tenancies where the operator, driver baseline and DCGM scrape are already configured per region — customers see only the workspace surface.

# Migration script: per-node cutover from standalone driver to operator
NODE=$1

# 1. Cordon and drain
kubectl cordon "$NODE"
kubectl drain "$NODE" --ignore-daemonsets --delete-emptydir-data --force

# 2. Remove old host driver (Ubuntu example)
ssh "$NODE" sudo apt-get purge -y \
    'nvidia-*' 'libnvidia-*' 'cuda-drivers*'
ssh "$NODE" sudo rm -rf /usr/local/cuda* /var/lib/dkms/nvidia*
ssh "$NODE" sudo reboot

# 3. Label the node for the operator
kubectl wait --for=condition=Ready node/"$NODE" --timeout=10m
kubectl label node "$NODE" nvidia.com/gpu.deploy.driver=true --overwrite

# 4. Watch the operator's DaemonSets come up
kubectl -n gpu-operator get pods --field-selector spec.nodeName="$NODE" -w

# 5. Uncordon once validator pod is Complete
kubectl uncordon "$NODE"

Troubleshooting

The error table below covers the failure modes that account for roughly 85% of production GPU Operator incidents observed on Yobitel-operated fleets and the upstream issue tracker. Each row maps an observable symptom to the underlying mechanism and the minimum-viable fix.

Symptom	Cause	Fix
Driver pod CrashLoopBackOff after OS update	Kernel version no longer matches driver build target.	Pin kernel package; or `driver.usePrecompiled=true`; or pin driver to an image tag matching the new kernel.
`nvidia.com/gpu` resource not appearing on node	Device plugin DaemonSet not scheduled, or validator failed.	Check NFD labels on node; check `nvidia-operator-validator` pod logs; ensure `feature.node.kubernetes.io/pci-10de.present=true`.
Pods schedule but `nvidia-smi` fails inside container	containerd not configured with `nvidia` runtime class.	Restart toolkit DaemonSet; verify `/etc/containerd/config.toml` has `nvidia` runtime; restart containerd.
MIG strategy mismatch — pods pending forever	`mig.strategy=single` but pod requests `nvidia.com/mig-1g.10gb`.	Either switch the cluster to `mixed` strategy or change pod to request `nvidia.com/gpu: 1`.
MIG repartition leaves node NotReady	Driver reload failed mid-partition.	`kubectl logs -n gpu-operator nvidia-mig-manager-*`; manually run `nvidia-smi mig -dgi -dci` and re-label.
NCCL hang on first multi-GPU job	/dev/shm too small or `nvidia-peermem` not loaded.	Mount `/dev/shm >= 8Gi`; enable `driver.rdma.enabled=true` for GPUDirect.
Operator pod CrashLoopBackOff after Helm upgrade	ClusterPolicy CRD schema drift.	Set `operator.upgradeCRD=true`; re-run upgrade; if still failing, delete CR then reapply.
DCGM Exporter returns empty metrics	DCGM hostengine not running, or NVML mismatch.	Restart `nvidia-dcgm-exporter` pod; check `dcgm.enabled=true`; verify driver version matches DCGM build.
Secure Boot host rejects driver module	Driver image not signed with MOK enrolled on host.	Either disable Secure Boot, or use signed driver image, or enrol MOK with operator's signing key.
Node drained but pods never reschedule elsewhere	GPU resource request not satisfied by any other node.	Check `kubectl describe pod`; verify other nodes have GPUs in matching profile.
Validator pod stuck Pending	Tolerations / nodeSelector mismatch.	Check `nvidia-operator-validator` SA tolerations; ensure node taints are tolerated.
Driver upgrade hangs on running training job	Driver DaemonSet cannot evict a privileged process holding the device.	Drain node manually with `--force --delete-emptydir-data`; never upgrade driver during a training run.

Where this fits in the Yobitel stack

The NVIDIA GPU Operator is the foundation under every GPU-bearing Kubernetes node in the Yobitel estate. Whether the workload is a Yobibyte-managed inference endpoint, a Yobitel GPU Cloud bare-metal tenant cluster, an Edge AI node at a customer site, or a sovereign Yobitel UK London-1 region, the operator is what makes nvidia.com/gpu schedulable. Yobitel does not maintain a forked or replacement driver layer — the value is added above the operator, in the Yobibyte control plane and Omniscient Compute scoring, not in re-implementing what NVIDIA already ships and supports.

On Yobitel-managed clusters the operator is installed via GitOps from the platform's standard Argo CD root, with values templated per region (driver version pinned to the regional kernel baseline, DCGM Exporter wired into the regional Prometheus / Mimir stack, MIG strategy chosen per tenant SKU). On customer-managed clusters where Yobitel provides Managed Operations, the operator is the first thing installed and the last thing touched — incidents almost always resolve to a layer above, and stability of the operator stack is treated as a precondition for SLA computation.

For UK and EU sovereign workloads, the operator runs on Yobitel tenancies that satisfy NCSC Cloud Security Principles, G-Cloud 14 lot definitions and the OFFICIAL handling caveat. Signed driver containers under customer-controlled MOK, confidential computing modes on Hopper and Blackwell, and air-gapped install paths are all supported. The combination of an open-source operator, sovereign hardware, and transparent benchmarking is what lets Yobitel customers run production GPU workloads on Kubernetes without ceding their kernel baseline to a hosted vendor.

References

NVIDIA GPU Operator Documentation · NVIDIA Docs
gpu-operator on GitHub · GitHub (NVIDIA)
k8s-device-plugin · GitHub (NVIDIA)
DCGM Exporter · GitHub (NVIDIA)
NVIDIA Container Toolkit · GitHub (NVIDIA)
Node Feature Discovery · Kubernetes SIGs
Container Device Interface (CDI) · CNCF

TL;DR

Helm-installed operator from NVIDIA (Apache 2.0, first GA in 2020) that bundles the driver container, NVIDIA Container Toolkit, k8s-device-plugin, DCGM Exporter, GPU Feature Discovery, MIG Manager, Node Feature Discovery and Sandbox Workloads into one reconciled stack.
Eliminates the historically painful step of hand-installing kernel modules and runtime hooks on every GPU node — a single Helm release makes a fresh Kubernetes node GPU-ready in 3-7 minutes.
Supports bare-metal kubeadm, EKS / GKE / AKS (driver mode opt-in), OpenShift, Rancher / RKE2, air-gapped clusters, signed driver containers for Secure Boot, vGPU on virtualised hypervisors, MIG partitioning on A100 / H100 / H200 / B200 and confidential-computing modes on Hopper and Blackwell.
Surfaces `nvidia.com/gpu`, `nvidia.com/mig-*g.*gb`, `nvidia.com/gpu.shared` (MPS / time-slicing) and vGPU resource names to the scheduler, plus >150 DCGM metrics on `:9400/metrics` for Prometheus.
Hard prerequisite for KServe, KubeRay, Kubeflow Training Operator, Volcano, Kueue, NVIDIA Dynamo, the Run:ai stack and every supported GPU path on the Yobitel sovereign tenancies (Yobibyte runs on top of the operator, not in place of it).

Overview

Quick start

# 1. Add the NVIDIA Helm repo
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

# 2. Install the operator into its own namespace
helm install --wait gpu-operator \
    nvidia/gpu-operator \
    --version "v25.3.0" \
    --namespace gpu-operator --create-namespace \
    --set driver.enabled=true \
    --set toolkit.enabled=true \
    --set devicePlugin.enabled=true \
    --set dcgmExporter.enabled=true \
    --set gfd.enabled=true \
    --set mig.strategy=none

# 3. Watch the per-node validation pods turn Ready (~3-7 minutes)
kubectl -n gpu-operator get pods -w

# 4. Verify a node advertises GPUs and run a CUDA workload
kubectl describe node <gpu-node> | grep -E "nvidia.com/gpu|nvidia.com/mig"

cat <<'YAML' | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata: { name: cuda-smoke-test }
spec:
  restartPolicy: Never
  containers:
    - name: cuda
      image: nvcr.io/nvidia/cuda:12.6.0-base-ubuntu24.04
      command: ["nvidia-smi"]
      resources: { limits: { nvidia.com/gpu: 1 } }
YAML

kubectl logs cuda-smoke-test

Tip: On EKS, GKE and AKS the cluster's node image may already ship a vendor-managed driver. Set driver.enabled=false to keep the host driver and let the operator manage only the toolkit, device plugin, DCGM Exporter and GFD. This is the recommended path on every managed Kubernetes service.

How it works

Reconciliation loop — operator pod watches ClusterPolicy, NFD-labelled nodes and component DaemonSet status; no etcd state outside Kubernetes.
Driver mode auto — drops the driver container if nvidia-smi already works on the host; the recommended default since v24.6.
Pre-compiled driver — driver.usePrecompiled=true skips in-container DKMS build, halves first-boot time on stable kernels.
Sandbox Workloads — opt-in support for KubeVirt VMs needing PCI passthrough; uses the vfio-pci driver path instead of the standard kernel module.
Confidential Computing — sandboxWorkloads.defaultWorkload=vm-passthrough + Hopper / Blackwell CC mode encrypts PCIe traffic between CPU and GPU.
Helm subchart for NFD — installs Node Feature Discovery if the cluster does not already run it; can be disabled with nfd.enabled=false.

Note: The operator does not own the kernel. On nodes where another agent (cloud-init, Ansible, an immutable OS image) installs a different driver out of band, the operator's validator will fail. Either run the operator in driver.enabled=false mode and let it manage only the upper layers, or strip the alternative install path from your provisioning pipeline.

Reference: Helm values

Helm key	Type	Default	Purpose
driver.enabled	bool	true	Run the driver DaemonSet. Set false on managed K8s with vendor driver.
driver.version	string	(chart-pinned)	Pin a specific driver, e.g. `570.124.06`. Must match CUDA matrix.
driver.usePrecompiled	bool	false	Use NVIDIA's prebuilt driver images; skips in-container DKMS.
driver.repository / driver.image	string	nvcr.io/nvidia/driver	Image source — switch for air-gapped mirrors.
driver.startupProbe.initialDelaySeconds	int	60	Raise on slow storage / IB-only mgmt networks.
driver.rdma.enabled	bool	false	Install nvidia-peermem for GPUDirect RDMA over InfiniBand.
toolkit.enabled	bool	true	Install / configure containerd or CRI-O nvidia runtime.
toolkit.version	string	(chart-pinned)	Container Toolkit version pin, e.g. `1.16.2-ubuntu20.04`.
devicePlugin.enabled	bool	true	Run k8s-device-plugin DaemonSet.
devicePlugin.config.name	string	(none)	Reference a ConfigMap with MPS / time-slicing config.
mig.strategy	string	none	none
migManager.enabled	bool	true	Reconcile MIG profiles from `nvidia.com/mig.config` node label.
dcgmExporter.enabled	bool	true	Run DCGM Exporter on :9400, surfacing >150 GPU metrics.
dcgmExporter.config.name	string	(default csv)	Override the metrics CSV for custom Prometheus emit.
dcgm.enabled	bool	true	Run nv-hostengine inside the cluster instead of in driver container.
gfd.enabled	bool	true	GPU Feature Discovery — emits per-node hardware labels.
nfd.enabled	bool	true	Install Node Feature Discovery subchart.
operator.runtimeClass	string	nvidia	RuntimeClass exposed to workloads for explicit selection.
operator.defaultRuntime	string	containerd	containerd
validator.image / validator.repository	string	nvcr.io/nvidia/cloud-native/gpu-operator-validator	Validator image source.
sandboxWorkloads.enabled	bool	false	KubeVirt passthrough mode; install vfio-pci-manager.
sandboxWorkloads.defaultWorkload	string	container	container
vfioManager.enabled	bool	false	Manage vfio-pci bindings for passthrough GPUs.
vgpuManager.enabled	bool	false	Install NVIDIA vGPU host driver + licence client.
vgpuDeviceManager.enabled	bool	false	Reconcile vGPU per-node profile from `nvidia.com/vgpu.config`.
mps.enabled (via devicePlugin sharing)	bool	false	Enable Multi-Process Service for fractional GPUs.
timeSlicing.replicas	int	1	Time-slice a GPU into N logical resources; software isolation only.
psp.enabled	bool	false	Generate PodSecurityPolicies — deprecated; use Pod Security Standards.
cdi.enabled	bool	true	Container Device Interface (CDI) generation — the future of OCI device wiring.
nodeSelector / affinity / tolerations	map	{}	Constrain operator + DaemonSets to specific node pools.
validator.driver.env.WITH_WORKLOAD	bool	true	Run CUDA hello-world as the driver validator (recommended).
dcgmExporter.serviceMonitor.enabled	bool	false	Create Prometheus Operator ServiceMonitor automatically.
operator.upgradeCRD	bool	true	Allow helm upgrade to update ClusterPolicy CRD.

Warning: driver.enabled=true plus a vendor-managed driver already on the host (EKS Bottlerocket, GKE Container-Optimized OS) will deadlock — the driver DaemonSet cannot unload the in-use module. Pick one or the other before the first install; switching after the fact requires draining and reimaging the node.

Workload patterns

# A — bare-metal kubeadm cluster, operator owns the driver
helm install --wait gpu-operator nvidia/gpu-operator \
    -n gpu-operator --create-namespace \
    --set driver.enabled=true \
    --set driver.version="570.124.06" \
    --set driver.usePrecompiled=true \
    --set mig.strategy=single \
    --set dcgmExporter.enabled=true

# B — EKS / GKE / AKS with vendor-managed driver
helm install --wait gpu-operator nvidia/gpu-operator \
    -n gpu-operator --create-namespace \
    --set driver.enabled=false \
    --set toolkit.enabled=true \
    --set devicePlugin.enabled=true \
    --set dcgmExporter.enabled=true \
    --set mig.strategy=single

# C — MIG-partitioned multi-tenant cluster (apply on each H100 node)
kubectl label node h100-node-01 nvidia.com/mig.config=all-1g.10gb --overwrite
kubectl label node h100-node-02 nvidia.com/mig.config=all-2g.20gb --overwrite
kubectl label node h100-node-03 nvidia.com/mig.config=mixed --overwrite

# Verify the reconciled state
kubectl get node h100-node-01 -o jsonpath='{.status.allocatable}' | jq

Tip: For Pattern C, group MIG nodes into separate node pools (e.g. one pool per profile) and use taints + tolerations to keep tenant workloads on the right slice. Mixing MIG profiles within a single pool produces hours of pending-pod investigation later.

Sizing and capacity planning

Plan /dev/shm ≥ 8 GiB on every GPU node — NCCL multi-rank training and tensor-parallel inference (vLLM, TensorRT-LLM) fail without it.
Driver image is ~3 GiB; pre-pull to a local registry mirror in air-gapped or restricted-bandwidth environments.
driver.usePrecompiled=true cuts first-boot time from 4-6 minutes to ~60-90 seconds, at the cost of needing NVIDIA precompiled images for your kernel.
DCGM Exporter scrape cost scales linearly with GPU count per node; on 8x H100 nodes expect 80-120 KiB per scrape.

Component	CPU / node	Memory / node	Disk	Notes
nvidia-driver-daemonset	200-400 mCPU	2.0-2.5 GiB	1.5-2.5 GiB /run/nvidia	Higher with usePrecompiled=false during DKMS build.
nvidia-container-toolkit-daemonset	50-100 mCPU	128-256 MiB	<100 MiB	Configures containerd / CRI-O once then idles.
nvidia-device-plugin-daemonset	50-100 mCPU	128-256 MiB	<50 MiB	Light gRPC server speaking kubelet device plugin API.
nvidia-dcgm-exporter	100-300 mCPU	256-512 MiB	<50 MiB	Scrapes every 15s; metrics surface on :9400.
gpu-feature-discovery	50 mCPU	128 MiB	<50 MiB	Emits labels once per node startup + on driver change.
nvidia-mig-manager (if MIG)	50 mCPU	128 MiB	<50 MiB	Reacts to `nvidia.com/mig.config` label changes.
nfd (cluster-wide)	100 mCPU master + 50 mCPU/node	256 MiB + 128 MiB/node	<50 MiB	Skip if NFD already installed cluster-wide.
gpu-operator (controller)	100-300 mCPU	256-512 MiB	n/a	Single pod per cluster; spikes during reconcile.

Observability

DCGM_FI_DEV_GPU_UTIL — coarse SM occupancy proxy; low + decode-heavy = Python overhead, high + idle queue = SLO under threat.
DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_FREE — frame buffer (VRAM) usage; alert at 95% sustained.
DCGM_FI_DEV_POWER_USAGE — wattage; sustained near TDP cap means the node is throttling.
DCGM_FI_DEV_GPU_TEMP / DCGM_FI_DEV_MEMORY_TEMP — temperatures; alert on thermal slowdown trip count.
DCGM_FI_DEV_ECC_DBE_VOL_TOTAL — uncorrected ECC errors; a non-zero rate retires the GPU.
DCGM_FI_PROF_NVLINK_TX_BYTES / RX_BYTES — NVLink throughput; critical for multi-GPU collectives.
DCGM_FI_DEV_MIG_MODE — per-instance MIG utilisation; the chargeback signal for multi-tenant clusters.
gpu_operator_reconcile_total / _errors_total — operator-side health; non-zero error rate means a sub-component is failing.

# Prometheus alerts for a GPU Operator deployment
groups:
  - name: gpu-operator-sla
    interval: 30s
    rules:
      - alert: GPUMemoryNearFull
        expr: avg by (node, gpu) (DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE)) > 0.95
        for: 10m
        labels: { severity: warning }
        annotations:
          summary: "GPU {{ $labels.gpu }} on {{ $labels.node }} above 95% VRAM"

      - alert: GPUEccUncorrectableRising
        expr: increase(DCGM_FI_DEV_ECC_DBE_VOL_TOTAL[1h]) > 0
        for: 5m
        labels: { severity: critical }
        annotations:
          summary: "Uncorrectable ECC error on {{ $labels.gpu }} — retire and RMA"

      - alert: GPUThermalSlowdown
        expr: DCGM_FI_DEV_THERMAL_VIOLATION > 0
        for: 5m
        labels: { severity: warning }
        annotations:
          summary: "Thermal slowdown on {{ $labels.gpu }} — check airflow / inlet temp"

      - alert: GPUOperatorReconcileFailing
        expr: rate(gpu_operator_reconcile_errors_total[10m]) > 0
        for: 15m
        labels: { severity: critical }
        annotations:
          summary: "GPU Operator reconcile errors on cluster — investigate operator pod"

      - alert: NvidiaDevicePluginDown
        expr: kube_daemonset_status_number_unavailable{daemonset="nvidia-device-plugin-daemonset"} > 0
        for: 10m
        labels: { severity: critical }
        annotations:
          summary: "Device plugin DaemonSet has unavailable replicas — GPUs may not schedule"

Tip: Ship a NVIDIA-published Grafana dashboard (IDs 12239 and 19725 on grafana.com) on day one. They give a complete GPU fleet view out of the box and are the de facto baseline every SRE expects to find when triaging an incident.

Cost and FinOps

Image pull bandwidth — driver images ~3 GiB, toolkit ~600 MiB, DCGM Exporter ~200 MiB. On 100 nodes that is ~380 GiB per major version bump. Mirror to a private registry to keep $0.05/GB egress charges contained.
Prometheus retention — DCGM Exporter at 15s interval costs ~50 MB/day per 8x H100 node retained 30 days. For a 100-node H100 fleet that is roughly 150 GB of TSDB, $3-6/month on object-storage backends like Mimir or Thanos.
Per-node CPU/RAM tax — ~0.5-1.0 vCPU + 2-3 GiB. On a 96-vCPU node this is <1% overhead; on a smaller utility node it can be 5-8%. Plan node pools accordingly.
Operational savings — every avoided kernel-mismatch incident saves 2-8 engineering hours; in a 100-node fleet that compounds quickly. The break-even point against hand-rolled provisioning is typically within the first 90 days.

Security and compliance

Warning: Do not run the operator's driver DaemonSet on the same nodes as workloads that mount hostPath: /lib/modules for their own driver build. The two will race during kernel updates and one will leave the node in a half-installed state. Segregate by node pool.

Migration and alternatives

From	Effort	Risk	Notes
Hand-rolled Ansible + standalone device plugin	Low	Medium	Reversible per node. Strip Ansible's driver tasks first.
Standalone k8s-device-plugin manifest	Trivial	Low	Operator's plugin DaemonSet replaces it; remove old manifest.
NVIDIA AI Enterprise installer	Low	Low	Same operator under the hood; just a chart-source switch.
EKS GPU AMI (bundled driver)	Low	Low	Set `driver.enabled=false`; operator manages upper layers.
GKE / AKS with vendor driver	Low	Low	Same as EKS — operator runs in driver-disabled mode.
Bottlerocket / Container-Optimized OS image	Low	Low	Driver baked into OS; operator owns toolkit / plugin / DCGM.
KubeVirt + manual vfio-pci	Medium	Medium	Enable `sandboxWorkloads.enabled=true`; reuse PCI bindings.
Run:ai pre-acquisition installer	Trivial	Low	Run:ai now ships on top of the operator post-NVIDIA acquisition.
vs Yobibyte managed alternative	n/a	n/a	If you would rather not own the operator at all, Yobibyte consumes GPUs through Yobitel-operated tenancies where the operator, driver baseline and DCGM scrape are already configured per region — customers see only the workspace surface.

# Migration script: per-node cutover from standalone driver to operator
NODE=$1

# 1. Cordon and drain
kubectl cordon "$NODE"
kubectl drain "$NODE" --ignore-daemonsets --delete-emptydir-data --force

# 2. Remove old host driver (Ubuntu example)
ssh "$NODE" sudo apt-get purge -y \
    'nvidia-*' 'libnvidia-*' 'cuda-drivers*'
ssh "$NODE" sudo rm -rf /usr/local/cuda* /var/lib/dkms/nvidia*
ssh "$NODE" sudo reboot

# 3. Label the node for the operator
kubectl wait --for=condition=Ready node/"$NODE" --timeout=10m
kubectl label node "$NODE" nvidia.com/gpu.deploy.driver=true --overwrite

# 4. Watch the operator's DaemonSets come up
kubectl -n gpu-operator get pods --field-selector spec.nodeName="$NODE" -w

# 5. Uncordon once validator pod is Complete
kubectl uncordon "$NODE"

Troubleshooting

Symptom	Cause	Fix
Driver pod CrashLoopBackOff after OS update	Kernel version no longer matches driver build target.	Pin kernel package; or `driver.usePrecompiled=true`; or pin driver to an image tag matching the new kernel.
`nvidia.com/gpu` resource not appearing on node	Device plugin DaemonSet not scheduled, or validator failed.	Check NFD labels on node; check `nvidia-operator-validator` pod logs; ensure `feature.node.kubernetes.io/pci-10de.present=true`.
Pods schedule but `nvidia-smi` fails inside container	containerd not configured with `nvidia` runtime class.	Restart toolkit DaemonSet; verify `/etc/containerd/config.toml` has `nvidia` runtime; restart containerd.
MIG strategy mismatch — pods pending forever	`mig.strategy=single` but pod requests `nvidia.com/mig-1g.10gb`.	Either switch the cluster to `mixed` strategy or change pod to request `nvidia.com/gpu: 1`.
MIG repartition leaves node NotReady	Driver reload failed mid-partition.	`kubectl logs -n gpu-operator nvidia-mig-manager-*`; manually run `nvidia-smi mig -dgi -dci` and re-label.
NCCL hang on first multi-GPU job	/dev/shm too small or `nvidia-peermem` not loaded.	Mount `/dev/shm >= 8Gi`; enable `driver.rdma.enabled=true` for GPUDirect.
Operator pod CrashLoopBackOff after Helm upgrade	ClusterPolicy CRD schema drift.	Set `operator.upgradeCRD=true`; re-run upgrade; if still failing, delete CR then reapply.
DCGM Exporter returns empty metrics	DCGM hostengine not running, or NVML mismatch.	Restart `nvidia-dcgm-exporter` pod; check `dcgm.enabled=true`; verify driver version matches DCGM build.
Secure Boot host rejects driver module	Driver image not signed with MOK enrolled on host.	Either disable Secure Boot, or use signed driver image, or enrol MOK with operator's signing key.
Node drained but pods never reschedule elsewhere	GPU resource request not satisfied by any other node.	Check `kubectl describe pod`; verify other nodes have GPUs in matching profile.
Validator pod stuck Pending	Tolerations / nodeSelector mismatch.	Check `nvidia-operator-validator` SA tolerations; ensure node taints are tolerated.
Driver upgrade hangs on running training job	Driver DaemonSet cannot evict a privileged process holding the device.	Drain node manually with `--force --delete-emptydir-data`; never upgrade driver during a training run.

Where this fits in the Yobitel stack

References

NVIDIA GPU Operator Documentation · NVIDIA Docs
gpu-operator on GitHub · GitHub (NVIDIA)
k8s-device-plugin · GitHub (NVIDIA)
DCGM Exporter · GitHub (NVIDIA)
NVIDIA Container Toolkit · GitHub (NVIDIA)
Node Feature Discovery · Kubernetes SIGs
Container Device Interface (CDI) · CNCF

NVIDIA GPU Operator

Overview

Quick start

How it works

Reference: Helm values

Workload patterns

Sizing and capacity planning

Observability

Cost and FinOps

Security and compliance

Migration and alternatives

Troubleshooting

Where this fits in the Yobitel stack

References

Browse all entries

Deploy on Yobibyte

NVIDIA GPU Operator

Overview

Quick start

How it works

Reference: Helm values

Workload patterns

Sizing and capacity planning

Observability

Cost and FinOps

Security and compliance

Migration and alternatives

Troubleshooting

Where this fits in the Yobitel stack

References

Browse all entries

Deploy on Yobibyte