TL;DR
- Helm-installed operator from NVIDIA (Apache 2.0, first GA in 2020) that bundles the driver container, NVIDIA Container Toolkit, k8s-device-plugin, DCGM Exporter, GPU Feature Discovery, MIG Manager, Node Feature Discovery and Sandbox Workloads into one reconciled stack.
- Eliminates the historically painful step of hand-installing kernel modules and runtime hooks on every GPU node — a single Helm release makes a fresh Kubernetes node GPU-ready in 3-7 minutes.
- Supports bare-metal kubeadm, EKS / GKE / AKS (driver mode opt-in), OpenShift, Rancher / RKE2, air-gapped clusters, signed driver containers for Secure Boot, vGPU on virtualised hypervisors, MIG partitioning on A100 / H100 / H200 / B200 and confidential-computing modes on Hopper and Blackwell.
- Surfaces `nvidia.com/gpu`, `nvidia.com/mig-*g.*gb`, `nvidia.com/gpu.shared` (MPS / time-slicing) and vGPU resource names to the scheduler, plus >150 DCGM metrics on `:9400/metrics` for Prometheus.
- Hard prerequisite for KServe, KubeRay, Kubeflow Training Operator, Volcano, Kueue, NVIDIA Dynamo, the Run:ai stack and every supported GPU path on the Yobitel sovereign tenancies (Yobibyte runs on top of the operator, not in place of it).
Overview#
The NVIDIA GPU Operator is the canonical mechanism for turning a vanilla Kubernetes node into a GPU-schedulable node. Before the operator landed in 2020, exposing a single GPU to a pod required four hand-managed components on every host: a kernel driver matched to the exact Linux kernel, the NVIDIA Container Toolkit so the OCI runtime could inject device nodes and user-space libraries, the Kubernetes device plugin DaemonSet to advertise `nvidia.com/gpu`, and DCGM for telemetry. Each piece had its own upgrade cadence, its own packaging conventions and its own way of breaking after an unattended OS update. Fleets of more than a handful of nodes inevitably built bespoke configuration-management to keep the pieces in lockstep.
The operator replaces that mess with one Helm release. It runs every component as a containerised DaemonSet, watches node labels emitted by Node Feature Discovery (NFD) so it only touches GPU hosts, and orchestrates restart order when drivers change. A previously CPU-only node becomes a fully-schedulable GPU node within minutes of being labelled `nvidia.com/gpu.present=true`. The operator is the install path NVIDIA recommends, support, and validate against in the NGC compatibility matrix; running anything else in production is now firmly outside the supported envelope.
By mid-2026 the operator is at v25.x, tracking CUDA 12.6 / 13.0 drivers (R565 / R570 series), Kubernetes 1.27-1.33, and supports the full Hopper / Hopper-X / Blackwell range plus the Ampere and Ada generations still in service. It is not a CNCF project — NVIDIA owns and ships it — but the source is open under Apache 2.0 and the bug tracker is public. Yobibyte runs the GPU Operator under the hood across every Yobitel NeoCloud region (UK London-1, EU Frankfurt-1, US-East), so customers consuming Yobibyte never install, version-pin or operate it themselves; this entry documents the production surface for teams that do own that responsibility on their own clusters. This entry helps you stand up the GPU Operator on your Kubernetes cluster — or recognise what Yobibyte does on your behalf as a managed service.
Quick start#
The fastest sane path on a bare-metal or cloud node group is the upstream Helm chart with MIG disabled and DCGM Exporter on. The four-command sequence below installs the operator, watches the daemonsets come up, exercises a CUDA workload and confirms the device plugin advertises GPUs to the scheduler. Run this against a fresh node group; do not attempt to install on a host that already has the proprietary driver bound to kernel modules — uninstall the host-side driver first.
# 1. Add the NVIDIA Helm repo
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
# 2. Install the operator into its own namespace
helm install --wait gpu-operator \
nvidia/gpu-operator \
--version "v25.3.0" \
--namespace gpu-operator --create-namespace \
--set driver.enabled=true \
--set toolkit.enabled=true \
--set devicePlugin.enabled=true \
--set dcgmExporter.enabled=true \
--set gfd.enabled=true \
--set mig.strategy=none
# 3. Watch the per-node validation pods turn Ready (~3-7 minutes)
kubectl -n gpu-operator get pods -w
# 4. Verify a node advertises GPUs and run a CUDA workload
kubectl describe node <gpu-node> | grep -E "nvidia.com/gpu|nvidia.com/mig"
cat <<'YAML' | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata: { name: cuda-smoke-test }
spec:
restartPolicy: Never
containers:
- name: cuda
image: nvcr.io/nvidia/cuda:12.6.0-base-ubuntu24.04
command: ["nvidia-smi"]
resources: { limits: { nvidia.com/gpu: 1 } }
YAML
kubectl logs cuda-smoke-testOn EKS, GKE and AKS the cluster's node image may already ship a vendor-managed driver. Set `driver.enabled=false` to keep the host driver and let the operator manage only the toolkit, device plugin, DCGM Exporter and GFD. This is the recommended path on every managed Kubernetes service.
How it works#
Internally the operator is a single Go controller built with operator-sdk that reconciles a cluster-scoped `ClusterPolicy` CR (or, since v23.6, sets installed straight from the Helm values). On reconcile it computes the desired DaemonSet, ConfigMap and RBAC objects for every sub-component, applies them only on nodes carrying the expected NFD labels (`feature.node.kubernetes.io/pci-10de.present=true`, `nvidia.com/gpu.present=true`), and walks restarts in a deterministic order: NFD first, then driver, then toolkit, then plugin / DCGM / GFD / MIG Manager. Validator pods between each step block the rollout if a stage fails — for example, the toolkit validator runs a CUDA hello-world inside a sample pod, and if it cannot allocate a device the device plugin DaemonSet is not started.
The driver container is the most distinctive piece. NVIDIA publishes a per-OS / per-kernel image (Ubuntu 22.04 / 24.04, RHEL 8 / 9, SUSE SLES 15, Rocky 9, Flatcar) which builds or loads the kernel module on first run, mounts the user-space libraries under `/run/nvidia/driver`, and exposes them to other DaemonSets via host path. The Container Toolkit (containerd / CRI-O / Docker shim) is then configured to call into that hostPath at OCI hook time. This means a node never needs a hand-installed `.run` file or `dkms` build — every kernel module compilation happens inside a privileged container with the right toolchain pinned.
Device exposure to pods is done by the Kubernetes device plugin. On each node it discovers physical GPUs, MIG instances or vGPU slices, advertises them under their respective resource names, and on pod admission writes the right `NVIDIA_VISIBLE_DEVICES` value and bind-mounts the user-space libraries into the container filesystem. GPU Feature Discovery layers richer labels (architecture, memory size, compute capability, MIG capable, NVLink topology) so workloads can target the precise hardware they need via `nodeSelector` or `nodeAffinity`.
- Reconciliation loop — operator pod watches `ClusterPolicy`, NFD-labelled nodes and component DaemonSet status; no etcd state outside Kubernetes.
- Driver mode `auto` — drops the driver container if `nvidia-smi` already works on the host; the recommended default since v24.6.
- Pre-compiled driver — `driver.usePrecompiled=true` skips in-container DKMS build, halves first-boot time on stable kernels.
- Sandbox Workloads — opt-in support for KubeVirt VMs needing PCI passthrough; uses the `vfio-pci` driver path instead of the standard kernel module.
- Confidential Computing — `sandboxWorkloads.defaultWorkload=vm-passthrough` + Hopper / Blackwell CC mode encrypts PCIe traffic between CPU and GPU.
- Helm subchart for NFD — installs Node Feature Discovery if the cluster does not already run it; can be disabled with `nfd.enabled=false`.
The operator does not own the kernel. On nodes where another agent (cloud-init, Ansible, an immutable OS image) installs a different driver out of band, the operator's validator will fail. Either run the operator in `driver.enabled=false` mode and let it manage only the upper layers, or strip the alternative install path from your provisioning pipeline.
Reference: Helm values#
The Helm chart exposes ~200 values; the table below covers the ones that matter on every install. Defaults are taken from chart `v25.3.0`. Every value can also be set via a `ClusterPolicy` CR if you prefer to drive the operator declaratively from Argo CD or Flux.
| Helm key | Type | Default | Purpose |
|---|---|---|---|
| driver.enabled | bool | true | Run the driver DaemonSet. Set false on managed K8s with vendor driver. |
| driver.version | string | (chart-pinned) | Pin a specific driver, e.g. `570.124.06`. Must match CUDA matrix. |
| driver.usePrecompiled | bool | false | Use NVIDIA's prebuilt driver images; skips in-container DKMS. |
| driver.repository / driver.image | string | nvcr.io/nvidia/driver | Image source — switch for air-gapped mirrors. |
| driver.startupProbe.initialDelaySeconds | int | 60 | Raise on slow storage / IB-only mgmt networks. |
| driver.rdma.enabled | bool | false | Install nvidia-peermem for GPUDirect RDMA over InfiniBand. |
| toolkit.enabled | bool | true | Install / configure containerd or CRI-O nvidia runtime. |
| toolkit.version | string | (chart-pinned) | Container Toolkit version pin, e.g. `1.16.2-ubuntu20.04`. |
| devicePlugin.enabled | bool | true | Run k8s-device-plugin DaemonSet. |
| devicePlugin.config.name | string | (none) | Reference a ConfigMap with MPS / time-slicing config. |
| mig.strategy | string | none | none | single | mixed. See MIG section. |
| migManager.enabled | bool | true | Reconcile MIG profiles from `nvidia.com/mig.config` node label. |
| dcgmExporter.enabled | bool | true | Run DCGM Exporter on :9400, surfacing >150 GPU metrics. |
| dcgmExporter.config.name | string | (default csv) | Override the metrics CSV for custom Prometheus emit. |
| dcgm.enabled | bool | true | Run nv-hostengine inside the cluster instead of in driver container. |
| gfd.enabled | bool | true | GPU Feature Discovery — emits per-node hardware labels. |
| nfd.enabled | bool | true | Install Node Feature Discovery subchart. |
| operator.runtimeClass | string | nvidia | RuntimeClass exposed to workloads for explicit selection. |
| operator.defaultRuntime | string | containerd | containerd | crio | docker. |
| validator.image / validator.repository | string | nvcr.io/nvidia/cloud-native/gpu-operator-validator | Validator image source. |
| sandboxWorkloads.enabled | bool | false | KubeVirt passthrough mode; install vfio-pci-manager. |
| sandboxWorkloads.defaultWorkload | string | container | container | vm-passthrough | vm-vgpu. |
| vfioManager.enabled | bool | false | Manage vfio-pci bindings for passthrough GPUs. |
| vgpuManager.enabled | bool | false | Install NVIDIA vGPU host driver + licence client. |
| vgpuDeviceManager.enabled | bool | false | Reconcile vGPU per-node profile from `nvidia.com/vgpu.config`. |
| mps.enabled (via devicePlugin sharing) | bool | false | Enable Multi-Process Service for fractional GPUs. |
| timeSlicing.replicas | int | 1 | Time-slice a GPU into N logical resources; software isolation only. |
| psp.enabled | bool | false | Generate PodSecurityPolicies — deprecated; use Pod Security Standards. |
| cdi.enabled | bool | true | Container Device Interface (CDI) generation — the future of OCI device wiring. |
| nodeSelector / affinity / tolerations | map | {} | Constrain operator + DaemonSets to specific node pools. |
| validator.driver.env.WITH_WORKLOAD | bool | true | Run CUDA hello-world as the driver validator (recommended). |
| dcgmExporter.serviceMonitor.enabled | bool | false | Create Prometheus Operator ServiceMonitor automatically. |
| operator.upgradeCRD | bool | true | Allow helm upgrade to update ClusterPolicy CRD. |
`driver.enabled=true` plus a vendor-managed driver already on the host (EKS Bottlerocket, GKE Container-Optimized OS) will deadlock — the driver DaemonSet cannot unload the in-use module. Pick one or the other before the first install; switching after the fact requires draining and reimaging the node.
Workload patterns#
Three deployment patterns cover the bulk of production installs. The first is bare-metal or self-managed Kubernetes (kubeadm, RKE2, Talos, OpenShift) where the operator owns the entire driver + toolkit + plugin stack. The second is cloud-managed Kubernetes (EKS, GKE, AKS) where the vendor ships a node image with the driver already baked in and the operator manages only the upper layers. The third is multi-tenant MIG, where the operator's MIG Manager reconciles hardware partitioning from a per-node label so each tenant sees only the slice it has been allocated.
Pattern A — bare-metal cluster bootstrap. The operator brings up everything. Set `driver.enabled=true`, pick a `driver.version` matched to your Linux kernel, and use `driver.usePrecompiled=true` if your kernel is on a stable LTS line. This is the canonical path on Yobitel-operated sovereign clusters and on most on-premises NVIDIA-Certified Systems.
Pattern B — cloud-managed K8s with vendor driver. EKS, GKE and AKS each ship optimised node images. Set `driver.enabled=false`; the operator then runs only toolkit, plugin, DCGM Exporter and GFD. Crucially, on EKS you must still match the vendor driver to the CUDA runtime your workload expects — a 535-series driver will refuse CUDA 13 workloads.
Pattern C — MIG-partitioned multi-tenant. Operator installs in `mig.strategy=single` (every GPU on the node carries the same uniform profile) or `mixed` (heterogeneous profiles per GPU). The cluster operator labels each node with `nvidia.com/mig.config=all-1g.10gb` or similar and the MIG Manager applies the partition. Tenants request slices through the precise resource name (`nvidia.com/mig-1g.10gb`).
# A — bare-metal kubeadm cluster, operator owns the driver
helm install --wait gpu-operator nvidia/gpu-operator \
-n gpu-operator --create-namespace \
--set driver.enabled=true \
--set driver.version="570.124.06" \
--set driver.usePrecompiled=true \
--set mig.strategy=single \
--set dcgmExporter.enabled=true
# B — EKS / GKE / AKS with vendor-managed driver
helm install --wait gpu-operator nvidia/gpu-operator \
-n gpu-operator --create-namespace \
--set driver.enabled=false \
--set toolkit.enabled=true \
--set devicePlugin.enabled=true \
--set dcgmExporter.enabled=true \
--set mig.strategy=single
# C — MIG-partitioned multi-tenant cluster (apply on each H100 node)
kubectl label node h100-node-01 nvidia.com/mig.config=all-1g.10gb --overwrite
kubectl label node h100-node-02 nvidia.com/mig.config=all-2g.20gb --overwrite
kubectl label node h100-node-03 nvidia.com/mig.config=mixed --overwrite
# Verify the reconciled state
kubectl get node h100-node-01 -o jsonpath='{.status.allocatable}' | jqFor Pattern C, group MIG nodes into separate node pools (e.g. one pool per profile) and use taints + tolerations to keep tenant workloads on the right slice. Mixing MIG profiles within a single pool produces hours of pending-pod investigation later.
Sizing and capacity planning#
The operator itself is cheap. The operator pod runs once per cluster (~200 mCPU, 256 MiB at idle). Per node the DaemonSets cost roughly 0.5-1.0 vCPU and 2-3 GiB of memory, dominated by the driver container holding the kernel module and the user-space libraries in shared memory. DCGM Exporter scrapes are CPU-light but emit ~30-60 KiB per node per scrape interval — sized for a 15s interval that is ~120 KiB/s per 1,000 nodes into Prometheus, well inside any reasonable retention budget. Where sizing matters is the host filesystem (driver container caches kernel artefacts) and `/dev/shm` budget for downstream training workloads using NCCL.
- Plan `/dev/shm` ≥ 8 GiB on every GPU node — NCCL multi-rank training and tensor-parallel inference (vLLM, TensorRT-LLM) fail without it.
- Driver image is ~3 GiB; pre-pull to a local registry mirror in air-gapped or restricted-bandwidth environments.
- `driver.usePrecompiled=true` cuts first-boot time from 4-6 minutes to ~60-90 seconds, at the cost of needing NVIDIA precompiled images for your kernel.
- DCGM Exporter scrape cost scales linearly with GPU count per node; on 8x H100 nodes expect 80-120 KiB per scrape.
| Component | CPU / node | Memory / node | Disk | Notes |
|---|---|---|---|---|
| nvidia-driver-daemonset | 200-400 mCPU | 2.0-2.5 GiB | 1.5-2.5 GiB /run/nvidia | Higher with usePrecompiled=false during DKMS build. |
| nvidia-container-toolkit-daemonset | 50-100 mCPU | 128-256 MiB | <100 MiB | Configures containerd / CRI-O once then idles. |
| nvidia-device-plugin-daemonset | 50-100 mCPU | 128-256 MiB | <50 MiB | Light gRPC server speaking kubelet device plugin API. |
| nvidia-dcgm-exporter | 100-300 mCPU | 256-512 MiB | <50 MiB | Scrapes every 15s; metrics surface on :9400. |
| gpu-feature-discovery | 50 mCPU | 128 MiB | <50 MiB | Emits labels once per node startup + on driver change. |
| nvidia-mig-manager (if MIG) | 50 mCPU | 128 MiB | <50 MiB | Reacts to `nvidia.com/mig.config` label changes. |
| nfd (cluster-wide) | 100 mCPU master + 50 mCPU/node | 256 MiB + 128 MiB/node | <50 MiB | Skip if NFD already installed cluster-wide. |
| gpu-operator (controller) | 100-300 mCPU | 256-512 MiB | n/a | Single pod per cluster; spikes during reconcile. |
Observability#
DCGM Exporter is the operator's eyes on the GPU. It exposes a Prometheus endpoint on `:9400/metrics` covering utilisation, memory, power, ECC errors, thermals, NVLink throughput, NVSwitch counters, MIG per-instance utilisation and a long tail of vendor metrics. The default config emits the high-signal subset (~30 metrics); a `dcgmExporter.config.name` ConfigMap can switch on the full ~150-metric CSV. The operator itself exports controller health on `:8080/healthz` and reconciliation metrics on `:8080/metrics` (Prometheus format). The validator pods log a structured one-line outcome that is the easiest first signal when a node fails to come Ready.
The metrics worth alerting on are GPU utilisation, GPU memory usage, ECC counter rate, thermal slowdown bits, and the operator's reconcile-error counter. The rules below are the minimum production set; refine per-tenant once you know your normal floor.
- DCGM_FI_DEV_GPU_UTIL — coarse SM occupancy proxy; low + decode-heavy = Python overhead, high + idle queue = SLO under threat.
- DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_FREE — frame buffer (VRAM) usage; alert at 95% sustained.
- DCGM_FI_DEV_POWER_USAGE — wattage; sustained near TDP cap means the node is throttling.
- DCGM_FI_DEV_GPU_TEMP / DCGM_FI_DEV_MEMORY_TEMP — temperatures; alert on thermal slowdown trip count.
- DCGM_FI_DEV_ECC_DBE_VOL_TOTAL — uncorrected ECC errors; a non-zero rate retires the GPU.
- DCGM_FI_PROF_NVLINK_TX_BYTES / RX_BYTES — NVLink throughput; critical for multi-GPU collectives.
- DCGM_FI_DEV_MIG_MODE — per-instance MIG utilisation; the chargeback signal for multi-tenant clusters.
- gpu_operator_reconcile_total / _errors_total — operator-side health; non-zero error rate means a sub-component is failing.
# Prometheus alerts for a GPU Operator deployment
groups:
- name: gpu-operator-sla
interval: 30s
rules:
- alert: GPUMemoryNearFull
expr: avg by (node, gpu) (DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE)) > 0.95
for: 10m
labels: { severity: warning }
annotations:
summary: "GPU {{ $labels.gpu }} on {{ $labels.node }} above 95% VRAM"
- alert: GPUEccUncorrectableRising
expr: increase(DCGM_FI_DEV_ECC_DBE_VOL_TOTAL[1h]) > 0
for: 5m
labels: { severity: critical }
annotations:
summary: "Uncorrectable ECC error on {{ $labels.gpu }} — retire and RMA"
- alert: GPUThermalSlowdown
expr: DCGM_FI_DEV_THERMAL_VIOLATION > 0
for: 5m
labels: { severity: warning }
annotations:
summary: "Thermal slowdown on {{ $labels.gpu }} — check airflow / inlet temp"
- alert: GPUOperatorReconcileFailing
expr: rate(gpu_operator_reconcile_errors_total[10m]) > 0
for: 15m
labels: { severity: critical }
annotations:
summary: "GPU Operator reconcile errors on cluster — investigate operator pod"
- alert: NvidiaDevicePluginDown
expr: kube_daemonset_status_number_unavailable{daemonset="nvidia-device-plugin-daemonset"} > 0
for: 10m
labels: { severity: critical }
annotations:
summary: "Device plugin DaemonSet has unavailable replicas — GPUs may not schedule"Ship a NVIDIA-published Grafana dashboard (IDs `12239` and `19725` on grafana.com) on day one. They give a complete GPU fleet view out of the box and are the de facto baseline every SRE expects to find when triaging an incident.
Cost and FinOps#
The operator software is free (Apache 2.0). The cost surface is operational: pre-pull bandwidth for driver images, control-plane overhead for DCGM scrapes, and the cluster-management labour the operator either saves or adds. In practice the operator is a clear net positive — replacing a six-step Ansible playbook plus a five-page runbook with one Helm release pays back within the first kernel upgrade incident the team avoids.
- Image pull bandwidth — driver images ~3 GiB, toolkit ~600 MiB, DCGM Exporter ~200 MiB. On 100 nodes that is ~380 GiB per major version bump. Mirror to a private registry to keep $0.05/GB egress charges contained.
- Prometheus retention — DCGM Exporter at 15s interval costs ~50 MB/day per 8x H100 node retained 30 days. For a 100-node H100 fleet that is roughly 150 GB of TSDB, $3-6/month on object-storage backends like Mimir or Thanos.
- Per-node CPU/RAM tax — ~0.5-1.0 vCPU + 2-3 GiB. On a 96-vCPU node this is <1% overhead; on a smaller utility node it can be 5-8%. Plan node pools accordingly.
- Operational savings — every avoided kernel-mismatch incident saves 2-8 engineering hours; in a 100-node fleet that compounds quickly. The break-even point against hand-rolled provisioning is typically within the first 90 days.
Security and compliance#
Every component the operator deploys runs privileged or near-privileged — kernel module loading, /dev mounts and /sys access are mandatory for the driver and toolkit DaemonSets. This is not negotiable: GPUs require kernel privileges to load. The compensating controls are: pin every image to a SHA digest (not `:latest`), constrain the operator namespace with Pod Security Standards Restricted on everything except the named operator-owned DaemonSets, and use admission policies (Kyverno / OPA Gatekeeper) to block pod requests that bypass the device plugin and try to mount `/dev/nvidia*` directly.
For Secure Boot environments, NVIDIA publishes signed driver images. For UK central-government OFFICIAL workloads, the operator can be configured with a customer-controlled Machine Owner Key (MOK) for kernel module signing. Confidential computing on Hopper / Blackwell encrypts PCIe DMA traffic and is supported via `sandboxWorkloads.enabled=true` plus the appropriate firmware mode — a hard requirement for some sovereign workloads.
Regulatory implications are mostly indirect: the operator is infrastructure, not a data plane. For NCSC Cloud Security Principles, the relevant principles are 2 (Asset protection and resilience — encrypted PCIe under CC mode), 5 (Operational security — operator's reconciliation provides drift detection and patch baseline), and 9 (Secure user management — operator scopes itself to a single namespace with explicit RBAC). For GDPR Article 32, the operator processes no personal data. For SOC 2 / ISO 27001, the operator's GitOps-friendly install path is the evidence trail (every version bump is a Git commit).
Do not run the operator's driver DaemonSet on the same nodes as workloads that mount `hostPath: /lib/modules` for their own driver build. The two will race during kernel updates and one will leave the node in a half-installed state. Segregate by node pool.
Migration and alternatives#
Most production migrations to the operator come from one of three origins: hand-rolled Ansible / Puppet / Chef installing driver + toolkit + plugin separately, the deprecated standalone NVIDIA Kubernetes device plugin manifest, or a cloud-vendor's bundled GPU AMI. The migration effort is shallow but the rollout sequence matters — get it wrong on a live cluster and you will lose all GPU scheduling for the duration of the cutover.
The canonical playbook is: roll the operator into the cluster with `driver.enabled=false`, drain a single canary node, uninstall the host-side driver from that node, relabel it `nvidia.com/gpu.deploy.driver=true`, watch the operator's driver DaemonSet come up and the validator pass, then iterate across the fleet. Reverse the sequence to roll back. The table below summarises the path from each common starting point.
| From | Effort | Risk | Notes |
|---|---|---|---|
| Hand-rolled Ansible + standalone device plugin | Low | Medium | Reversible per node. Strip Ansible's driver tasks first. |
| Standalone k8s-device-plugin manifest | Trivial | Low | Operator's plugin DaemonSet replaces it; remove old manifest. |
| NVIDIA AI Enterprise installer | Low | Low | Same operator under the hood; just a chart-source switch. |
| EKS GPU AMI (bundled driver) | Low | Low | Set `driver.enabled=false`; operator manages upper layers. |
| GKE / AKS with vendor driver | Low | Low | Same as EKS — operator runs in driver-disabled mode. |
| Bottlerocket / Container-Optimized OS image | Low | Low | Driver baked into OS; operator owns toolkit / plugin / DCGM. |
| KubeVirt + manual vfio-pci | Medium | Medium | Enable `sandboxWorkloads.enabled=true`; reuse PCI bindings. |
| Run:ai pre-acquisition installer | Trivial | Low | Run:ai now ships on top of the operator post-NVIDIA acquisition. |
| vs Yobibyte managed alternative | n/a | n/a | If you would rather not own the operator at all, Yobibyte consumes GPUs through Yobitel-operated tenancies where the operator, driver baseline and DCGM scrape are already configured per region — customers see only the workspace surface. |
# Migration script: per-node cutover from standalone driver to operator
NODE=$1
# 1. Cordon and drain
kubectl cordon "$NODE"
kubectl drain "$NODE" --ignore-daemonsets --delete-emptydir-data --force
# 2. Remove old host driver (Ubuntu example)
ssh "$NODE" sudo apt-get purge -y \
'nvidia-*' 'libnvidia-*' 'cuda-drivers*'
ssh "$NODE" sudo rm -rf /usr/local/cuda* /var/lib/dkms/nvidia*
ssh "$NODE" sudo reboot
# 3. Label the node for the operator
kubectl wait --for=condition=Ready node/"$NODE" --timeout=10m
kubectl label node "$NODE" nvidia.com/gpu.deploy.driver=true --overwrite
# 4. Watch the operator's DaemonSets come up
kubectl -n gpu-operator get pods --field-selector spec.nodeName="$NODE" -w
# 5. Uncordon once validator pod is Complete
kubectl uncordon "$NODE"Troubleshooting#
The error table below covers the failure modes that account for roughly 85% of production GPU Operator incidents observed on Yobitel-operated fleets and the upstream issue tracker. Each row maps an observable symptom to the underlying mechanism and the minimum-viable fix.
| Symptom | Cause | Fix |
|---|---|---|
| Driver pod CrashLoopBackOff after OS update | Kernel version no longer matches driver build target. | Pin kernel package; or `driver.usePrecompiled=true`; or pin driver to an image tag matching the new kernel. |
| `nvidia.com/gpu` resource not appearing on node | Device plugin DaemonSet not scheduled, or validator failed. | Check NFD labels on node; check `nvidia-operator-validator` pod logs; ensure `feature.node.kubernetes.io/pci-10de.present=true`. |
| Pods schedule but `nvidia-smi` fails inside container | containerd not configured with `nvidia` runtime class. | Restart toolkit DaemonSet; verify `/etc/containerd/config.toml` has `nvidia` runtime; restart containerd. |
| MIG strategy mismatch — pods pending forever | `mig.strategy=single` but pod requests `nvidia.com/mig-1g.10gb`. | Either switch the cluster to `mixed` strategy or change pod to request `nvidia.com/gpu: 1`. |
| MIG repartition leaves node NotReady | Driver reload failed mid-partition. | `kubectl logs -n gpu-operator nvidia-mig-manager-*`; manually run `nvidia-smi mig -dgi -dci` and re-label. |
| NCCL hang on first multi-GPU job | /dev/shm too small or `nvidia-peermem` not loaded. | Mount `/dev/shm >= 8Gi`; enable `driver.rdma.enabled=true` for GPUDirect. |
| Operator pod CrashLoopBackOff after Helm upgrade | ClusterPolicy CRD schema drift. | Set `operator.upgradeCRD=true`; re-run upgrade; if still failing, delete CR then reapply. |
| DCGM Exporter returns empty metrics | DCGM hostengine not running, or NVML mismatch. | Restart `nvidia-dcgm-exporter` pod; check `dcgm.enabled=true`; verify driver version matches DCGM build. |
| Secure Boot host rejects driver module | Driver image not signed with MOK enrolled on host. | Either disable Secure Boot, or use signed driver image, or enrol MOK with operator's signing key. |
| Node drained but pods never reschedule elsewhere | GPU resource request not satisfied by any other node. | Check `kubectl describe pod`; verify other nodes have GPUs in matching profile. |
| Validator pod stuck Pending | Tolerations / nodeSelector mismatch. | Check `nvidia-operator-validator` SA tolerations; ensure node taints are tolerated. |
| Driver upgrade hangs on running training job | Driver DaemonSet cannot evict a privileged process holding the device. | Drain node manually with `--force --delete-emptydir-data`; never upgrade driver during a training run. |
Where this fits in the Yobitel stack#
The NVIDIA GPU Operator is the foundation under every GPU-bearing Kubernetes node in the Yobitel estate. Whether the workload is a Yobibyte-managed inference endpoint, a Yobitel GPU Cloud bare-metal tenant cluster, an Edge AI node at a customer site, or a sovereign Yobitel UK London-1 region, the operator is what makes `nvidia.com/gpu` schedulable. Yobitel does not maintain a forked or replacement driver layer — the value is added above the operator, in the Yobibyte control plane and Omniscient Compute scoring, not in re-implementing what NVIDIA already ships and supports.
On Yobitel-managed clusters the operator is installed via GitOps from the platform's standard Argo CD root, with values templated per region (driver version pinned to the regional kernel baseline, DCGM Exporter wired into the regional Prometheus / Mimir stack, MIG strategy chosen per tenant SKU). On customer-managed clusters where Yobitel provides Managed Operations, the operator is the first thing installed and the last thing touched — incidents almost always resolve to a layer above, and stability of the operator stack is treated as a precondition for SLA computation.
For UK and EU sovereign workloads, the operator runs on Yobitel tenancies that satisfy NCSC Cloud Security Principles, G-Cloud 14 lot definitions and the OFFICIAL handling caveat. Signed driver containers under customer-controlled MOK, confidential computing modes on Hopper and Blackwell, and air-gapped install paths are all supported. The combination of an open-source operator, sovereign hardware, and transparent benchmarking is what lets Yobitel customers run production GPU workloads on Kubernetes without ceding their kernel baseline to a hosted vendor.
References
- NVIDIA GPU Operator Documentation · NVIDIA Docs
- gpu-operator on GitHub · GitHub (NVIDIA)
- k8s-device-plugin · GitHub (NVIDIA)
- DCGM Exporter · GitHub (NVIDIA)
- NVIDIA Container Toolkit · GitHub (NVIDIA)
- Node Feature Discovery · Kubernetes SIGs
- Container Device Interface (CDI) · CNCF