TL;DR
- Open-source (Apache 2.0) Prometheus exporter from NVIDIA that wraps the Data Center GPU Manager (DCGM) library and exposes every meaningful GPU telemetry field as `DCGM_FI_*` metrics on TCP port 9400.
- Ships as a Go binary and the `nvcr.io/nvidia/k8s/dcgm-exporter` container image; the NVIDIA GPU Operator installs it as a DaemonSet on every node labelled `nvidia.com/gpu.present=true`, with a ServiceMonitor wired into Prometheus.
- Default counter set covers SM occupancy, Tensor Core pipe activity, framebuffer use, power, temperature, ECC error counters, NVLink and PCIe throughput, and per-MIG-instance breakdowns when MIG is enabled.
- Joins with `gpu-feature-discovery` and cAdvisor labels to produce per-pod and per-namespace GPU attribution — the basis of every GPU FinOps dashboard, capacity plan, and noisy-neighbour incident analysis.
- The metrics backbone behind Yobitel's GPU fleet observability, InferenceBench scoring runs, and the per-tenant utilisation breakdown surfaced inside the Yobibyte console.
Overview#
DCGM Exporter is the thin Prometheus adapter NVIDIA ships on top of Data Center GPU Manager (DCGM), the official daemon and library for monitoring and managing data-centre GPUs. Where DCGM itself is a C/Python/Go API plus the `nv-hostengine` daemon that aggregates telemetry from NVML, the GPU driver and the GPU firmware, DCGM Exporter is the small Go binary that selects a subset of DCGM field IDs, polls them on a configurable cadence, and serves them at `/metrics` in the standard Prometheus text exposition format.
The exporter does not invent metrics. Every series it emits carries a `DCGM_FI_*` name corresponding to a published DCGM field, and the values come straight from the same counters that `dcgmi dmon`, `nvidia-smi dmon`, and any other DCGM-aware tool would read. That stability is why it has become the default — the field IDs do not move between driver releases, the namespace is consistent across H100, H200, B200, L40S, A100 and Ampere consumer cards, and the same dashboards work for every NVIDIA SKU.
On Kubernetes it is the standardised telemetry source for the NVIDIA GPU Operator, kube-prometheus-stack, KServe, vLLM, NVIDIA Run:ai, and every Helm chart that emits a Grafana dashboard for GPU workloads. On bare metal and Slurm it runs as a systemd unit on each GPU host. Yobibyte ships DCGM Exporter on every region by default; customer Prometheus instances scrape directly through a tenant-scoped federation endpoint, and Yobitel NeoCloud worker nodes carry the same DaemonSet feeding the central observability stack that powers customer dashboards.
This entry documents the production surface: container deployment via the GPU Operator, the counter set you should run in steady state versus benchmarking, MIG and per-pod attribution, the alerting rules that actually catch incidents, cardinality and overhead trade-offs, migration from `nvidia-smi` polling, and the troubleshooting matrix for the failure modes that account for almost all real outages. This entry helps you wire up GPU telemetry on your own cluster — or read what Yobibyte exposes to your Prometheus by default.
Quick start#
The example below installs DCGM Exporter through the NVIDIA GPU Operator on a Kubernetes cluster, exposes a ServiceMonitor for kube-prometheus-stack, and verifies the metrics endpoint from a debug pod. The second block is the standalone Helm chart for clusters that do not run the full GPU Operator. The third block is the equivalent Docker invocation on a bare-metal host without Kubernetes.
# 1. Install via the NVIDIA GPU Operator (recommended on Kubernetes)
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update
helm upgrade --install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator --create-namespace \
--set dcgmExporter.enabled=true \
--set dcgmExporter.serviceMonitor.enabled=true \
--set dcgmExporter.serviceMonitor.interval=30s \
--set toolkit.enabled=true \
--set driver.enabled=true
# Verify the DaemonSet is on every GPU node and scrape works
kubectl -n gpu-operator get ds nvidia-dcgm-exporter
kubectl -n gpu-operator run curl-test --rm -it --restart=Never \
--image=curlimages/curl -- \
curl -s http://nvidia-dcgm-exporter:9400/metrics | head -40
# 2. Standalone Helm chart (no GPU Operator, but driver already installed)
helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts
helm upgrade --install dcgm-exporter gpu-helm-charts/dcgm-exporter \
--namespace monitoring --create-namespace \
--set serviceMonitor.enabled=true \
--set serviceMonitor.interval=30s
# 3. Bare-metal Docker (host has CUDA driver + nvidia-container-toolkit)
docker run -d --gpus all --rm --cap-add SYS_ADMIN \
-p 9400:9400 --name dcgm-exporter \
nvcr.io/nvidia/k8s/dcgm-exporter:3.3.7-3.4.2-ubuntu22.04
curl -s http://localhost:9400/metrics | grep -E '^DCGM_FI_DEV_(GPU_UTIL|FB_USED|POWER_USAGE|GPU_TEMP) 'Always install via the NVIDIA GPU Operator unless you have an explicit reason not to. The operator handles driver, runtime, MIG manager, gpu-feature-discovery, the device plugin, and DCGM Exporter in a single, version-pinned bundle — running them piecemeal is the most common source of `metric missing` issues.
How it works#
Three components sit between an SM and a `DCGM_FI_*` series in Prometheus. The GPU firmware and driver expose raw counters through NVML and CUPTI. The DCGM library — running inside a daemon called `nv-hostengine`, either embedded in the exporter or running as a host service — polls those counters on a fixed cadence, performs derived calculations (rates, ratios, percentages), and exposes the result under a stable field-ID scheme. DCGM Exporter is the final Go process that selects a subset of fields, polls DCGM over its `libdcgm.so` API, and serves the result at `/metrics` in OpenMetrics text format.
Two of DCGM's field categories matter operationally. `DCGM_FI_DEV_*` fields are device telemetry — utilisation, memory, power, temperature, clocks, ECC counters. They are cheap to read and safe to scrape continuously. `DCGM_FI_PROF_*` fields are profiling counters — SM occupancy, Tensor Core pipe activity, NVLink and PCIe bytes — that require the GPU's profiling unit to be active. Profiling counters cost roughly 1-2 percent of SM time when sampled at sub-second cadence; at the 15-30 second scrape interval recommended for production the overhead disappears into the noise.
On Kubernetes, the exporter joins its metrics with two adjacent data sources to produce per-workload attribution. `gpu-feature-discovery` writes node labels describing every GPU's UUID, model, MIG geometry and PCIe topology. The `nvidia-container-toolkit` writes per-pod GPU device-id metadata that the kubelet exposes through cAdvisor. DCGM Exporter relabels its metrics with `Hostname`, `UUID`, `device`, and (when MIG is on) `GPU_I_ID` and `GPU_I_PROFILE`. A PromQL `label_replace` join against kube-state-metrics produces `pod` and `namespace` labels and you have every metric scoped from a physical GPU all the way up to a tenant.
On MIG-enabled hosts the exporter emits a separate series for each MIG instance — when a single H100 is sliced into seven 1g.12gb instances the metrics endpoint returns seven sets of `DCGM_FI_DEV_GPU_UTIL` rows, distinguished by the `GPU_I_ID` and `GPU_I_PROFILE` labels. Some device-level fields (board power, fan speed) remain per-physical-device because there is only one device to measure; profiling fields are partitioned per MIG instance because the MIG hardware boundary isolates the SMs.
- `nv-hostengine` daemon: polls NVML and CUPTI; can run embedded in the exporter pod or as a host service shared with `dcgmi`.
- `DCGM_FI_DEV_*` fields: cheap device telemetry — utilisation, memory, power, temperature, clocks, ECC, link state. Safe at any scrape interval.
- `DCGM_FI_PROF_*` fields: profiling counters — SM occupancy, Tensor Core activity, NVLink/PCIe bytes. Cost ~1-2 percent SM at high sampling rates.
- MIG awareness: per-instance series with `GPU_I_ID` and `GPU_I_PROFILE` labels; board-level metrics remain per physical device.
- Per-pod attribution: join DCGM metrics with cAdvisor's `container_accelerator_*` labels or gpu-feature-discovery node labels via PromQL relabel rules.
- Counter selection: driven by a CSV config file mounted into the pod; default set is sensible, custom sets are common at scale.
Reference and metric catalogue#
The exporter is configured by a CSV file (`/etc/dcgm-exporter/default-counters.csv` in the upstream image) that lists the DCGM field IDs to scrape, their Prometheus type (`gauge` or `counter`), and a human-readable description. The table below documents the canonical production counter set as of DCGM 3.3 / exporter 3.4 (mid-2026). The selection covers compute, memory, power and thermals, reliability, and fabric — everything a Grafana dashboard or alerting rule needs without inflating series cardinality.
| DCGM field | Prometheus type | Unit | What it measures |
|---|---|---|---|
| DCGM_FI_DEV_GPU_UTIL | gauge | percent (0-100) | Fraction of sample window with at least one active kernel — coarse |
| DCGM_FI_DEV_MEM_COPY_UTIL | gauge | percent | Memory copy engine utilisation — H2D, D2H, D2D traffic |
| DCGM_FI_PROF_GR_ENGINE_ACTIVE | gauge | ratio 0-1 | Fraction of time the graphics/compute engine was active |
| DCGM_FI_PROF_SM_ACTIVE | gauge | ratio 0-1 | Fraction of SMs with at least one warp resident |
| DCGM_FI_PROF_SM_OCCUPANCY | gauge | ratio 0-1 | Average warp-slot occupancy across active SMs |
| DCGM_FI_PROF_PIPE_TENSOR_ACTIVE | gauge | ratio 0-1 | Tensor Core pipe activity — the real GPU saturation signal for LLMs |
| DCGM_FI_PROF_DRAM_ACTIVE | gauge | ratio 0-1 | HBM memory channels active — bandwidth-bound workload signal |
| DCGM_FI_PROF_PCIE_TX_BYTES | counter | bytes | Cumulative PCIe transmit bytes (host to device) |
| DCGM_FI_PROF_PCIE_RX_BYTES | counter | bytes | Cumulative PCIe receive bytes (device to host) |
| DCGM_FI_PROF_NVLINK_TX_BYTES | counter | bytes | Cumulative NVLink transmit bytes across all links |
| DCGM_FI_PROF_NVLINK_RX_BYTES | counter | bytes | Cumulative NVLink receive bytes across all links |
| DCGM_FI_DEV_FB_USED | gauge | MiB | Framebuffer (VRAM) reserved by all CUDA contexts |
| DCGM_FI_DEV_FB_FREE | gauge | MiB | Framebuffer free for allocation |
| DCGM_FI_DEV_FB_TOTAL | gauge | MiB | Total physical framebuffer on the device |
| DCGM_FI_DEV_POWER_USAGE | gauge | watts | Instantaneous board power draw |
| DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION | counter | millijoules | Cumulative energy used since boot — basis of $/Wh accounting |
| DCGM_FI_DEV_GPU_TEMP | gauge | celsius | GPU die temperature |
| DCGM_FI_DEV_MEMORY_TEMP | gauge | celsius | HBM stack temperature (Hopper/Blackwell) |
| DCGM_FI_DEV_SM_CLOCK | gauge | MHz | Current SM clock speed |
| DCGM_FI_DEV_MEM_CLOCK | gauge | MHz | Current memory clock speed |
| DCGM_FI_DEV_ECC_SBE_VOL_TOTAL | counter | errors | Single-bit ECC errors corrected since boot |
| DCGM_FI_DEV_ECC_DBE_VOL_TOTAL | counter | errors | Double-bit ECC errors since boot — RMA threshold |
| DCGM_FI_DEV_RETIRED_PENDING | gauge | pages | Memory pages pending retirement (DRAM failing soon) |
| DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL | gauge | MB/s | Aggregate NVLink bandwidth across all links |
| DCGM_FI_DEV_XID_ERRORS | counter | events | Driver-reported XID error events — crash/hang signal |
| DCGM_FI_DEV_GPU_UTIL_SAMPLES | gauge | samples | Number of underlying samples in the GPU util window |
`DCGM_FI_DEV_GPU_UTIL` is the most misread metric in the field. It reports the fraction of sample windows where at least one kernel was running — a single tiny kernel hogging one SM reports 100 percent. For real saturation watch `DCGM_FI_PROF_SM_OCCUPANCY` and `DCGM_FI_PROF_PIPE_TENSOR_ACTIVE` together. Many production dashboards still alert on the wrong metric.
Workload patterns#
Three workload shapes cover the bulk of DCGM Exporter deployments: a node-level saturation dashboard for capacity planning, per-pod attribution for multi-tenant cost and noisy-neighbour analysis, and a MIG-aware view for clusters that slice H100/H200 GPUs into smaller tenancies. Each pattern uses a slightly different counter selection and label-join strategy.
Pattern A — node-level saturation for capacity planning. Run with the default counter set plus the `PROF_PIPE_TENSOR_ACTIVE` and `PROF_DRAM_ACTIVE` fields. The PromQL questions you want answered are: which nodes are sustained above 70 percent Tensor Core activity, which nodes are bandwidth-bound at high DRAM activity but low Tensor activity (likely batched inference with small models), and which nodes are mostly idle and candidates for workload consolidation.
Pattern B — per-pod attribution. Use the default counters but add the `Hostname`, `UUID` and pod-attribution relabels in the exporter config. Pair with the `nvidia.com/gpu` resource reported by kube-state-metrics so the namespace and pod-name labels join in Prometheus. The PromQL question is: which tenant's pods are responsible for the GPU load on each node, broken down by namespace, deployment and container.
Pattern C — MIG-aware monitoring. Enable MIG mode on the host, configure the GPU Operator's MIG manager to create instance profiles (e.g. `all-1g.12gb` for seven small slices, or `all-balanced` for mixed sizes), then scrape the exporter as normal. Every metric is now emitted seven times per H100 with distinct `GPU_I_ID` labels — your dashboards must `sum by (GPU_I_ID)` to avoid double-counting and your alerts must use `topk by (GPU_I_ID)` to identify hot slices.
# Custom counter file mounted into the exporter for per-pod attribution
# saved as default-counters.csv, mounted at /etc/dcgm-exporter/counters.csv
#
# format: DCGM_FIELD_ID, prometheus_type, help
DCGM_FI_DEV_GPU_UTIL, gauge, GPU utilization
DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory copy engine utilization
DCGM_FI_DEV_FB_USED, gauge, Framebuffer used (MiB)
DCGM_FI_DEV_FB_FREE, gauge, Framebuffer free (MiB)
DCGM_FI_DEV_POWER_USAGE, gauge, Power draw (W)
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Energy (mJ)
DCGM_FI_DEV_GPU_TEMP, gauge, GPU die temperature (C)
DCGM_FI_DEV_MEMORY_TEMP, gauge, HBM stack temperature (C)
DCGM_FI_DEV_SM_CLOCK, gauge, SM clock (MHz)
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock (MHz)
DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, counter, Single-bit ECC errors
DCGM_FI_DEV_ECC_DBE_VOL_TOTAL, counter, Double-bit ECC errors
DCGM_FI_DEV_XID_ERRORS, counter, XID error events
DCGM_FI_PROF_GR_ENGINE_ACTIVE, gauge, Graphics/compute engine active
DCGM_FI_PROF_SM_ACTIVE, gauge, SMs with at least one warp
DCGM_FI_PROF_SM_OCCUPANCY, gauge, Warp slot occupancy
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Tensor Core pipe active
DCGM_FI_PROF_DRAM_ACTIVE, gauge, HBM channels active
DCGM_FI_PROF_PCIE_TX_BYTES, counter, PCIe transmit bytes
DCGM_FI_PROF_PCIE_RX_BYTES, counter, PCIe receive bytes
DCGM_FI_PROF_NVLINK_TX_BYTES, counter, NVLink transmit bytes
DCGM_FI_PROF_NVLINK_RX_BYTES, counter, NVLink receive bytesPattern B (per-pod attribution) requires `gpu-feature-discovery` and the device plugin to be running, AND the exporter to be configured with kubernetes pod resolution (`--kubernetes=true --kubernetes-gpu-id-type=device-name`). Both are set automatically by the GPU Operator. Missing pod labels almost always means one of those two prerequisites is not in place.
Sizing and capacity planning#
DCGM Exporter sizing is governed by scrape interval, counter cardinality, and number of GPUs per node. The exporter itself uses negligible CPU and memory — typically under 50 MB resident — but the Prometheus side scales with `nodes x GPUs-per-node x metrics x retention`. The table below shows the steady-state series count and scrape cost for typical fleet sizes, assuming the canonical 25-counter set from the reference section and the standard pod-attribution labels.
The two numbers that matter for Prometheus capacity planning are active series and ingest rate. As a planning anchor, the canonical counter set produces roughly 90-100 series per physical GPU (25 counters x 3-4 labels collapsing to unique combinations) plus ~7x that on MIG-enabled hosts running the 1g profile. A 256-GPU H100 cluster produces around 25,000 active series from DCGM; a 1,024-GPU fleet around 100,000. Prometheus comfortably handles 10 million active series per server, so DCGM is rarely the cardinality bottleneck — application metrics usually are.
- Default scrape interval: 30 s. Drop to 15 s for SLA-critical inference fleets; raise to 60 s for batch-only training clusters.
- Counter selection: a 12-counter minimal set (utilisation, memory, power, temperature, ECC, XID) halves series count at modest visibility cost.
- MIG inflation: each 1g.12gb instance multiplies per-GPU series by 7. Use this when planning Prometheus retention for MIG-heavy clusters.
- Profiling counters: include `PROF_PIPE_TENSOR_ACTIVE` and `PROF_DRAM_ACTIVE` in steady state; reserve `PROF_PCIE_*` and `PROF_NVLINK_*` byte counters for clusters where fabric is the focus.
- Remote write: forward DCGM metrics to long-term storage (Thanos, Mimir, VictoriaMetrics) — local Prometheus retention of 7-14 days is sufficient for live queries.
| Fleet | Nodes | GPUs | MIG | Active series | Prometheus ingest | Retention storage (30d) |
|---|---|---|---|---|---|---|
| Single dev node | 1 | 8 | No | ~800 | ~25 samples/s | ~150 MB |
| Small cluster | 8 | 64 | No | ~6,400 | ~210 samples/s | ~1.2 GB |
| Production tenancy | 32 | 256 | No | ~25,000 | ~830 samples/s | ~5 GB |
| Production tenancy + MIG | 32 | 256 | Yes (1g.12gb x7) | ~175,000 | ~5,800 samples/s | ~35 GB |
| Yobitel London-1 region | 128 | 1,024 | Mixed | ~120,000 | ~4,000 samples/s | ~24 GB |
| Yobitel multi-region fleet | 512 | 4,096 | Mixed | ~480,000 | ~16,000 samples/s | ~96 GB |
Limits and quotas#
DCGM Exporter has very few hard limits. The constraints that matter in practice are driver and DCGM library version compatibility, MIG mode detection, profiling counter availability per architecture, and the cost of high-cardinality label combinations. The table below documents each ceiling and the operational lever for raising it.
| Limit | Default | Ceiling | How to raise / work around |
|---|---|---|---|
| Scrape interval (Prometheus) | 30 s | 1 s (impractical) | Lower in `ServiceMonitor.interval`; watch GPU profiling cost. |
| Profiling fields per scrape | all enabled | GPU profiling unit shared with Nsight | Disable when Nsight Systems is running concurrently. |
| MIG instances per H100 | 1 (no MIG) | 7 instances | Configure MIG manager via GPU Operator; exporter auto-detects. |
| Driver version | R535+ | R570 / R580 for B200 | Upgrade driver via GPU Operator; older drivers omit Blackwell fields. |
| DCGM library version | 3.3+ | 3.3.x recommended for H200/B200 | Pin via `nvcr.io/nvidia/k8s/dcgm-exporter:3.3.7-3.4.2-ubuntu22.04` tag. |
| Pods per node label join | unlimited | Prometheus cardinality budget | Join via `kube-state-metrics`, not exporter-side relabel. |
| Container privilege | host driver caps required | n/a | Run with `SYS_ADMIN` and host `/dev/nvidia*`; the GPU Operator does this. |
| Shared memory for nv-hostengine | default | Container-defined | Mount `/dev/shm` at 64 MB minimum (default in stock image). |
| XID error history retained | since boot | Reset on driver reload | Use `rate(...[1h])` on counters, do not query absolute values. |
| Concurrency with `dcgmi` | shared daemon | n/a | Use either embedded or host `nv-hostengine`, not both. |
| Concurrency with Nsight profiler | exclusive | n/a | Profiling counters disabled while Nsight Compute holds the perf unit. |
Running Nsight Compute on a host where DCGM Exporter is also reading profiling counters causes DCGM to silently drop the `PROF_*` fields until Nsight releases the profiling unit. If your Tensor Core activity series shows a flat-line gap, check whether a performance engineer is profiling on that host.
Observability#
DCGM Exporter is itself an observability component, but its own health is worth alerting on. The exporter exposes a small set of meta-metrics: `dcgm_exporter_field_collection_errors_total` (counter — field reads that failed against DCGM), `dcgm_exporter_scrape_duration_seconds` (gauge — per-scrape cost), and standard Prometheus `up` (whether the scrape succeeded). The alerts below cover the high-value GPU-side incidents the exporter exists to surface, plus the meta-alerts for the exporter itself.
- Thermal — `DCGM_FI_DEV_GPU_TEMP > 85` for 5 min: cooling failure, hot-aisle thermal runaway, or fan failure on the chassis.
- Memory pressure — `DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL > 0.95` for 10 min: workload one batch from CUDA OOM.
- Reliability — `increase(DCGM_FI_DEV_ECC_DBE_VOL_TOTAL[1h]) > 0`: double-bit ECC error → cordon node, schedule RMA.
- Reliability — `increase(DCGM_FI_DEV_RETIRED_PENDING[24h]) > 0`: DRAM cells failing; rebalance off this GPU before retirement.
- Fabric — `DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL == 0` on an active workload: NVLink down, expect 4-10x training slowdown on TP/DP jobs.
- Driver crash — `increase(DCGM_FI_DEV_XID_ERRORS[5m]) > 0`: any XID event is worth a page; XID 79 (GPU fallen off the bus) is a hardware reset condition.
- Underutilisation — `avg_over_time(DCGM_FI_PROF_PIPE_TENSOR_ACTIVE[1h]) < 0.10`: a paid-for GPU not earning its keep; investigate workload.
- Exporter health — `up{job="dcgm-exporter"} == 0`: scrape failed, alert and check the DaemonSet for the affected node.
# Prometheus alerting rules — DCGM Exporter on GPU clusters
groups:
- name: gpu-hardware
interval: 30s
rules:
- alert: GPUThermalWarning
expr: max by (Hostname, gpu) (DCGM_FI_DEV_GPU_TEMP) > 85
for: 5m
labels: { severity: warning, team: infra }
annotations:
summary: "GPU {{ $labels.gpu }} on {{ $labels.Hostname }} at {{ $value }}C"
- alert: GPUMemoryPressure
expr: (DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL) > 0.95
for: 10m
labels: { severity: warning }
annotations:
summary: "GPU {{ $labels.gpu }} VRAM >95% — OOM imminent"
- alert: GPUDoubleBitECC
expr: increase(DCGM_FI_DEV_ECC_DBE_VOL_TOTAL[1h]) > 0
labels: { severity: critical, team: infra }
annotations:
summary: "Double-bit ECC on {{ $labels.Hostname }} GPU {{ $labels.gpu }} — cordon & RMA"
- alert: GPUXIDError
expr: increase(DCGM_FI_DEV_XID_ERRORS[5m]) > 0
labels: { severity: critical }
annotations:
summary: "XID error on {{ $labels.Hostname }} GPU {{ $labels.gpu }} — investigate driver"
- alert: NVLinkDown
expr: DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL == 0
and on (Hostname, gpu) DCGM_FI_PROF_PIPE_TENSOR_ACTIVE > 0.05
for: 5m
labels: { severity: critical }
annotations:
summary: "NVLink down on {{ $labels.Hostname }} GPU {{ $labels.gpu }} during active workload"
- alert: GPUUnderutilised
expr: avg_over_time(DCGM_FI_PROF_PIPE_TENSOR_ACTIVE[1h]) < 0.10
and on (Hostname, gpu) DCGM_FI_DEV_POWER_USAGE > 100
for: 1h
labels: { severity: info, team: finops }
annotations:
summary: "{{ $labels.Hostname }} GPU {{ $labels.gpu }} <10% Tensor Core utilisation for 1h"
- alert: DCGMExporterDown
expr: up{job="dcgm-exporter"} == 0
for: 5m
labels: { severity: warning }
annotations:
summary: "DCGM Exporter scrape failed on {{ $labels.instance }}"Alert on Tensor Core activity (`DCGM_FI_PROF_PIPE_TENSOR_ACTIVE`), not `DCGM_FI_DEV_GPU_UTIL`. The former tells you whether the workload is actually using the GPU for matrix multiply — the foundation of LLM and CV inference economics. The latter conflates a fully-saturated SM with a single launched kernel.
Cost and FinOps#
DCGM Exporter is free under Apache 2.0 — there is no licence cost. The operational cost is Prometheus storage for the metrics it produces, and the small SM-time overhead of profiling counters. The table below puts both in USD terms for typical fleet sizes, using mid-2026 pricing anchors for managed Prometheus (Grafana Cloud, AMP) and self-hosted Thanos on cheap object storage.
- Profiling overhead: ~1-2 percent SM time when sampling all `PROF_*` fields at 1 s. At 30 s scrape it is unmeasurable.
- Storage: assume ~1.3 bytes per sample compressed in Prometheus TSDB. Self-hosted Thanos on object storage drops effective cost to ~$0.025/GB-month.
- Cardinality drivers: MIG mode (7x per GPU on 1g profile), per-pod attribution labels, and excessive `gpu` label values are the three knobs to watch.
- FinOps integration: `DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION` is a counter in millijoules — join with cAdvisor pod labels to attribute energy and $/Wh to tenants.
- Yobitel customers see DCGM-derived per-tenant utilisation in the Yobibyte console at no extra cost; the underlying telemetry is included in the GPU rate.
| Fleet | GPUs | Active series | Self-hosted Prom + Thanos (30d) | Managed Prom (Grafana Cloud, 30d) | Notes |
|---|---|---|---|---|---|
| Single dev node | 8 | ~800 | $0 (existing) | ~$2 | Negligible at this scale. |
| Production tenancy | 256 | ~25,000 | ~$15/month (S3) | ~$75/month | DCGM dominates GPU-side metrics. |
| Production tenancy + MIG | 256 | ~175,000 | ~$80/month | ~$520/month | MIG inflates 7x on 1g profiles. |
| Yobitel London-1 region | 1,024 | ~120,000 | ~$60/month | ~$360/month | Mixed MIG and full-GPU tenancies. |
| Yobitel multi-region fleet | 4,096 | ~480,000 | ~$240/month | ~$1,440/month | Federate via Thanos sidecar per cluster. |
Security and compliance#
DCGM Exporter requires privileged access to the host GPU device files (`/dev/nvidia0`, `/dev/nvidiactl`, `/dev/nvidia-uvm`) and the `SYS_ADMIN` Linux capability to read profiling counters. The GPU Operator handles this transparently by deploying the pod with the right `securityContext`; if you deploy the exporter outside the operator you must mirror those settings. The exporter does not need network egress and should be locked down with a NetworkPolicy that only permits ingress from the Prometheus scrape endpoint.
The exporter does not authenticate scrape requests. On shared clusters this is fine because the Service is internal and the pod selector is locked down, but if the metrics endpoint is exposed beyond the cluster boundary (federation across regions, external Grafana) place a reverse proxy with mTLS or bearer-token auth in front of it. Prometheus operator's `bearerTokenSecret` field on the `ServiceMonitor` is the standard pattern.
Regulatory posture is straightforward because DCGM metrics are telemetry counters with no customer payload. They contain GPU UUIDs, host names, namespace and pod names (when attribution is enabled), and numeric counters — no PII, no model weights, no inference payloads. For UK public-sector workloads (NCSC Cloud Security Principles, G-Cloud 14) this means DCGM metrics flow freely within the sovereign tenancy and to the central monitoring stack without additional control. For GDPR purposes the metrics are operational data, not personal data. The one caveat is the namespace and pod-name labels — on tenancies where the namespace name itself reveals a customer identity, scrub or rewrite those labels at the Prometheus relabel stage before federating to a multi-tenant store.
Never expose the DCGM Exporter `/metrics` endpoint to the public internet without an authenticating reverse proxy. The metrics themselves are non-sensitive but the GPU UUIDs, host names, and pod labels leak operational topology that helps an attacker target the cluster.
Migration and alternatives#
Most production migrations to DCGM Exporter come from one of three origins: shell scripts polling `nvidia-smi --query-gpu`, the legacy Prometheus exporter from `nvidia/gpu_exporter` (community fork that predated the official one), or cloud-provider GPU telemetry (CloudWatch Container Insights, GCP Cloud Monitoring's GPU plugin, Azure Monitor). The table below documents the trade-offs of each migration path.
If you are currently polling `nvidia-smi` from a shell script and writing to a TSDB, the migration is largely a deletion: install the GPU Operator, point Prometheus at the exporter, retire the script. Field names change (`utilization.gpu` becomes `DCGM_FI_DEV_GPU_UTIL`) and you lose Tegra/Jetson-specific fields that the embedded `nvidia-smi` reports on edge devices but DCGM does not implement.
| Migration source | Effort | What you gain | What you lose |
|---|---|---|---|
| nvidia-smi polling shell script | Low | Native Prometheus, MIG awareness, no fork overhead | Tegra/Jetson fields, custom parsing logic |
| Legacy nvidia/gpu_exporter | Low — drop in | Active NVIDIA support, full profiling counter set | Some custom labels — re-derive via relabel rules |
| AWS CloudWatch Container Insights | Medium | Open-source standard, portable across clouds | AWS-native alarms; re-implement in Prometheus rules |
| GCP Cloud Monitoring GPU plugin | Medium | Same as above | GKE Autopilot integration; re-wire dashboards |
| Azure Monitor GPU agent | Medium | Same as above | Azure-native portal integration |
| Datadog DCGM integration | Low | Self-hosted control, no per-host licence | Datadog's automated topology view |
| No GPU monitoring at all | Trivial via GPU Operator | Every benefit | n/a — this is the right migration |
# Equivalent invocations: nvidia-smi shell script vs DCGM Exporter PromQL
# Old: nvidia-smi script writing to a TSDB
nvidia-smi --query-gpu=index,utilization.gpu,memory.used,memory.total,power.draw \
--format=csv,noheader,nounits | \
while IFS=, read idx util mem_used mem_total power; do
echo "gpu_util{gpu=\"$idx\"} $util $(date +%s)000"
echo "gpu_mem_used{gpu=\"$idx\"} $mem_used $(date +%s)000"
echo "gpu_power{gpu=\"$idx\"} $power $(date +%s)000"
done | curl --data-binary @- http://pushgateway:9091/metrics/job/gpu
# New: DCGM Exporter — same data, native PromQL queries
# Per-GPU utilisation, last 5 minutes
avg by (gpu, Hostname) (avg_over_time(DCGM_FI_DEV_GPU_UTIL[5m]))
# Memory pressure, sorted hot
topk(10, DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL)
# Power draw aggregated to node level for capacity planning
sum by (Hostname) (DCGM_FI_DEV_POWER_USAGE)
# Energy attribution per namespace (joined with cAdvisor pod labels)
sum by (namespace) (
rate(DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION[5m])
* on (Hostname, gpu) group_left (namespace, pod)
kube_pod_container_resource_requests{resource="nvidia_com_gpu"}
) / 1000 # joules per secondThere are no serious open-source alternatives to DCGM Exporter on NVIDIA hardware. AMD ships `rocm-smi-exporter` for ROCm, Intel ships `xpum-exporter` for Gaudi and Max GPUs — both follow the same DaemonSet + Prometheus pattern but with vendor-specific field names. The pattern is universal; the implementation is per vendor.
Troubleshooting#
The error table below covers the failure modes that account for almost all real DCGM Exporter incidents. Each row maps an observable symptom to the underlying cause and the minimum-viable fix. Most issues trace back to one of three root causes: driver version mismatch, MIG configuration drift, or missing label-join prerequisites.
| Symptom | Cause | Fix |
|---|---|---|
| No metrics scraped, pod CrashLoopBackOff | Driver missing or version mismatch with DCGM library | Verify `nvidia-smi` works on the host; align driver to DCGM's supported matrix; reinstall via GPU Operator. |
| Metrics endpoint up but `PROF_*` fields missing | Profiling unit not available (older driver) or held by Nsight | Upgrade driver to R535+; confirm no Nsight Compute session is active on the host. |
| MIG mode enabled but no per-instance series | MIG manager not configured or instances not created | Apply a MIG profile via the GPU Operator's `migManager.config` ConfigMap; reboot or `nvidia-smi mig -cgi` manually. |
| Per-pod labels (`pod`, `namespace`) missing | gpu-feature-discovery not deployed, or kubernetes pod resolution flag missing | Install GFD via GPU Operator; set `--kubernetes=true` on the exporter. |
| Cardinality explosion in Prometheus | MIG enabled with 1g profile across many H100s (7x multiplier) | Filter unused profile labels; aggregate at scrape time; raise Prometheus series budget. |
| `DCGM_FI_DEV_GPU_UTIL` always 0 on H100 | Hopper reports activity differently; check `PROF_GR_ENGINE_ACTIVE` instead | Use `DCGM_FI_PROF_GR_ENGINE_ACTIVE` and `DCGM_FI_PROF_PIPE_TENSOR_ACTIVE` on Hopper/Blackwell. |
| NVLink bandwidth always 0 | NVLink topology not detected, or P2P disabled | Run `nvidia-smi topo -m`; confirm NVLink visible; check `NVLINK_BANDWIDTH_TOTAL` field is in counter list. |
| Exporter pod evicted under memory pressure | `nv-hostengine` leaks memory on long-running deployments | Set pod memory limit at 512 MB; restart DaemonSet weekly via a CronJob. |
| Scrape duration spikes when MIG profile changes | Exporter re-discovers MIG geometry on every scrape after change | Expected; settles after 1-2 scrape cycles; do not alert on transient spikes. |
| Metrics report 0 W power on consumer GPUs | Power reporting unsupported on RTX consumer cards in some BIOS modes | DCGM is designed for data-centre SKUs; switch to data-centre GPUs or accept the gap. |
| `up == 0` on one node only | kubelet evicted DaemonSet pod due to node memory pressure | Increase pod priorityClass to `system-node-critical`; check node memory limits. |
| Conflict with `dcgmi` CLI on same host | Two `nv-hostengine` instances competing for the GPU's DCGM channel | Use embedded `nv-hostengine` (default) OR host service, not both; check `systemctl status nvidia-dcgm`. |
| Tensor Core activity flat-lines mid-day | Performance engineer running Nsight Compute on the host | Coordinate profiler sessions; warn before profiling production GPUs. |
Where this fits in the Yobitel stack#
DCGM Exporter is the metrics backbone of every GPU host Yobitel operates. Every NeoCloud worker node, every Yobitel Edge AI appliance, and every Yobibyte tenancy worker runs the exporter as part of the standard NVIDIA GPU Operator install. The metrics feed three downstream consumers: the regional Prometheus stack that powers customer-facing dashboards in the Yobibyte console, the Thanos federation that retains 12 months of telemetry for capacity planning, and InferenceBench's scoring pipeline that records GPU-side energy and utilisation alongside model throughput numbers.
On Yobibyte tenancies, the metrics produced by DCGM Exporter are exposed back to the customer through two surfaces. The console's per-workspace dashboard shows real-time GPU utilisation, framebuffer, power, and thermal headroom for every workload the tenant is running. The Prometheus federation endpoint lets customers scrape their own slice of the fleet metrics into their own observability stack — a `Prometheus federate` URL with bearer-token auth, scoped to the tenant's GPU UUIDs. The FOCUS-conformant billing export joins DCGM energy counters with cAdvisor pod labels to produce per-workload $/Wh accounting.
On UK and EU sovereign tenancies (NCSC Cloud Security Principles, G-Cloud 14, OFFICIAL-handling), DCGM metrics remain inside the sovereign region's Prometheus and never federate to the global multi-tenant store. The metrics themselves contain no customer payload, but Yobitel applies the same data-residency boundary to operational telemetry as to inference payloads — sovereign customers see a one-region observability stack with no cross-region replication.
References
- DCGM Exporter on GitHub · GitHub (NVIDIA)
- Data Center GPU Manager User Guide · NVIDIA Documentation
- DCGM Field Identifiers Reference · NVIDIA Documentation
- Monitoring GPUs in Kubernetes with DCGM · NVIDIA Developer Blog
- NVIDIA GPU Operator Documentation · NVIDIA Documentation
- GPU Feature Discovery · GitHub (NVIDIA)
- Profiling DCGM Counters Explained · NVIDIA Documentation