DCGM Exporter

TL;DR

Open-source (Apache 2.0) Prometheus exporter from NVIDIA that wraps the Data Center GPU Manager (DCGM) library and exposes every meaningful GPU telemetry field as `DCGM_FI_*` metrics on TCP port 9400.
Ships as a Go binary and the `nvcr.io/nvidia/k8s/dcgm-exporter` container image; the NVIDIA GPU Operator installs it as a DaemonSet on every node labelled `nvidia.com/gpu.present=true`, with a ServiceMonitor wired into Prometheus.
Default counter set covers SM occupancy, Tensor Core pipe activity, framebuffer use, power, temperature, ECC error counters, NVLink and PCIe throughput, and per-MIG-instance breakdowns when MIG is enabled.
Joins with `gpu-feature-discovery` and cAdvisor labels to produce per-pod and per-namespace GPU attribution — the basis of every GPU FinOps dashboard, capacity plan, and noisy-neighbour incident analysis.
The metrics backbone behind Yobitel's GPU fleet observability, InferenceBench scoring runs, and the per-tenant utilisation breakdown surfaced inside the Yobibyte console.

Overview

DCGM Exporter is the thin Prometheus adapter NVIDIA ships on top of Data Center GPU Manager (DCGM), the official daemon and library for monitoring and managing data-centre GPUs. Where DCGM itself is a C/Python/Go API plus the nv-hostengine daemon that aggregates telemetry from NVML, the GPU driver and the GPU firmware, DCGM Exporter is the small Go binary that selects a subset of DCGM field IDs, polls them on a configurable cadence, and serves them at /metrics in the standard Prometheus text exposition format.

The exporter does not invent metrics. Every series it emits carries a DCGM_FI_* name corresponding to a published DCGM field, and the values come straight from the same counters that dcgmi dmon, nvidia-smi dmon, and any other DCGM-aware tool would read. That stability is why it has become the default — the field IDs do not move between driver releases, the namespace is consistent across H100, H200, B200, L40S, A100 and Ampere consumer cards, and the same dashboards work for every NVIDIA SKU.

On Kubernetes it is the standardised telemetry source for the NVIDIA GPU Operator, kube-prometheus-stack, KServe, vLLM, NVIDIA Run:ai, and every Helm chart that emits a Grafana dashboard for GPU workloads. On bare metal and Slurm it runs as a systemd unit on each GPU host. Yobibyte ships DCGM Exporter on every region by default; customer Prometheus instances scrape directly through a tenant-scoped federation endpoint, and Yobitel NeoCloud worker nodes carry the same DaemonSet feeding the central observability stack that powers customer dashboards.

This entry documents the production surface: container deployment via the GPU Operator, the counter set you should run in steady state versus benchmarking, MIG and per-pod attribution, the alerting rules that actually catch incidents, cardinality and overhead trade-offs, migration from nvidia-smi polling, and the troubleshooting matrix for the failure modes that account for almost all real outages. This entry helps you wire up GPU telemetry on your own cluster — or read what Yobibyte exposes to your Prometheus by default.

Quick start

The example below installs DCGM Exporter through the NVIDIA GPU Operator on a Kubernetes cluster, exposes a ServiceMonitor for kube-prometheus-stack, and verifies the metrics endpoint from a debug pod. The second block is the standalone Helm chart for clusters that do not run the full GPU Operator. The third block is the equivalent Docker invocation on a bare-metal host without Kubernetes.

# 1. Install via the NVIDIA GPU Operator (recommended on Kubernetes)
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update

helm upgrade --install gpu-operator nvidia/gpu-operator \
    --namespace gpu-operator --create-namespace \
    --set dcgmExporter.enabled=true \
    --set dcgmExporter.serviceMonitor.enabled=true \
    --set dcgmExporter.serviceMonitor.interval=30s \
    --set toolkit.enabled=true \
    --set driver.enabled=true

# Verify the DaemonSet is on every GPU node and scrape works
kubectl -n gpu-operator get ds nvidia-dcgm-exporter
kubectl -n gpu-operator run curl-test --rm -it --restart=Never \
    --image=curlimages/curl -- \
    curl -s http://nvidia-dcgm-exporter:9400/metrics | head -40

# 2. Standalone Helm chart (no GPU Operator, but driver already installed)
helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts
helm upgrade --install dcgm-exporter gpu-helm-charts/dcgm-exporter \
    --namespace monitoring --create-namespace \
    --set serviceMonitor.enabled=true \
    --set serviceMonitor.interval=30s

# 3. Bare-metal Docker (host has CUDA driver + nvidia-container-toolkit)
docker run -d --gpus all --rm --cap-add SYS_ADMIN \
    -p 9400:9400 --name dcgm-exporter \
    nvcr.io/nvidia/k8s/dcgm-exporter:3.3.7-3.4.2-ubuntu22.04

curl -s http://localhost:9400/metrics | grep -E '^DCGM_FI_DEV_(GPU_UTIL|FB_USED|POWER_USAGE|GPU_TEMP) '

Tip: Always install via the NVIDIA GPU Operator unless you have an explicit reason not to. The operator handles driver, runtime, MIG manager, gpu-feature-discovery, the device plugin, and DCGM Exporter in a single, version-pinned bundle — running them piecemeal is the most common source of metric missing issues.

How it works

Three components sit between an SM and a DCGM_FI_* series in Prometheus. The GPU firmware and driver expose raw counters through NVML and CUPTI. The DCGM library — running inside a daemon called nv-hostengine, either embedded in the exporter or running as a host service — polls those counters on a fixed cadence, performs derived calculations (rates, ratios, percentages), and exposes the result under a stable field-ID scheme. DCGM Exporter is the final Go process that selects a subset of fields, polls DCGM over its libdcgm.so API, and serves the result at /metrics in OpenMetrics text format.

Two of DCGM's field categories matter operationally. DCGM_FI_DEV_* fields are device telemetry — utilisation, memory, power, temperature, clocks, ECC counters. They are cheap to read and safe to scrape continuously. DCGM_FI_PROF_* fields are profiling counters — SM occupancy, Tensor Core pipe activity, NVLink and PCIe bytes — that require the GPU's profiling unit to be active. Profiling counters cost roughly 1-2 percent of SM time when sampled at sub-second cadence; at the 15-30 second scrape interval recommended for production the overhead disappears into the noise.

On Kubernetes, the exporter joins its metrics with two adjacent data sources to produce per-workload attribution. gpu-feature-discovery writes node labels describing every GPU's UUID, model, MIG geometry and PCIe topology. The nvidia-container-toolkit writes per-pod GPU device-id metadata that the kubelet exposes through cAdvisor. DCGM Exporter relabels its metrics with Hostname, UUID, device, and (when MIG is on) GPU_I_ID and GPU_I_PROFILE. A PromQL label_replace join against kube-state-metrics produces pod and namespace labels and you have every metric scoped from a physical GPU all the way up to a tenant.

On MIG-enabled hosts the exporter emits a separate series for each MIG instance — when a single H100 is sliced into seven 1g.12gb instances the metrics endpoint returns seven sets of DCGM_FI_DEV_GPU_UTIL rows, distinguished by the GPU_I_ID and GPU_I_PROFILE labels. Some device-level fields (board power, fan speed) remain per-physical-device because there is only one device to measure; profiling fields are partitioned per MIG instance because the MIG hardware boundary isolates the SMs.

nv-hostengine daemon: polls NVML and CUPTI; can run embedded in the exporter pod or as a host service shared with dcgmi.
DCGM_FI_DEV_* fields: cheap device telemetry — utilisation, memory, power, temperature, clocks, ECC, link state. Safe at any scrape interval.
DCGM_FI_PROF_* fields: profiling counters — SM occupancy, Tensor Core activity, NVLink/PCIe bytes. Cost ~1-2 percent SM at high sampling rates.
MIG awareness: per-instance series with GPU_I_ID and GPU_I_PROFILE labels; board-level metrics remain per physical device.
Per-pod attribution: join DCGM metrics with cAdvisor's container_accelerator_* labels or gpu-feature-discovery node labels via PromQL relabel rules.
Counter selection: driven by a CSV config file mounted into the pod; default set is sensible, custom sets are common at scale.

Reference and metric catalogue

The exporter is configured by a CSV file (/etc/dcgm-exporter/default-counters.csv in the upstream image) that lists the DCGM field IDs to scrape, their Prometheus type (gauge or counter), and a human-readable description. The table below documents the canonical production counter set as of DCGM 3.3 / exporter 3.4 (mid-2026). The selection covers compute, memory, power and thermals, reliability, and fabric — everything a Grafana dashboard or alerting rule needs without inflating series cardinality.

DCGM field	Prometheus type	Unit	What it measures
DCGM_FI_DEV_GPU_UTIL	gauge	percent (0-100)	Fraction of sample window with at least one active kernel — coarse
DCGM_FI_DEV_MEM_COPY_UTIL	gauge	percent	Memory copy engine utilisation — H2D, D2H, D2D traffic
DCGM_FI_PROF_GR_ENGINE_ACTIVE	gauge	ratio 0-1	Fraction of time the graphics/compute engine was active
DCGM_FI_PROF_SM_ACTIVE	gauge	ratio 0-1	Fraction of SMs with at least one warp resident
DCGM_FI_PROF_SM_OCCUPANCY	gauge	ratio 0-1	Average warp-slot occupancy across active SMs
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE	gauge	ratio 0-1	Tensor Core pipe activity — the real GPU saturation signal for LLMs
DCGM_FI_PROF_DRAM_ACTIVE	gauge	ratio 0-1	HBM memory channels active — bandwidth-bound workload signal
DCGM_FI_PROF_PCIE_TX_BYTES	counter	bytes	Cumulative PCIe transmit bytes (host to device)
DCGM_FI_PROF_PCIE_RX_BYTES	counter	bytes	Cumulative PCIe receive bytes (device to host)
DCGM_FI_PROF_NVLINK_TX_BYTES	counter	bytes	Cumulative NVLink transmit bytes across all links
DCGM_FI_PROF_NVLINK_RX_BYTES	counter	bytes	Cumulative NVLink receive bytes across all links
DCGM_FI_DEV_FB_USED	gauge	MiB	Framebuffer (VRAM) reserved by all CUDA contexts
DCGM_FI_DEV_FB_FREE	gauge	MiB	Framebuffer free for allocation
DCGM_FI_DEV_FB_TOTAL	gauge	MiB	Total physical framebuffer on the device
DCGM_FI_DEV_POWER_USAGE	gauge	watts	Instantaneous board power draw
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION	counter	millijoules	Cumulative energy used since boot — basis of $/Wh accounting
DCGM_FI_DEV_GPU_TEMP	gauge	celsius	GPU die temperature
DCGM_FI_DEV_MEMORY_TEMP	gauge	celsius	HBM stack temperature (Hopper/Blackwell)
DCGM_FI_DEV_SM_CLOCK	gauge	MHz	Current SM clock speed
DCGM_FI_DEV_MEM_CLOCK	gauge	MHz	Current memory clock speed
DCGM_FI_DEV_ECC_SBE_VOL_TOTAL	counter	errors	Single-bit ECC errors corrected since boot
DCGM_FI_DEV_ECC_DBE_VOL_TOTAL	counter	errors	Double-bit ECC errors since boot — RMA threshold
DCGM_FI_DEV_RETIRED_PENDING	gauge	pages	Memory pages pending retirement (DRAM failing soon)
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL	gauge	MB/s	Aggregate NVLink bandwidth across all links
DCGM_FI_DEV_XID_ERRORS	counter	events	Driver-reported XID error events — crash/hang signal
DCGM_FI_DEV_GPU_UTIL_SAMPLES	gauge	samples	Number of underlying samples in the GPU util window

Warning: DCGM_FI_DEV_GPU_UTIL is the most misread metric in the field. It reports the fraction of sample windows where at least one kernel was running — a single tiny kernel hogging one SM reports 100 percent. For real saturation watch DCGM_FI_PROF_SM_OCCUPANCY and DCGM_FI_PROF_PIPE_TENSOR_ACTIVE together. Many production dashboards still alert on the wrong metric.

Workload patterns

Three workload shapes cover the bulk of DCGM Exporter deployments: a node-level saturation dashboard for capacity planning, per-pod attribution for multi-tenant cost and noisy-neighbour analysis, and a MIG-aware view for clusters that slice H100/H200 GPUs into smaller tenancies. Each pattern uses a slightly different counter selection and label-join strategy.

Pattern A — node-level saturation for capacity planning. Run with the default counter set plus the PROF_PIPE_TENSOR_ACTIVE and PROF_DRAM_ACTIVE fields. The PromQL questions you want answered are: which nodes are sustained above 70 percent Tensor Core activity, which nodes are bandwidth-bound at high DRAM activity but low Tensor activity (likely batched inference with small models), and which nodes are mostly idle and candidates for workload consolidation.

Pattern B — per-pod attribution. Use the default counters but add the Hostname, UUID and pod-attribution relabels in the exporter config. Pair with the nvidia.com/gpu resource reported by kube-state-metrics so the namespace and pod-name labels join in Prometheus. The PromQL question is: which tenant's pods are responsible for the GPU load on each node, broken down by namespace, deployment and container.

Pattern C — MIG-aware monitoring. Enable MIG mode on the host, configure the GPU Operator's MIG manager to create instance profiles (e.g. all-1g.12gb for seven small slices, or all-balanced for mixed sizes), then scrape the exporter as normal. Every metric is now emitted seven times per H100 with distinct GPU_I_ID labels — your dashboards must sum by (GPU_I_ID) to avoid double-counting and your alerts must use topk by (GPU_I_ID) to identify hot slices.

# Custom counter file mounted into the exporter for per-pod attribution
# saved as default-counters.csv, mounted at /etc/dcgm-exporter/counters.csv
#
# format: DCGM_FIELD_ID, prometheus_type, help
DCGM_FI_DEV_GPU_UTIL, gauge, GPU utilization
DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory copy engine utilization
DCGM_FI_DEV_FB_USED, gauge, Framebuffer used (MiB)
DCGM_FI_DEV_FB_FREE, gauge, Framebuffer free (MiB)
DCGM_FI_DEV_POWER_USAGE, gauge, Power draw (W)
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Energy (mJ)
DCGM_FI_DEV_GPU_TEMP, gauge, GPU die temperature (C)
DCGM_FI_DEV_MEMORY_TEMP, gauge, HBM stack temperature (C)
DCGM_FI_DEV_SM_CLOCK, gauge, SM clock (MHz)
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock (MHz)
DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, counter, Single-bit ECC errors
DCGM_FI_DEV_ECC_DBE_VOL_TOTAL, counter, Double-bit ECC errors
DCGM_FI_DEV_XID_ERRORS, counter, XID error events
DCGM_FI_PROF_GR_ENGINE_ACTIVE, gauge, Graphics/compute engine active
DCGM_FI_PROF_SM_ACTIVE, gauge, SMs with at least one warp
DCGM_FI_PROF_SM_OCCUPANCY, gauge, Warp slot occupancy
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Tensor Core pipe active
DCGM_FI_PROF_DRAM_ACTIVE, gauge, HBM channels active
DCGM_FI_PROF_PCIE_TX_BYTES, counter, PCIe transmit bytes
DCGM_FI_PROF_PCIE_RX_BYTES, counter, PCIe receive bytes
DCGM_FI_PROF_NVLINK_TX_BYTES, counter, NVLink transmit bytes
DCGM_FI_PROF_NVLINK_RX_BYTES, counter, NVLink receive bytes

Tip: Pattern B (per-pod attribution) requires gpu-feature-discovery and the device plugin to be running, AND the exporter to be configured with kubernetes pod resolution (--kubernetes=true --kubernetes-gpu-id-type=device-name). Both are set automatically by the GPU Operator. Missing pod labels almost always means one of those two prerequisites is not in place.

Sizing and capacity planning

DCGM Exporter sizing is governed by scrape interval, counter cardinality, and number of GPUs per node. The exporter itself uses negligible CPU and memory — typically under 50 MB resident — but the Prometheus side scales with nodes x GPUs-per-node x metrics x retention. The table below shows the steady-state series count and scrape cost for typical fleet sizes, assuming the canonical 25-counter set from the reference section and the standard pod-attribution labels.

The two numbers that matter for Prometheus capacity planning are active series and ingest rate. As a planning anchor, the canonical counter set produces roughly 90-100 series per physical GPU (25 counters x 3-4 labels collapsing to unique combinations) plus ~7x that on MIG-enabled hosts running the 1g profile. A 256-GPU H100 cluster produces around 25,000 active series from DCGM; a 1,024-GPU fleet around 100,000. Prometheus comfortably handles 10 million active series per server, so DCGM is rarely the cardinality bottleneck — application metrics usually are.

Default scrape interval: 30 s. Drop to 15 s for SLA-critical inference fleets; raise to 60 s for batch-only training clusters.
Counter selection: a 12-counter minimal set (utilisation, memory, power, temperature, ECC, XID) halves series count at modest visibility cost.
MIG inflation: each 1g.12gb instance multiplies per-GPU series by 7. Use this when planning Prometheus retention for MIG-heavy clusters.
Profiling counters: include PROF_PIPE_TENSOR_ACTIVE and PROF_DRAM_ACTIVE in steady state; reserve PROF_PCIE_* and PROF_NVLINK_* byte counters for clusters where fabric is the focus.
Remote write: forward DCGM metrics to long-term storage (Thanos, Mimir, VictoriaMetrics) — local Prometheus retention of 7-14 days is sufficient for live queries.

Fleet	Nodes	GPUs	MIG	Active series	Prometheus ingest	Retention storage (30d)
Single dev node	1	8	No	~800	~25 samples/s	~150 MB
Small cluster	8	64	No	~6,400	~210 samples/s	~1.2 GB
Production tenancy	32	256	No	~25,000	~830 samples/s	~5 GB
Production tenancy + MIG	32	256	Yes (1g.12gb x7)	~175,000	~5,800 samples/s	~35 GB
Yobitel London-1 region	128	1,024	Mixed	~120,000	~4,000 samples/s	~24 GB
Yobitel multi-region fleet	512	4,096	Mixed	~480,000	~16,000 samples/s	~96 GB

Limits and quotas

DCGM Exporter has very few hard limits. The constraints that matter in practice are driver and DCGM library version compatibility, MIG mode detection, profiling counter availability per architecture, and the cost of high-cardinality label combinations. The table below documents each ceiling and the operational lever for raising it.

Limit	Default	Ceiling	How to raise / work around
Scrape interval (Prometheus)	30 s	1 s (impractical)	Lower in `ServiceMonitor.interval`; watch GPU profiling cost.
Profiling fields per scrape	all enabled	GPU profiling unit shared with Nsight	Disable when Nsight Systems is running concurrently.
MIG instances per H100	1 (no MIG)	7 instances	Configure MIG manager via GPU Operator; exporter auto-detects.
Driver version	R535+	R570 / R580 for B200	Upgrade driver via GPU Operator; older drivers omit Blackwell fields.
DCGM library version	3.3+	3.3.x recommended for H200/B200	Pin via `nvcr.io/nvidia/k8s/dcgm-exporter:3.3.7-3.4.2-ubuntu22.04` tag.
Pods per node label join	unlimited	Prometheus cardinality budget	Join via `kube-state-metrics`, not exporter-side relabel.
Container privilege	host driver caps required	n/a	Run with `SYS_ADMIN` and host `/dev/nvidia*`; the GPU Operator does this.
Shared memory for nv-hostengine	default	Container-defined	Mount `/dev/shm` at 64 MB minimum (default in stock image).
XID error history retained	since boot	Reset on driver reload	Use `rate(...[1h])` on counters, do not query absolute values.
Concurrency with `dcgmi`	shared daemon	n/a	Use either embedded or host `nv-hostengine`, not both.
Concurrency with Nsight profiler	exclusive	n/a	Profiling counters disabled while Nsight Compute holds the perf unit.

Warning: Running Nsight Compute on a host where DCGM Exporter is also reading profiling counters causes DCGM to silently drop the PROF_* fields until Nsight releases the profiling unit. If your Tensor Core activity series shows a flat-line gap, check whether a performance engineer is profiling on that host.

Observability

DCGM Exporter is itself an observability component, but its own health is worth alerting on. The exporter exposes a small set of meta-metrics: dcgm_exporter_field_collection_errors_total (counter — field reads that failed against DCGM), dcgm_exporter_scrape_duration_seconds (gauge — per-scrape cost), and standard Prometheus up (whether the scrape succeeded). The alerts below cover the high-value GPU-side incidents the exporter exists to surface, plus the meta-alerts for the exporter itself.

Thermal — DCGM_FI_DEV_GPU_TEMP > 85 for 5 min: cooling failure, hot-aisle thermal runaway, or fan failure on the chassis.
Memory pressure — DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL > 0.95 for 10 min: workload one batch from CUDA OOM.
Reliability — increase(DCGM_FI_DEV_ECC_DBE_VOL_TOTAL[1h]) > 0: double-bit ECC error → cordon node, schedule RMA.
Reliability — increase(DCGM_FI_DEV_RETIRED_PENDING[24h]) > 0: DRAM cells failing; rebalance off this GPU before retirement.
Fabric — DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL == 0 on an active workload: NVLink down, expect 4-10x training slowdown on TP/DP jobs.
Driver crash — increase(DCGM_FI_DEV_XID_ERRORS[5m]) > 0: any XID event is worth a page; XID 79 (GPU fallen off the bus) is a hardware reset condition.
Underutilisation — avg_over_time(DCGM_FI_PROF_PIPE_TENSOR_ACTIVE[1h]) < 0.10: a paid-for GPU not earning its keep; investigate workload.
Exporter health — up{job="dcgm-exporter"} == 0: scrape failed, alert and check the DaemonSet for the affected node.

# Prometheus alerting rules — DCGM Exporter on GPU clusters
groups:
  - name: gpu-hardware
    interval: 30s
    rules:
      - alert: GPUThermalWarning
        expr: max by (Hostname, gpu) (DCGM_FI_DEV_GPU_TEMP) > 85
        for: 5m
        labels: { severity: warning, team: infra }
        annotations:
          summary: "GPU {{ $labels.gpu }} on {{ $labels.Hostname }} at {{ $value }}C"

      - alert: GPUMemoryPressure
        expr: (DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL) > 0.95
        for: 10m
        labels: { severity: warning }
        annotations:
          summary: "GPU {{ $labels.gpu }} VRAM >95% — OOM imminent"

      - alert: GPUDoubleBitECC
        expr: increase(DCGM_FI_DEV_ECC_DBE_VOL_TOTAL[1h]) > 0
        labels: { severity: critical, team: infra }
        annotations:
          summary: "Double-bit ECC on {{ $labels.Hostname }} GPU {{ $labels.gpu }} — cordon & RMA"

      - alert: GPUXIDError
        expr: increase(DCGM_FI_DEV_XID_ERRORS[5m]) > 0
        labels: { severity: critical }
        annotations:
          summary: "XID error on {{ $labels.Hostname }} GPU {{ $labels.gpu }} — investigate driver"

      - alert: NVLinkDown
        expr: DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL == 0
              and on (Hostname, gpu) DCGM_FI_PROF_PIPE_TENSOR_ACTIVE > 0.05
        for: 5m
        labels: { severity: critical }
        annotations:
          summary: "NVLink down on {{ $labels.Hostname }} GPU {{ $labels.gpu }} during active workload"

      - alert: GPUUnderutilised
        expr: avg_over_time(DCGM_FI_PROF_PIPE_TENSOR_ACTIVE[1h]) < 0.10
              and on (Hostname, gpu) DCGM_FI_DEV_POWER_USAGE > 100
        for: 1h
        labels: { severity: info, team: finops }
        annotations:
          summary: "{{ $labels.Hostname }} GPU {{ $labels.gpu }} <10% Tensor Core utilisation for 1h"

      - alert: DCGMExporterDown
        expr: up{job="dcgm-exporter"} == 0
        for: 5m
        labels: { severity: warning }
        annotations:
          summary: "DCGM Exporter scrape failed on {{ $labels.instance }}"

Tip: Alert on Tensor Core activity (DCGM_FI_PROF_PIPE_TENSOR_ACTIVE), not DCGM_FI_DEV_GPU_UTIL. The former tells you whether the workload is actually using the GPU for matrix multiply — the foundation of LLM and CV inference economics. The latter conflates a fully-saturated SM with a single launched kernel.

Cost and FinOps

DCGM Exporter is free under Apache 2.0 — there is no licence cost. The operational cost is Prometheus storage for the metrics it produces, and the small SM-time overhead of profiling counters. The table below puts both in USD terms for typical fleet sizes, using mid-2026 pricing anchors for managed Prometheus (Grafana Cloud, AMP) and self-hosted Thanos on cheap object storage.

Profiling overhead: ~1-2 percent SM time when sampling all PROF_* fields at 1 s. At 30 s scrape it is unmeasurable.
Storage: assume ~1.3 bytes per sample compressed in Prometheus TSDB. Self-hosted Thanos on object storage drops effective cost to ~$0.025/GB-month.
Cardinality drivers: MIG mode (7x per GPU on 1g profile), per-pod attribution labels, and excessive gpu label values are the three knobs to watch.
FinOps integration: DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION is a counter in millijoules — join with cAdvisor pod labels to attribute energy and $/Wh to tenants.
Yobitel customers see DCGM-derived per-tenant utilisation in the Yobibyte console at no extra cost; the underlying telemetry is included in the GPU rate.

Fleet	GPUs	Active series	Self-hosted Prom + Thanos (30d)	Managed Prom (Grafana Cloud, 30d)	Notes
Single dev node	8	~800	$0 (existing)	~$2	Negligible at this scale.
Production tenancy	256	~25,000	~$15/month (S3)	~$75/month	DCGM dominates GPU-side metrics.
Production tenancy + MIG	256	~175,000	~$80/month	~$520/month	MIG inflates 7x on 1g profiles.
Yobitel London-1 region	1,024	~120,000	~$60/month	~$360/month	Mixed MIG and full-GPU tenancies.
Yobitel multi-region fleet	4,096	~480,000	~$240/month	~$1,440/month	Federate via Thanos sidecar per cluster.

Security and compliance

DCGM Exporter requires privileged access to the host GPU device files (/dev/nvidia0, /dev/nvidiactl, /dev/nvidia-uvm) and the SYS_ADMIN Linux capability to read profiling counters. The GPU Operator handles this transparently by deploying the pod with the right securityContext; if you deploy the exporter outside the operator you must mirror those settings. The exporter does not need network egress and should be locked down with a NetworkPolicy that only permits ingress from the Prometheus scrape endpoint.

The exporter does not authenticate scrape requests. On shared clusters this is fine because the Service is internal and the pod selector is locked down, but if the metrics endpoint is exposed beyond the cluster boundary (federation across regions, external Grafana) place a reverse proxy with mTLS or bearer-token auth in front of it. Prometheus operator's bearerTokenSecret field on the ServiceMonitor is the standard pattern.

Regulatory posture is straightforward because DCGM metrics are telemetry counters with no customer payload. They contain GPU UUIDs, host names, namespace and pod names (when attribution is enabled), and numeric counters — no PII, no model weights, no inference payloads. For UK public-sector workloads (NCSC Cloud Security Principles, G-Cloud 14) this means DCGM metrics flow freely within the sovereign tenancy and to the central monitoring stack without additional control. For GDPR purposes the metrics are operational data, not personal data. The one caveat is the namespace and pod-name labels — on tenancies where the namespace name itself reveals a customer identity, scrub or rewrite those labels at the Prometheus relabel stage before federating to a multi-tenant store.

Warning: Never expose the DCGM Exporter /metrics endpoint to the public internet without an authenticating reverse proxy. The metrics themselves are non-sensitive but the GPU UUIDs, host names, and pod labels leak operational topology that helps an attacker target the cluster.

Migration and alternatives

Most production migrations to DCGM Exporter come from one of three origins: shell scripts polling nvidia-smi --query-gpu, the legacy Prometheus exporter from nvidia/gpu_exporter (community fork that predated the official one), or cloud-provider GPU telemetry (CloudWatch Container Insights, GCP Cloud Monitoring's GPU plugin, Azure Monitor). The table below documents the trade-offs of each migration path.

If you are currently polling nvidia-smi from a shell script and writing to a TSDB, the migration is largely a deletion: install the GPU Operator, point Prometheus at the exporter, retire the script. Field names change (utilization.gpu becomes DCGM_FI_DEV_GPU_UTIL) and you lose Tegra/Jetson-specific fields that the embedded nvidia-smi reports on edge devices but DCGM does not implement.

Migration source	Effort	What you gain	What you lose
nvidia-smi polling shell script	Low	Native Prometheus, MIG awareness, no fork overhead	Tegra/Jetson fields, custom parsing logic
Legacy nvidia/gpu_exporter	Low — drop in	Active NVIDIA support, full profiling counter set	Some custom labels — re-derive via relabel rules
AWS CloudWatch Container Insights	Medium	Open-source standard, portable across clouds	AWS-native alarms; re-implement in Prometheus rules
GCP Cloud Monitoring GPU plugin	Medium	Same as above	GKE Autopilot integration; re-wire dashboards
Azure Monitor GPU agent	Medium	Same as above	Azure-native portal integration
Datadog DCGM integration	Low	Self-hosted control, no per-host licence	Datadog's automated topology view
No GPU monitoring at all	Trivial via GPU Operator	Every benefit	n/a — this is the right migration

# Equivalent invocations: nvidia-smi shell script vs DCGM Exporter PromQL

# Old: nvidia-smi script writing to a TSDB
nvidia-smi --query-gpu=index,utilization.gpu,memory.used,memory.total,power.draw \
    --format=csv,noheader,nounits | \
    while IFS=, read idx util mem_used mem_total power; do
        echo "gpu_util{gpu=\"$idx\"} $util $(date +%s)000"
        echo "gpu_mem_used{gpu=\"$idx\"} $mem_used $(date +%s)000"
        echo "gpu_power{gpu=\"$idx\"} $power $(date +%s)000"
    done | curl --data-binary @- http://pushgateway:9091/metrics/job/gpu

# New: DCGM Exporter — same data, native PromQL queries
# Per-GPU utilisation, last 5 minutes
avg by (gpu, Hostname) (avg_over_time(DCGM_FI_DEV_GPU_UTIL[5m]))

# Memory pressure, sorted hot
topk(10, DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL)

# Power draw aggregated to node level for capacity planning
sum by (Hostname) (DCGM_FI_DEV_POWER_USAGE)

# Energy attribution per namespace (joined with cAdvisor pod labels)
sum by (namespace) (
  rate(DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION[5m])
  * on (Hostname, gpu) group_left (namespace, pod)
  kube_pod_container_resource_requests{resource="nvidia_com_gpu"}
) / 1000   # joules per second

Note: There are no serious open-source alternatives to DCGM Exporter on NVIDIA hardware. AMD ships rocm-smi-exporter for ROCm, Intel ships xpum-exporter for Gaudi and Max GPUs — both follow the same DaemonSet + Prometheus pattern but with vendor-specific field names. The pattern is universal; the implementation is per vendor.

Troubleshooting

The error table below covers the failure modes that account for almost all real DCGM Exporter incidents. Each row maps an observable symptom to the underlying cause and the minimum-viable fix. Most issues trace back to one of three root causes: driver version mismatch, MIG configuration drift, or missing label-join prerequisites.

Symptom	Cause	Fix
No metrics scraped, pod CrashLoopBackOff	Driver missing or version mismatch with DCGM library	Verify `nvidia-smi` works on the host; align driver to DCGM's supported matrix; reinstall via GPU Operator.
Metrics endpoint up but `PROF_*` fields missing	Profiling unit not available (older driver) or held by Nsight	Upgrade driver to R535+; confirm no Nsight Compute session is active on the host.
MIG mode enabled but no per-instance series	MIG manager not configured or instances not created	Apply a MIG profile via the GPU Operator's `migManager.config` ConfigMap; reboot or `nvidia-smi mig -cgi` manually.
Per-pod labels (`pod`, `namespace`) missing	gpu-feature-discovery not deployed, or kubernetes pod resolution flag missing	Install GFD via GPU Operator; set `--kubernetes=true` on the exporter.
Cardinality explosion in Prometheus	MIG enabled with 1g profile across many H100s (7x multiplier)	Filter unused profile labels; aggregate at scrape time; raise Prometheus series budget.
`DCGM_FI_DEV_GPU_UTIL` always 0 on H100	Hopper reports activity differently; check `PROF_GR_ENGINE_ACTIVE` instead	Use `DCGM_FI_PROF_GR_ENGINE_ACTIVE` and `DCGM_FI_PROF_PIPE_TENSOR_ACTIVE` on Hopper/Blackwell.
NVLink bandwidth always 0	NVLink topology not detected, or P2P disabled	Run `nvidia-smi topo -m`; confirm NVLink visible; check `NVLINK_BANDWIDTH_TOTAL` field is in counter list.
Exporter pod evicted under memory pressure	`nv-hostengine` leaks memory on long-running deployments	Set pod memory limit at 512 MB; restart DaemonSet weekly via a CronJob.
Scrape duration spikes when MIG profile changes	Exporter re-discovers MIG geometry on every scrape after change	Expected; settles after 1-2 scrape cycles; do not alert on transient spikes.
Metrics report 0 W power on consumer GPUs	Power reporting unsupported on RTX consumer cards in some BIOS modes	DCGM is designed for data-centre SKUs; switch to data-centre GPUs or accept the gap.
`up == 0` on one node only	kubelet evicted DaemonSet pod due to node memory pressure	Increase pod priorityClass to `system-node-critical`; check node memory limits.
Conflict with `dcgmi` CLI on same host	Two `nv-hostengine` instances competing for the GPU's DCGM channel	Use embedded `nv-hostengine` (default) OR host service, not both; check `systemctl status nvidia-dcgm`.
Tensor Core activity flat-lines mid-day	Performance engineer running Nsight Compute on the host	Coordinate profiler sessions; warn before profiling production GPUs.

Where this fits in the Yobitel stack

DCGM Exporter is the metrics backbone of every GPU host Yobitel operates. Every NeoCloud worker node, every Yobitel Edge AI appliance, and every Yobibyte tenancy worker runs the exporter as part of the standard NVIDIA GPU Operator install. The metrics feed three downstream consumers: the regional Prometheus stack that powers customer-facing dashboards in the Yobibyte console, the Thanos federation that retains 12 months of telemetry for capacity planning, and InferenceBench's scoring pipeline that records GPU-side energy and utilisation alongside model throughput numbers.

On Yobibyte tenancies, the metrics produced by DCGM Exporter are exposed back to the customer through two surfaces. The console's per-workspace dashboard shows real-time GPU utilisation, framebuffer, power, and thermal headroom for every workload the tenant is running. The Prometheus federation endpoint lets customers scrape their own slice of the fleet metrics into their own observability stack — a Prometheus federate URL with bearer-token auth, scoped to the tenant's GPU UUIDs. The FOCUS-conformant billing export joins DCGM energy counters with cAdvisor pod labels to produce per-workload $/Wh accounting.

On UK and EU sovereign tenancies (NCSC Cloud Security Principles, G-Cloud 14, OFFICIAL-handling), DCGM metrics remain inside the sovereign region's Prometheus and never federate to the global multi-tenant store. The metrics themselves contain no customer payload, but Yobitel applies the same data-residency boundary to operational telemetry as to inference payloads — sovereign customers see a one-region observability stack with no cross-region replication.

References

DCGM Exporter on GitHub · GitHub (NVIDIA)
Data Center GPU Manager User Guide · NVIDIA Documentation
DCGM Field Identifiers Reference · NVIDIA Documentation
Monitoring GPUs in Kubernetes with DCGM · NVIDIA Developer Blog
NVIDIA GPU Operator Documentation · NVIDIA Documentation
GPU Feature Discovery · GitHub (NVIDIA)
Profiling DCGM Counters Explained · NVIDIA Documentation

TL;DR

Open-source (Apache 2.0) Prometheus exporter from NVIDIA that wraps the Data Center GPU Manager (DCGM) library and exposes every meaningful GPU telemetry field as `DCGM_FI_*` metrics on TCP port 9400.
Ships as a Go binary and the `nvcr.io/nvidia/k8s/dcgm-exporter` container image; the NVIDIA GPU Operator installs it as a DaemonSet on every node labelled `nvidia.com/gpu.present=true`, with a ServiceMonitor wired into Prometheus.
Default counter set covers SM occupancy, Tensor Core pipe activity, framebuffer use, power, temperature, ECC error counters, NVLink and PCIe throughput, and per-MIG-instance breakdowns when MIG is enabled.
Joins with `gpu-feature-discovery` and cAdvisor labels to produce per-pod and per-namespace GPU attribution — the basis of every GPU FinOps dashboard, capacity plan, and noisy-neighbour incident analysis.
The metrics backbone behind Yobitel's GPU fleet observability, InferenceBench scoring runs, and the per-tenant utilisation breakdown surfaced inside the Yobibyte console.

Overview

Quick start

# 1. Install via the NVIDIA GPU Operator (recommended on Kubernetes)
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update

helm upgrade --install gpu-operator nvidia/gpu-operator \
    --namespace gpu-operator --create-namespace \
    --set dcgmExporter.enabled=true \
    --set dcgmExporter.serviceMonitor.enabled=true \
    --set dcgmExporter.serviceMonitor.interval=30s \
    --set toolkit.enabled=true \
    --set driver.enabled=true

# Verify the DaemonSet is on every GPU node and scrape works
kubectl -n gpu-operator get ds nvidia-dcgm-exporter
kubectl -n gpu-operator run curl-test --rm -it --restart=Never \
    --image=curlimages/curl -- \
    curl -s http://nvidia-dcgm-exporter:9400/metrics | head -40

# 2. Standalone Helm chart (no GPU Operator, but driver already installed)
helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts
helm upgrade --install dcgm-exporter gpu-helm-charts/dcgm-exporter \
    --namespace monitoring --create-namespace \
    --set serviceMonitor.enabled=true \
    --set serviceMonitor.interval=30s

# 3. Bare-metal Docker (host has CUDA driver + nvidia-container-toolkit)
docker run -d --gpus all --rm --cap-add SYS_ADMIN \
    -p 9400:9400 --name dcgm-exporter \
    nvcr.io/nvidia/k8s/dcgm-exporter:3.3.7-3.4.2-ubuntu22.04

curl -s http://localhost:9400/metrics | grep -E '^DCGM_FI_DEV_(GPU_UTIL|FB_USED|POWER_USAGE|GPU_TEMP) '

Tip: Always install via the NVIDIA GPU Operator unless you have an explicit reason not to. The operator handles driver, runtime, MIG manager, gpu-feature-discovery, the device plugin, and DCGM Exporter in a single, version-pinned bundle — running them piecemeal is the most common source of metric missing issues.

How it works

nv-hostengine daemon: polls NVML and CUPTI; can run embedded in the exporter pod or as a host service shared with dcgmi.
DCGM_FI_DEV_* fields: cheap device telemetry — utilisation, memory, power, temperature, clocks, ECC, link state. Safe at any scrape interval.
DCGM_FI_PROF_* fields: profiling counters — SM occupancy, Tensor Core activity, NVLink/PCIe bytes. Cost ~1-2 percent SM at high sampling rates.
MIG awareness: per-instance series with GPU_I_ID and GPU_I_PROFILE labels; board-level metrics remain per physical device.
Per-pod attribution: join DCGM metrics with cAdvisor's container_accelerator_* labels or gpu-feature-discovery node labels via PromQL relabel rules.
Counter selection: driven by a CSV config file mounted into the pod; default set is sensible, custom sets are common at scale.

Reference and metric catalogue

DCGM field	Prometheus type	Unit	What it measures
DCGM_FI_DEV_GPU_UTIL	gauge	percent (0-100)	Fraction of sample window with at least one active kernel — coarse
DCGM_FI_DEV_MEM_COPY_UTIL	gauge	percent	Memory copy engine utilisation — H2D, D2H, D2D traffic
DCGM_FI_PROF_GR_ENGINE_ACTIVE	gauge	ratio 0-1	Fraction of time the graphics/compute engine was active
DCGM_FI_PROF_SM_ACTIVE	gauge	ratio 0-1	Fraction of SMs with at least one warp resident
DCGM_FI_PROF_SM_OCCUPANCY	gauge	ratio 0-1	Average warp-slot occupancy across active SMs
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE	gauge	ratio 0-1	Tensor Core pipe activity — the real GPU saturation signal for LLMs
DCGM_FI_PROF_DRAM_ACTIVE	gauge	ratio 0-1	HBM memory channels active — bandwidth-bound workload signal
DCGM_FI_PROF_PCIE_TX_BYTES	counter	bytes	Cumulative PCIe transmit bytes (host to device)
DCGM_FI_PROF_PCIE_RX_BYTES	counter	bytes	Cumulative PCIe receive bytes (device to host)
DCGM_FI_PROF_NVLINK_TX_BYTES	counter	bytes	Cumulative NVLink transmit bytes across all links
DCGM_FI_PROF_NVLINK_RX_BYTES	counter	bytes	Cumulative NVLink receive bytes across all links
DCGM_FI_DEV_FB_USED	gauge	MiB	Framebuffer (VRAM) reserved by all CUDA contexts
DCGM_FI_DEV_FB_FREE	gauge	MiB	Framebuffer free for allocation
DCGM_FI_DEV_FB_TOTAL	gauge	MiB	Total physical framebuffer on the device
DCGM_FI_DEV_POWER_USAGE	gauge	watts	Instantaneous board power draw
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION	counter	millijoules	Cumulative energy used since boot — basis of $/Wh accounting
DCGM_FI_DEV_GPU_TEMP	gauge	celsius	GPU die temperature
DCGM_FI_DEV_MEMORY_TEMP	gauge	celsius	HBM stack temperature (Hopper/Blackwell)
DCGM_FI_DEV_SM_CLOCK	gauge	MHz	Current SM clock speed
DCGM_FI_DEV_MEM_CLOCK	gauge	MHz	Current memory clock speed
DCGM_FI_DEV_ECC_SBE_VOL_TOTAL	counter	errors	Single-bit ECC errors corrected since boot
DCGM_FI_DEV_ECC_DBE_VOL_TOTAL	counter	errors	Double-bit ECC errors since boot — RMA threshold
DCGM_FI_DEV_RETIRED_PENDING	gauge	pages	Memory pages pending retirement (DRAM failing soon)
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL	gauge	MB/s	Aggregate NVLink bandwidth across all links
DCGM_FI_DEV_XID_ERRORS	counter	events	Driver-reported XID error events — crash/hang signal
DCGM_FI_DEV_GPU_UTIL_SAMPLES	gauge	samples	Number of underlying samples in the GPU util window

Warning: DCGM_FI_DEV_GPU_UTIL is the most misread metric in the field. It reports the fraction of sample windows where at least one kernel was running — a single tiny kernel hogging one SM reports 100 percent. For real saturation watch DCGM_FI_PROF_SM_OCCUPANCY and DCGM_FI_PROF_PIPE_TENSOR_ACTIVE together. Many production dashboards still alert on the wrong metric.

Workload patterns

# Custom counter file mounted into the exporter for per-pod attribution
# saved as default-counters.csv, mounted at /etc/dcgm-exporter/counters.csv
#
# format: DCGM_FIELD_ID, prometheus_type, help
DCGM_FI_DEV_GPU_UTIL, gauge, GPU utilization
DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory copy engine utilization
DCGM_FI_DEV_FB_USED, gauge, Framebuffer used (MiB)
DCGM_FI_DEV_FB_FREE, gauge, Framebuffer free (MiB)
DCGM_FI_DEV_POWER_USAGE, gauge, Power draw (W)
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Energy (mJ)
DCGM_FI_DEV_GPU_TEMP, gauge, GPU die temperature (C)
DCGM_FI_DEV_MEMORY_TEMP, gauge, HBM stack temperature (C)
DCGM_FI_DEV_SM_CLOCK, gauge, SM clock (MHz)
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock (MHz)
DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, counter, Single-bit ECC errors
DCGM_FI_DEV_ECC_DBE_VOL_TOTAL, counter, Double-bit ECC errors
DCGM_FI_DEV_XID_ERRORS, counter, XID error events
DCGM_FI_PROF_GR_ENGINE_ACTIVE, gauge, Graphics/compute engine active
DCGM_FI_PROF_SM_ACTIVE, gauge, SMs with at least one warp
DCGM_FI_PROF_SM_OCCUPANCY, gauge, Warp slot occupancy
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Tensor Core pipe active
DCGM_FI_PROF_DRAM_ACTIVE, gauge, HBM channels active
DCGM_FI_PROF_PCIE_TX_BYTES, counter, PCIe transmit bytes
DCGM_FI_PROF_PCIE_RX_BYTES, counter, PCIe receive bytes
DCGM_FI_PROF_NVLINK_TX_BYTES, counter, NVLink transmit bytes
DCGM_FI_PROF_NVLINK_RX_BYTES, counter, NVLink receive bytes

Tip: Pattern B (per-pod attribution) requires gpu-feature-discovery and the device plugin to be running, AND the exporter to be configured with kubernetes pod resolution (--kubernetes=true --kubernetes-gpu-id-type=device-name). Both are set automatically by the GPU Operator. Missing pod labels almost always means one of those two prerequisites is not in place.

Sizing and capacity planning

Default scrape interval: 30 s. Drop to 15 s for SLA-critical inference fleets; raise to 60 s for batch-only training clusters.
Counter selection: a 12-counter minimal set (utilisation, memory, power, temperature, ECC, XID) halves series count at modest visibility cost.
MIG inflation: each 1g.12gb instance multiplies per-GPU series by 7. Use this when planning Prometheus retention for MIG-heavy clusters.
Profiling counters: include PROF_PIPE_TENSOR_ACTIVE and PROF_DRAM_ACTIVE in steady state; reserve PROF_PCIE_* and PROF_NVLINK_* byte counters for clusters where fabric is the focus.
Remote write: forward DCGM metrics to long-term storage (Thanos, Mimir, VictoriaMetrics) — local Prometheus retention of 7-14 days is sufficient for live queries.

Fleet	Nodes	GPUs	MIG	Active series	Prometheus ingest	Retention storage (30d)
Single dev node	1	8	No	~800	~25 samples/s	~150 MB
Small cluster	8	64	No	~6,400	~210 samples/s	~1.2 GB
Production tenancy	32	256	No	~25,000	~830 samples/s	~5 GB
Production tenancy + MIG	32	256	Yes (1g.12gb x7)	~175,000	~5,800 samples/s	~35 GB
Yobitel London-1 region	128	1,024	Mixed	~120,000	~4,000 samples/s	~24 GB
Yobitel multi-region fleet	512	4,096	Mixed	~480,000	~16,000 samples/s	~96 GB

Limits and quotas

Limit	Default	Ceiling	How to raise / work around
Scrape interval (Prometheus)	30 s	1 s (impractical)	Lower in `ServiceMonitor.interval`; watch GPU profiling cost.
Profiling fields per scrape	all enabled	GPU profiling unit shared with Nsight	Disable when Nsight Systems is running concurrently.
MIG instances per H100	1 (no MIG)	7 instances	Configure MIG manager via GPU Operator; exporter auto-detects.
Driver version	R535+	R570 / R580 for B200	Upgrade driver via GPU Operator; older drivers omit Blackwell fields.
DCGM library version	3.3+	3.3.x recommended for H200/B200	Pin via `nvcr.io/nvidia/k8s/dcgm-exporter:3.3.7-3.4.2-ubuntu22.04` tag.
Pods per node label join	unlimited	Prometheus cardinality budget	Join via `kube-state-metrics`, not exporter-side relabel.
Container privilege	host driver caps required	n/a	Run with `SYS_ADMIN` and host `/dev/nvidia*`; the GPU Operator does this.
Shared memory for nv-hostengine	default	Container-defined	Mount `/dev/shm` at 64 MB minimum (default in stock image).
XID error history retained	since boot	Reset on driver reload	Use `rate(...[1h])` on counters, do not query absolute values.
Concurrency with `dcgmi`	shared daemon	n/a	Use either embedded or host `nv-hostengine`, not both.
Concurrency with Nsight profiler	exclusive	n/a	Profiling counters disabled while Nsight Compute holds the perf unit.

Warning: Running Nsight Compute on a host where DCGM Exporter is also reading profiling counters causes DCGM to silently drop the PROF_* fields until Nsight releases the profiling unit. If your Tensor Core activity series shows a flat-line gap, check whether a performance engineer is profiling on that host.

Observability

Thermal — DCGM_FI_DEV_GPU_TEMP > 85 for 5 min: cooling failure, hot-aisle thermal runaway, or fan failure on the chassis.
Memory pressure — DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL > 0.95 for 10 min: workload one batch from CUDA OOM.
Reliability — increase(DCGM_FI_DEV_ECC_DBE_VOL_TOTAL[1h]) > 0: double-bit ECC error → cordon node, schedule RMA.
Reliability — increase(DCGM_FI_DEV_RETIRED_PENDING[24h]) > 0: DRAM cells failing; rebalance off this GPU before retirement.
Fabric — DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL == 0 on an active workload: NVLink down, expect 4-10x training slowdown on TP/DP jobs.
Driver crash — increase(DCGM_FI_DEV_XID_ERRORS[5m]) > 0: any XID event is worth a page; XID 79 (GPU fallen off the bus) is a hardware reset condition.
Underutilisation — avg_over_time(DCGM_FI_PROF_PIPE_TENSOR_ACTIVE[1h]) < 0.10: a paid-for GPU not earning its keep; investigate workload.
Exporter health — up{job="dcgm-exporter"} == 0: scrape failed, alert and check the DaemonSet for the affected node.

# Prometheus alerting rules — DCGM Exporter on GPU clusters
groups:
  - name: gpu-hardware
    interval: 30s
    rules:
      - alert: GPUThermalWarning
        expr: max by (Hostname, gpu) (DCGM_FI_DEV_GPU_TEMP) > 85
        for: 5m
        labels: { severity: warning, team: infra }
        annotations:
          summary: "GPU {{ $labels.gpu }} on {{ $labels.Hostname }} at {{ $value }}C"

      - alert: GPUMemoryPressure
        expr: (DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL) > 0.95
        for: 10m
        labels: { severity: warning }
        annotations:
          summary: "GPU {{ $labels.gpu }} VRAM >95% — OOM imminent"

      - alert: GPUDoubleBitECC
        expr: increase(DCGM_FI_DEV_ECC_DBE_VOL_TOTAL[1h]) > 0
        labels: { severity: critical, team: infra }
        annotations:
          summary: "Double-bit ECC on {{ $labels.Hostname }} GPU {{ $labels.gpu }} — cordon & RMA"

      - alert: GPUXIDError
        expr: increase(DCGM_FI_DEV_XID_ERRORS[5m]) > 0
        labels: { severity: critical }
        annotations:
          summary: "XID error on {{ $labels.Hostname }} GPU {{ $labels.gpu }} — investigate driver"

      - alert: NVLinkDown
        expr: DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL == 0
              and on (Hostname, gpu) DCGM_FI_PROF_PIPE_TENSOR_ACTIVE > 0.05
        for: 5m
        labels: { severity: critical }
        annotations:
          summary: "NVLink down on {{ $labels.Hostname }} GPU {{ $labels.gpu }} during active workload"

      - alert: GPUUnderutilised
        expr: avg_over_time(DCGM_FI_PROF_PIPE_TENSOR_ACTIVE[1h]) < 0.10
              and on (Hostname, gpu) DCGM_FI_DEV_POWER_USAGE > 100
        for: 1h
        labels: { severity: info, team: finops }
        annotations:
          summary: "{{ $labels.Hostname }} GPU {{ $labels.gpu }} <10% Tensor Core utilisation for 1h"

      - alert: DCGMExporterDown
        expr: up{job="dcgm-exporter"} == 0
        for: 5m
        labels: { severity: warning }
        annotations:
          summary: "DCGM Exporter scrape failed on {{ $labels.instance }}"

Tip: Alert on Tensor Core activity (DCGM_FI_PROF_PIPE_TENSOR_ACTIVE), not DCGM_FI_DEV_GPU_UTIL. The former tells you whether the workload is actually using the GPU for matrix multiply — the foundation of LLM and CV inference economics. The latter conflates a fully-saturated SM with a single launched kernel.

Cost and FinOps

Profiling overhead: ~1-2 percent SM time when sampling all PROF_* fields at 1 s. At 30 s scrape it is unmeasurable.
Storage: assume ~1.3 bytes per sample compressed in Prometheus TSDB. Self-hosted Thanos on object storage drops effective cost to ~$0.025/GB-month.
Cardinality drivers: MIG mode (7x per GPU on 1g profile), per-pod attribution labels, and excessive gpu label values are the three knobs to watch.
FinOps integration: DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION is a counter in millijoules — join with cAdvisor pod labels to attribute energy and $/Wh to tenants.
Yobitel customers see DCGM-derived per-tenant utilisation in the Yobibyte console at no extra cost; the underlying telemetry is included in the GPU rate.

Fleet	GPUs	Active series	Self-hosted Prom + Thanos (30d)	Managed Prom (Grafana Cloud, 30d)	Notes
Single dev node	8	~800	$0 (existing)	~$2	Negligible at this scale.
Production tenancy	256	~25,000	~$15/month (S3)	~$75/month	DCGM dominates GPU-side metrics.
Production tenancy + MIG	256	~175,000	~$80/month	~$520/month	MIG inflates 7x on 1g profiles.
Yobitel London-1 region	1,024	~120,000	~$60/month	~$360/month	Mixed MIG and full-GPU tenancies.
Yobitel multi-region fleet	4,096	~480,000	~$240/month	~$1,440/month	Federate via Thanos sidecar per cluster.

Security and compliance

Warning: Never expose the DCGM Exporter /metrics endpoint to the public internet without an authenticating reverse proxy. The metrics themselves are non-sensitive but the GPU UUIDs, host names, and pod labels leak operational topology that helps an attacker target the cluster.

Migration and alternatives

Migration source	Effort	What you gain	What you lose
nvidia-smi polling shell script	Low	Native Prometheus, MIG awareness, no fork overhead	Tegra/Jetson fields, custom parsing logic
Legacy nvidia/gpu_exporter	Low — drop in	Active NVIDIA support, full profiling counter set	Some custom labels — re-derive via relabel rules
AWS CloudWatch Container Insights	Medium	Open-source standard, portable across clouds	AWS-native alarms; re-implement in Prometheus rules
GCP Cloud Monitoring GPU plugin	Medium	Same as above	GKE Autopilot integration; re-wire dashboards
Azure Monitor GPU agent	Medium	Same as above	Azure-native portal integration
Datadog DCGM integration	Low	Self-hosted control, no per-host licence	Datadog's automated topology view
No GPU monitoring at all	Trivial via GPU Operator	Every benefit	n/a — this is the right migration

# Equivalent invocations: nvidia-smi shell script vs DCGM Exporter PromQL

# Old: nvidia-smi script writing to a TSDB
nvidia-smi --query-gpu=index,utilization.gpu,memory.used,memory.total,power.draw \
    --format=csv,noheader,nounits | \
    while IFS=, read idx util mem_used mem_total power; do
        echo "gpu_util{gpu=\"$idx\"} $util $(date +%s)000"
        echo "gpu_mem_used{gpu=\"$idx\"} $mem_used $(date +%s)000"
        echo "gpu_power{gpu=\"$idx\"} $power $(date +%s)000"
    done | curl --data-binary @- http://pushgateway:9091/metrics/job/gpu

# New: DCGM Exporter — same data, native PromQL queries
# Per-GPU utilisation, last 5 minutes
avg by (gpu, Hostname) (avg_over_time(DCGM_FI_DEV_GPU_UTIL[5m]))

# Memory pressure, sorted hot
topk(10, DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL)

# Power draw aggregated to node level for capacity planning
sum by (Hostname) (DCGM_FI_DEV_POWER_USAGE)

# Energy attribution per namespace (joined with cAdvisor pod labels)
sum by (namespace) (
  rate(DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION[5m])
  * on (Hostname, gpu) group_left (namespace, pod)
  kube_pod_container_resource_requests{resource="nvidia_com_gpu"}
) / 1000   # joules per second

Note: There are no serious open-source alternatives to DCGM Exporter on NVIDIA hardware. AMD ships rocm-smi-exporter for ROCm, Intel ships xpum-exporter for Gaudi and Max GPUs — both follow the same DaemonSet + Prometheus pattern but with vendor-specific field names. The pattern is universal; the implementation is per vendor.

Troubleshooting

Symptom	Cause	Fix
No metrics scraped, pod CrashLoopBackOff	Driver missing or version mismatch with DCGM library	Verify `nvidia-smi` works on the host; align driver to DCGM's supported matrix; reinstall via GPU Operator.
Metrics endpoint up but `PROF_*` fields missing	Profiling unit not available (older driver) or held by Nsight	Upgrade driver to R535+; confirm no Nsight Compute session is active on the host.
MIG mode enabled but no per-instance series	MIG manager not configured or instances not created	Apply a MIG profile via the GPU Operator's `migManager.config` ConfigMap; reboot or `nvidia-smi mig -cgi` manually.
Per-pod labels (`pod`, `namespace`) missing	gpu-feature-discovery not deployed, or kubernetes pod resolution flag missing	Install GFD via GPU Operator; set `--kubernetes=true` on the exporter.
Cardinality explosion in Prometheus	MIG enabled with 1g profile across many H100s (7x multiplier)	Filter unused profile labels; aggregate at scrape time; raise Prometheus series budget.
`DCGM_FI_DEV_GPU_UTIL` always 0 on H100	Hopper reports activity differently; check `PROF_GR_ENGINE_ACTIVE` instead	Use `DCGM_FI_PROF_GR_ENGINE_ACTIVE` and `DCGM_FI_PROF_PIPE_TENSOR_ACTIVE` on Hopper/Blackwell.
NVLink bandwidth always 0	NVLink topology not detected, or P2P disabled	Run `nvidia-smi topo -m`; confirm NVLink visible; check `NVLINK_BANDWIDTH_TOTAL` field is in counter list.
Exporter pod evicted under memory pressure	`nv-hostengine` leaks memory on long-running deployments	Set pod memory limit at 512 MB; restart DaemonSet weekly via a CronJob.
Scrape duration spikes when MIG profile changes	Exporter re-discovers MIG geometry on every scrape after change	Expected; settles after 1-2 scrape cycles; do not alert on transient spikes.
Metrics report 0 W power on consumer GPUs	Power reporting unsupported on RTX consumer cards in some BIOS modes	DCGM is designed for data-centre SKUs; switch to data-centre GPUs or accept the gap.
`up == 0` on one node only	kubelet evicted DaemonSet pod due to node memory pressure	Increase pod priorityClass to `system-node-critical`; check node memory limits.
Conflict with `dcgmi` CLI on same host	Two `nv-hostengine` instances competing for the GPU's DCGM channel	Use embedded `nv-hostengine` (default) OR host service, not both; check `systemctl status nvidia-dcgm`.
Tensor Core activity flat-lines mid-day	Performance engineer running Nsight Compute on the host	Coordinate profiler sessions; warn before profiling production GPUs.

Where this fits in the Yobitel stack

References

DCGM Exporter on GitHub · GitHub (NVIDIA)
Data Center GPU Manager User Guide · NVIDIA Documentation
DCGM Field Identifiers Reference · NVIDIA Documentation
Monitoring GPUs in Kubernetes with DCGM · NVIDIA Developer Blog
NVIDIA GPU Operator Documentation · NVIDIA Documentation
GPU Feature Discovery · GitHub (NVIDIA)
Profiling DCGM Counters Explained · NVIDIA Documentation

DCGM Exporter

Overview

Quick start

How it works

Reference and metric catalogue

Workload patterns

Sizing and capacity planning

Limits and quotas

Observability

Cost and FinOps

Security and compliance

Migration and alternatives

Troubleshooting

Where this fits in the Yobitel stack

References

Browse all entries

Deploy on Yobibyte

DCGM Exporter

Overview

Quick start

How it works

Reference and metric catalogue

Workload patterns

Sizing and capacity planning

Limits and quotas

Observability

Cost and FinOps

Security and compliance

Migration and alternatives

Troubleshooting

Where this fits in the Yobitel stack

References

Browse all entries

Deploy on Yobibyte