TL;DR
- CNCF Graduated (2018), Apache 2.0. Single Go binary combining a scrape engine, an embedded time-series database (`tsdb`), a recording-and-alerting rule evaluator and a PromQL HTTP query API — the second CNCF project ever to graduate, after Kubernetes.
- Pull-based scrape model: targets expose `/metrics` in OpenMetrics text format, the server discovers them via `scrape_configs` (static, Kubernetes, Consul, EC2, GCE, Azure or any file-SD plugin) and ingests samples on a 15-60 s cadence.
- Configurable through five top-level blocks — `global`, `scrape_configs`, `rule_files`, `alerting` and `remote_write` — plus `recording_rules` and `alerting_rules` that pre-compute expensive PromQL and fire alerts via Alertmanager.
- Scales to ~10-15 M active series per node; horizontal scale, multi-tenancy and 12-month retention come from `remote_write` to Thanos, Cortex, Grafana Mimir or VictoriaMetrics. Exemplars link metrics to OpenTelemetry trace IDs.
- Default metrics fabric on every Yobitel NeoCloud region and the scrape format Yobibyte emits for customer-side observability — point your own Prometheus at a Yobibyte tenant-scoped federation endpoint and your existing dashboards keep working.
Overview#
Prometheus is a single Go binary that does five things in one process: it discovers targets through `scrape_configs`, it pulls metrics from each target's `/metrics` endpoint on a fixed cadence, it stores the resulting samples in an embedded time-series database (`tsdb`), it evaluates recording and alerting rules against that database, and it answers PromQL queries over an HTTP API. There is no agent, no message bus and no separate ingestion service. That single-binary shape — plus the OpenMetrics text exposition format and PromQL — is the source of its dominance across cloud-native infrastructure.
Originally written at SoundCloud in 2012, donated to the CNCF in 2016 and graduated in 2018, Prometheus has become the de facto metrics backbone for every Kubernetes platform, every GPU runtime and every AI-infrastructure stack that ships dashboards. DCGM Exporter, KServe, vLLM, TensorRT-LLM, NVIDIA GPU Operator, kube-state-metrics, node-exporter and the NVIDIA Network Operator all emit Prometheus-shaped metrics by default. If you operate AI compute, you operate Prometheus — or you operate a managed system that speaks PromQL.
Yobitel ships Prometheus as the metrics fabric of every NeoCloud region. Each region runs a regional Prometheus cluster scraping DCGM Exporter, node-exporter, vLLM/TensorRT-LLM inference replicas, the NVIDIA Network Operator and the regional control plane; samples remote-write into a long-term Thanos store backed by S3-compatible object storage. Yobibyte exposes the same shape back to customers: every tenant has a federation endpoint that returns Prometheus text for that tenant's scope, so customers can scrape Yobibyte directly from their own Prometheus, Grafana Cloud or Datadog agent without any Yobitel-specific adapter.
This entry helps you stand up a production-grade Prometheus for an AI fleet, write the recording and alerting rules that catch the incidents that actually happen, scope it correctly for GPU-side cardinality and federate it into long-term storage — or, equivalently, configure your existing Prometheus to scrape Yobibyte and Yobitel NeoCloud regions through the public federation surface.
Quick start#
The example below installs the kube-prometheus-stack Helm chart — Prometheus Operator, Prometheus, Alertmanager, Grafana, node-exporter and kube-state-metrics in one bundle — onto a Kubernetes cluster, wires a ServiceMonitor for DCGM Exporter, and verifies the scrape with a PromQL query. The second block is the equivalent standalone Prometheus running on a bare-metal host. The third block points a customer-owned Prometheus at a Yobibyte tenant federation endpoint.
# 1. Install kube-prometheus-stack with sensible defaults for an AI cluster
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm upgrade --install kps prometheus-community/kube-prometheus-stack \
--namespace monitoring --create-namespace \
--set prometheus.prometheusSpec.retention=14d \
--set prometheus.prometheusSpec.retentionSize=200GiB \
--set prometheus.prometheusSpec.replicas=2 \
--set prometheus.prometheusSpec.scrapeInterval=30s \
--set prometheus.prometheusSpec.evaluationInterval=30s \
--set prometheus.prometheusSpec.enableRemoteWriteReceiver=true \
--set prometheus.prometheusSpec.enableFeatures="{exemplar-storage,native-histograms}" \
--set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false
# Verify Prometheus is up and scraping
kubectl -n monitoring get prometheus,statefulset,service
kubectl -n monitoring port-forward svc/kps-kube-prometheus-stack-prometheus 9090:9090 &
# Query — total tokens-per-second across a vLLM fleet
curl -s 'http://localhost:9090/api/v1/query?query=sum(rate(vllm:generation_tokens_total[1m]))' | jq
# 2. Standalone Prometheus on bare-metal
prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus \
--storage.tsdb.retention.time=14d \
--storage.tsdb.retention.size=200GB \
--web.enable-lifecycle \
--web.enable-admin-api \
--enable-feature=exemplar-storage,native-histograms
# 3. Scrape a Yobibyte tenant federation endpoint from your own Prometheus
# (illustrative scrape_config — paste into your prometheus.yml)
# scrape_configs:
# - job_name: yobibyte-tenant
# scrape_interval: 30s
# honor_labels: true
# metrics_path: /federate
# params:
# 'match[]':
# - '{__name__=~"DCGM_FI_.*|vllm_.*|yobibyte_.*"}'
# scheme: https
# authorization:
# type: Bearer
# credentials_file: /etc/prometheus/yobibyte-token
# static_configs:
# - targets: ['observability.london-1.yobitel.com']Always run Prometheus with `--web.enable-lifecycle` and `--web.enable-admin-api` disabled in untrusted environments — both let any caller reload the config or delete series. The kube-prometheus-stack defaults are safe; bare-metal operators sometimes leave both on.
How it works#
Prometheus's data model is a name-and-labels addressable time series. Each metric is identified by a name plus an arbitrary set of key-value labels — for example `DCGM_FI_DEV_GPU_TEMP{gpu="0",Hostname="gpu-07.london-1",cluster="prod-london"}`. The combination of name and labels defines a unique series; each series is an append-only stream of (timestamp, float64) samples. Internally `tsdb` partitions series into 2-hour blocks on disk, compacts them periodically into multi-hour blocks and serves queries by reading the in-memory head plus the on-disk blocks intersecting the query range.
Four metric types matter operationally. Counters are monotonically increasing values — `vllm_generation_tokens_total`, `DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION` — that you query with `rate()` or `increase()`. Gauges are point-in-time values — `DCGM_FI_DEV_GPU_TEMP`, `kube_pod_status_ready` — queried directly. Histograms record observation distributions across pre-configured buckets and expose `_sum`, `_count` and `_bucket` series, queried with `histogram_quantile()` to estimate percentiles. Summaries pre-compute quantiles at the client; they are harder to aggregate and modern code prefers histograms.
Prometheus pulls. The server periodically scrapes each target's `/metrics` endpoint and ingests whatever it finds. The pull model has three operational benefits over push: targets are stateless and require no configuration to be monitored, dead targets are detected automatically (a failed scrape is itself a signal via the synthetic `up` metric), and the same endpoint can be scraped by multiple Prometheus servers in different regions for HA without coordination. For workloads that genuinely cannot be scraped — short-lived batch jobs, cron tasks, edge devices behind NAT — the Pushgateway sits in front of Prometheus and accepts pushes which are then scraped normally. Treat it as an exception.
Service discovery turns the static target list into a live one. The Kubernetes SD plugin watches the Kubernetes API for `Pod`, `Service` and `Endpoints` changes and rewrites the scrape target list on every change; the EC2, GCE and Azure SD plugins do the same for cloud instances; the Consul SD plugin reads a Consul catalog. The Prometheus Operator adds higher-level CRDs (`ServiceMonitor`, `PodMonitor`, `Probe`) that codify scrape config in Kubernetes-native YAML.
Recording rules pre-compute expensive PromQL on the evaluation cadence and write the result as a new series. Alerting rules evaluate a PromQL expression and, when the result is non-empty for the `for:` duration, emit an alert to Alertmanager. Both run inside the same Prometheus process; both are reloaded on SIGHUP or `/-/reload`. On Yobitel NeoCloud, recording rules drive the per-tenant rollups that customers see in the Yobibyte console; alerting rules drive the regional SRE rota.
- Scrape loop: per-target goroutine; configurable `scrape_interval` (global default), `scrape_timeout` and per-job overrides via `scrape_configs`.
- Storage: `tsdb` with 2-hour head blocks, lazy compaction to 6h/24h/72h/14d levels; on-disk format is Snappy-compressed XOR-encoded chunks.
- Service discovery: 20+ SD mechanisms — `kubernetes_sd_configs`, `ec2_sd_configs`, `consul_sd_configs`, `file_sd_configs`, `http_sd_configs`.
- Recording rules: pre-aggregate `sum`, `rate`, `histogram_quantile` so dashboards query a cheap rollup instead of recomputing 100k series every refresh.
- Alerting rules: PromQL expression + `for:` window + labels; routed via Alertmanager to PagerDuty, Slack, OpsGenie, webhooks, email.
- Remote write: WAL-tailing exporter to Thanos / Mimir / Cortex / VictoriaMetrics for long-term retention and global query.
- Exemplars: trace-ID sample attached to a metric data point — the bridge from a PromQL panel to an OpenTelemetry trace in Tempo or Jaeger.
- Native histograms (stable in v3.x): high-resolution exponential-bucket histograms that replace the classic fixed-bucket form at a fraction of the series count.
Reference and specifications#
The reference below documents the configuration sections and PromQL operators that an AI-infrastructure operator touches most. The full Prometheus configuration schema is much larger — TLS, OAuth2, file_sd, blackbox probing, agent mode — but the table here covers the production surface for a GPU fleet plus the PromQL features that drive every recording rule, alerting rule and Grafana dashboard you will write.
| Section / operator | Type | Purpose |
|---|---|---|
| `global.scrape_interval` | config | Default scrape cadence; 15-60 s typical, 30 s on GPU fleets. |
| `global.evaluation_interval` | config | Rule evaluation cadence; usually equal to `scrape_interval`. |
| `global.external_labels` | config | Labels stamped on every sample sent via `remote_write` — `cluster`, `region`, `env`. |
| `scrape_configs[].job_name` | config | Logical grouping; surfaces as the `job` label on every series. |
| `scrape_configs[].kubernetes_sd_configs` | config | Watch the Kubernetes API for pod / service / endpoints / node targets. |
| `scrape_configs[].relabel_configs` | config | Rewrite target metadata before the scrape (filter, rename, route). |
| `scrape_configs[].metric_relabel_configs` | config | Rewrite metric labels after the scrape (drop high-cardinality labels, rename). |
| `rule_files[]` | config | Glob list of recording and alerting rule files; reloaded on SIGHUP. |
| `alerting.alertmanagers[]` | config | Targets that receive alert payloads — typically a 3-node Alertmanager cluster. |
| `remote_write[].url` | config | WAL-tailed forwarder to Thanos Receive, Mimir, Cortex or VictoriaMetrics. |
| `remote_write[].queue_config` | config | Per-shard buffer, batch size, retry backoff — critical for downstream pushback. |
| `recording_rule.record / .expr` | rule | Pre-compute a PromQL expression; result written as a new series. |
| `alerting_rule.alert / .expr / .for` | rule | Fire alert when expression non-empty for `for:` duration. |
| `rate(counter[1m])` | PromQL | Per-second average rate of increase of a counter over the window. |
| `irate(counter[1m])` | PromQL | Instantaneous rate from the last two samples — for fast-moving counters. |
| `sum by (label) (...)` | PromQL | Aggregate keeping the named labels, dropping the rest. |
| `histogram_quantile(0.99, ...)` | PromQL | Estimate the q-th quantile from `_bucket` series. |
| `avg_over_time(gauge[5m])` | PromQL | Average gauge value over the range; common for utilisation alerts. |
| `max_over_time(gauge[5m]) > THRESHOLD` | PromQL | Threshold-with-debounce; pair with `for:` for cleaner alerts. |
| `label_replace(...)` | PromQL | Rewrite a label inline — the join glue between metrics from different exporters. |
| `group_left / group_right` | PromQL | Many-to-one vector matching — the cAdvisor `pod` × DCGM `gpu` join pattern. |
| `exemplars-storage` feature | config | Persist trace-ID exemplars attached to histogram observations. |
| `native-histograms` feature | config | Enable exponential-bucket histograms — lower cardinality than classic buckets. |
| `/api/v1/query` HTTP | API | Instant query — single timestamp evaluation. |
| `/api/v1/query_range` | API | Range query — fills a Grafana panel. |
| `/federate?match[]=...` | API | Scrape a subset of series from another Prometheus — basis of Yobibyte tenant federation. |
| `/-/reload` | API | Hot-reload config + rules; requires `--web.enable-lifecycle`. |
Cardinality is the one operational pitfall everyone hits. Every unique label combination is a new series; encoding a request ID, trace ID, user ID, prompt hash or full path string as a label will explode storage and slow queries. Keep labels low-cardinality (model, node, cluster, region, tenant) and put unique identifiers in trace systems instead — exemplars link the two without the cardinality cost.
Workload patterns#
Three workload shapes cover the bulk of Prometheus deployments on AI infrastructure: a single-cluster metrics stack for a development or small-production GPU tenancy, a federated multi-cluster stack for a NeoCloud-scale operator with long-term retention, and a customer-side scrape of a Yobibyte managed tenancy. Each pattern uses a slightly different scrape topology, retention budget and rule set.
Pattern A — single GPU cluster, kube-prometheus-stack. One HA Prometheus pair (2 replicas) per cluster scraping DCGM Exporter, node-exporter, kube-state-metrics, vLLM/TensorRT-LLM endpoints and the NVIDIA Network Operator. Local retention of 14-30 days, no remote write, Alertmanager 3-replica cluster routing to PagerDuty and Slack. The single-Prometheus footprint absorbs roughly 256 H100s of telemetry comfortably; above that, shard by namespace or move to Pattern B.
Pattern B — federated multi-cluster + Thanos. One regional Prometheus per cluster (HA pair) writing to a Thanos Receive cluster fronted by an S3-compatible object store. Thanos Querier serves global PromQL across all regions; Thanos Compactor downsamples to 5-minute and 1-hour resolutions for long-range queries. This is the shape Yobitel NeoCloud regions run — every London-1, Frankfurt-1 and Virginia-1 cluster pushes into the regional long-term store, and customer dashboards in the Yobibyte console query Thanos for cross-region or 90-day views.
Pattern C — Yobibyte tenant federation into customer-owned Prometheus. The customer runs their own Prometheus (any version, any vendor, any cloud) and adds a `scrape_configs` block pointing at `/federate?match[]=...` on the tenant's Yobibyte observability endpoint. The federation returns a curated subset of metrics — GPU utilisation, inference latency, tokens served, spend — scoped to that tenant's workloads. The customer keeps full control of retention, alerting and dashboard tooling; Yobitel keeps full control of multi-tenant isolation. This is the recommended integration for customers who already have an internal observability platform and want Yobibyte as one more scrape target.
# Recording and alerting rules for an AI cluster
groups:
- name: ai-cluster-recording
interval: 30s
rules:
# Per-namespace token throughput, pre-aggregated for dashboards
- record: namespace:vllm_tokens_per_second:rate1m
expr: sum by (namespace) (rate(vllm_generation_tokens_total[1m]))
# Per-node Tensor Core saturation, smoothed
- record: node:tensor_core_active:avg5m
expr: avg by (Hostname) (avg_over_time(DCGM_FI_PROF_PIPE_TENSOR_ACTIVE[5m]))
# Per-tenant inference p99 latency
- record: tenant:vllm_e2e_p99:5m
expr: histogram_quantile(0.99,
sum by (tenant, le) (rate(vllm_e2e_request_latency_seconds_bucket[5m])))
- name: ai-cluster-alerting
interval: 30s
rules:
# SLO — p99 inference latency above the regional budget
- alert: InferenceP99LatencyHigh
expr: tenant:vllm_e2e_p99:5m > 2.5
for: 10m
labels: { severity: warning, slo: latency }
annotations:
summary: "Tenant {{ $labels.tenant }} p99 inference latency at {{ $value | humanizeDuration }}"
# SLO — tokens-per-second below the floor when traffic present
- alert: InferenceThroughputCollapse
expr: namespace:vllm_tokens_per_second:rate1m
< 0.5 * avg_over_time(namespace:vllm_tokens_per_second:rate1m[6h])
and namespace:vllm_tokens_per_second:rate1m > 0
for: 10m
labels: { severity: critical }
annotations:
summary: "Throughput in {{ $labels.namespace }} dropped >50% vs 6h baseline"
# Capacity — region-wide Tensor Core saturation sustained high
- alert: RegionTensorCoreSaturation
expr: avg(node:tensor_core_active:avg5m) > 0.85
for: 30m
labels: { severity: info, team: capacity }
annotations:
summary: "Region {{ $labels.cluster }} averaging >85% Tensor Core activity — plan capacity"Write recording rules for any PromQL expression your dashboards or alerts evaluate more than once an interval. Grafana panels that recompute a 100k-series histogram quantile on every refresh are the most common cause of Prometheus query-side overload — pre-aggregate to a `tenant:vllm_e2e_p99:5m`-style series and your dashboards become near-free to render.
Sizing and capacity planning#
Prometheus sizing is governed by active series count, ingest rate and retention window. As a planning anchor, a healthy AI cluster produces roughly 100-200 series per GPU from DCGM Exporter plus 500-1,500 series per inference replica from vLLM/TensorRT-LLM histograms plus 30-50 series per node from node-exporter plus 5-15k cluster-level series from kube-state-metrics. The table below maps representative fleet sizes onto Prometheus footprint, ingest rate and retention storage at a 30 s scrape interval and 14-day local retention.
Prometheus comfortably handles 10-15 million active series and 1-2 million samples per second per server on modern hardware (8 vCPU, 64 GB RAM, NVMe SSD). Above that, shard by namespace, move to one Prometheus per cluster federated through Thanos, or switch to a horizontally scalable backend (Mimir, Cortex, VictoriaMetrics cluster). On Yobitel NeoCloud the operating point is one HA Prometheus pair per region with 14-day local retention and Thanos Receive for the 12-month long-term store; this comfortably absorbs 1,024 GPUs of telemetry per region with headroom.
- Default scrape interval: 30 s on production GPU clusters; drop to 15 s for SLA-critical inference; raise to 60 s for batch-only training clusters.
- RAM rule of thumb: ~3 KB working set per active series (head block + index). 1 M series ≈ 3 GB RAM before query/headroom.
- Disk rule of thumb: ~1.3 bytes per compressed sample. 200 k samples/s × 86,400 s × 14 d × 1.3 B ≈ 315 GB before WAL and index overhead.
- WAL: separate fast NVMe for `--storage.tsdb.wal-segment-size` — WAL corruption is the most painful Prometheus failure mode.
- Remote write: 1.3-1.6x the local ingest rate due to WAL re-reading; budget downstream Thanos/Mimir capacity accordingly.
- HA pairs: two replicas with identical scrape config; downstream dedup at Thanos Querier or Grafana datasource layer.
- Yobitel NeoCloud anchor: regional Prometheus pair on c6id.4xlarge-equivalent (16 vCPU, 128 GB RAM, 2 TB NVMe) per 1,024-GPU region.
| Fleet | GPUs | Active series | Samples/s | RAM (working) | Disk (14d local) | Yobitel footprint |
|---|---|---|---|---|---|---|
| Single dev cluster | 8 | ~150k | ~5k | 4 GB | ~25 GB | n/a |
| Small production tenancy | 64 | ~600k | ~20k | 8 GB | ~100 GB | Pattern A |
| Production tenancy + MIG | 256 | ~3.5 M | ~120k | 32 GB | ~600 GB | Pattern A → B at the high end |
| Yobitel London-1 region | 1,024 | ~6 M | ~200k | 48 GB | ~1.0 TB | Pattern B — HA pair + Thanos |
| Yobitel multi-region fleet | 4,096 | ~24 M | ~800k | n/a (sharded) | n/a | Pattern B — per-region pairs into central Thanos |
| Customer scraping Yobibyte tenant | varies | ~10-50k filtered | ~0.5-2k | 1-2 GB | ~5-20 GB | Pattern C — `/federate` endpoint |
Limits and quotas#
Prometheus has very few configuration ceilings. The constraints that matter operationally are cardinality budgets, scrape latency, WAL replay time and downstream `remote_write` backpressure. The table below documents each limit, the operational symptom when you hit it and the lever for raising or working around it.
| Limit | Default | Operational ceiling | How to raise / work around |
|---|---|---|---|
| Active series per server | unlimited | ~10-15 M before head-block memory blows up | Shard by namespace; move to Mimir/Thanos; native histograms. |
| Samples per second per server | unlimited | ~1-2 M before scrape duration exceeds interval | Increase scrape interval; shard scrape config; recording rules. |
| Scrape body size | unlimited (warns at 100 MB) | Memory in head block | Set `sample_limit` per scrape; drop high-cardinality labels via `metric_relabel_configs`. |
| `scrape_timeout` | 10 s | Equal to `scrape_interval` | Increase per-job; investigate slow targets before bumping globally. |
| WAL replay on restart | n/a | Linear in series × hours of WAL | Use `--storage.tsdb.wal-compression`; restart during low-write windows. |
| Retention time | 15 d | Disk-bound | `--storage.tsdb.retention.time=Nd`; pair with `retention.size` cap. |
| Retention size | unlimited | Disk-bound | `--storage.tsdb.retention.size=200GB`; first-applied limit wins. |
| `remote_write` queue | 5,000 samples/shard | Downstream ingest rate | Tune `queue_config.max_shards`, `capacity`, `batch_send_deadline`. |
| Federation series count | unlimited | Scrape body size + scrape duration | Use `match[]` aggressively; prefer remote_write for high-volume. |
| Query memory per request | unlimited | Server RAM | Set `--query.max-samples=50000000`; prefer recording rules. |
| Query timeout | 2 m | n/a | `--query.timeout=2m`; long queries should be recording rules. |
| Concurrent queries | 20 | CPU-bound | `--query.max-concurrency=20`; Grafana panel bursts are the usual cause. |
| Rule evaluation latency | must be < interval | n/a | Split heavy groups; pre-aggregate; reduce `interval`. |
| Alertmanager dedup window | 5 m | n/a | `group_wait`, `group_interval`, `repeat_interval` in route config. |
WAL replay time is the silent operational scar. A Prometheus with 5 M series and 6 hours of WAL can take 10-15 minutes to come up — long enough for monitoring to be down during an incident. Enable `--storage.tsdb.wal-compression`, keep WAL on fast NVMe, and consider a Prometheus Agent fronting your scrape config so the long-term store keeps receiving samples even while the query Prometheus restarts.
Observability#
Prometheus is itself an observability component, but its own health is worth alerting on — a silently-failing Prometheus is the worst possible failure mode because nothing else fires. Prometheus exposes its own metrics on `/metrics`, the most important of which cover scrape success, ingest rate, head-block size, WAL state, rule evaluation duration and `remote_write` queue depth. The alert rules below cover the failure modes that account for almost all production incidents.
- Scrape — `up == 0` for a target: scrape failed; investigate target, network, TLS, auth.
- Scrape duration — `scrape_duration_seconds > scrape_interval * 0.8`: scrape close to overrunning; target is producing too many series.
- Cardinality — `prometheus_tsdb_symbol_table_size_bytes` growing without bound: a label has gone high-cardinality.
- Head block — `prometheus_tsdb_head_series` near series-budget ceiling: shard or move to long-term store.
- WAL — `prometheus_tsdb_wal_corruptions_total > 0`: WAL corruption; replay will fail on restart.
- Remote write — `prometheus_remote_storage_samples_pending` growing unbounded: downstream backpressure; investigate Thanos/Mimir.
- Rule evaluation — `prometheus_rule_evaluation_duration_seconds` exceeds interval: a recording rule is slower than its `interval:`.
- Notifications — `prometheus_notifications_dropped_total > 0`: Alertmanager unreachable or overloaded.
# Self-monitoring rules — Prometheus watching Prometheus
groups:
- name: prometheus-self
interval: 30s
rules:
- alert: PrometheusTargetDown
expr: up == 0
for: 5m
labels: { severity: warning, team: observability }
annotations:
summary: "Scrape target {{ $labels.job }}/{{ $labels.instance }} down"
- alert: PrometheusScrapeSlow
expr: scrape_duration_seconds > 0.8 * on(job) group_left()
(max by (job) (scrape_samples_scraped) * 0 + 30)
for: 15m
labels: { severity: warning }
annotations:
summary: "{{ $labels.job }} scrape duration near interval — investigate target"
- alert: PrometheusHighCardinality
expr: rate(prometheus_tsdb_head_series_created_total[10m]) > 1000
for: 30m
labels: { severity: warning }
annotations:
summary: "High series creation rate — cardinality leak suspected"
- alert: PrometheusRemoteWriteBacklog
expr: prometheus_remote_storage_samples_pending > 100000
for: 10m
labels: { severity: critical }
annotations:
summary: "remote_write backlog growing — downstream store struggling"
- alert: PrometheusWALCorruption
expr: increase(prometheus_tsdb_wal_corruptions_total[1h]) > 0
labels: { severity: critical }
annotations:
summary: "WAL corruption on {{ $labels.instance }} — investigate disk"
- alert: PrometheusRuleEvaluationSlow
expr: prometheus_rule_evaluation_duration_seconds{quantile="0.99"} > 25
for: 15m
labels: { severity: warning }
annotations:
summary: "Rule evaluation p99 {{ $value }}s near 30s interval"
- alert: PrometheusNotificationsDropped
expr: increase(prometheus_notifications_dropped_total[5m]) > 0
labels: { severity: critical }
annotations:
summary: "Alertmanager notifications dropped — investigate alerting path"Always run two Prometheus replicas with identical scrape config — and have each replica scrape the *other* one's `/metrics`. That cross-scrape is the only way to alert on a Prometheus that has stopped scraping itself.
Cost and FinOps#
Prometheus itself is free under Apache 2.0 — there is no licence cost. The operational cost is the compute, RAM and storage to run it, plus the downstream long-term store. The table below puts both in USD terms for representative AI-cluster sizes, using mid-2026 anchors for self-hosted Prometheus on cloud VMs, managed Prometheus services (Grafana Cloud, AWS AMP, Chronosphere) and the Yobibyte-included observability surface.
- Self-hosted on cloud VM: typical 16 vCPU / 64 GB RAM / 2 TB NVMe instance is ~$400/month on-demand, ~$200/month on 1-year reserved.
- Object-store Thanos: ~$0.023/GB-month on S3 standard, ~$0.0125/GB-month on S3 IA — long retention is dominated by storage, not compute.
- Managed Prometheus: priced per active series per month (~$0.30-0.90/1k series at the time of writing); cost scales linearly with cardinality.
- Yobitel NeoCloud: Prometheus + Thanos + Grafana are part of the GPU rate; customers pay only the per-GPU/hr published price.
- Yobibyte managed observability: the tenant federation endpoint is included; customers who federate into their own Prometheus pay only their own infrastructure cost.
- FinOps wedge: high-cardinality labels are the #1 cost driver on managed Prometheus — `relabel_configs` to drop request-ID, prompt-hash, full-path labels before they leave the cluster.
| Fleet | Active series | Self-hosted Prom + S3 Thanos (monthly USD) | Grafana Cloud (monthly USD) | AWS AMP (monthly USD) | Yobitel NeoCloud |
|---|---|---|---|---|---|
| Single dev cluster (8 GPU) | ~150k | ~$50 (m6i.large + 50 GB EBS) | ~$80 | ~$60 | Included in GPU rate |
| Production tenancy (256 GPU) | ~3.5 M | ~$400 (c6id.2xlarge + 1 TB EBS + S3) | ~$1,800 | ~$1,400 | Included in GPU rate |
| Yobitel London-1 region (1,024 GPU) | ~6 M | ~$700 (c6id.4xlarge HA + Thanos) | ~$3,200 | ~$2,500 | Yobitel-operated; surfaced to customers |
| Customer scraping Yobibyte tenant | ~10-50k | ~$25 (existing Prom + tiny disk) | ~$40 | ~$30 | Federation included, no extra fee |
Security and compliance#
Prometheus does not authenticate scrape requests by default and does not encrypt data on disk. On shared clusters the standard pattern is to keep Prometheus internal (ClusterIP-only Service, NetworkPolicy permitting only the namespaces that need access), expose Grafana as the user-facing surface, and put TLS + bearer-token auth in front of any federation or remote_write endpoint that crosses a trust boundary. The Prometheus Operator's `ServiceMonitor` resource supports `bearerTokenSecret` and `tlsConfig` blocks natively; bare-metal operators put nginx or Envoy in front.
For UK public-sector workloads (NCSC Cloud Security Principles, G-Cloud 14, OFFICIAL-handling), Prometheus telemetry remains inside the sovereign tenancy and never federates to a multi-region store. Yobitel NeoCloud's London-1 region runs an independent Prometheus + Thanos stack with no cross-region replication; sovereign customers see a one-region observability surface. The metrics themselves are operational data — GPU UUIDs, host names, namespace and pod names, request rates, latencies — and contain no customer payload. For GDPR purposes they are not personal data.
Yobibyte's customer-facing observability surface enforces three controls. Per-tenant API tokens limit federation to that tenant's scope. Series-level label filtering blocks cross-tenant leakage even when a token is misused. All federation responses are signed and rate-limited per token. Customers receive enough operational telemetry to run their own SRE rota — utilisation, latency, throughput, spend — without seeing Yobitel-internal metrics from other tenants or the regional control plane. This recipe-protected boundary is documented in the [yobibyte](/knowledge-base/yobibyte) entry.
Never expose Prometheus's `/api/v1/admin/tsdb/delete_series` or `/-/reload` HTTP endpoints to a network you do not control. `--web.enable-admin-api` and `--web.enable-lifecycle` are off by default for a reason — re-enabling them without an authenticating reverse proxy lets any caller delete production series or reload arbitrary config.
Migration and alternatives#
Most migrations to Prometheus on AI infrastructure come from one of four origins: Graphite or StatsD push pipelines, cloud-native metrics (CloudWatch, Azure Monitor, Cloud Monitoring), a legacy InfluxDB + Telegraf stack, or no metrics at all. The table below documents the trade-offs of each migration path and the Prometheus-ecosystem alternatives if Prometheus itself is not the right fit.
For green-field AI clusters, install kube-prometheus-stack and skip the alternatives — every GPU runtime and Kubernetes component already speaks Prometheus natively. For existing organisations with significant Datadog or New Relic investment, the typical pattern is to keep Prometheus inside the cluster for GPU and inference telemetry and forward via `remote_write` to the existing vendor for unified dashboards.
| Migration source / alternative | Effort | What you gain | What you lose |
|---|---|---|---|
| Graphite / StatsD push | Medium | Pull model, service discovery, PromQL, AI ecosystem default | Statsd_exporter bridges push clients; some statsd semantics differ |
| InfluxDB + Telegraf | Medium | Active CNCF community, GPU-runtime native integrations | InfluxQL → PromQL retraining; Flux drop-ins exist |
| AWS CloudWatch | Medium | Open source, portable, AI ecosystem default | CloudWatch alarms re-implemented in Prometheus + Alertmanager |
| Azure Monitor / GCP Cloud Monitoring | Medium | Same | Same |
| Datadog (keep as backend) | Low — remote_write | Cluster-local Prometheus + Datadog UI | Datadog APM tie-in remains separate |
| No metrics at all | Trivial — kube-prometheus-stack | Every benefit | n/a — this is the right migration |
| Thanos | Pair with Prometheus | Object-store long-term retention, global query, downsampling | Operational complexity; another component to run |
| Cortex | Pair with Prometheus | Horizontally scalable multi-tenant, blocks storage | Operational complexity |
| Grafana Mimir | Pair with Prometheus | Cortex fork with better operability, native multi-tenancy | AGPLv3; vendor positioning |
| VictoriaMetrics | Drop-in replacement | Higher compression, lower RAM, simpler operations | Smaller community; some PromQL differences |
| Grafana Cloud / AWS AMP / Chronosphere | Hosted | No ops; usage-based pricing | Per-series cost can outrun self-hosted at scale |
# Long-term storage via Thanos sidecar — production NeoCloud pattern
# Prometheus runs with a Thanos sidecar that uploads finalised blocks to S3
# and answers Querier requests for recent samples.
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: regional
namespace: monitoring
spec:
replicas: 2
retention: 14d
retentionSize: 1TiB
scrapeInterval: 30s
externalLabels:
cluster: london-1
region: uk-london
env: prod
serviceMonitorSelector: {}
podMonitorSelector: {}
ruleSelector: {}
thanos:
objectStorageConfig:
key: thanos.yaml
name: thanos-objstore-secret
image: quay.io/thanos/thanos:v0.36.1
storage:
volumeClaimTemplate:
spec:
storageClassName: fast-nvme
resources:
requests:
storage: 1Ti
---
# Thanos object-storage config (referenced above)
type: S3
config:
bucket: yobitel-london-1-thanos
endpoint: s3.eu-west-2.amazonaws.com
region: eu-west-2
sse_config:
type: SSE-KMS
kms_key_id: arn:aws:kms:eu-west-2:000000000000:key/abcPicking between Thanos, Mimir, Cortex and VictoriaMetrics is a 10x easier decision than picking Prometheus over a non-Prometheus stack. All four speak PromQL, all four ingest Prometheus `remote_write` and all four are reversible — try one, switch later if the operational profile does not suit. The bigger choice is leaving Prometheus's data model behind.
Troubleshooting#
The error table below covers the failure modes that account for almost all real Prometheus incidents on AI infrastructure. Each row maps an observable symptom to the underlying cause and the minimum-viable fix. Most issues trace back to one of three root causes: cardinality leakage, downstream backpressure on `remote_write`, or a misconfigured scrape (TLS, auth, or service-discovery filter).
| Symptom | Cause | Fix |
|---|---|---|
| Prometheus pod OOMKilled on restart | WAL replay loaded too many series into head block | Reduce series via `metric_relabel_configs`; raise pod memory; consider native histograms. |
| Scrape duration approaches scrape_interval | Target producing too many series or slow `/metrics` handler | Filter labels at scrape; raise `scrape_timeout`; sample the target. |
| `up == 0` for one target only | TLS, auth, or NetworkPolicy | Curl `/metrics` from a debug pod in the same namespace; check operator logs. |
| Cardinality explosion overnight | A label became free-form (request ID, prompt hash) | Identify via `topk(20, count by (__name__)({__name__=~".+"}))`; drop the label. |
| `remote_write` pending samples growing | Downstream Thanos/Mimir backpressured | Tune `queue_config`; investigate downstream; consider Prometheus Agent. |
| Grafana panel timeouts | PromQL evaluating millions of series per query | Pre-aggregate via recording rules; raise `--query.max-samples`. |
| Alerts firing but no notifications | Alertmanager unreachable, mis-routed, or silenced | Check `prometheus_notifications_dropped_total`; inspect Alertmanager `/api/v2/silences`. |
| Duplicate alerts from HA pair | Alertmanager dedup not configured | Use Alertmanager cluster mode (`--cluster.peer`); identical alert labels deduplicate. |
| Recording rule lagging behind | Rule evaluation slower than `interval` | Split rule group; pre-aggregate inputs; raise interval; profile via `/api/v1/rules`. |
| Federation scrape times out | `match[]` returns too many series in one body | Tighten `match[]`; switch to `remote_write` for high volume; shard by job. |
| `tsdb_compactions_failed_total` increasing | Disk full, permissions, or running concurrent prometheus | Check disk and `--storage.tsdb.path` ownership; ensure single Prometheus per data dir. |
| External labels missing on remote-written samples | `global.external_labels` not set on the Prometheus spec | Add `externalLabels.cluster`, `region`, `env` to the Prometheus CRD. |
| Queries return stale samples after target down | `staleness` not triggering — target reappeared briefly | Confirm scrape returned 5-minute-old samples; `up{}` cycle pattern. |
| Yobibyte federation returns 401 | Tenant API token expired or scoped to wrong tenant | Rotate the token from the Yobibyte console; verify `Authorization: Bearer` header. |
Where this fits in the Yobitel stack#
Prometheus is the metrics layer Yobitel operates and the metrics layer Yobitel publishes. On the operator side, every NeoCloud region — London-1, Frankfurt-1, Virginia-1 — runs a Prometheus Operator-managed HA pair scraping DCGM Exporter, node-exporter, kube-state-metrics, the NVIDIA GPU Operator, the NVIDIA Network Operator, the inference replicas Yobibyte runs on behalf of tenants and the regional control plane. Samples remote_write into a regional Thanos cluster backed by an S3-compatible object store for 12-month retention, with cross-region federation through Thanos Querier for capacity-planning workloads.
On the publisher side, Yobibyte exposes the Prometheus federation surface back to every customer. Each tenant has a federation endpoint scoped to that tenant's GPU UUIDs, namespaces and inference replicas — a `Prometheus federate` URL with bearer-token auth that returns a curated subset of `DCGM_FI_*`, `vllm_*`, `tensorrt_llm_*` and `yobibyte_*` metrics. Customers point their own Prometheus, Grafana Cloud, Datadog agent or any OpenMetrics-compatible scraper at that endpoint and the metrics flow into their existing dashboards. The same federation surface powers the InferenceBench scoring pipeline that compares Yobitel NeoCloud throughput against public managed-inference vendors, with every metric traceable back to a Prometheus query.
On UK and EU sovereign tenancies the Prometheus + Thanos stack stays inside the sovereign region. Federation tokens are scoped to that region; cross-region replication is disabled; the Yobibyte console queries only the in-region Thanos. Customers running under NCSC Cloud Security Principles, G-Cloud 14 OFFICIAL-handling or EU Data Boundary commitments see a one-region observability surface and a documented control boundary. The recipe-protection rule applies: the Yobibyte console exposes what the customer needs (utilisation, latency, throughput, spend) without disclosing Yobitel's internal scheduling, admission or routing metrics — see the [yobibyte](/knowledge-base/yobibyte) entry for the customer-facing API shape.
References
- Prometheus Documentation · Prometheus Project
- Prometheus on GitHub · GitHub
- Prometheus at the CNCF · Cloud Native Computing Foundation
- kube-prometheus-stack Helm Chart · GitHub
- Prometheus Operator · Prometheus Operator
- Thanos Documentation · Thanos
- OpenMetrics Specification · OpenMetrics
- PromQL Functions Reference · Prometheus