Prometheus

TL;DR

CNCF Graduated (2018), Apache 2.0. Single Go binary combining a scrape engine, an embedded time-series database (`tsdb`), a recording-and-alerting rule evaluator and a PromQL HTTP query API — the second CNCF project ever to graduate, after Kubernetes.
Pull-based scrape model: targets expose `/metrics` in OpenMetrics text format, the server discovers them via `scrape_configs` (static, Kubernetes, Consul, EC2, GCE, Azure or any file-SD plugin) and ingests samples on a 15-60 s cadence.
Configurable through five top-level blocks — `global`, `scrape_configs`, `rule_files`, `alerting` and `remote_write` — plus `recording_rules` and `alerting_rules` that pre-compute expensive PromQL and fire alerts via Alertmanager.
Scales to ~10-15 M active series per node; horizontal scale, multi-tenancy and 12-month retention come from `remote_write` to Thanos, Cortex, Grafana Mimir or VictoriaMetrics. Exemplars link metrics to OpenTelemetry trace IDs.
Default metrics fabric on every Yobitel NeoCloud region and the scrape format Yobibyte emits for customer-side observability — point your own Prometheus at a Yobibyte tenant-scoped federation endpoint and your existing dashboards keep working.

Overview

Prometheus is a single Go binary that does five things in one process: it discovers targets through scrape_configs, it pulls metrics from each target's /metrics endpoint on a fixed cadence, it stores the resulting samples in an embedded time-series database (tsdb), it evaluates recording and alerting rules against that database, and it answers PromQL queries over an HTTP API. There is no agent, no message bus and no separate ingestion service. That single-binary shape — plus the OpenMetrics text exposition format and PromQL — is the source of its dominance across cloud-native infrastructure.

Originally written at SoundCloud in 2012, donated to the CNCF in 2016 and graduated in 2018, Prometheus has become the de facto metrics backbone for every Kubernetes platform, every GPU runtime and every AI-infrastructure stack that ships dashboards. DCGM Exporter, KServe, vLLM, TensorRT-LLM, NVIDIA GPU Operator, kube-state-metrics, node-exporter and the NVIDIA Network Operator all emit Prometheus-shaped metrics by default. If you operate AI compute, you operate Prometheus — or you operate a managed system that speaks PromQL.

Yobitel ships Prometheus as the metrics fabric of every NeoCloud region. Each region runs a regional Prometheus cluster scraping DCGM Exporter, node-exporter, vLLM/TensorRT-LLM inference replicas, the NVIDIA Network Operator and the regional control plane; samples remote-write into a long-term Thanos store backed by S3-compatible object storage. Yobibyte exposes the same shape back to customers: every tenant has a federation endpoint that returns Prometheus text for that tenant's scope, so customers can scrape Yobibyte directly from their own Prometheus, Grafana Cloud or Datadog agent without any Yobitel-specific adapter.

This entry helps you stand up a production-grade Prometheus for an AI fleet, write the recording and alerting rules that catch the incidents that actually happen, scope it correctly for GPU-side cardinality and federate it into long-term storage — or, equivalently, configure your existing Prometheus to scrape Yobibyte and Yobitel NeoCloud regions through the public federation surface.

Quick start

The example below installs the kube-prometheus-stack Helm chart — Prometheus Operator, Prometheus, Alertmanager, Grafana, node-exporter and kube-state-metrics in one bundle — onto a Kubernetes cluster, wires a ServiceMonitor for DCGM Exporter, and verifies the scrape with a PromQL query. The second block is the equivalent standalone Prometheus running on a bare-metal host. The third block points a customer-owned Prometheus at a Yobibyte tenant federation endpoint.

# 1. Install kube-prometheus-stack with sensible defaults for an AI cluster
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm upgrade --install kps prometheus-community/kube-prometheus-stack \
    --namespace monitoring --create-namespace \
    --set prometheus.prometheusSpec.retention=14d \
    --set prometheus.prometheusSpec.retentionSize=200GiB \
    --set prometheus.prometheusSpec.replicas=2 \
    --set prometheus.prometheusSpec.scrapeInterval=30s \
    --set prometheus.prometheusSpec.evaluationInterval=30s \
    --set prometheus.prometheusSpec.enableRemoteWriteReceiver=true \
    --set prometheus.prometheusSpec.enableFeatures="{exemplar-storage,native-histograms}" \
    --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false

# Verify Prometheus is up and scraping
kubectl -n monitoring get prometheus,statefulset,service
kubectl -n monitoring port-forward svc/kps-kube-prometheus-stack-prometheus 9090:9090 &

# Query — total tokens-per-second across a vLLM fleet
curl -s 'http://localhost:9090/api/v1/query?query=sum(rate(vllm:generation_tokens_total[1m]))' | jq

# 2. Standalone Prometheus on bare-metal
prometheus \
    --config.file=/etc/prometheus/prometheus.yml \
    --storage.tsdb.path=/var/lib/prometheus \
    --storage.tsdb.retention.time=14d \
    --storage.tsdb.retention.size=200GB \
    --web.enable-lifecycle \
    --web.enable-admin-api \
    --enable-feature=exemplar-storage,native-histograms

# 3. Scrape a Yobibyte tenant federation endpoint from your own Prometheus
# (illustrative scrape_config — paste into your prometheus.yml)
# scrape_configs:
#   - job_name: yobibyte-tenant
#     scrape_interval: 30s
#     honor_labels: true
#     metrics_path: /federate
#     params:
#       'match[]':
#         - '{__name__=~"DCGM_FI_.*|vllm_.*|yobibyte_.*"}'
#     scheme: https
#     authorization:
#       type: Bearer
#       credentials_file: /etc/prometheus/yobibyte-token
#     static_configs:
#       - targets: ['observability.london-1.yobitel.com']

Tip: Always run Prometheus with --web.enable-lifecycle and --web.enable-admin-api disabled in untrusted environments — both let any caller reload the config or delete series. The kube-prometheus-stack defaults are safe; bare-metal operators sometimes leave both on.

How it works

Prometheus's data model is a name-and-labels addressable time series. Each metric is identified by a name plus an arbitrary set of key-value labels — for example DCGM_FI_DEV_GPU_TEMP{gpu="0",Hostname="gpu-07.london-1",cluster="prod-london"}. The combination of name and labels defines a unique series; each series is an append-only stream of (timestamp, float64) samples. Internally tsdb partitions series into 2-hour blocks on disk, compacts them periodically into multi-hour blocks and serves queries by reading the in-memory head plus the on-disk blocks intersecting the query range.

Four metric types matter operationally. Counters are monotonically increasing values — vllm_generation_tokens_total, DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION — that you query with rate() or increase(). Gauges are point-in-time values — DCGM_FI_DEV_GPU_TEMP, kube_pod_status_ready — queried directly. Histograms record observation distributions across pre-configured buckets and expose _sum, _count and _bucket series, queried with histogram_quantile() to estimate percentiles. Summaries pre-compute quantiles at the client; they are harder to aggregate and modern code prefers histograms.

Prometheus pulls. The server periodically scrapes each target's /metrics endpoint and ingests whatever it finds. The pull model has three operational benefits over push: targets are stateless and require no configuration to be monitored, dead targets are detected automatically (a failed scrape is itself a signal via the synthetic up metric), and the same endpoint can be scraped by multiple Prometheus servers in different regions for HA without coordination. For workloads that genuinely cannot be scraped — short-lived batch jobs, cron tasks, edge devices behind NAT — the Pushgateway sits in front of Prometheus and accepts pushes which are then scraped normally. Treat it as an exception.

Service discovery turns the static target list into a live one. The Kubernetes SD plugin watches the Kubernetes API for Pod, Service and Endpoints changes and rewrites the scrape target list on every change; the EC2, GCE and Azure SD plugins do the same for cloud instances; the Consul SD plugin reads a Consul catalog. The Prometheus Operator adds higher-level CRDs (ServiceMonitor, PodMonitor, Probe) that codify scrape config in Kubernetes-native YAML.

Recording rules pre-compute expensive PromQL on the evaluation cadence and write the result as a new series. Alerting rules evaluate a PromQL expression and, when the result is non-empty for the for: duration, emit an alert to Alertmanager. Both run inside the same Prometheus process; both are reloaded on SIGHUP or /-/reload. On Yobitel NeoCloud, recording rules drive the per-tenant rollups that customers see in the Yobibyte console; alerting rules drive the regional SRE rota.

Scrape loop: per-target goroutine; configurable scrape_interval (global default), scrape_timeout and per-job overrides via scrape_configs.
Storage: tsdb with 2-hour head blocks, lazy compaction to 6h/24h/72h/14d levels; on-disk format is Snappy-compressed XOR-encoded chunks.
Service discovery: 20+ SD mechanisms — kubernetes_sd_configs, ec2_sd_configs, consul_sd_configs, file_sd_configs, http_sd_configs.
Recording rules: pre-aggregate sum, rate, histogram_quantile so dashboards query a cheap rollup instead of recomputing 100k series every refresh.
Alerting rules: PromQL expression + for: window + labels; routed via Alertmanager to PagerDuty, Slack, OpsGenie, webhooks, email.
Remote write: WAL-tailing exporter to Thanos / Mimir / Cortex / VictoriaMetrics for long-term retention and global query.
Exemplars: trace-ID sample attached to a metric data point — the bridge from a PromQL panel to an OpenTelemetry trace in Tempo or Jaeger.
Native histograms (stable in v3.x): high-resolution exponential-bucket histograms that replace the classic fixed-bucket form at a fraction of the series count.

Reference and specifications

The reference below documents the configuration sections and PromQL operators that an AI-infrastructure operator touches most. The full Prometheus configuration schema is much larger — TLS, OAuth2, file_sd, blackbox probing, agent mode — but the table here covers the production surface for a GPU fleet plus the PromQL features that drive every recording rule, alerting rule and Grafana dashboard you will write.

Section / operator	Type	Purpose
`global.scrape_interval`	config	Default scrape cadence; 15-60 s typical, 30 s on GPU fleets.
`global.evaluation_interval`	config	Rule evaluation cadence; usually equal to `scrape_interval`.
`global.external_labels`	config	Labels stamped on every sample sent via `remote_write` — `cluster`, `region`, `env`.
`scrape_configs[].job_name`	config	Logical grouping; surfaces as the `job` label on every series.
`scrape_configs[].kubernetes_sd_configs`	config	Watch the Kubernetes API for pod / service / endpoints / node targets.
`scrape_configs[].relabel_configs`	config	Rewrite target metadata before the scrape (filter, rename, route).
`scrape_configs[].metric_relabel_configs`	config	Rewrite metric labels after the scrape (drop high-cardinality labels, rename).
`rule_files[]`	config	Glob list of recording and alerting rule files; reloaded on SIGHUP.
`alerting.alertmanagers[]`	config	Targets that receive alert payloads — typically a 3-node Alertmanager cluster.
`remote_write[].url`	config	WAL-tailed forwarder to Thanos Receive, Mimir, Cortex or VictoriaMetrics.
`remote_write[].queue_config`	config	Per-shard buffer, batch size, retry backoff — critical for downstream pushback.
`recording_rule.record / .expr`	rule	Pre-compute a PromQL expression; result written as a new series.
`alerting_rule.alert / .expr / .for`	rule	Fire alert when expression non-empty for `for:` duration.
`rate(counter[1m])`	PromQL	Per-second average rate of increase of a counter over the window.
`irate(counter[1m])`	PromQL	Instantaneous rate from the last two samples — for fast-moving counters.
`sum by (label) (...)`	PromQL	Aggregate keeping the named labels, dropping the rest.
`histogram_quantile(0.99, ...)`	PromQL	Estimate the q-th quantile from `_bucket` series.
`avg_over_time(gauge[5m])`	PromQL	Average gauge value over the range; common for utilisation alerts.
`max_over_time(gauge[5m]) > THRESHOLD`	PromQL	Threshold-with-debounce; pair with `for:` for cleaner alerts.
`label_replace(...)`	PromQL	Rewrite a label inline — the join glue between metrics from different exporters.
`group_left / group_right`	PromQL	Many-to-one vector matching — the cAdvisor `pod` × DCGM `gpu` join pattern.
`exemplars-storage` feature	config	Persist trace-ID exemplars attached to histogram observations.
`native-histograms` feature	config	Enable exponential-bucket histograms — lower cardinality than classic buckets.
`/api/v1/query` HTTP	API	Instant query — single timestamp evaluation.
`/api/v1/query_range`	API	Range query — fills a Grafana panel.
`/federate?match[]=...`	API	Scrape a subset of series from another Prometheus — basis of Yobibyte tenant federation.
`/-/reload`	API	Hot-reload config + rules; requires `--web.enable-lifecycle`.

Warning: Cardinality is the one operational pitfall everyone hits. Every unique label combination is a new series; encoding a request ID, trace ID, user ID, prompt hash or full path string as a label will explode storage and slow queries. Keep labels low-cardinality (model, node, cluster, region, tenant) and put unique identifiers in trace systems instead — exemplars link the two without the cardinality cost.

Workload patterns

Three workload shapes cover the bulk of Prometheus deployments on AI infrastructure: a single-cluster metrics stack for a development or small-production GPU tenancy, a federated multi-cluster stack for a NeoCloud-scale operator with long-term retention, and a customer-side scrape of a Yobibyte managed tenancy. Each pattern uses a slightly different scrape topology, retention budget and rule set.

Pattern A — single GPU cluster, kube-prometheus-stack. One HA Prometheus pair (2 replicas) per cluster scraping DCGM Exporter, node-exporter, kube-state-metrics, vLLM/TensorRT-LLM endpoints and the NVIDIA Network Operator. Local retention of 14-30 days, no remote write, Alertmanager 3-replica cluster routing to PagerDuty and Slack. The single-Prometheus footprint absorbs roughly 256 H100s of telemetry comfortably; above that, shard by namespace or move to Pattern B.

Pattern B — federated multi-cluster + Thanos. One regional Prometheus per cluster (HA pair) writing to a Thanos Receive cluster fronted by an S3-compatible object store. Thanos Querier serves global PromQL across all regions; Thanos Compactor downsamples to 5-minute and 1-hour resolutions for long-range queries. This is the shape Yobitel NeoCloud regions run — every London-1, Frankfurt-1 and Virginia-1 cluster pushes into the regional long-term store, and customer dashboards in the Yobibyte console query Thanos for cross-region or 90-day views.

Pattern C — Yobibyte tenant federation into customer-owned Prometheus. The customer runs their own Prometheus (any version, any vendor, any cloud) and adds a scrape_configs block pointing at /federate?match[]=... on the tenant's Yobibyte observability endpoint. The federation returns a curated subset of metrics — GPU utilisation, inference latency, tokens served, spend — scoped to that tenant's workloads. The customer keeps full control of retention, alerting and dashboard tooling; Yobitel keeps full control of multi-tenant isolation. This is the recommended integration for customers who already have an internal observability platform and want Yobibyte as one more scrape target.

# Recording and alerting rules for an AI cluster
groups:
  - name: ai-cluster-recording
    interval: 30s
    rules:
      # Per-namespace token throughput, pre-aggregated for dashboards
      - record: namespace:vllm_tokens_per_second:rate1m
        expr: sum by (namespace) (rate(vllm_generation_tokens_total[1m]))

      # Per-node Tensor Core saturation, smoothed
      - record: node:tensor_core_active:avg5m
        expr: avg by (Hostname) (avg_over_time(DCGM_FI_PROF_PIPE_TENSOR_ACTIVE[5m]))

      # Per-tenant inference p99 latency
      - record: tenant:vllm_e2e_p99:5m
        expr: histogram_quantile(0.99,
                sum by (tenant, le) (rate(vllm_e2e_request_latency_seconds_bucket[5m])))

  - name: ai-cluster-alerting
    interval: 30s
    rules:
      # SLO — p99 inference latency above the regional budget
      - alert: InferenceP99LatencyHigh
        expr: tenant:vllm_e2e_p99:5m > 2.5
        for: 10m
        labels: { severity: warning, slo: latency }
        annotations:
          summary: "Tenant {{ $labels.tenant }} p99 inference latency at {{ $value | humanizeDuration }}"

      # SLO — tokens-per-second below the floor when traffic present
      - alert: InferenceThroughputCollapse
        expr: namespace:vllm_tokens_per_second:rate1m
              < 0.5 * avg_over_time(namespace:vllm_tokens_per_second:rate1m[6h])
              and namespace:vllm_tokens_per_second:rate1m > 0
        for: 10m
        labels: { severity: critical }
        annotations:
          summary: "Throughput in {{ $labels.namespace }} dropped >50% vs 6h baseline"

      # Capacity — region-wide Tensor Core saturation sustained high
      - alert: RegionTensorCoreSaturation
        expr: avg(node:tensor_core_active:avg5m) > 0.85
        for: 30m
        labels: { severity: info, team: capacity }
        annotations:
          summary: "Region {{ $labels.cluster }} averaging >85% Tensor Core activity — plan capacity"

Tip: Write recording rules for any PromQL expression your dashboards or alerts evaluate more than once an interval. Grafana panels that recompute a 100k-series histogram quantile on every refresh are the most common cause of Prometheus query-side overload — pre-aggregate to a tenant:vllm_e2e_p99:5m-style series and your dashboards become near-free to render.

Sizing and capacity planning

Prometheus sizing is governed by active series count, ingest rate and retention window. As a planning anchor, a healthy AI cluster produces roughly 100-200 series per GPU from DCGM Exporter plus 500-1,500 series per inference replica from vLLM/TensorRT-LLM histograms plus 30-50 series per node from node-exporter plus 5-15k cluster-level series from kube-state-metrics. The table below maps representative fleet sizes onto Prometheus footprint, ingest rate and retention storage at a 30 s scrape interval and 14-day local retention.

Prometheus comfortably handles 10-15 million active series and 1-2 million samples per second per server on modern hardware (8 vCPU, 64 GB RAM, NVMe SSD). Above that, shard by namespace, move to one Prometheus per cluster federated through Thanos, or switch to a horizontally scalable backend (Mimir, Cortex, VictoriaMetrics cluster). On Yobitel NeoCloud the operating point is one HA Prometheus pair per region with 14-day local retention and Thanos Receive for the 12-month long-term store; this comfortably absorbs 1,024 GPUs of telemetry per region with headroom.

Default scrape interval: 30 s on production GPU clusters; drop to 15 s for SLA-critical inference; raise to 60 s for batch-only training clusters.
RAM rule of thumb: ~3 KB working set per active series (head block + index). 1 M series ≈ 3 GB RAM before query/headroom.
Disk rule of thumb: ~1.3 bytes per compressed sample. 200 k samples/s × 86,400 s × 14 d × 1.3 B ≈ 315 GB before WAL and index overhead.
WAL: separate fast NVMe for --storage.tsdb.wal-segment-size — WAL corruption is the most painful Prometheus failure mode.
Remote write: 1.3-1.6x the local ingest rate due to WAL re-reading; budget downstream Thanos/Mimir capacity accordingly.
HA pairs: two replicas with identical scrape config; downstream dedup at Thanos Querier or Grafana datasource layer.
Yobitel NeoCloud anchor: regional Prometheus pair on c6id.4xlarge-equivalent (16 vCPU, 128 GB RAM, 2 TB NVMe) per 1,024-GPU region.

Fleet	GPUs	Active series	Samples/s	RAM (working)	Disk (14d local)	Yobitel footprint
Single dev cluster	8	~150k	~5k	4 GB	~25 GB	n/a
Small production tenancy	64	~600k	~20k	8 GB	~100 GB	Pattern A
Production tenancy + MIG	256	~3.5 M	~120k	32 GB	~600 GB	Pattern A → B at the high end
Yobitel London-1 region	1,024	~6 M	~200k	48 GB	~1.0 TB	Pattern B — HA pair + Thanos
Yobitel multi-region fleet	4,096	~24 M	~800k	n/a (sharded)	n/a	Pattern B — per-region pairs into central Thanos
Customer scraping Yobibyte tenant	varies	~10-50k filtered	~0.5-2k	1-2 GB	~5-20 GB	Pattern C — `/federate` endpoint

Limits and quotas

Prometheus has very few configuration ceilings. The constraints that matter operationally are cardinality budgets, scrape latency, WAL replay time and downstream remote_write backpressure. The table below documents each limit, the operational symptom when you hit it and the lever for raising or working around it.

Limit	Default	Operational ceiling	How to raise / work around
Active series per server	unlimited	~10-15 M before head-block memory blows up	Shard by namespace; move to Mimir/Thanos; native histograms.
Samples per second per server	unlimited	~1-2 M before scrape duration exceeds interval	Increase scrape interval; shard scrape config; recording rules.
Scrape body size	unlimited (warns at 100 MB)	Memory in head block	Set `sample_limit` per scrape; drop high-cardinality labels via `metric_relabel_configs`.
`scrape_timeout`	10 s	Equal to `scrape_interval`	Increase per-job; investigate slow targets before bumping globally.
WAL replay on restart	n/a	Linear in series × hours of WAL	Use `--storage.tsdb.wal-compression`; restart during low-write windows.
Retention time	15 d	Disk-bound	`--storage.tsdb.retention.time=Nd`; pair with `retention.size` cap.
Retention size	unlimited	Disk-bound	`--storage.tsdb.retention.size=200GB`; first-applied limit wins.
`remote_write` queue	5,000 samples/shard	Downstream ingest rate	Tune `queue_config.max_shards`, `capacity`, `batch_send_deadline`.
Federation series count	unlimited	Scrape body size + scrape duration	Use `match[]` aggressively; prefer remote_write for high-volume.
Query memory per request	unlimited	Server RAM	Set `--query.max-samples=50000000`; prefer recording rules.
Query timeout	2 m	n/a	`--query.timeout=2m`; long queries should be recording rules.
Concurrent queries	20	CPU-bound	`--query.max-concurrency=20`; Grafana panel bursts are the usual cause.
Rule evaluation latency	must be < interval	n/a	Split heavy groups; pre-aggregate; reduce `interval`.
Alertmanager dedup window	5 m	n/a	`group_wait`, `group_interval`, `repeat_interval` in route config.

Warning: WAL replay time is the silent operational scar. A Prometheus with 5 M series and 6 hours of WAL can take 10-15 minutes to come up — long enough for monitoring to be down during an incident. Enable --storage.tsdb.wal-compression, keep WAL on fast NVMe, and consider a Prometheus Agent fronting your scrape config so the long-term store keeps receiving samples even while the query Prometheus restarts.

Observability

Prometheus is itself an observability component, but its own health is worth alerting on — a silently-failing Prometheus is the worst possible failure mode because nothing else fires. Prometheus exposes its own metrics on /metrics, the most important of which cover scrape success, ingest rate, head-block size, WAL state, rule evaluation duration and remote_write queue depth. The alert rules below cover the failure modes that account for almost all production incidents.

Scrape — up == 0 for a target: scrape failed; investigate target, network, TLS, auth.
Scrape duration — scrape_duration_seconds > scrape_interval * 0.8: scrape close to overrunning; target is producing too many series.
Cardinality — prometheus_tsdb_symbol_table_size_bytes growing without bound: a label has gone high-cardinality.
Head block — prometheus_tsdb_head_series near series-budget ceiling: shard or move to long-term store.
WAL — prometheus_tsdb_wal_corruptions_total > 0: WAL corruption; replay will fail on restart.
Remote write — prometheus_remote_storage_samples_pending growing unbounded: downstream backpressure; investigate Thanos/Mimir.
Rule evaluation — prometheus_rule_evaluation_duration_seconds exceeds interval: a recording rule is slower than its interval:.
Notifications — prometheus_notifications_dropped_total > 0: Alertmanager unreachable or overloaded.

# Self-monitoring rules — Prometheus watching Prometheus
groups:
  - name: prometheus-self
    interval: 30s
    rules:
      - alert: PrometheusTargetDown
        expr: up == 0
        for: 5m
        labels: { severity: warning, team: observability }
        annotations:
          summary: "Scrape target {{ $labels.job }}/{{ $labels.instance }} down"

      - alert: PrometheusScrapeSlow
        expr: scrape_duration_seconds > 0.8 * on(job) group_left()
              (max by (job) (scrape_samples_scraped) * 0 + 30)
        for: 15m
        labels: { severity: warning }
        annotations:
          summary: "{{ $labels.job }} scrape duration near interval — investigate target"

      - alert: PrometheusHighCardinality
        expr: rate(prometheus_tsdb_head_series_created_total[10m]) > 1000
        for: 30m
        labels: { severity: warning }
        annotations:
          summary: "High series creation rate — cardinality leak suspected"

      - alert: PrometheusRemoteWriteBacklog
        expr: prometheus_remote_storage_samples_pending > 100000
        for: 10m
        labels: { severity: critical }
        annotations:
          summary: "remote_write backlog growing — downstream store struggling"

      - alert: PrometheusWALCorruption
        expr: increase(prometheus_tsdb_wal_corruptions_total[1h]) > 0
        labels: { severity: critical }
        annotations:
          summary: "WAL corruption on {{ $labels.instance }} — investigate disk"

      - alert: PrometheusRuleEvaluationSlow
        expr: prometheus_rule_evaluation_duration_seconds{quantile="0.99"} > 25
        for: 15m
        labels: { severity: warning }
        annotations:
          summary: "Rule evaluation p99 {{ $value }}s near 30s interval"

      - alert: PrometheusNotificationsDropped
        expr: increase(prometheus_notifications_dropped_total[5m]) > 0
        labels: { severity: critical }
        annotations:
          summary: "Alertmanager notifications dropped — investigate alerting path"

Tip: Always run two Prometheus replicas with identical scrape config — and have each replica scrape the other one's /metrics. That cross-scrape is the only way to alert on a Prometheus that has stopped scraping itself.

Cost and FinOps

Prometheus itself is free under Apache 2.0 — there is no licence cost. The operational cost is the compute, RAM and storage to run it, plus the downstream long-term store. The table below puts both in USD terms for representative AI-cluster sizes, using mid-2026 anchors for self-hosted Prometheus on cloud VMs, managed Prometheus services (Grafana Cloud, AWS AMP, Chronosphere) and the Yobibyte-included observability surface.

Self-hosted on cloud VM: typical 16 vCPU / 64 GB RAM / 2 TB NVMe instance is ~$400/month on-demand, ~$200/month on 1-year reserved.
Object-store Thanos: ~$0.023/GB-month on S3 standard, ~$0.0125/GB-month on S3 IA — long retention is dominated by storage, not compute.
Managed Prometheus: priced per active series per month (~$0.30-0.90/1k series at the time of writing); cost scales linearly with cardinality.
Yobitel NeoCloud: Prometheus + Thanos + Grafana are part of the GPU rate; customers pay only the per-GPU/hr published price.
Yobibyte managed observability: the tenant federation endpoint is included; customers who federate into their own Prometheus pay only their own infrastructure cost.
FinOps wedge: high-cardinality labels are the #1 cost driver on managed Prometheus — relabel_configs to drop request-ID, prompt-hash, full-path labels before they leave the cluster.

Fleet	Active series	Self-hosted Prom + S3 Thanos (monthly USD)	Grafana Cloud (monthly USD)	AWS AMP (monthly USD)	Yobitel NeoCloud
Single dev cluster (8 GPU)	~150k	~$50 (m6i.large + 50 GB EBS)	~$80	~$60	Included in GPU rate
Production tenancy (256 GPU)	~3.5 M	~$400 (c6id.2xlarge + 1 TB EBS + S3)	~$1,800	~$1,400	Included in GPU rate
Yobitel London-1 region (1,024 GPU)	~6 M	~$700 (c6id.4xlarge HA + Thanos)	~$3,200	~$2,500	Yobitel-operated; surfaced to customers
Customer scraping Yobibyte tenant	~10-50k	~$25 (existing Prom + tiny disk)	~$40	~$30	Federation included, no extra fee

Security and compliance

Prometheus does not authenticate scrape requests by default and does not encrypt data on disk. On shared clusters the standard pattern is to keep Prometheus internal (ClusterIP-only Service, NetworkPolicy permitting only the namespaces that need access), expose Grafana as the user-facing surface, and put TLS + bearer-token auth in front of any federation or remote_write endpoint that crosses a trust boundary. The Prometheus Operator's ServiceMonitor resource supports bearerTokenSecret and tlsConfig blocks natively; bare-metal operators put nginx or Envoy in front.

For UK public-sector workloads (NCSC Cloud Security Principles, G-Cloud 14, OFFICIAL-handling), Prometheus telemetry remains inside the sovereign tenancy and never federates to a multi-region store. Yobitel NeoCloud's London-1 region runs an independent Prometheus + Thanos stack with no cross-region replication; sovereign customers see a one-region observability surface. The metrics themselves are operational data — GPU UUIDs, host names, namespace and pod names, request rates, latencies — and contain no customer payload. For GDPR purposes they are not personal data.

Yobibyte's customer-facing observability surface enforces three controls. Per-tenant API tokens limit federation to that tenant's scope. Series-level label filtering blocks cross-tenant leakage even when a token is misused. All federation responses are signed and rate-limited per token. Customers receive enough operational telemetry to run their own SRE rota — utilisation, latency, throughput, spend — without seeing Yobitel-internal metrics from other tenants or the regional control plane. This recipe-protected boundary is documented in the yobibyte entry.

Warning: Never expose Prometheus's /api/v1/admin/tsdb/delete_series or /-/reload HTTP endpoints to a network you do not control. --web.enable-admin-api and --web.enable-lifecycle are off by default for a reason — re-enabling them without an authenticating reverse proxy lets any caller delete production series or reload arbitrary config.

Migration and alternatives

Most migrations to Prometheus on AI infrastructure come from one of four origins: Graphite or StatsD push pipelines, cloud-native metrics (CloudWatch, Azure Monitor, Cloud Monitoring), a legacy InfluxDB + Telegraf stack, or no metrics at all. The table below documents the trade-offs of each migration path and the Prometheus-ecosystem alternatives if Prometheus itself is not the right fit.

For green-field AI clusters, install kube-prometheus-stack and skip the alternatives — every GPU runtime and Kubernetes component already speaks Prometheus natively. For existing organisations with significant Datadog or New Relic investment, the typical pattern is to keep Prometheus inside the cluster for GPU and inference telemetry and forward via remote_write to the existing vendor for unified dashboards.

Migration source / alternative	Effort	What you gain	What you lose
Graphite / StatsD push	Medium	Pull model, service discovery, PromQL, AI ecosystem default	Statsd_exporter bridges push clients; some statsd semantics differ
InfluxDB + Telegraf	Medium	Active CNCF community, GPU-runtime native integrations	InfluxQL → PromQL retraining; Flux drop-ins exist
AWS CloudWatch	Medium	Open source, portable, AI ecosystem default	CloudWatch alarms re-implemented in Prometheus + Alertmanager
Azure Monitor / GCP Cloud Monitoring	Medium	Same	Same
Datadog (keep as backend)	Low — remote_write	Cluster-local Prometheus + Datadog UI	Datadog APM tie-in remains separate
No metrics at all	Trivial — kube-prometheus-stack	Every benefit	n/a — this is the right migration
Thanos	Pair with Prometheus	Object-store long-term retention, global query, downsampling	Operational complexity; another component to run
Cortex	Pair with Prometheus	Horizontally scalable multi-tenant, blocks storage	Operational complexity
Grafana Mimir	Pair with Prometheus	Cortex fork with better operability, native multi-tenancy	AGPLv3; vendor positioning
VictoriaMetrics	Drop-in replacement	Higher compression, lower RAM, simpler operations	Smaller community; some PromQL differences
Grafana Cloud / AWS AMP / Chronosphere	Hosted	No ops; usage-based pricing	Per-series cost can outrun self-hosted at scale

# Long-term storage via Thanos sidecar — production NeoCloud pattern
# Prometheus runs with a Thanos sidecar that uploads finalised blocks to S3
# and answers Querier requests for recent samples.

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: regional
  namespace: monitoring
spec:
  replicas: 2
  retention: 14d
  retentionSize: 1TiB
  scrapeInterval: 30s
  externalLabels:
    cluster: london-1
    region: uk-london
    env: prod
  serviceMonitorSelector: {}
  podMonitorSelector: {}
  ruleSelector: {}
  thanos:
    objectStorageConfig:
      key: thanos.yaml
      name: thanos-objstore-secret
    image: quay.io/thanos/thanos:v0.36.1
  storage:
    volumeClaimTemplate:
      spec:
        storageClassName: fast-nvme
        resources:
          requests:
            storage: 1Ti
---
# Thanos object-storage config (referenced above)
type: S3
config:
  bucket: yobitel-london-1-thanos
  endpoint: s3.eu-west-2.amazonaws.com
  region: eu-west-2
  sse_config:
    type: SSE-KMS
    kms_key_id: arn:aws:kms:eu-west-2:000000000000:key/abc

Note: Picking between Thanos, Mimir, Cortex and VictoriaMetrics is a 10x easier decision than picking Prometheus over a non-Prometheus stack. All four speak PromQL, all four ingest Prometheus remote_write and all four are reversible — try one, switch later if the operational profile does not suit. The bigger choice is leaving Prometheus's data model behind.

Troubleshooting

The error table below covers the failure modes that account for almost all real Prometheus incidents on AI infrastructure. Each row maps an observable symptom to the underlying cause and the minimum-viable fix. Most issues trace back to one of three root causes: cardinality leakage, downstream backpressure on remote_write, or a misconfigured scrape (TLS, auth, or service-discovery filter).

Symptom	Cause	Fix
Prometheus pod OOMKilled on restart	WAL replay loaded too many series into head block	Reduce series via `metric_relabel_configs`; raise pod memory; consider native histograms.
Scrape duration approaches scrape_interval	Target producing too many series or slow `/metrics` handler	Filter labels at scrape; raise `scrape_timeout`; sample the target.
`up == 0` for one target only	TLS, auth, or NetworkPolicy	Curl `/metrics` from a debug pod in the same namespace; check operator logs.
Cardinality explosion overnight	A label became free-form (request ID, prompt hash)	Identify via `topk(20, count by (__name__)({__name__=~".+"}))`; drop the label.
`remote_write` pending samples growing	Downstream Thanos/Mimir backpressured	Tune `queue_config`; investigate downstream; consider Prometheus Agent.
Grafana panel timeouts	PromQL evaluating millions of series per query	Pre-aggregate via recording rules; raise `--query.max-samples`.
Alerts firing but no notifications	Alertmanager unreachable, mis-routed, or silenced	Check `prometheus_notifications_dropped_total`; inspect Alertmanager `/api/v2/silences`.
Duplicate alerts from HA pair	Alertmanager dedup not configured	Use Alertmanager cluster mode (`--cluster.peer`); identical alert labels deduplicate.
Recording rule lagging behind	Rule evaluation slower than `interval`	Split rule group; pre-aggregate inputs; raise interval; profile via `/api/v1/rules`.
Federation scrape times out	`match[]` returns too many series in one body	Tighten `match[]`; switch to `remote_write` for high volume; shard by job.
`tsdb_compactions_failed_total` increasing	Disk full, permissions, or running concurrent prometheus	Check disk and `--storage.tsdb.path` ownership; ensure single Prometheus per data dir.
External labels missing on remote-written samples	`global.external_labels` not set on the Prometheus spec	Add `externalLabels.cluster`, `region`, `env` to the Prometheus CRD.
Queries return stale samples after target down	`staleness` not triggering — target reappeared briefly	Confirm scrape returned 5-minute-old samples; `up{}` cycle pattern.
Yobibyte federation returns 401	Tenant API token expired or scoped to wrong tenant	Rotate the token from the Yobibyte console; verify `Authorization: Bearer` header.

Where this fits in the Yobitel stack

Prometheus is the metrics layer Yobitel operates and the metrics layer Yobitel publishes. On the operator side, every NeoCloud region — London-1, Frankfurt-1, Virginia-1 — runs a Prometheus Operator-managed HA pair scraping DCGM Exporter, node-exporter, kube-state-metrics, the NVIDIA GPU Operator, the NVIDIA Network Operator, the inference replicas Yobibyte runs on behalf of tenants and the regional control plane. Samples remote_write into a regional Thanos cluster backed by an S3-compatible object store for 12-month retention, with cross-region federation through Thanos Querier for capacity-planning workloads.

On the publisher side, Yobibyte exposes the Prometheus federation surface back to every customer. Each tenant has a federation endpoint scoped to that tenant's GPU UUIDs, namespaces and inference replicas — a Prometheus federate URL with bearer-token auth that returns a curated subset of DCGM_FI_*, vllm_*, tensorrt_llm_* and yobibyte_* metrics. Customers point their own Prometheus, Grafana Cloud, Datadog agent or any OpenMetrics-compatible scraper at that endpoint and the metrics flow into their existing dashboards. The same federation surface powers the InferenceBench scoring pipeline that compares Yobitel NeoCloud throughput against public managed-inference vendors, with every metric traceable back to a Prometheus query.

On UK and EU sovereign tenancies the Prometheus + Thanos stack stays inside the sovereign region. Federation tokens are scoped to that region; cross-region replication is disabled; the Yobibyte console queries only the in-region Thanos. Customers running under NCSC Cloud Security Principles, G-Cloud 14 OFFICIAL-handling or EU Data Boundary commitments see a one-region observability surface and a documented control boundary. The recipe-protection rule applies: the Yobibyte console exposes what the customer needs (utilisation, latency, throughput, spend) without disclosing Yobitel's internal scheduling, admission or routing metrics — see the yobibyte entry for the customer-facing API shape.

References

Prometheus Documentation · Prometheus Project
Prometheus on GitHub · GitHub
Prometheus at the CNCF · Cloud Native Computing Foundation
kube-prometheus-stack Helm Chart · GitHub
Prometheus Operator · Prometheus Operator
Thanos Documentation · Thanos
OpenMetrics Specification · OpenMetrics
PromQL Functions Reference · Prometheus

TL;DR

CNCF Graduated (2018), Apache 2.0. Single Go binary combining a scrape engine, an embedded time-series database (`tsdb`), a recording-and-alerting rule evaluator and a PromQL HTTP query API — the second CNCF project ever to graduate, after Kubernetes.
Pull-based scrape model: targets expose `/metrics` in OpenMetrics text format, the server discovers them via `scrape_configs` (static, Kubernetes, Consul, EC2, GCE, Azure or any file-SD plugin) and ingests samples on a 15-60 s cadence.
Configurable through five top-level blocks — `global`, `scrape_configs`, `rule_files`, `alerting` and `remote_write` — plus `recording_rules` and `alerting_rules` that pre-compute expensive PromQL and fire alerts via Alertmanager.
Scales to ~10-15 M active series per node; horizontal scale, multi-tenancy and 12-month retention come from `remote_write` to Thanos, Cortex, Grafana Mimir or VictoriaMetrics. Exemplars link metrics to OpenTelemetry trace IDs.
Default metrics fabric on every Yobitel NeoCloud region and the scrape format Yobibyte emits for customer-side observability — point your own Prometheus at a Yobibyte tenant-scoped federation endpoint and your existing dashboards keep working.

Overview

Quick start

# 1. Install kube-prometheus-stack with sensible defaults for an AI cluster
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm upgrade --install kps prometheus-community/kube-prometheus-stack \
    --namespace monitoring --create-namespace \
    --set prometheus.prometheusSpec.retention=14d \
    --set prometheus.prometheusSpec.retentionSize=200GiB \
    --set prometheus.prometheusSpec.replicas=2 \
    --set prometheus.prometheusSpec.scrapeInterval=30s \
    --set prometheus.prometheusSpec.evaluationInterval=30s \
    --set prometheus.prometheusSpec.enableRemoteWriteReceiver=true \
    --set prometheus.prometheusSpec.enableFeatures="{exemplar-storage,native-histograms}" \
    --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false

# Verify Prometheus is up and scraping
kubectl -n monitoring get prometheus,statefulset,service
kubectl -n monitoring port-forward svc/kps-kube-prometheus-stack-prometheus 9090:9090 &

# Query — total tokens-per-second across a vLLM fleet
curl -s 'http://localhost:9090/api/v1/query?query=sum(rate(vllm:generation_tokens_total[1m]))' | jq

# 2. Standalone Prometheus on bare-metal
prometheus \
    --config.file=/etc/prometheus/prometheus.yml \
    --storage.tsdb.path=/var/lib/prometheus \
    --storage.tsdb.retention.time=14d \
    --storage.tsdb.retention.size=200GB \
    --web.enable-lifecycle \
    --web.enable-admin-api \
    --enable-feature=exemplar-storage,native-histograms

# 3. Scrape a Yobibyte tenant federation endpoint from your own Prometheus
# (illustrative scrape_config — paste into your prometheus.yml)
# scrape_configs:
#   - job_name: yobibyte-tenant
#     scrape_interval: 30s
#     honor_labels: true
#     metrics_path: /federate
#     params:
#       'match[]':
#         - '{__name__=~"DCGM_FI_.*|vllm_.*|yobibyte_.*"}'
#     scheme: https
#     authorization:
#       type: Bearer
#       credentials_file: /etc/prometheus/yobibyte-token
#     static_configs:
#       - targets: ['observability.london-1.yobitel.com']

Tip: Always run Prometheus with --web.enable-lifecycle and --web.enable-admin-api disabled in untrusted environments — both let any caller reload the config or delete series. The kube-prometheus-stack defaults are safe; bare-metal operators sometimes leave both on.

How it works

Scrape loop: per-target goroutine; configurable scrape_interval (global default), scrape_timeout and per-job overrides via scrape_configs.
Storage: tsdb with 2-hour head blocks, lazy compaction to 6h/24h/72h/14d levels; on-disk format is Snappy-compressed XOR-encoded chunks.
Service discovery: 20+ SD mechanisms — kubernetes_sd_configs, ec2_sd_configs, consul_sd_configs, file_sd_configs, http_sd_configs.
Recording rules: pre-aggregate sum, rate, histogram_quantile so dashboards query a cheap rollup instead of recomputing 100k series every refresh.
Alerting rules: PromQL expression + for: window + labels; routed via Alertmanager to PagerDuty, Slack, OpsGenie, webhooks, email.
Remote write: WAL-tailing exporter to Thanos / Mimir / Cortex / VictoriaMetrics for long-term retention and global query.
Exemplars: trace-ID sample attached to a metric data point — the bridge from a PromQL panel to an OpenTelemetry trace in Tempo or Jaeger.
Native histograms (stable in v3.x): high-resolution exponential-bucket histograms that replace the classic fixed-bucket form at a fraction of the series count.

Reference and specifications

Section / operator	Type	Purpose
`global.scrape_interval`	config	Default scrape cadence; 15-60 s typical, 30 s on GPU fleets.
`global.evaluation_interval`	config	Rule evaluation cadence; usually equal to `scrape_interval`.
`global.external_labels`	config	Labels stamped on every sample sent via `remote_write` — `cluster`, `region`, `env`.
`scrape_configs[].job_name`	config	Logical grouping; surfaces as the `job` label on every series.
`scrape_configs[].kubernetes_sd_configs`	config	Watch the Kubernetes API for pod / service / endpoints / node targets.
`scrape_configs[].relabel_configs`	config	Rewrite target metadata before the scrape (filter, rename, route).
`scrape_configs[].metric_relabel_configs`	config	Rewrite metric labels after the scrape (drop high-cardinality labels, rename).
`rule_files[]`	config	Glob list of recording and alerting rule files; reloaded on SIGHUP.
`alerting.alertmanagers[]`	config	Targets that receive alert payloads — typically a 3-node Alertmanager cluster.
`remote_write[].url`	config	WAL-tailed forwarder to Thanos Receive, Mimir, Cortex or VictoriaMetrics.
`remote_write[].queue_config`	config	Per-shard buffer, batch size, retry backoff — critical for downstream pushback.
`recording_rule.record / .expr`	rule	Pre-compute a PromQL expression; result written as a new series.
`alerting_rule.alert / .expr / .for`	rule	Fire alert when expression non-empty for `for:` duration.
`rate(counter[1m])`	PromQL	Per-second average rate of increase of a counter over the window.
`irate(counter[1m])`	PromQL	Instantaneous rate from the last two samples — for fast-moving counters.
`sum by (label) (...)`	PromQL	Aggregate keeping the named labels, dropping the rest.
`histogram_quantile(0.99, ...)`	PromQL	Estimate the q-th quantile from `_bucket` series.
`avg_over_time(gauge[5m])`	PromQL	Average gauge value over the range; common for utilisation alerts.
`max_over_time(gauge[5m]) > THRESHOLD`	PromQL	Threshold-with-debounce; pair with `for:` for cleaner alerts.
`label_replace(...)`	PromQL	Rewrite a label inline — the join glue between metrics from different exporters.
`group_left / group_right`	PromQL	Many-to-one vector matching — the cAdvisor `pod` × DCGM `gpu` join pattern.
`exemplars-storage` feature	config	Persist trace-ID exemplars attached to histogram observations.
`native-histograms` feature	config	Enable exponential-bucket histograms — lower cardinality than classic buckets.
`/api/v1/query` HTTP	API	Instant query — single timestamp evaluation.
`/api/v1/query_range`	API	Range query — fills a Grafana panel.
`/federate?match[]=...`	API	Scrape a subset of series from another Prometheus — basis of Yobibyte tenant federation.
`/-/reload`	API	Hot-reload config + rules; requires `--web.enable-lifecycle`.

Warning: Cardinality is the one operational pitfall everyone hits. Every unique label combination is a new series; encoding a request ID, trace ID, user ID, prompt hash or full path string as a label will explode storage and slow queries. Keep labels low-cardinality (model, node, cluster, region, tenant) and put unique identifiers in trace systems instead — exemplars link the two without the cardinality cost.

Workload patterns

# Recording and alerting rules for an AI cluster
groups:
  - name: ai-cluster-recording
    interval: 30s
    rules:
      # Per-namespace token throughput, pre-aggregated for dashboards
      - record: namespace:vllm_tokens_per_second:rate1m
        expr: sum by (namespace) (rate(vllm_generation_tokens_total[1m]))

      # Per-node Tensor Core saturation, smoothed
      - record: node:tensor_core_active:avg5m
        expr: avg by (Hostname) (avg_over_time(DCGM_FI_PROF_PIPE_TENSOR_ACTIVE[5m]))

      # Per-tenant inference p99 latency
      - record: tenant:vllm_e2e_p99:5m
        expr: histogram_quantile(0.99,
                sum by (tenant, le) (rate(vllm_e2e_request_latency_seconds_bucket[5m])))

  - name: ai-cluster-alerting
    interval: 30s
    rules:
      # SLO — p99 inference latency above the regional budget
      - alert: InferenceP99LatencyHigh
        expr: tenant:vllm_e2e_p99:5m > 2.5
        for: 10m
        labels: { severity: warning, slo: latency }
        annotations:
          summary: "Tenant {{ $labels.tenant }} p99 inference latency at {{ $value | humanizeDuration }}"

      # SLO — tokens-per-second below the floor when traffic present
      - alert: InferenceThroughputCollapse
        expr: namespace:vllm_tokens_per_second:rate1m
              < 0.5 * avg_over_time(namespace:vllm_tokens_per_second:rate1m[6h])
              and namespace:vllm_tokens_per_second:rate1m > 0
        for: 10m
        labels: { severity: critical }
        annotations:
          summary: "Throughput in {{ $labels.namespace }} dropped >50% vs 6h baseline"

      # Capacity — region-wide Tensor Core saturation sustained high
      - alert: RegionTensorCoreSaturation
        expr: avg(node:tensor_core_active:avg5m) > 0.85
        for: 30m
        labels: { severity: info, team: capacity }
        annotations:
          summary: "Region {{ $labels.cluster }} averaging >85% Tensor Core activity — plan capacity"

Tip: Write recording rules for any PromQL expression your dashboards or alerts evaluate more than once an interval. Grafana panels that recompute a 100k-series histogram quantile on every refresh are the most common cause of Prometheus query-side overload — pre-aggregate to a tenant:vllm_e2e_p99:5m-style series and your dashboards become near-free to render.

Sizing and capacity planning

Default scrape interval: 30 s on production GPU clusters; drop to 15 s for SLA-critical inference; raise to 60 s for batch-only training clusters.
RAM rule of thumb: ~3 KB working set per active series (head block + index). 1 M series ≈ 3 GB RAM before query/headroom.
Disk rule of thumb: ~1.3 bytes per compressed sample. 200 k samples/s × 86,400 s × 14 d × 1.3 B ≈ 315 GB before WAL and index overhead.
WAL: separate fast NVMe for --storage.tsdb.wal-segment-size — WAL corruption is the most painful Prometheus failure mode.
Remote write: 1.3-1.6x the local ingest rate due to WAL re-reading; budget downstream Thanos/Mimir capacity accordingly.
HA pairs: two replicas with identical scrape config; downstream dedup at Thanos Querier or Grafana datasource layer.
Yobitel NeoCloud anchor: regional Prometheus pair on c6id.4xlarge-equivalent (16 vCPU, 128 GB RAM, 2 TB NVMe) per 1,024-GPU region.

Fleet	GPUs	Active series	Samples/s	RAM (working)	Disk (14d local)	Yobitel footprint
Single dev cluster	8	~150k	~5k	4 GB	~25 GB	n/a
Small production tenancy	64	~600k	~20k	8 GB	~100 GB	Pattern A
Production tenancy + MIG	256	~3.5 M	~120k	32 GB	~600 GB	Pattern A → B at the high end
Yobitel London-1 region	1,024	~6 M	~200k	48 GB	~1.0 TB	Pattern B — HA pair + Thanos
Yobitel multi-region fleet	4,096	~24 M	~800k	n/a (sharded)	n/a	Pattern B — per-region pairs into central Thanos
Customer scraping Yobibyte tenant	varies	~10-50k filtered	~0.5-2k	1-2 GB	~5-20 GB	Pattern C — `/federate` endpoint

Limits and quotas

Limit	Default	Operational ceiling	How to raise / work around
Active series per server	unlimited	~10-15 M before head-block memory blows up	Shard by namespace; move to Mimir/Thanos; native histograms.
Samples per second per server	unlimited	~1-2 M before scrape duration exceeds interval	Increase scrape interval; shard scrape config; recording rules.
Scrape body size	unlimited (warns at 100 MB)	Memory in head block	Set `sample_limit` per scrape; drop high-cardinality labels via `metric_relabel_configs`.
`scrape_timeout`	10 s	Equal to `scrape_interval`	Increase per-job; investigate slow targets before bumping globally.
WAL replay on restart	n/a	Linear in series × hours of WAL	Use `--storage.tsdb.wal-compression`; restart during low-write windows.
Retention time	15 d	Disk-bound	`--storage.tsdb.retention.time=Nd`; pair with `retention.size` cap.
Retention size	unlimited	Disk-bound	`--storage.tsdb.retention.size=200GB`; first-applied limit wins.
`remote_write` queue	5,000 samples/shard	Downstream ingest rate	Tune `queue_config.max_shards`, `capacity`, `batch_send_deadline`.
Federation series count	unlimited	Scrape body size + scrape duration	Use `match[]` aggressively; prefer remote_write for high-volume.
Query memory per request	unlimited	Server RAM	Set `--query.max-samples=50000000`; prefer recording rules.
Query timeout	2 m	n/a	`--query.timeout=2m`; long queries should be recording rules.
Concurrent queries	20	CPU-bound	`--query.max-concurrency=20`; Grafana panel bursts are the usual cause.
Rule evaluation latency	must be < interval	n/a	Split heavy groups; pre-aggregate; reduce `interval`.
Alertmanager dedup window	5 m	n/a	`group_wait`, `group_interval`, `repeat_interval` in route config.

Warning: WAL replay time is the silent operational scar. A Prometheus with 5 M series and 6 hours of WAL can take 10-15 minutes to come up — long enough for monitoring to be down during an incident. Enable --storage.tsdb.wal-compression, keep WAL on fast NVMe, and consider a Prometheus Agent fronting your scrape config so the long-term store keeps receiving samples even while the query Prometheus restarts.

Observability

Scrape — up == 0 for a target: scrape failed; investigate target, network, TLS, auth.
Scrape duration — scrape_duration_seconds > scrape_interval * 0.8: scrape close to overrunning; target is producing too many series.
Cardinality — prometheus_tsdb_symbol_table_size_bytes growing without bound: a label has gone high-cardinality.
Head block — prometheus_tsdb_head_series near series-budget ceiling: shard or move to long-term store.
WAL — prometheus_tsdb_wal_corruptions_total > 0: WAL corruption; replay will fail on restart.
Remote write — prometheus_remote_storage_samples_pending growing unbounded: downstream backpressure; investigate Thanos/Mimir.
Rule evaluation — prometheus_rule_evaluation_duration_seconds exceeds interval: a recording rule is slower than its interval:.
Notifications — prometheus_notifications_dropped_total > 0: Alertmanager unreachable or overloaded.

# Self-monitoring rules — Prometheus watching Prometheus
groups:
  - name: prometheus-self
    interval: 30s
    rules:
      - alert: PrometheusTargetDown
        expr: up == 0
        for: 5m
        labels: { severity: warning, team: observability }
        annotations:
          summary: "Scrape target {{ $labels.job }}/{{ $labels.instance }} down"

      - alert: PrometheusScrapeSlow
        expr: scrape_duration_seconds > 0.8 * on(job) group_left()
              (max by (job) (scrape_samples_scraped) * 0 + 30)
        for: 15m
        labels: { severity: warning }
        annotations:
          summary: "{{ $labels.job }} scrape duration near interval — investigate target"

      - alert: PrometheusHighCardinality
        expr: rate(prometheus_tsdb_head_series_created_total[10m]) > 1000
        for: 30m
        labels: { severity: warning }
        annotations:
          summary: "High series creation rate — cardinality leak suspected"

      - alert: PrometheusRemoteWriteBacklog
        expr: prometheus_remote_storage_samples_pending > 100000
        for: 10m
        labels: { severity: critical }
        annotations:
          summary: "remote_write backlog growing — downstream store struggling"

      - alert: PrometheusWALCorruption
        expr: increase(prometheus_tsdb_wal_corruptions_total[1h]) > 0
        labels: { severity: critical }
        annotations:
          summary: "WAL corruption on {{ $labels.instance }} — investigate disk"

      - alert: PrometheusRuleEvaluationSlow
        expr: prometheus_rule_evaluation_duration_seconds{quantile="0.99"} > 25
        for: 15m
        labels: { severity: warning }
        annotations:
          summary: "Rule evaluation p99 {{ $value }}s near 30s interval"

      - alert: PrometheusNotificationsDropped
        expr: increase(prometheus_notifications_dropped_total[5m]) > 0
        labels: { severity: critical }
        annotations:
          summary: "Alertmanager notifications dropped — investigate alerting path"

Tip: Always run two Prometheus replicas with identical scrape config — and have each replica scrape the other one's /metrics. That cross-scrape is the only way to alert on a Prometheus that has stopped scraping itself.

Cost and FinOps

Self-hosted on cloud VM: typical 16 vCPU / 64 GB RAM / 2 TB NVMe instance is ~$400/month on-demand, ~$200/month on 1-year reserved.
Object-store Thanos: ~$0.023/GB-month on S3 standard, ~$0.0125/GB-month on S3 IA — long retention is dominated by storage, not compute.
Managed Prometheus: priced per active series per month (~$0.30-0.90/1k series at the time of writing); cost scales linearly with cardinality.
Yobitel NeoCloud: Prometheus + Thanos + Grafana are part of the GPU rate; customers pay only the per-GPU/hr published price.
Yobibyte managed observability: the tenant federation endpoint is included; customers who federate into their own Prometheus pay only their own infrastructure cost.
FinOps wedge: high-cardinality labels are the #1 cost driver on managed Prometheus — relabel_configs to drop request-ID, prompt-hash, full-path labels before they leave the cluster.

Fleet	Active series	Self-hosted Prom + S3 Thanos (monthly USD)	Grafana Cloud (monthly USD)	AWS AMP (monthly USD)	Yobitel NeoCloud
Single dev cluster (8 GPU)	~150k	~$50 (m6i.large + 50 GB EBS)	~$80	~$60	Included in GPU rate
Production tenancy (256 GPU)	~3.5 M	~$400 (c6id.2xlarge + 1 TB EBS + S3)	~$1,800	~$1,400	Included in GPU rate
Yobitel London-1 region (1,024 GPU)	~6 M	~$700 (c6id.4xlarge HA + Thanos)	~$3,200	~$2,500	Yobitel-operated; surfaced to customers
Customer scraping Yobibyte tenant	~10-50k	~$25 (existing Prom + tiny disk)	~$40	~$30	Federation included, no extra fee

Security and compliance

Warning: Never expose Prometheus's /api/v1/admin/tsdb/delete_series or /-/reload HTTP endpoints to a network you do not control. --web.enable-admin-api and --web.enable-lifecycle are off by default for a reason — re-enabling them without an authenticating reverse proxy lets any caller delete production series or reload arbitrary config.

Migration and alternatives

Migration source / alternative	Effort	What you gain	What you lose
Graphite / StatsD push	Medium	Pull model, service discovery, PromQL, AI ecosystem default	Statsd_exporter bridges push clients; some statsd semantics differ
InfluxDB + Telegraf	Medium	Active CNCF community, GPU-runtime native integrations	InfluxQL → PromQL retraining; Flux drop-ins exist
AWS CloudWatch	Medium	Open source, portable, AI ecosystem default	CloudWatch alarms re-implemented in Prometheus + Alertmanager
Azure Monitor / GCP Cloud Monitoring	Medium	Same	Same
Datadog (keep as backend)	Low — remote_write	Cluster-local Prometheus + Datadog UI	Datadog APM tie-in remains separate
No metrics at all	Trivial — kube-prometheus-stack	Every benefit	n/a — this is the right migration
Thanos	Pair with Prometheus	Object-store long-term retention, global query, downsampling	Operational complexity; another component to run
Cortex	Pair with Prometheus	Horizontally scalable multi-tenant, blocks storage	Operational complexity
Grafana Mimir	Pair with Prometheus	Cortex fork with better operability, native multi-tenancy	AGPLv3; vendor positioning
VictoriaMetrics	Drop-in replacement	Higher compression, lower RAM, simpler operations	Smaller community; some PromQL differences
Grafana Cloud / AWS AMP / Chronosphere	Hosted	No ops; usage-based pricing	Per-series cost can outrun self-hosted at scale

# Long-term storage via Thanos sidecar — production NeoCloud pattern
# Prometheus runs with a Thanos sidecar that uploads finalised blocks to S3
# and answers Querier requests for recent samples.

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: regional
  namespace: monitoring
spec:
  replicas: 2
  retention: 14d
  retentionSize: 1TiB
  scrapeInterval: 30s
  externalLabels:
    cluster: london-1
    region: uk-london
    env: prod
  serviceMonitorSelector: {}
  podMonitorSelector: {}
  ruleSelector: {}
  thanos:
    objectStorageConfig:
      key: thanos.yaml
      name: thanos-objstore-secret
    image: quay.io/thanos/thanos:v0.36.1
  storage:
    volumeClaimTemplate:
      spec:
        storageClassName: fast-nvme
        resources:
          requests:
            storage: 1Ti
---
# Thanos object-storage config (referenced above)
type: S3
config:
  bucket: yobitel-london-1-thanos
  endpoint: s3.eu-west-2.amazonaws.com
  region: eu-west-2
  sse_config:
    type: SSE-KMS
    kms_key_id: arn:aws:kms:eu-west-2:000000000000:key/abc

Note: Picking between Thanos, Mimir, Cortex and VictoriaMetrics is a 10x easier decision than picking Prometheus over a non-Prometheus stack. All four speak PromQL, all four ingest Prometheus remote_write and all four are reversible — try one, switch later if the operational profile does not suit. The bigger choice is leaving Prometheus's data model behind.

Troubleshooting

Symptom	Cause	Fix
Prometheus pod OOMKilled on restart	WAL replay loaded too many series into head block	Reduce series via `metric_relabel_configs`; raise pod memory; consider native histograms.
Scrape duration approaches scrape_interval	Target producing too many series or slow `/metrics` handler	Filter labels at scrape; raise `scrape_timeout`; sample the target.
`up == 0` for one target only	TLS, auth, or NetworkPolicy	Curl `/metrics` from a debug pod in the same namespace; check operator logs.
Cardinality explosion overnight	A label became free-form (request ID, prompt hash)	Identify via `topk(20, count by (__name__)({__name__=~".+"}))`; drop the label.
`remote_write` pending samples growing	Downstream Thanos/Mimir backpressured	Tune `queue_config`; investigate downstream; consider Prometheus Agent.
Grafana panel timeouts	PromQL evaluating millions of series per query	Pre-aggregate via recording rules; raise `--query.max-samples`.
Alerts firing but no notifications	Alertmanager unreachable, mis-routed, or silenced	Check `prometheus_notifications_dropped_total`; inspect Alertmanager `/api/v2/silences`.
Duplicate alerts from HA pair	Alertmanager dedup not configured	Use Alertmanager cluster mode (`--cluster.peer`); identical alert labels deduplicate.
Recording rule lagging behind	Rule evaluation slower than `interval`	Split rule group; pre-aggregate inputs; raise interval; profile via `/api/v1/rules`.
Federation scrape times out	`match[]` returns too many series in one body	Tighten `match[]`; switch to `remote_write` for high volume; shard by job.
`tsdb_compactions_failed_total` increasing	Disk full, permissions, or running concurrent prometheus	Check disk and `--storage.tsdb.path` ownership; ensure single Prometheus per data dir.
External labels missing on remote-written samples	`global.external_labels` not set on the Prometheus spec	Add `externalLabels.cluster`, `region`, `env` to the Prometheus CRD.
Queries return stale samples after target down	`staleness` not triggering — target reappeared briefly	Confirm scrape returned 5-minute-old samples; `up{}` cycle pattern.
Yobibyte federation returns 401	Tenant API token expired or scoped to wrong tenant	Rotate the token from the Yobibyte console; verify `Authorization: Bearer` header.

Where this fits in the Yobitel stack

References

Prometheus Documentation · Prometheus Project
Prometheus on GitHub · GitHub
Prometheus at the CNCF · Cloud Native Computing Foundation
kube-prometheus-stack Helm Chart · GitHub
Prometheus Operator · Prometheus Operator
Thanos Documentation · Thanos
OpenMetrics Specification · OpenMetrics
PromQL Functions Reference · Prometheus

Prometheus

Overview

Quick start

How it works

Reference and specifications

Workload patterns

Sizing and capacity planning

Limits and quotas

Observability

Cost and FinOps

Security and compliance

Migration and alternatives

Troubleshooting

Where this fits in the Yobitel stack

References

Browse all entries

Deploy on Yobibyte

Prometheus

Overview

Quick start

How it works

Reference and specifications

Workload patterns

Sizing and capacity planning

Limits and quotas

Observability

Cost and FinOps

Security and compliance

Migration and alternatives

Troubleshooting

Where this fits in the Yobitel stack

References

Browse all entries

Deploy on Yobibyte