TL;DR
- InferenceBench is Yobitel's open inference economics platform — a public, vendor-neutral leaderboard at inferencebench.io that tracks 338 models, 60 GPU SKUs, and 19 inference providers in one continuously updated view.
- Rankings are driven by a composite Value score that fuses benchmark quality (MMLU, HumanEval, GSM8K, MT-Bench), measured throughput (tokens per second, TTFT, inter-token latency), and live cost in USD per million tokens across providers.
- Pricing is re-pulled every six hours via automated provider-API ingestion; throughput is predicted from a roofline model layered with CUDA kernel-level profiling and validated against HuggingFace LLM Perf and provider-reported numbers.
- Vendor-neutral by design — InferenceBench is not affiliated with any GPU vendor or cloud, accepts no paid placement, and indexes Yobitel-operated capacity alongside every other provider without preferential ranking. Community submissions are accepted with full configuration disclosure and verified before inclusion.
- Use it free at inferencebench.io: rank models by Value, compare GPUs by dollar-per-token, filter to open-weights with NCSC-eligible providers, and export the data for offline analysis. The same numbers power Yobibyte's marketplace and the performance side of Omniscient Compute's ranking.
Overview#
Choosing how to serve a large language model is a multi-dimensional procurement problem. The model itself matters — a 7B model on a saturated GPU often beats a poorly-batched 70B model on better hardware. The accelerator matters — an H100 SXM5 at FP8 with FlashAttention-3 looks nothing like an H100 PCIe at FP16 without paged attention. The provider matters — the same H100 can vary by 4x in dollar-per-token between a hyperscaler on-demand quote and a neocloud reservation. The shape of the request matters — 2K input / 256 output is bandwidth-bound, 256 input / 2K output is compute-bound, and the leaderboard has to surface both.
InferenceBench is the public answer to that procurement problem, operated by Yobitel and published openly at inferencebench.io. It indexes 338 models across chat, code, math, reasoning, vision, multimodal, embeddings, and speech categories; profiles each model across 60 GPU SKUs spanning NVIDIA, AMD, and Intel parts; and watches 19 inference providers for live price, latency, and capacity. Every leaderboard row carries a single composite Value score so a procurement reviewer can answer 'what should I actually run?' without rebuilding the analysis from raw provider quotes.
Yobitel Communications, the UK-headquartered AI infrastructure company that operates InferenceBench, treats neutrality as the product. The leaderboard regularly returns providers and SKUs that compete directly with Yobitel NeoCloud, Yobibyte, and the rest of the Yobitel stack — and that is the point. The same data feeds Yobibyte's marketplace ranking and the performance side of Omniscient Compute, so a customer who later adopts Yobitel's managed surface is consuming the same public methodology they could verify independently. Paid placement is not accepted, vendor-supplied benchmarks are clearly tagged as vendor-supplied, and the scoring formula is published in the open.
InferenceBench has three audiences. AI engineers use it to pick a model for a workload. FinOps and procurement teams use it to validate provider quotes. Researchers and journalists use it as a primary source for cross-vendor inference economics. Everything described in this entry is browsable today at inferencebench.io; no sign-in, no contract, no quota.
How to use the leaderboard#
InferenceBench is a public website. Open https://inferencebench.io and the full ranked table renders by default, sorted by the composite Value score. The interface is a three-pane layout: a filter sidebar on the left, the ranked leaderboard in the centre, and a detail drawer that opens when you select a row.
Filtering is the primary motion. The sidebar offers six filter axes — model category (chat, code, math, reasoning, vision, multimodal, embeddings, speech), licence (open-weights vs proprietary), accelerator family (NVIDIA, AMD, Intel, Google, AWS), provider (any of the 19 indexed), sovereignty tag (UK NCSC OFFICIAL, EU Data Boundary, US FedRAMP-equivalent, HIPAA, ISO 27001, SOC 2, DORA, unattested), and input/output shape (nine standard shapes from 256/64 short-chat to 32K/2K long-context). Every filter encodes into the URL, so a saved view ("chat models, open-weights, NCSC-eligible, sorted by Value") is a bookmark you can share.
Reading a row: the Value column is the headline composite (higher is better), Quality is the normalised benchmark score, Throughput is tokens per second at the selected shape, and Price is USD per million tokens blended input + output. Click any row to open the detail drawer, which shows per-direction price, per-benchmark quality, throughput history, and the provider-feed freshness window. Export the current filtered view as CSV via the toolbar; bulk historical snapshots are published quarterly to a public S3 bucket linked from the methodology page.
If sovereignty is your gating constraint, set the sovereignty filter first (e.g. NCSC OFFICIAL) before any other axis. The leaderboard will then never show non-eligible providers regardless of price, which avoids accidentally fixating on a cheap option you cannot actually deploy on.
Methodology#
InferenceBench is built around three independent pipelines that converge on the leaderboard. The price pipeline polls each indexed provider's public pricing API every six hours, normalises units (per-token vs per-hour, input vs output, prompt vs completion) into a common dollar-per-million-tokens shape, and writes the result to a time-series store so readers can plot drift. The throughput pipeline predicts tokens-per-second for every (model, accelerator, quantisation, batch size) combination using a roofline model anchored by compute, memory bandwidth, and communication roofs, then layers kernel-level profiling for the attention path and validates against measured runs from HuggingFace LLM Perf and provider-reported numbers. The quality pipeline pulls third-party evaluations (MMLU, HumanEval, GSM8K, MT-Bench, multilingual sets) from the model's published evaluation card and verifies the headline numbers against community re-runs where available.
Model selection is open. New models are ingested within seven days of public release from the model publisher's evaluation card; provider-served entries land as providers list the model. Provider selection follows quarterly cycles after capacity and pricing-API verification, with new providers nominated through the public submission form. Community-submitted benchmarks are accepted with full configuration disclosure — quantisation, batch size, sequence length, KV cache size, system prompt, and the underlying serving stack must all be declared — and are verified before inclusion.
Neutrality is enforced procedurally. No GPU vendor or cloud provider has editorial influence over the leaderboard. Vendor-supplied benchmarks are tagged with a vendor-supplied badge and excluded from the headline ranking until an independent re-run validates them. Yobitel-operated capacity (NeoCloud) is indexed under the same rules as every other provider and routinely ranks below competitors on individual rows.
- Three pipelines: price (six-hour refresh), throughput (roofline + kernel profiling, validated against measured runs), quality (third-party evaluation cards plus community re-runs).
- New models ingested within seven days of public release; new providers added on quarterly cycles after verification.
- Vendor-supplied benchmarks are tagged and excluded from the headline ranking until independently re-run.
- Yobitel-operated capacity is indexed under the same rules as every other provider; neutrality is the product.
- Historical price and quality drift is retained for trend analysis and forecasting.
The Composite Value Score#
The Value score is the headline number on every leaderboard row. It is a published composite that fuses three streams: benchmark quality, measured throughput, and live cost per token. At a high level the formula is quality-weighted tokens-per-dollar, normalised within each model category so a code model's HumanEval score and a chat model's MMLU score are not directly compared. The detailed weighting per category is documented openly at inferencebench.io/methodology, and the input numbers (price, throughput, quality) are surfaced as separate columns so the composite can be deconstructed at a glance.
The intent of the Value score is to answer 'what should I actually run?' rather than 'which model has the highest MMLU?' or 'which provider is cheapest?'. A high-quality model on a saturated provider can score below a slightly weaker model on a well-priced provider, and that is the right answer for most production workloads.
What the Value score does not capture — and where readers should still apply judgement — is workload-specific behaviour outside the nine standard input/output shapes (long-tail prompts, agentic loops with high tool-call density, multi-turn state), reliability and incident history of the provider, contractual terms (rate-card vs reserved vs committed-use pricing), data-residency and connectivity constraints not encoded in the eight sovereignty tags, and ecosystem fit such as SDK support, framework integration, and operational tooling. Procurement reviewers should use Value as the starting filter and overlay these qualitative factors before signing a contract.
Coverage#
InferenceBench's job is to be the most complete cross-vendor inference index that exists. The current coverage envelope is summarised below; counts are refreshed at every leaderboard build.
| Dimension | Coverage | Refresh cadence | Notes |
|---|---|---|---|
| Models tracked | 338 | New models added within 7 days of public release | Chat, code, math, reasoning, vision, multimodal, embeddings, speech, image generation. |
| GPU and accelerator SKUs | 60 | New SKUs added at vendor announcement plus measured-run availability | NVIDIA B300/B200/H200/H100/L40S/L4/A100/A10G/T4, AMD MI300X/MI250X, Intel Gaudi 3, Google TPU v5e/v5p, AWS Trainium2 and Inferentia2 where third-party benchmarks exist. |
| Inference providers indexed | 19 | Live price every 6 hours | Hyperscaler, neocloud, regional, sovereign, and community tier coverage. |
| Quality benchmarks | 12 headline suites | Per model release | MMLU, HumanEval, GSM8K, MT-Bench, BBH, MATH, HellaSwag, ARC-Challenge, IFEval, MMLU-Pro, multilingual sets, vision and multimodal sets where relevant. |
| Sovereignty tags | 8 classes | Per provider region | UK NCSC OFFICIAL, EU Data Boundary, US FedRAMP-equivalent, HIPAA-eligible, ISO 27001, SOC 2 Type II, DORA-aligned, and an explicit 'no compliance posture published' tag. |
| Historical price retention | 24 months rolling | Continuous | Per-SKU, per-provider, per-region price history downloadable as CSV. |
| Input/output shape profiles | 9 shapes per model | Per profiling run | From 256/64 tokens (low-latency chat) to 32K/2K (long-context summarisation). |
| Languages benchmarked | 12+ headline | Per quality refresh | English, French, German, Spanish, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, Hindi, plus language-specific suites where published. |
How it differs from peer leaderboards#
InferenceBench is one cross-vendor inference index among several. Most peer leaderboards specialise in one slice of the procurement question — quality only (OpenLLM Leaderboard, MMLU-focused single-benchmark indices), preference only (LMSYS Chatbot Arena), or quality plus a curated subset of providers (Artificial Analysis). InferenceBench's design choice is to combine quality, measured throughput, and live multi-provider pricing in a single composite, and to make sovereignty-tag filtering a first-class axis so regulated workloads can be filtered before any other consideration.
The honest comparison is below — the right choice depends on whether you need cross-provider price coverage, a particular quality methodology, or the most-cited public preference score.
| Concern | InferenceBench | Artificial Analysis | OpenLLM Leaderboard | LMSYS Chatbot Arena | MMLU-only / MT-Bench-only indices |
|---|---|---|---|---|---|
| Provider price coverage | 19 providers across hyperscaler, neocloud, regional, sovereign, community | Strong on US neoclouds and hyperscalers | Quality only — no provider price coverage | None — preference only | None |
| GPU and accelerator SKU coverage | 60 SKUs across NVIDIA, AMD, Intel, Google, AWS | Provider-rate-card driven | N/A | N/A | N/A |
| Price refresh cadence | Every 6 hours | Frequent but cadence not published | N/A | N/A | N/A |
| Sovereignty filtering | 8 tag classes, first-class filter | Limited | N/A | N/A | N/A |
| Composite Value score | Quality-weighted tokens-per-dollar across 12 benchmark suites | Performance Index | Quality only | Elo from human preference | Single benchmark |
| Throughput methodology | Roofline + kernel profiling + measured validation | Measured runs | N/A | N/A | N/A |
| Vendor neutrality | No paid placement; Yobitel-operated capacity indexed under same rules | Independent | Independent | Independent academic | Independent |
| Cost to reader | Free | Free tier + paid | Free | Free | Free |
Open data and citation#
InferenceBench publishes its dataset openly. The interactive leaderboard at inferencebench.io is free with no sign-in. A public REST API returns the same data as JSON or CSV at a default rate limit of 60 requests per minute and 10,000 per day per IP; higher limits are available on request without commercial pricing. Bulk historical snapshots — per-SKU, per-provider, per-region price history covering the rolling 24-month window — are published quarterly to a public S3 bucket linked from the methodology page.
What is published: the full ranked leaderboard, per-row quality breakdown by benchmark suite, per-row throughput by input/output shape, per-row price history with the source URL the price was scraped from, and the per-category weighting that the Value composite uses. What is not published: provider-private commercial terms (reserved-instance discounts, committed-use commitments, negotiated rates) — only the public rate card is indexed, with the rate-card-vs-reserved gap noted on rows where the provider publishes both.
Citation policy is permissive: cite as 'InferenceBench, Yobitel Communications, [URL], accessed [date]' for academic and journalistic use, and link the leaderboard URL with the active filters preserved so readers can reproduce the view. Commercial republication (a competing leaderboard, a SaaS product) requires written permission; the contact form at inferencebench.io/contact is the route.
Roadmap#
InferenceBench's roadmap is published openly and tracks three workstreams: broader coverage, deeper methodology, and richer compliance metadata. Items shipped most-recent-first appear on the methodology page changelog; the items below are the publicly-committed in-flight investments.
- Benchmark suite expansion — agentic-loop benchmarks (tool-use density, plan-and-execute success rate), long-context retrieval suites beyond 32K, code-execution-correctness suites alongside HumanEval pass@1.
- Provider expansion — adding regional sovereign clouds across APAC and LATAM to bring full price coverage on workloads pinned to those regions; nominated providers are visible on the public submission queue.
- Compliance tag expansion — adding DORA-aligned, EU AI Act conformance, India MeitY empanelment, and Australia IRAP tags as those frameworks publish stable attestation criteria.
- Throughput methodology — moving the kernel-profiling validation from a sampled subset to full per-row coverage as profiling capacity scales; widening from nine standard shapes to twelve including a long-context-summarisation shape and an agentic short-turn shape.
- Public dataset cadence — moving bulk snapshots from quarterly to monthly, with a stable JSONL schema versioned for downstream consumers.
Where InferenceBench fits in the Yobitel stack#
InferenceBench is the benchmark and economics layer of Yobitel's three-platform stack. It sits alongside Yobibyte and Omniscient Compute rather than beneath them: where Yobibyte is the managed inference surface and Omniscient Compute is the vendor-neutral capacity search engine, InferenceBench is the open methodology that grounds the rankings used inside both. Yobibyte's marketplace pulls model and accelerator scoring directly from InferenceBench, so 'which model is best' inside a Yobibyte workspace is the same public number any reader can verify on the leaderboard. Omniscient Compute uses InferenceBench throughput data to populate the performance dimension of its composite Value score.
The deliberate split is that InferenceBench is public, neutral, free, and read-only. A FinOps team can use it to validate a provider quote without ever talking to Yobitel. A research team can cite it as a primary source for a paper. A procurement reviewer can hand the URL to a CFO and the methodology page answers most questions. That neutrality is what makes the data trustworthy when it feeds Yobibyte's marketplace one layer up — Yobitel cannot rig a ranking that everyone can already see.
Practically, a customer can adopt the stack at any layer. A team that just wants a model decision uses InferenceBench standalone. A team that wants the managed inference surface adopts Yobibyte and lets it consume InferenceBench internally. A team that wants vendor-neutral capacity search adopts Omniscient Compute and lets it consume InferenceBench's throughput data. The boundaries are deliberate, the APIs are stable, and InferenceBench stays free and open regardless of what else the customer adopts.
References
- InferenceBench leaderboard · InferenceBench
- InferenceBench methodology · InferenceBench
- InferenceBench product page · Yobitel
- Yobibyte platform · Yobitel
- Omniscient Compute · Yobitel
- HuggingFace LLM Performance Leaderboard · HuggingFace
- Artificial Analysis · Artificial Analysis
- FOCUS — FinOps Open Cost and Usage Specification · FinOps Foundation