InferenceBench

TL;DR

InferenceBench is Yobitel's open inference economics platform — a public, vendor-neutral leaderboard at inferencebench.io that tracks 338 models, 60 GPU SKUs, and 19 inference providers in one continuously updated view.
Rankings are driven by a composite Value score that fuses benchmark quality (MMLU, HumanEval, GSM8K, MT-Bench), measured throughput (tokens per second, TTFT, inter-token latency), and live cost in USD per million tokens across providers.
Pricing is re-pulled every six hours via automated provider-API ingestion; throughput is predicted from a roofline model layered with CUDA kernel-level profiling and validated against HuggingFace LLM Perf and provider-reported numbers.
Vendor-neutral by design — InferenceBench is not affiliated with any GPU vendor or cloud, accepts no paid placement, and indexes Yobitel-operated capacity alongside every other provider without preferential ranking. Community submissions are accepted with full configuration disclosure and verified before inclusion.
Use it free at inferencebench.io: rank models by Value, compare GPUs by dollar-per-token, filter to open-weights with NCSC-eligible providers, and export the data for offline analysis. The same numbers power Yobibyte's marketplace and the performance side of Omniscient Compute's ranking.

Overview

Choosing how to serve a large language model is a multi-dimensional procurement problem. The model itself matters — a 7B model on a saturated GPU often beats a poorly-batched 70B model on better hardware. The accelerator matters — an H100 SXM5 at FP8 with FlashAttention-3 looks nothing like an H100 PCIe at FP16 without paged attention. The provider matters — the same H100 can vary by 4x in dollar-per-token between a hyperscaler on-demand quote and a neocloud reservation. The shape of the request matters — 2K input / 256 output is bandwidth-bound, 256 input / 2K output is compute-bound, and the leaderboard has to surface both.

InferenceBench is the public answer to that procurement problem, operated by Yobitel and published openly at inferencebench.io. It indexes 338 models across chat, code, math, reasoning, vision, multimodal, embeddings, and speech categories; profiles each model across 60 GPU SKUs spanning NVIDIA, AMD, and Intel parts; and watches 19 inference providers for live price, latency, and capacity. Every leaderboard row carries a single composite Value score so a procurement reviewer can answer 'what should I actually run?' without rebuilding the analysis from raw provider quotes.

Yobitel Communications, the UK-headquartered AI infrastructure company that operates InferenceBench, treats neutrality as the product. The leaderboard regularly returns providers and SKUs that compete directly with Yobitel NeoCloud, Yobibyte, and the rest of the Yobitel stack — and that is the point. The same data feeds Yobibyte's marketplace ranking and the performance side of Omniscient Compute, so a customer who later adopts Yobitel's managed surface is consuming the same public methodology they could verify independently. Paid placement is not accepted, vendor-supplied benchmarks are clearly tagged as vendor-supplied, and the scoring formula is published in the open.

InferenceBench has three audiences. AI engineers use it to pick a model for a workload. FinOps and procurement teams use it to validate provider quotes. Researchers and journalists use it as a primary source for cross-vendor inference economics. Everything described in this entry is browsable today at inferencebench.io; no sign-in, no contract, no quota.

How to use the leaderboard

InferenceBench is a public website. Open https://inferencebench.io and the full ranked table renders by default, sorted by the composite Value score. The interface is a three-pane layout: a filter sidebar on the left, the ranked leaderboard in the centre, and a detail drawer that opens when you select a row.

Filtering is the primary motion. The sidebar offers six filter axes — model category (chat, code, math, reasoning, vision, multimodal, embeddings, speech), licence (open-weights vs proprietary), accelerator family (NVIDIA, AMD, Intel, Google, AWS), provider (any of the 19 indexed), sovereignty tag (UK NCSC OFFICIAL, EU Data Boundary, US FedRAMP-equivalent, HIPAA, ISO 27001, SOC 2, DORA, unattested), and input/output shape (nine standard shapes from 256/64 short-chat to 32K/2K long-context). Every filter encodes into the URL, so a saved view ("chat models, open-weights, NCSC-eligible, sorted by Value") is a bookmark you can share.

Reading a row: the Value column is the headline composite (higher is better), Quality is the normalised benchmark score, Throughput is tokens per second at the selected shape, and Price is USD per million tokens blended input + output. Click any row to open the detail drawer, which shows per-direction price, per-benchmark quality, throughput history, and the provider-feed freshness window. Export the current filtered view as CSV via the toolbar; bulk historical snapshots are published quarterly to a public S3 bucket linked from the methodology page.

Tip: If sovereignty is your gating constraint, set the sovereignty filter first (e.g. NCSC OFFICIAL) before any other axis. The leaderboard will then never show non-eligible providers regardless of price, which avoids accidentally fixating on a cheap option you cannot actually deploy on.

Methodology

InferenceBench is built around three independent pipelines that converge on the leaderboard. The price pipeline polls each indexed provider's public pricing API every six hours, normalises units (per-token vs per-hour, input vs output, prompt vs completion) into a common dollar-per-million-tokens shape, and writes the result to a time-series store so readers can plot drift. The throughput pipeline predicts tokens-per-second for every (model, accelerator, quantisation, batch size) combination using a roofline model anchored by compute, memory bandwidth, and communication roofs, then layers kernel-level profiling for the attention path and validates against measured runs from HuggingFace LLM Perf and provider-reported numbers. The quality pipeline pulls third-party evaluations (MMLU, HumanEval, GSM8K, MT-Bench, multilingual sets) from the model's published evaluation card and verifies the headline numbers against community re-runs where available.

Model selection is open. New models are ingested within seven days of public release from the model publisher's evaluation card; provider-served entries land as providers list the model. Provider selection follows quarterly cycles after capacity and pricing-API verification, with new providers nominated through the public submission form. Community-submitted benchmarks are accepted with full configuration disclosure — quantisation, batch size, sequence length, KV cache size, system prompt, and the underlying serving stack must all be declared — and are verified before inclusion.

Neutrality is enforced procedurally. No GPU vendor or cloud provider has editorial influence over the leaderboard. Vendor-supplied benchmarks are tagged with a vendor-supplied badge and excluded from the headline ranking until an independent re-run validates them. Yobitel-operated capacity (NeoCloud) is indexed under the same rules as every other provider and routinely ranks below competitors on individual rows.

Three pipelines: price (six-hour refresh), throughput (roofline + kernel profiling, validated against measured runs), quality (third-party evaluation cards plus community re-runs).
New models ingested within seven days of public release; new providers added on quarterly cycles after verification.
Vendor-supplied benchmarks are tagged and excluded from the headline ranking until independently re-run.
Yobitel-operated capacity is indexed under the same rules as every other provider; neutrality is the product.
Historical price and quality drift is retained for trend analysis and forecasting.

The Composite Value Score

The Value score is the headline number on every leaderboard row. It is a published composite that fuses three streams: benchmark quality, measured throughput, and live cost per token. At a high level the formula is quality-weighted tokens-per-dollar, normalised within each model category so a code model's HumanEval score and a chat model's MMLU score are not directly compared. The detailed weighting per category is documented openly at inferencebench.io/methodology, and the input numbers (price, throughput, quality) are surfaced as separate columns so the composite can be deconstructed at a glance.

The intent of the Value score is to answer 'what should I actually run?' rather than 'which model has the highest MMLU?' or 'which provider is cheapest?'. A high-quality model on a saturated provider can score below a slightly weaker model on a well-priced provider, and that is the right answer for most production workloads.

What the Value score does not capture — and where readers should still apply judgement — is workload-specific behaviour outside the nine standard input/output shapes (long-tail prompts, agentic loops with high tool-call density, multi-turn state), reliability and incident history of the provider, contractual terms (rate-card vs reserved vs committed-use pricing), data-residency and connectivity constraints not encoded in the eight sovereignty tags, and ecosystem fit such as SDK support, framework integration, and operational tooling. Procurement reviewers should use Value as the starting filter and overlay these qualitative factors before signing a contract.

Coverage

InferenceBench's job is to be the most complete cross-vendor inference index that exists. The current coverage envelope is summarised below; counts are refreshed at every leaderboard build.

Dimension	Coverage	Refresh cadence	Notes
Models tracked	338	New models added within 7 days of public release	Chat, code, math, reasoning, vision, multimodal, embeddings, speech, image generation.
GPU and accelerator SKUs	60	New SKUs added at vendor announcement plus measured-run availability	NVIDIA B300/B200/H200/H100/L40S/L4/A100/A10G/T4, AMD MI300X/MI250X, Intel Gaudi 3, Google TPU v5e/v5p, AWS Trainium2 and Inferentia2 where third-party benchmarks exist.
Inference providers indexed	19	Live price every 6 hours	Hyperscaler, neocloud, regional, sovereign, and community tier coverage.
Quality benchmarks	12 headline suites	Per model release	MMLU, HumanEval, GSM8K, MT-Bench, BBH, MATH, HellaSwag, ARC-Challenge, IFEval, MMLU-Pro, multilingual sets, vision and multimodal sets where relevant.
Sovereignty tags	8 classes	Per provider region	UK NCSC OFFICIAL, EU Data Boundary, US FedRAMP-equivalent, HIPAA-eligible, ISO 27001, SOC 2 Type II, DORA-aligned, and an explicit 'no compliance posture published' tag.
Historical price retention	24 months rolling	Continuous	Per-SKU, per-provider, per-region price history downloadable as CSV.
Input/output shape profiles	9 shapes per model	Per profiling run	From 256/64 tokens (low-latency chat) to 32K/2K (long-context summarisation).
Languages benchmarked	12+ headline	Per quality refresh	English, French, German, Spanish, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, Hindi, plus language-specific suites where published.

How it differs from peer leaderboards

InferenceBench is one cross-vendor inference index among several. Most peer leaderboards specialise in one slice of the procurement question — quality only (OpenLLM Leaderboard, MMLU-focused single-benchmark indices), preference only (LMSYS Chatbot Arena), or quality plus a curated subset of providers (Artificial Analysis). InferenceBench's design choice is to combine quality, measured throughput, and live multi-provider pricing in a single composite, and to make sovereignty-tag filtering a first-class axis so regulated workloads can be filtered before any other consideration.

The honest comparison is below — the right choice depends on whether you need cross-provider price coverage, a particular quality methodology, or the most-cited public preference score.

Concern	InferenceBench	Artificial Analysis	OpenLLM Leaderboard	LMSYS Chatbot Arena	MMLU-only / MT-Bench-only indices
Provider price coverage	19 providers across hyperscaler, neocloud, regional, sovereign, community	Strong on US neoclouds and hyperscalers	Quality only — no provider price coverage	None — preference only	None
GPU and accelerator SKU coverage	60 SKUs across NVIDIA, AMD, Intel, Google, AWS	Provider-rate-card driven	N/A	N/A	N/A
Price refresh cadence	Every 6 hours	Frequent but cadence not published	N/A	N/A	N/A
Sovereignty filtering	8 tag classes, first-class filter	Limited	N/A	N/A	N/A
Composite Value score	Quality-weighted tokens-per-dollar across 12 benchmark suites	Performance Index	Quality only	Elo from human preference	Single benchmark
Throughput methodology	Roofline + kernel profiling + measured validation	Measured runs	N/A	N/A	N/A
Vendor neutrality	No paid placement; Yobitel-operated capacity indexed under same rules	Independent	Independent	Independent academic	Independent
Cost to reader	Free	Free tier + paid	Free	Free	Free

Open data and citation

InferenceBench publishes its dataset openly. The interactive leaderboard at inferencebench.io is free with no sign-in. A public REST API returns the same data as JSON or CSV at a default rate limit of 60 requests per minute and 10,000 per day per IP; higher limits are available on request without commercial pricing. Bulk historical snapshots — per-SKU, per-provider, per-region price history covering the rolling 24-month window — are published quarterly to a public S3 bucket linked from the methodology page.

What is published: the full ranked leaderboard, per-row quality breakdown by benchmark suite, per-row throughput by input/output shape, per-row price history with the source URL the price was scraped from, and the per-category weighting that the Value composite uses. What is not published: provider-private commercial terms (reserved-instance discounts, committed-use commitments, negotiated rates) — only the public rate card is indexed, with the rate-card-vs-reserved gap noted on rows where the provider publishes both.

Citation policy is permissive: cite as 'InferenceBench, Yobitel Communications, [URL], accessed [date]' for academic and journalistic use, and link the leaderboard URL with the active filters preserved so readers can reproduce the view. Commercial republication (a competing leaderboard, a SaaS product) requires written permission; the contact form at inferencebench.io/contact is the route.

Roadmap

InferenceBench's roadmap is published openly and tracks three workstreams: broader coverage, deeper methodology, and richer compliance metadata. Items shipped most-recent-first appear on the methodology page changelog; the items below are the publicly-committed in-flight investments.

Benchmark suite expansion — agentic-loop benchmarks (tool-use density, plan-and-execute success rate), long-context retrieval suites beyond 32K, code-execution-correctness suites alongside HumanEval pass@1.
Provider expansion — adding regional sovereign clouds across APAC and LATAM to bring full price coverage on workloads pinned to those regions; nominated providers are visible on the public submission queue.
Compliance tag expansion — adding DORA-aligned, EU AI Act conformance, India MeitY empanelment, and Australia IRAP tags as those frameworks publish stable attestation criteria.
Throughput methodology — moving the kernel-profiling validation from a sampled subset to full per-row coverage as profiling capacity scales; widening from nine standard shapes to twelve including a long-context-summarisation shape and an agentic short-turn shape.
Public dataset cadence — moving bulk snapshots from quarterly to monthly, with a stable JSONL schema versioned for downstream consumers.

Where InferenceBench fits in the Yobitel stack

InferenceBench is the benchmark and economics layer of Yobitel's three-platform stack. It sits alongside Yobibyte and Omniscient Compute rather than beneath them: where Yobibyte is the managed inference surface and Omniscient Compute is the vendor-neutral capacity search engine, InferenceBench is the open methodology that grounds the rankings used inside both. Yobibyte's marketplace pulls model and accelerator scoring directly from InferenceBench, so 'which model is best' inside a Yobibyte workspace is the same public number any reader can verify on the leaderboard. Omniscient Compute uses InferenceBench throughput data to populate the performance dimension of its composite Value score.

The deliberate split is that InferenceBench is public, neutral, free, and read-only. A FinOps team can use it to validate a provider quote without ever talking to Yobitel. A research team can cite it as a primary source for a paper. A procurement reviewer can hand the URL to a CFO and the methodology page answers most questions. That neutrality is what makes the data trustworthy when it feeds Yobibyte's marketplace one layer up — Yobitel cannot rig a ranking that everyone can already see.

Practically, a customer can adopt the stack at any layer. A team that just wants a model decision uses InferenceBench standalone. A team that wants the managed inference surface adopts Yobibyte and lets it consume InferenceBench internally. A team that wants vendor-neutral capacity search adopts Omniscient Compute and lets it consume InferenceBench's throughput data. The boundaries are deliberate, the APIs are stable, and InferenceBench stays free and open regardless of what else the customer adopts.

References

InferenceBench leaderboard · InferenceBench
InferenceBench methodology · InferenceBench
InferenceBench product page · Yobitel
Yobibyte platform · Yobitel
Omniscient Compute · Yobitel
HuggingFace LLM Performance Leaderboard · HuggingFace
Artificial Analysis · Artificial Analysis
FOCUS — FinOps Open Cost and Usage Specification · FinOps Foundation

TL;DR

InferenceBench is Yobitel's open inference economics platform — a public, vendor-neutral leaderboard at inferencebench.io that tracks 338 models, 60 GPU SKUs, and 19 inference providers in one continuously updated view.
Rankings are driven by a composite Value score that fuses benchmark quality (MMLU, HumanEval, GSM8K, MT-Bench), measured throughput (tokens per second, TTFT, inter-token latency), and live cost in USD per million tokens across providers.
Pricing is re-pulled every six hours via automated provider-API ingestion; throughput is predicted from a roofline model layered with CUDA kernel-level profiling and validated against HuggingFace LLM Perf and provider-reported numbers.
Vendor-neutral by design — InferenceBench is not affiliated with any GPU vendor or cloud, accepts no paid placement, and indexes Yobitel-operated capacity alongside every other provider without preferential ranking. Community submissions are accepted with full configuration disclosure and verified before inclusion.
Use it free at inferencebench.io: rank models by Value, compare GPUs by dollar-per-token, filter to open-weights with NCSC-eligible providers, and export the data for offline analysis. The same numbers power Yobibyte's marketplace and the performance side of Omniscient Compute's ranking.

Overview

How to use the leaderboard

Tip: If sovereignty is your gating constraint, set the sovereignty filter first (e.g. NCSC OFFICIAL) before any other axis. The leaderboard will then never show non-eligible providers regardless of price, which avoids accidentally fixating on a cheap option you cannot actually deploy on.

Methodology

Three pipelines: price (six-hour refresh), throughput (roofline + kernel profiling, validated against measured runs), quality (third-party evaluation cards plus community re-runs).
New models ingested within seven days of public release; new providers added on quarterly cycles after verification.
Vendor-supplied benchmarks are tagged and excluded from the headline ranking until independently re-run.
Yobitel-operated capacity is indexed under the same rules as every other provider; neutrality is the product.
Historical price and quality drift is retained for trend analysis and forecasting.

The Composite Value Score

Coverage

InferenceBench's job is to be the most complete cross-vendor inference index that exists. The current coverage envelope is summarised below; counts are refreshed at every leaderboard build.

Dimension	Coverage	Refresh cadence	Notes
Models tracked	338	New models added within 7 days of public release	Chat, code, math, reasoning, vision, multimodal, embeddings, speech, image generation.
GPU and accelerator SKUs	60	New SKUs added at vendor announcement plus measured-run availability	NVIDIA B300/B200/H200/H100/L40S/L4/A100/A10G/T4, AMD MI300X/MI250X, Intel Gaudi 3, Google TPU v5e/v5p, AWS Trainium2 and Inferentia2 where third-party benchmarks exist.
Inference providers indexed	19	Live price every 6 hours	Hyperscaler, neocloud, regional, sovereign, and community tier coverage.
Quality benchmarks	12 headline suites	Per model release	MMLU, HumanEval, GSM8K, MT-Bench, BBH, MATH, HellaSwag, ARC-Challenge, IFEval, MMLU-Pro, multilingual sets, vision and multimodal sets where relevant.
Sovereignty tags	8 classes	Per provider region	UK NCSC OFFICIAL, EU Data Boundary, US FedRAMP-equivalent, HIPAA-eligible, ISO 27001, SOC 2 Type II, DORA-aligned, and an explicit 'no compliance posture published' tag.
Historical price retention	24 months rolling	Continuous	Per-SKU, per-provider, per-region price history downloadable as CSV.
Input/output shape profiles	9 shapes per model	Per profiling run	From 256/64 tokens (low-latency chat) to 32K/2K (long-context summarisation).
Languages benchmarked	12+ headline	Per quality refresh	English, French, German, Spanish, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, Hindi, plus language-specific suites where published.

How it differs from peer leaderboards

The honest comparison is below — the right choice depends on whether you need cross-provider price coverage, a particular quality methodology, or the most-cited public preference score.

Concern	InferenceBench	Artificial Analysis	OpenLLM Leaderboard	LMSYS Chatbot Arena	MMLU-only / MT-Bench-only indices
Provider price coverage	19 providers across hyperscaler, neocloud, regional, sovereign, community	Strong on US neoclouds and hyperscalers	Quality only — no provider price coverage	None — preference only	None
GPU and accelerator SKU coverage	60 SKUs across NVIDIA, AMD, Intel, Google, AWS	Provider-rate-card driven	N/A	N/A	N/A
Price refresh cadence	Every 6 hours	Frequent but cadence not published	N/A	N/A	N/A
Sovereignty filtering	8 tag classes, first-class filter	Limited	N/A	N/A	N/A
Composite Value score	Quality-weighted tokens-per-dollar across 12 benchmark suites	Performance Index	Quality only	Elo from human preference	Single benchmark
Throughput methodology	Roofline + kernel profiling + measured validation	Measured runs	N/A	N/A	N/A
Vendor neutrality	No paid placement; Yobitel-operated capacity indexed under same rules	Independent	Independent	Independent academic	Independent
Cost to reader	Free	Free tier + paid	Free	Free	Free

Open data and citation

Roadmap

Benchmark suite expansion — agentic-loop benchmarks (tool-use density, plan-and-execute success rate), long-context retrieval suites beyond 32K, code-execution-correctness suites alongside HumanEval pass@1.
Provider expansion — adding regional sovereign clouds across APAC and LATAM to bring full price coverage on workloads pinned to those regions; nominated providers are visible on the public submission queue.
Compliance tag expansion — adding DORA-aligned, EU AI Act conformance, India MeitY empanelment, and Australia IRAP tags as those frameworks publish stable attestation criteria.
Throughput methodology — moving the kernel-profiling validation from a sampled subset to full per-row coverage as profiling capacity scales; widening from nine standard shapes to twelve including a long-context-summarisation shape and an agentic short-turn shape.
Public dataset cadence — moving bulk snapshots from quarterly to monthly, with a stable JSONL schema versioned for downstream consumers.

Where InferenceBench fits in the Yobitel stack

References

InferenceBench leaderboard · InferenceBench
InferenceBench methodology · InferenceBench
InferenceBench product page · Yobitel
Yobibyte platform · Yobitel
Omniscient Compute · Yobitel
HuggingFace LLM Performance Leaderboard · HuggingFace
Artificial Analysis · Artificial Analysis
FOCUS — FinOps Open Cost and Usage Specification · FinOps Foundation

InferenceBench

Overview

How to use the leaderboard

Methodology

The Composite Value Score

Coverage

How it differs from peer leaderboards

Open data and citation

Roadmap

Where InferenceBench fits in the Yobitel stack

References

Browse all entries

Deploy on Yobibyte

InferenceBench

Overview

How to use the leaderboard

Methodology

The Composite Value Score

Coverage

How it differs from peer leaderboards

Open data and citation

Roadmap

Where InferenceBench fits in the Yobitel stack

References

Browse all entries

Deploy on Yobibyte