AI Model Performance Leaderboard

Compare 338 AI modelsby quality, cost & value

Name: InferenceBench
Brand: Yobitel Communications

The definitive platform for GPU inference and training economics. Vendor-neutral rankings across 60 GPUs and 19 providers, with a composite Value score that combines benchmark quality, throughput, and dollar-per-token. Pricing refreshed every 6 hours.

Open Leaderboard Open Calculator Training Leaderboard

338

Models tracked

GPUs monitored

Providers indexed

6hr

Price refresh

Live Leaderboard

Top models, ranked by Value

Snapshot of the top entries per category. Full leaderboard with 338 models, filters, and per-row ROI calculators lives at inferencebench.io.

View full leaderboard

#	Model	Family	Params	Quality	Input $/M	Output $/M	Speed	Tokens/$	Context	Providers	Value
1	Qwen 2.5 7BMost Popular	Qwen	7.6B	70	$0.2	$0.2	142/s	350	128K	8	350.0
2	Qwen 3 8BBest Value	Qwen	8.2B	70	$0.2	$0.2	138/s	350	128K	6	350.0
3	Qwen 2.5 1.5B	Qwen	1.5B	—	$0.027	$0.027	220/s	1,862	32K	4	1862.0
4	Llama 3.1 8B	Llama	8B	68	$0.18	$0.18	156/s	389	128K	12	320.4
5	Mistral 7B v0.3	Mistral	7.2B	65	$0.2	$0.2	168/s	350	32K	9	310.8
6	Gemma 2 9B	Gemma	9B	71	$0.22	$0.22	124/s	318	8K	5	298.5
7	Phi-3 Mini	Phi	3.8B	69	$0.1	$0.1	245/s	700	128K	4	287.2
8	Qwen 2.5 14B	Qwen	14.7B	76	$0.35	$0.35	98/s	200	128K	7	268.9
9	Llama 3.1 70BBest Quality	Llama	70B	85	$0.6	$0.6	58/s	117	128K	14	254.3
10	Mixtral 8x7B	Mistral	46.7B MoE	79	$0.4	$0.4	92/s	175	32K	11	246.7

Quality: MMLU/HumanEval/GSM8K composite. Speed: tokens per second. Value: InferenceBench composite (quality × tokens/$ × latency).

All 338 models on inferencebench.io

The Full Toolbox

Eight tools, one inference economics platform

InferenceBench is more than a leaderboard. It's the toolbox we use internally when sizing GPU clusters for Yobitel customers.

Leaderboard

Ranked comparison across 338 models with composite Value scoring.

Calculator

ROI and inference cost analysis for any model-GPU-provider mix.

Models Directory

Browse 338 catalogued models. Quality, pricing, context, and license.

GPU Comparison

20+ datacenter GPU SKUs from H100 SXM and B200 down to L40S and A10G.

Provider Analysis

19 inference providers tracked. Uptime, throughput, and $/M tokens.

Playground

Interactive testing. Send prompts to any model from one place.

Workload Matcher

Describe your workload; get a ranked shortlist of model-GPU stacks.

Training Leaderboard

Companion ranking focused on training throughput per dollar.

Methodology

How we measure performance

Vendor-neutral, reproducible, and open. Every number on the leaderboard is traceable to a configuration we publish.

Composite Value Score

Rankings combine benchmark quality (MMLU, HumanEval, GSM8K), token throughput, and cost-efficiency into a single Value number: tokens-per-dollar weighted by quality and latency percentiles.

Roofline + Kernel Profiling

Predicted throughput comes from a roofline model layered with CUDA kernel-level profiling (FlashAttention, PagedAttention, fused kernels), validated against HuggingFace LLM Perf data and provider-reported numbers.

Pricing Refreshed Every 6 Hours

Pricing across all 19 providers is re-pulled every 6 hours via automated API ingestion. Historical price trends are kept so cost forecasts reflect actual market drift.

Vendor-Neutral by Design

InferenceBench is not affiliated with any GPU vendor or cloud provider. Community-submitted benchmarks are verified before inclusion. Methodology and weighting formulas are published in the open.

FAQ

Frequently asked questions

What is an AI inference benchmark?

A measurement of how fast and cheaply a given model can run on a given GPU and serving framework. Core metrics are throughput (tokens/sec), TTFT (time to first token), ITL (inter-token latency), and dollar-cost per million tokens. InferenceBench rolls these into a single Value score.

Which GPU is fastest for LLM inference?

Depends on the model. For 70B+ models, H100 SXM and B200 dominate raw throughput. For 7B–14B models, H100 PCIe and L40S offer better price-performance. For embeddings and small models (≤3B), A10G is often the best dollar-per-token. InferenceBench's Workload Matcher returns the right answer for your specific scenario.

How often is benchmark data updated?

Pricing every 6 hours via provider APIs. Benchmark results when new GPUs, models, or serving framework versions ship. Community submissions verified before inclusion.

Can I submit my own benchmarks?

Yes. InferenceBench accepts community submissions with full configuration (model quantization, batch size, sequence length, KV cache size, system prompt). Verified runs land on the public leaderboard.

How does Yobitel relate to InferenceBench?

Yobitel builds and maintains InferenceBench as part of our open inference economics work. The platform stays neutral and free; Yobitel uses the same data internally to right-size GPU clusters for customers running on our GPU Cloud and Yobibyte platform.

Ready to right-size your inference stack?

Use InferenceBench to pick the model, GPU, and provider, then run it on Yobitel's GPU Cloud or Yobibyte platform.

Open InferenceBench.io Browse GPU Cloud

Not affiliated with any GPU vendor. Methodology and weighting formulas published in the open at inferencebench.io.

Compare 338 AI modelsby quality, cost & value

338

Models tracked

GPUs monitored

Providers indexed

6hr

Price refresh

Model

Family

Params

Quality

Input $/M

Output $/M

Speed

Tokens/$

Context

Providers

Value

Qwen 2.5 7BMost Popular

Qwen

7.6B

$0.2

142/s

350

128K

350.0

Qwen 3 8BBest Value

Qwen

8.2B

$0.2

138/s

350

128K

350.0

Qwen 2.5 1.5B

Qwen

1.5B

—

$0.027

220/s

1,862

32K

1862.0

Llama 3.1 8B

Llama

$0.18

156/s

389

128K

320.4

Mistral 7B v0.3

Mistral

7.2B

$0.2

168/s

350

32K

310.8

Gemma 2 9B

Gemma

$0.22

124/s

318

298.5

Phi-3 Mini

Phi

3.8B

$0.1

245/s

700

128K

287.2

Qwen 2.5 14B

Qwen

14.7B

$0.35

98/s

200

128K

268.9

Llama 3.1 70BBest Quality

Llama

70B

$0.6

58/s

117

128K

254.3

Mixtral 8x7B

Mistral

46.7B MoE

$0.4

92/s

175

32K

246.7