Compare 338 AI modelsby quality, cost & value
The definitive platform for GPU inference and training economics. Vendor-neutral rankings across 60 GPUs and 19 providers, with a composite Value score that combines benchmark quality, throughput, and dollar-per-token. Pricing refreshed every 6 hours.
338
Models tracked
60
GPUs monitored
19
Providers indexed
6hr
Price refresh
Live Leaderboard
Top models, ranked by Value
Snapshot of the top entries per category. Full leaderboard with 338 models, filters, and per-row ROI calculators lives at inferencebench.io.
| # | Model | Family | Params | Quality | Input $/M | Output $/M | Speed | Tokens/$ | Context | Providers | Value |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Qwen 2.5 7BMost Popular | Qwen | 7.6B | 70 | $0.2 | $0.2 | 142/s | 350 | 128K | 8 | 350.0 |
| 2 | Qwen 3 8BBest Value | Qwen | 8.2B | 70 | $0.2 | $0.2 | 138/s | 350 | 128K | 6 | 350.0 |
| 3 | Qwen 2.5 1.5B | Qwen | 1.5B | — | $0.027 | $0.027 | 220/s | 1,862 | 32K | 4 | 1862.0 |
| 4 | Llama 3.1 8B | Llama | 8B | 68 | $0.18 | $0.18 | 156/s | 389 | 128K | 12 | 320.4 |
| 5 | Mistral 7B v0.3 | Mistral | 7.2B | 65 | $0.2 | $0.2 | 168/s | 350 | 32K | 9 | 310.8 |
| 6 | Gemma 2 9B | Gemma | 9B | 71 | $0.22 | $0.22 | 124/s | 318 | 8K | 5 | 298.5 |
| 7 | Phi-3 Mini | Phi | 3.8B | 69 | $0.1 | $0.1 | 245/s | 700 | 128K | 4 | 287.2 |
| 8 | Qwen 2.5 14B | Qwen | 14.7B | 76 | $0.35 | $0.35 | 98/s | 200 | 128K | 7 | 268.9 |
| 9 | Llama 3.1 70BBest Quality | Llama | 70B | 85 | $0.6 | $0.6 | 58/s | 117 | 128K | 14 | 254.3 |
| 10 | Mixtral 8x7B | Mistral | 46.7B MoE | 79 | $0.4 | $0.4 | 92/s | 175 | 32K | 11 | 246.7 |
Quality: MMLU/HumanEval/GSM8K composite. Speed: tokens per second. Value: InferenceBench composite (quality × tokens/$ × latency).
All 338 models on inferencebench.ioThe Full Toolbox
Eight tools, one inference economics platform
InferenceBench is more than a leaderboard. It's the toolbox we use internally when sizing GPU clusters for Yobitel customers.
Leaderboard
Ranked comparison across 338 models with composite Value scoring.
Calculator
ROI and inference cost analysis for any model-GPU-provider mix.
Models Directory
Browse 338 catalogued models. Quality, pricing, context, and license.
GPU Comparison
20+ datacenter GPU SKUs from H100 SXM and B200 down to L40S and A10G.
Provider Analysis
19 inference providers tracked. Uptime, throughput, and $/M tokens.
Playground
Interactive testing. Send prompts to any model from one place.
Workload Matcher
Describe your workload; get a ranked shortlist of model-GPU stacks.
Training Leaderboard
Companion ranking focused on training throughput per dollar.
Methodology
How we measure performance
Vendor-neutral, reproducible, and open. Every number on the leaderboard is traceable to a configuration we publish.
Composite Value Score
Rankings combine benchmark quality (MMLU, HumanEval, GSM8K), token throughput, and cost-efficiency into a single Value number: tokens-per-dollar weighted by quality and latency percentiles.
Roofline + Kernel Profiling
Predicted throughput comes from a roofline model layered with CUDA kernel-level profiling (FlashAttention, PagedAttention, fused kernels), validated against HuggingFace LLM Perf data and provider-reported numbers.
Pricing Refreshed Every 6 Hours
Pricing across all 19 providers is re-pulled every 6 hours via automated API ingestion. Historical price trends are kept so cost forecasts reflect actual market drift.
Vendor-Neutral by Design
InferenceBench is not affiliated with any GPU vendor or cloud provider. Community-submitted benchmarks are verified before inclusion. Methodology and weighting formulas are published in the open.
FAQ
Frequently asked questions
What is an AI inference benchmark?
A measurement of how fast and cheaply a given model can run on a given GPU and serving framework. Core metrics are throughput (tokens/sec), TTFT (time to first token), ITL (inter-token latency), and dollar-cost per million tokens. InferenceBench rolls these into a single Value score.
Which GPU is fastest for LLM inference?
Depends on the model. For 70B+ models, H100 SXM and B200 dominate raw throughput. For 7B–14B models, H100 PCIe and L40S offer better price-performance. For embeddings and small models (≤3B), A10G is often the best dollar-per-token. InferenceBench's Workload Matcher returns the right answer for your specific scenario.
How often is benchmark data updated?
Pricing every 6 hours via provider APIs. Benchmark results when new GPUs, models, or serving framework versions ship. Community submissions verified before inclusion.
Can I submit my own benchmarks?
Yes. InferenceBench accepts community submissions with full configuration (model quantization, batch size, sequence length, KV cache size, system prompt). Verified runs land on the public leaderboard.
How does Yobitel relate to InferenceBench?
Yobitel builds and maintains InferenceBench as part of our open inference economics work. The platform stays neutral and free; Yobitel uses the same data internally to right-size GPU clusters for customers running on our GPU Cloud and Yobibyte platform.
Ready to right-size your inference stack?
Use InferenceBench to pick the model, GPU, and provider, then run it on Yobitel's GPU Cloud or Yobibyte platform.
Not affiliated with any GPU vendor. Methodology and weighting formulas published in the open at inferencebench.io.