TL;DR
- Inference-time latency-reduction algorithm where a small fast draft model proposes N tokens and the large target model verifies them in a single parallel forward pass. Introduced concurrently by Leviathan et al. (arXiv:2211.17192, Google, November 2022) and Chen et al. (arXiv:2302.01318, DeepMind, February 2023).
- Distribution-preserving: the rejection-sampling verification step ensures that accepted tokens are statistically identical to direct sampling from the target model. Zero quality loss at any temperature.
- Expected speedup ≈ 1 + (average accepted draft length); typical chat workloads see 1.5-3x end-to-end latency reduction with a well-matched draft, climbing to 3-5x with EAGLE-2 self-draft heads on Vicuna and Llama families on H100.
- Variants in production by 2026: external draft model, Medusa heads, EAGLE-1 / EAGLE-2 feature-level draft, n-gram lookahead, tree-attention verification with multiple draft branches.
- Supported natively by vLLM (`--speculative-model`), TensorRT-LLM (`--speculative-decoding-mode`), SGLang (`--speculative-algorithm`) and TGI. Used in production at Anthropic, OpenAI and inside the Yobitel Yobibyte platform for latency-sensitive interactive endpoints.
Overview#
Speculative decoding is the inference-time latency-reduction algorithm that exploits one of the structural features of autoregressive LLM decoding: most of the time, the GPU is idle waiting for memory. The forward pass for a single decode step is memory-bandwidth-bound on every modern accelerator — the model weights have to stream through the chip to produce one token, and the actual compute per token is trivial compared to that memory cost. Hopper, Blackwell and MI300X spend the bulk of each decode step waiting on HBM. The arithmetic intensity is wrong by an order of magnitude.
Leviathan et al. (Google, arXiv:2211.17192, November 2022) and Chen et al. (DeepMind, arXiv:2302.01318, February 2023) independently observed that the right way to recover that idle compute is to issue more work per forward pass. A small fast draft model produces N candidate tokens autoregressively (cheaply, because the small model has small weights); the large target model then verifies all N candidates in a single parallel forward pass (which is essentially free compared to one decode step because the bottleneck is memory bandwidth, not compute). A rejection-sampling verification step accepts the longest prefix that the target model agrees with, samples one replacement token for the first disagreement, and the cycle repeats.
The end-to-end behaviour is that the system advances by 1 + (average accepted draft length) tokens per target forward pass instead of 1, and the wall-clock per-token latency drops by the same factor. The verification step is provably distribution-preserving — accepted tokens are statistically identical to those that would have been sampled directly from the target at the same temperature. No quality is lost; the only cost is the engineering work to train and ship a draft, and the small constant overhead of the draft model itself.
By mid-2026 speculative decoding is a production standard for latency-sensitive endpoints. Anthropic, OpenAI and Google all report using proprietary variants in their hosted APIs; the open-source serving stacks (vLLM, TensorRT-LLM, SGLang, TGI) all implement at least three of the four major variants. The variant that wins in any given deployment depends on the draft-target alignment, the workload concurrency, and the operational tolerance for shipping per-model auxiliary heads. If you are deploying models on Yobibyte, speculative decoding is exposed as a per-workload flag on the Yobibyte Inference resource — interactive endpoints get it on by default, offline batch workloads get it off, and the platform picks the variant per (target model, batch size) from a continuously-benchmarked set.
This entry covers the algorithm, the verification mathematics, the variants and where each pays off, the runtimes that implement them, the trade-offs and known limitations, and the practical implementation notes that matter when you turn speculative decoding on in production. This entry helps you understand speculative decoding so you can pick variants and sizing intelligently — whether you are tuning raw vLLM, TensorRT-LLM or SGLang on your own cluster, or choosing the Yobibyte Inference flag that fits your latency SLO.
How it works#
The core loop has three stages. Stage 1: draft. The draft model, given the current context, autoregressively samples N candidate tokens (N is the draft length, typically 4-8). Because the draft is small, this is cheap — a Llama 3 8B drafting for Llama 3 70B takes roughly 10% of the wall-clock time per token. Stage 2: verify. The target model is invoked once with the original context plus all N draft tokens as input; its forward pass produces N + 1 sets of next-token logits (one for the prefix and one for each draft token's position). This is what makes the verification cheap — the target sees all N draft positions in a single parallel pass, leveraging the unused compute capacity per step. Stage 3: accept / reject / commit. The rejection-sampling rule walks the draft tokens left to right and decides which to keep.
The verification mathematics is the elegant bit. Let q(x) be the draft model's probability of token x at a given position and p(x) be the target model's probability. For each draft token, accept with probability min(1, p(x) / q(x)). If accepted, advance to the next draft position. If rejected, sample one new token from the adjusted distribution max(0, p(x) - q(x)) renormalised, append it to the committed sequence, and stop verifying the remaining draft tokens (they were proposed under the wrong context anyway). The committed sequence is then the new ground truth and the next draft cycle begins.
Why this preserves the target distribution: the rejection-sampling rule is a special case of the standard rejection-sampling lemma — accept-with-probability-min(1, p/q) plus residual-resample-from-max(0, p-q) is mathematically equivalent to sampling directly from p(x). Every accepted token is distributed exactly as if it had been sampled from the target model in the conventional one-at-a-time way. The proof is two lines in the Leviathan paper; the practical implication is that you do not need to validate output quality after enabling speculative decoding because there is no quality change to validate.
Tree verification generalises the idea. Instead of a single draft sequence of length N, the draft proposes a tree of candidate sequences (k branches at each step), and the target verifies the entire tree in one parallel forward pass using tree attention masks. EAGLE-2 in particular generates a dynamic draft tree whose shape adapts per context — confident draft positions narrow the tree, uncertain positions widen it. The expected accepted-token count per verification step rises from ~3-4 (single-path draft) to ~5-7 (tree draft), which directly multiplies the speedup.
The expected speedup is 1 + α where α is the average number of additional accepted tokens per target forward pass. α depends on draft-target alignment (a draft trained on the same data as the target sees higher α) and on sampling temperature (greedy / low-temp inflates α; high-temp deflates it because the target's distribution is wider and rejection is more likely).
Variants and architectural choices#
Four variant families dominate production deployments in 2026, each with different operational profiles. The choice between them is essentially a trade-off between training cost (build a custom draft head vs reuse an existing small model), acceptance rate ceiling (feature-level draft can hit 70-85%, external draft tops out around 55-70%), and VRAM footprint (separate draft model adds GBs; auxiliary heads add MBs).
External draft model. The original variant: a separate small LLM drafts for a larger target. The canonical pairing is Llama 3 8B drafting for Llama 3 70B — same vocabulary, same training distribution, ~10% wall-clock cost per drafted token. Acceptance rates of 50-70% are typical. Strengths: works out of the box, no extra training, easy to debug. Weaknesses: needs ~10% extra VRAM for the draft weights; cannot exceed the draft model's own quality ceiling on the easy cases; alignment to the target degrades when you fine-tune the target.
Medusa heads. Cai et al. (arXiv:2401.10774, January 2024) attached k additional decoder heads to the target model that each predict tokens at distance +1, +2, …, +k from the current position in parallel. No separate draft model; the heads share the target's representation. Training is cheap — a few GPU-hours to add the heads on top of a frozen target. Acceptance rates of 60-75% typical. Strengths: minimal VRAM overhead; no separate draft to maintain; works at any target size. Weaknesses: per-model head training; tree-verification needed to hit the published numbers.
EAGLE-1 / EAGLE-2. Li et al. (arXiv:2401.15077 and arXiv:2406.16858, 2024) predict the next feature (hidden state) rather than the next token, using a small autoregressive head over the target's penultimate-layer features. The predicted feature is fed back through the target's LM head to produce a draft token. Acceptance rates are the highest of any variant — 70-85% on Llama and Vicuna families. EAGLE-2 adds dynamic draft trees that adapt fanout per context. Strengths: highest measured speedup (3-5x end-to-end on chat); modest training cost. Weaknesses: per-(base, fine-tune) head training; falls off at very high batch sizes when the target becomes compute-bound.
N-gram lookahead. Saxena (Apple, arXiv:2304.04487 and subsequent work) and Fu et al. (arXiv:2404.02528) showed that for predictable text — code, structured outputs, repetitive patterns — speculative decoding does not need a learned draft at all. An n-gram lookup over the recent context can propose candidates with surprisingly high acceptance. Strengths: zero training; no extra VRAM; works on any model. Weaknesses: only works on highly predictable text; acceptance collapses on creative prose.
Operationally, the typical 2026 production stack ships EAGLE-2 for the headline interactive chat path (where it delivers the strongest measured speedup), Medusa as a simpler fallback for models without an EAGLE head, an external draft (Llama 3 8B for 70B) for the cold-start case before any custom heads are trained, and n-gram lookahead automatically enabled on code-generation endpoints.
Where it is used today#
The table below summarises which serving runtime supports which speculative-decoding variant by mid-2026, with the per-runtime flag to enable it. All four open-source runtimes support EAGLE-2 and external draft; Medusa support is universal but quality of integration varies; n-gram lookahead is currently best-supported in vLLM and SGLang.
Two production realities deserve emphasis beyond the table. First, the runtimes auto-gate speculation by batch size: when the running batch is large enough that the target's forward pass becomes compute-bound rather than memory-bound, the verification step is no longer free, and the engines automatically disable speculation. vLLM, TensorRT-LLM and SGLang all do this with a configurable threshold. This means speculative decoding is most valuable on interactive low-concurrency endpoints (chat, voice agents, real-time RAG) and least valuable on offline batch jobs.
Second, the hosted-API providers (Anthropic Claude, OpenAI GPT, Google Gemini) all confirm using proprietary speculative-decoding variants in their inference paths, with reported speedups in the 1.5-3x range. The proprietary variants tend to combine multiple draft strategies (EAGLE-style feature draft + n-gram fallback + retrieval-of-prior-completions) but the published open-source variants now cover the bulk of the achievable gain. Yobibyte sits in the open-source camp: speculative decoding is exposed via a Yobibyte Inference resource flag and the platform picks EAGLE-2, Medusa, an external draft pair, or n-gram lookahead per workload based on the live InferenceBench measurements for the target model on Yobitel NeoCloud capacity.
| Runtime | External draft | Medusa | EAGLE-2 | N-gram lookahead | Flag |
|---|---|---|---|---|---|
| vLLM | Yes | Yes | Yes | Yes | --speculative-model / --speculative-num-tokens |
| TensorRT-LLM | Yes | Yes | Yes | No | --speculative-decoding-mode draft_target / medusa / eagle |
| SGLang | Yes | Yes | Yes | Yes | --speculative-algorithm eagle2 / medusa / ngram |
| TGI (Hugging Face) | Yes | Yes | Limited | No | --speculate / --medusa-id |
| NVIDIA NIM | Inherited from TRT-LLM | Yes | Yes | No | Inherited |
| MLC-LLM | Yes | No | No | Yes | Built-in n-gram |
Trade-offs and known limitations#
Speculative decoding's economics depend on three factors that interact non-trivially: draft cost, acceptance rate and target batch size. Draft cost has to be a small fraction of target cost (rule of thumb: draft <10% of target wall-clock per token) or the overhead eats the win; this is automatic for external drafts at ~10x size ratio and trivially true for auxiliary heads. Acceptance rate has to be high enough that the expected accepted draft length exceeds the draft overhead; rule of thumb, acceptance above 50% almost always pays off, below 30% almost never does.
Sampling temperature interacts directly. At temperature 0 (greedy decoding) the target produces a deterministic distribution and the draft can match it exactly, so acceptance rates approach 70-85%. At temperature 1.0 the target distribution is wide, the draft is unlikely to land on the same sample, and acceptance drops to 40-55%. Production deployments running creative generation at high temperatures see materially less benefit than chat-style deployments at temperature 0.3-0.7.
Batch size is the harder constraint. At batch size 1-4 the target's per-step compute utilisation is low and the parallel verification is essentially free; speedup is maximal (2-5x). At batch size 32+ the target's forward pass is already compute-bound at its full memory bandwidth budget, verification is no longer free, and the speedup shrinks toward 1.0-1.5x. At batch size 128+ on a well-utilised engine, speculation typically loses outright because the verification overhead exceeds the saving. This is why runtimes auto-gate speculation by batch size, and why offline batch workloads should generally disable it.
Operationally, the auxiliary-head variants (Medusa, EAGLE-2) require per-(base, fine-tune) training. If your production fleet serves five fine-tunes of the same base, you need five EAGLE-2 heads — one per fine-tune, retrained when the fine-tune changes. This is a cost in MLOps complexity that some teams under-estimate. External-draft variants sidestep this at the cost of lower acceptance.
Quality is not a trade-off — the distribution-preserving property is mathematically exact. Operators do not need to A/B speculation vs no-speculation for output quality; the only thing worth measuring is latency, throughput and acceptance rate.
Practical implementation notes#
The flags below show the canonical configuration for each of the four major variants on vLLM. TensorRT-LLM and SGLang have equivalent flags with renamed parameters; the recipes are operationally the same. Acceptance rate is the single metric to watch — if it stays above 60% the speedup is real; if it drops below 40% in steady state, the draft is misaligned and you should retrain or switch variants.
- Match the draft to the target at the same vocabulary and ideally same fine-tune distribution; a draft from a different fine-tune is the most common cause of low acceptance rates.
- Set --num-speculative-tokens / --speculative-num-steps in the 4-6 range; above 6 the per-draft-pass cost rises faster than the acceptance benefit.
- Auto-gate by batch size in production: above ~32 concurrent sequences, speculation usually loses; the runtimes do this automatically but verify with your workload.
- Cold-start strategy: ship external-draft speculation on day one (Llama 3 8B for 70B), train EAGLE-2 heads in the background, switch to EAGLE-2 once the heads are validated.
- Per-fine-tune EAGLE heads must be retrained when the fine-tune changes; budget the GPU-hours into the fine-tune pipeline.
- Quality is mathematically unchanged — do not bother with output-quality A/B tests, only with latency and acceptance-rate measurements.
- Pair with prefix caching on shared system prompts; the two optimisations stack multiplicatively on interactive chat workloads.
# 1. External draft — Llama 3 8B drafts for Llama 3 70B (cold start, no custom training)
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
--tensor-parallel-size 4 \
--speculative-model meta-llama/Meta-Llama-3.1-8B-Instruct \
--num-speculative-tokens 5 \
--quantization fp8 \
--enable-prefix-caching
# 2. EAGLE-2 — strongest measured speedup, needs a per-model draft head
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
--tensor-parallel-size 4 \
--speculative-model yuhuili/EAGLE-LLaMA3.1-Instruct-70B \
--speculative-draft-tensor-parallel-size 1 \
--num-speculative-tokens 5 \
--use-eagle \
--quantization fp8
# 3. Medusa — auxiliary heads, simpler training than EAGLE
vllm serve lmsys/vicuna-7b-v1.5 \
--speculative-model lmsys/medusa-vicuna-7b-v1.5 \
--num-speculative-tokens 5
# 4. N-gram lookahead — zero training, works best on code / repetitive text
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
--tensor-parallel-size 4 \
--speculative-model "[ngram]" \
--ngram-prompt-lookup-max 4 \
--num-speculative-tokens 5
# Equivalent on SGLang for EAGLE-2:
python -m sglang.launch_server \
--model-path meta-llama/Meta-Llama-3.1-70B-Instruct \
--tp 4 \
--speculative-algorithm eagle2 \
--speculative-draft-model-path yuhuili/EAGLE-LLaMA3.1-Instruct-70B \
--speculative-num-steps 5
# Equivalent on TensorRT-LLM (set at engine build time):
trtllm-build \
--checkpoint_dir ./ckpt/llama3-70b-fp8-tp4 \
--output_dir ./engines/llama3-70b-eagle \
--speculative_decoding_mode eagle \
--max_draft_len 5
# Always watch the acceptance metric in production. On vLLM:
# vllm:spec_decode_num_accepted_tokens_total — running accepted count
# vllm:spec_decode_num_draft_tokens_total — running draft count
# ratio = accepted / draft → α (expected speedup minus 1)
# If α drops below 0.5 in steady state, the draft is misaligned.Greedy or low-temperature workloads (temperature ≤ 0.3) see the strongest speedups (often 3-5x with EAGLE-2). High-temperature creative-writing workloads (temperature 0.9+) see only 1.2-1.5x because the target distribution is wide and rejection is frequent. Set expectations for your workload's temperature profile before promising 3x.
Where this fits in the Yobitel stack#
Speculative decoding is enabled by default on Yobibyte interactive endpoints — chat, voice agents, real-time RAG, code completion. The platform picks a variant per workload: EAGLE-2 where a head is available, Medusa as a fallback, an external-draft pair (Llama 3 8B for 70B, Llama 3 70B for 405B) for cold starts, and n-gram lookahead automatically on code-generation endpoints. Offline batch jobs run with speculation disabled because their concurrency profile pushes the target into compute-bound territory where verification is no longer free.
The Omniscient Compute scoring layer benchmarks every supported speculative variant on InferenceBench across H100 SXM5, H200 and B200 tenancies at fixed input/output token mixes, with acceptance rates and end-to-end latency surfaced per (variant, target model, batch size). Customers see the achievable speedup for their specific workload shape rather than the published headline numbers, and the Yobibyte console recommends the variant that maximises latency reduction at their concurrency level.
References
- Fast Inference from Transformers via Speculative Decoding · arXiv (Leviathan et al., Google, 2022)
- Accelerating Large Language Model Decoding with Speculative Sampling · arXiv (Chen et al., DeepMind, 2023)
- Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads · arXiv (Cai et al., 2024)
- EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty · arXiv (Li et al., 2024)
- EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees · arXiv (Li et al., 2024)
- Lookahead Decoding: Break the Sequential Dependency of LLM Inference · arXiv (Fu et al., 2024)
- vLLM Speculative Decoding Documentation · vLLM