TL;DR
- Runtime feature that reuses KV-cache blocks when two or more requests share a token-identical prefix.
- Built on top of PagedAttention: physical blocks are content-addressed by their token sequence, so duplicates can share storage.
- Common production gain: 20-50 percent throughput improvement on workloads with shared system prompts or few-shot examples.
- Supported in vLLM, SGLang (as RadixAttention), TensorRT-LLM, TGI and most modern LLM runtimes.
Overview#
Most production LLM traffic carries a long shared prefix — a system prompt, a tool-use scaffold, a set of few-shot examples, retrieved documents. Without prefix caching, every request prefills the same prefix again, paying for compute that does not change the output.
Prefix caching short-circuits this. When a new request arrives, the runtime compares its prompt to the KV cache and identifies the longest cached prefix. The matching prefix is skipped during prefill — its KV blocks are already in memory — and only the suffix is computed.
How It Works#
Implementation builds directly on PagedAttention. Each physical KV block is hashed by its token sequence (and parent block hash). A lookup table maps `(parent_hash, block_tokens)` to the physical block ID. When a request's prompt walks the block boundaries, the runtime checks the hash table and reuses any block that already exists.
Eviction policy matters. Least-recently-used (LRU) is the default; more sophisticated runtimes weight by hit count or by branch popularity in the radix tree. Hot prefixes — frequently used system prompts — should stay resident through normal use.
For maximum hit rate, keep system prompts byte-identical across requests. A trailing whitespace or a timestamp variable breaks the prefix match and forces a full reprefill.
RadixAttention#
SGLang generalises prefix caching with RadixAttention: rather than a flat hash table, prefixes are organised in a radix tree, allowing the runtime to find the longest matching prefix even when the request shares many tokens with one cached path and many others with a different path. The technique is particularly effective for agent workloads with deep tool-use trees.
Measured Impact#
- Chat apps with a 1000-token system prompt: 30-40 percent throughput gain.
- Agent platforms with shared tool scaffolds: 60-90 percent prefill compute saved.
- Batch evaluation with shared few-shot examples: near-complete prefill amortisation.
- RAG endpoints where retrieved chunks change per request: small gain; prefix caching only helps if the system prompt is shared.
Operational Considerations#
Prefix caching adds memory pressure — cached prefixes that nobody is currently using still hold physical blocks until evicted. Most runtimes expose tunables for the maximum cached-prefix budget and the eviction policy.
Prefix-aware routing in multi-replica deployments amplifies the win: route requests sharing a prefix to the same replica so the cache hits land where the prefix already lives. vLLM's Production Stack and KServe both support prefix-aware load balancing.
When to Enable#
Always, unless you have a specific reason not to. The downside is a small bookkeeping overhead and increased memory pressure; the upside is a free throughput win on any workload with prompt overlap. Most modern runtimes default it to on.
References
- vLLM Automatic Prefix Caching Documentation · vLLM
- SGLang RadixAttention · arXiv (Zheng et al., 2023)
- Efficient Memory Management for LLM Serving with PagedAttention · arXiv (Kwon et al., 2023)