Prefix Caching

TL;DR

Runtime feature that reuses KV-cache blocks when two or more requests share a token-identical prefix.
Built on top of PagedAttention: physical blocks are content-addressed by their token sequence, so duplicates can share storage.
Common production gain: 20-50 percent throughput improvement on workloads with shared system prompts or few-shot examples.
Supported in vLLM, SGLang (as RadixAttention), TensorRT-LLM, TGI and most modern LLM runtimes.

Overview#

Most production LLM traffic carries a long shared prefix — a system prompt, a tool-use scaffold, a set of few-shot examples, retrieved documents. Without prefix caching, every request prefills the same prefix again, paying for compute that does not change the output.

Prefix caching short-circuits this. When a new request arrives, the runtime compares its prompt to the KV cache and identifies the longest cached prefix. The matching prefix is skipped during prefill — its KV blocks are already in memory — and only the suffix is computed.

How It Works#

Implementation builds directly on PagedAttention. Each physical KV block is hashed by its token sequence (and parent block hash). A lookup table maps `(parent_hash, block_tokens)` to the physical block ID. When a request's prompt walks the block boundaries, the runtime checks the hash table and reuses any block that already exists.

Eviction policy matters. Least-recently-used (LRU) is the default; more sophisticated runtimes weight by hit count or by branch popularity in the radix tree. Hot prefixes — frequently used system prompts — should stay resident through normal use.

For maximum hit rate, keep system prompts byte-identical across requests. A trailing whitespace or a timestamp variable breaks the prefix match and forces a full reprefill.

RadixAttention#

SGLang generalises prefix caching with RadixAttention: rather than a flat hash table, prefixes are organised in a radix tree, allowing the runtime to find the longest matching prefix even when the request shares many tokens with one cached path and many others with a different path. The technique is particularly effective for agent workloads with deep tool-use trees.

Measured Impact#

Chat apps with a 1000-token system prompt: 30-40 percent throughput gain.
Agent platforms with shared tool scaffolds: 60-90 percent prefill compute saved.
Batch evaluation with shared few-shot examples: near-complete prefill amortisation.
RAG endpoints where retrieved chunks change per request: small gain; prefix caching only helps if the system prompt is shared.

Operational Considerations#

Prefix caching adds memory pressure — cached prefixes that nobody is currently using still hold physical blocks until evicted. Most runtimes expose tunables for the maximum cached-prefix budget and the eviction policy.

Prefix-aware routing in multi-replica deployments amplifies the win: route requests sharing a prefix to the same replica so the cache hits land where the prefix already lives. vLLM's Production Stack and KServe both support prefix-aware load balancing.

When to Enable#

Always, unless you have a specific reason not to. The downside is a small bookkeeping overhead and increased memory pressure; the upside is a free throughput win on any workload with prompt overlap. Most modern runtimes default it to on.

References

vLLM Automatic Prefix Caching Documentation · vLLM
SGLang RadixAttention · arXiv (Zheng et al., 2023)
Efficient Memory Management for LLM Serving with PagedAttention · arXiv (Kwon et al., 2023)

Overview#

How It Works#

For maximum hit rate, keep system prompts byte-identical across requests. A trailing whitespace or a timestamp variable breaks the prefix match and forces a full reprefill.

RadixAttention#

Measured Impact#

Chat apps with a 1000-token system prompt: 30-40 percent throughput gain.

Agent platforms with shared tool scaffolds: 60-90 percent prefill compute saved.

Batch evaluation with shared few-shot examples: near-complete prefill amortisation.

RAG endpoints where retrieved chunks change per request: small gain; prefix caching only helps if the system prompt is shared.

Operational Considerations#

Prefix Caching

Overview#

How It Works#

RadixAttention#

Measured Impact#

Operational Considerations#

When to Enable#

References

Browse all entries

Deploy on Yobitel

Prefix Caching

Overview#

How It Works#

RadixAttention#

Measured Impact#

Operational Considerations#

When to Enable#

References

Browse all entries

Deploy on Yobitel