TL;DR
- Per-sequence cache of the key and value projections produced by each attention layer at every previously generated token.
- Lets each decoding step attend over the full context while only computing Q, K, V for the new token — turning generation from O(n^2) to O(n) compute per token.
- Dominates memory consumption during inference; for a 70B model at long context the KV cache can exceed the weights in size.
- Targeted by every major LLM optimisation: paged attention for memory, prefix caching for reuse, INT8 / FP8 / INT4 quantisation for footprint, GQA / MLA for structural compression.
Why a KV Cache Exists#
Transformers are autoregressive: each generated token attends over every previous token in the sequence. Naively recomputing the attention keys and values for every prior token at every step would cost O(n^2) compute per output token, where n is the sequence length. For long-context generation this is prohibitive.
The KV cache exploits a structural property of decoder attention: a token's key and value vectors depend only on its own embedding and position, not on later tokens. So the first time a token is processed, its K and V projections are computed and stored. From then on, every subsequent decoding step only computes Q, K, V for the new token, appends K and V to the cache, and reads the full cache to attend.
Size and Memory Pressure#
Per-token KV-cache size is `2 × n_layers × n_kv_heads × head_dim × dtype_bytes`. For Llama 3 70B at FP16 with grouped-query attention, that works out to roughly 80 KB per token. A single 32k-context sequence then carries ~2.5 GB of KV cache alongside the ~140 GB of weights.
With many concurrent sequences the cache becomes the dominant memory consumer, and the practical batch size on a given GPU is bounded by KV-cache budget rather than compute. This is the core reason every modern LLM optimisation eventually targets the KV cache.
| Optimisation | What it changes | Headline effect |
|---|---|---|
| PagedAttention | Block-based allocation | Memory utilisation ~40 % to ~95 % |
| Prefix caching | Reuse across sequences | Free shared prefixes |
| GQA / MLA | Fewer KV heads | 2-8x smaller cache |
| INT8 / FP8 KV | Lower precision | 0.5x cache size |
| INT4 KV | Lowest precision | 0.25x cache size, small quality cost |
| StreamingLLM | Sink + sliding window | Bounded cache, lossy |
Quantisation#
KV-cache quantisation lowers the dtype of stored K and V tensors. FP8 KV cache is now standard on Hopper and Blackwell; INT8 KV is widely supported. INT4 KV is more aggressive — typically a small quality cost on long contexts — and is the right knob when fitting 70B-class models on a single 80 GB GPU.
Reuse and Sharing#
Once KV cache is content-addressed by token sequence (as in PagedAttention), identical prefixes can share physical blocks. System prompts that are repeated across thousands of requests cost a one-time prefill and then ride for free. SGLang's RadixAttention generalises this further by maintaining a radix tree of all in-flight prefixes.
Structural Compression#
Architectural changes shrink the cache by design. Grouped-Query Attention (Llama 3, Mistral) shares one set of KV heads across multiple Q heads. Multi-head Latent Attention (DeepSeek-V2 and V3) projects K and V through a low-rank latent, further reducing the cache. Sliding window attention (Mistral 7B) caps the cache at a fixed length at the cost of forgetting long-range tokens.
Practical Implications#
- Estimating maximum batch size: subtract the weights footprint from GPU memory, divide by per-token KV size, divide by max sequence length.
- Long context expands KV linearly; doubling context doubles cache, and at some point cache exceeds weights.
- Speculative decoding adds a small extra cache for the draft model — usually negligible.
- Cache management overhead matters: paged allocation, eviction policies and prefix matching all consume CPU and should be benchmarked alongside throughput.
References
- Attention Is All You Need · arXiv (Vaswani et al., 2017)
- Efficient Memory Management for LLM Serving with PagedAttention · arXiv (Kwon et al., 2023)
- GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints · arXiv (Ainslie et al., 2023)