KV Cache

TL;DR

Per-sequence cache of the key and value projections produced by each attention layer at every previously generated token.
Lets each decoding step attend over the full context while only computing Q, K, V for the new token — turning generation from O(n^2) to O(n) compute per token.
Dominates memory consumption during inference; for a 70B model at long context the KV cache can exceed the weights in size.
Targeted by every major LLM optimisation: paged attention for memory, prefix caching for reuse, INT8 / FP8 / INT4 quantisation for footprint, GQA / MLA for structural compression.

Why a KV Cache Exists#

Transformers are autoregressive: each generated token attends over every previous token in the sequence. Naively recomputing the attention keys and values for every prior token at every step would cost O(n^2) compute per output token, where n is the sequence length. For long-context generation this is prohibitive.

The KV cache exploits a structural property of decoder attention: a token's key and value vectors depend only on its own embedding and position, not on later tokens. So the first time a token is processed, its K and V projections are computed and stored. From then on, every subsequent decoding step only computes Q, K, V for the new token, appends K and V to the cache, and reads the full cache to attend.

Size and Memory Pressure#

Per-token KV-cache size is `2 × n_layers × n_kv_heads × head_dim × dtype_bytes`. For Llama 3 70B at FP16 with grouped-query attention, that works out to roughly 80 KB per token. A single 32k-context sequence then carries ~2.5 GB of KV cache alongside the ~140 GB of weights.

With many concurrent sequences the cache becomes the dominant memory consumer, and the practical batch size on a given GPU is bounded by KV-cache budget rather than compute. This is the core reason every modern LLM optimisation eventually targets the KV cache.

Optimisation	What it changes	Headline effect
PagedAttention	Block-based allocation	Memory utilisation ~40 % to ~95 %
Prefix caching	Reuse across sequences	Free shared prefixes
GQA / MLA	Fewer KV heads	2-8x smaller cache
INT8 / FP8 KV	Lower precision	0.5x cache size
INT4 KV	Lowest precision	0.25x cache size, small quality cost
StreamingLLM	Sink + sliding window	Bounded cache, lossy

Quantisation#

KV-cache quantisation lowers the dtype of stored K and V tensors. FP8 KV cache is now standard on Hopper and Blackwell; INT8 KV is widely supported. INT4 KV is more aggressive — typically a small quality cost on long contexts — and is the right knob when fitting 70B-class models on a single 80 GB GPU.

Structural Compression#

Architectural changes shrink the cache by design. Grouped-Query Attention (Llama 3, Mistral) shares one set of KV heads across multiple Q heads. Multi-head Latent Attention (DeepSeek-V2 and V3) projects K and V through a low-rank latent, further reducing the cache. Sliding window attention (Mistral 7B) caps the cache at a fixed length at the cost of forgetting long-range tokens.

Practical Implications#

Estimating maximum batch size: subtract the weights footprint from GPU memory, divide by per-token KV size, divide by max sequence length.
Long context expands KV linearly; doubling context doubles cache, and at some point cache exceeds weights.
Speculative decoding adds a small extra cache for the draft model — usually negligible.
Cache management overhead matters: paged allocation, eviction policies and prefix matching all consume CPU and should be benchmarked alongside throughput.

References

Attention Is All You Need · arXiv (Vaswani et al., 2017)
Efficient Memory Management for LLM Serving with PagedAttention · arXiv (Kwon et al., 2023)
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints · arXiv (Ainslie et al., 2023)

Why a KV Cache Exists#

Size and Memory Pressure#

Optimisation	What it changes	Headline effect
PagedAttention	Block-based allocation	Memory utilisation ~40 % to ~95 %
Prefix caching	Reuse across sequences	Free shared prefixes
GQA / MLA	Fewer KV heads	2-8x smaller cache
INT8 / FP8 KV	Lower precision	0.5x cache size
INT4 KV	Lowest precision	0.25x cache size, small quality cost
StreamingLLM	Sink + sliding window	Bounded cache, lossy

Quantisation#

Structural Compression#

Practical Implications#

Estimating maximum batch size: subtract the weights footprint from GPU memory, divide by per-token KV size, divide by max sequence length.

Long context expands KV linearly; doubling context doubles cache, and at some point cache exceeds weights.

Speculative decoding adds a small extra cache for the draft model — usually negligible.

Cache management overhead matters: paged allocation, eviction policies and prefix matching all consume CPU and should be benchmarked alongside throughput.

KV Cache

Why a KV Cache Exists#

Size and Memory Pressure#

Quantisation#

Structural Compression#

Practical Implications#

References

Browse all entries

Deploy on Yobitel

KV Cache

Why a KV Cache Exists#

Size and Memory Pressure#

Quantisation#

Structural Compression#

Practical Implications#

References

Browse all entries

Deploy on Yobitel