Flash Attention 2

TL;DR

Tri Dao's 2023 follow-up to Flash Attention 1 (Dao, arXiv:2307.08691). Targets Ampere and early Hopper; ~2× faster than FA1 on A100.
Reworked the parallelisation strategy and reduced non-matmul FLOPs — pushed utilisation from ~25 % of A100 peak to ~50-72 %.
Still the default attention kernel on Ampere (A100, A6000) hardware; superseded by FA3 on Hopper.

Overview#

Flash Attention 1 (Dao et al., 2022, arXiv:2205.14135) introduced the streaming-softmax tiled attention algorithm and showed that attention's memory access pattern, not its FLOPs, was the bottleneck. FA1 worked but left throughput on the table — partly because of an awkward parallelisation over heads only, partly because of non-matmul instructions (online-softmax bookkeeping) that competed with the tensor cores.

Flash Attention 2 fixed both. It parallelises over the sequence dimension in addition to heads, and restructured the inner loops so non-matmul FLOPs drop sharply. The result is ~2× the throughput of FA1 on A100, comparable to ~50-72 % of the GPU's theoretical FP16/BF16 peak.

Mechanism#

FA2 splits work along two axes: each thread block handles a tile of the output Q×K^T (sequence × sequence shape) for one head. Tiles are computed by streaming K, V blocks through fast shared memory while accumulating the running softmax statistics. The kernel never materialises the full N×N attention matrix in HBM — at most an N×d tile per thread block.

The crucial micro-optimisations: fewer divisions in the online softmax, better register reuse, and a thread mapping that lets the tensor cores stay fed continuously.

Performance Characteristics#

A100 throughput: ~50-72 % of peak BF16/FP16 — roughly 2× FA1.
Memory: O(N) — never materialises the attention matrix.
Supports causal masking, ALiBi, multi-query and grouped-query attention.
On Hopper, FA2 runs well but FA3 is faster.

When to Use#

Use FA2 on Ampere hardware (A100, A40, A6000, etc.) and as a fallback on Hopper when FA3 is unavailable or the workload's attention shape is outside FA3's optimised range. Most LLM training stacks already select the right variant automatically — explicit choice is rarely necessary.

Pitfalls#

Building from source needs a recent CUDA and PyTorch — use pre-built wheels where possible.
Variable-length sequences require the `flash_attn_varlen_func` API; uniform-length sequences use the simpler `flash_attn_func`.
Some attention masks (custom biases, non-causal sliding-window) are not all equally optimised.

Software#

github.com/Dao-AILab/flash-attention — the same repository as FA3, version selected per hardware.
PyTorch SDPA dispatches to FA2 on Ampere when applicable.
All major training frameworks (Megatron, NeMo, FSDP-based) integrate it.

References

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning · arXiv (Dao, 2023)
FlashAttention: Fast and Memory-Efficient Exact Attention · arXiv (Dao et al., 2022)
FlashAttention on GitHub · GitHub

Overview#

Mechanism#

The crucial micro-optimisations: fewer divisions in the online softmax, better register reuse, and a thread mapping that lets the tensor cores stay fed continuously.

Pitfalls#

Building from source needs a recent CUDA and PyTorch — use pre-built wheels where possible.

Variable-length sequences require the `flash_attn_varlen_func` API; uniform-length sequences use the simpler `flash_attn_func`.

Some attention masks (custom biases, non-causal sliding-window) are not all equally optimised.

Flash Attention 2

Overview#

Mechanism#

Performance Characteristics#

When to Use#

Pitfalls#

Software#

References

Browse all entries

Deploy on Yobitel

Flash Attention 2

Overview#

Mechanism#

Performance Characteristics#

When to Use#

Pitfalls#

Software#

References

Browse all entries

Deploy on Yobitel