TL;DR
- Tri Dao's 2023 follow-up to Flash Attention 1 (Dao, arXiv:2307.08691). Targets Ampere and early Hopper; ~2× faster than FA1 on A100.
- Reworked the parallelisation strategy and reduced non-matmul FLOPs — pushed utilisation from ~25 % of A100 peak to ~50-72 %.
- Still the default attention kernel on Ampere (A100, A6000) hardware; superseded by FA3 on Hopper.
Overview#
Flash Attention 1 (Dao et al., 2022, arXiv:2205.14135) introduced the streaming-softmax tiled attention algorithm and showed that attention's memory access pattern, not its FLOPs, was the bottleneck. FA1 worked but left throughput on the table — partly because of an awkward parallelisation over heads only, partly because of non-matmul instructions (online-softmax bookkeeping) that competed with the tensor cores.
Flash Attention 2 fixed both. It parallelises over the sequence dimension in addition to heads, and restructured the inner loops so non-matmul FLOPs drop sharply. The result is ~2× the throughput of FA1 on A100, comparable to ~50-72 % of the GPU's theoretical FP16/BF16 peak.
Mechanism#
FA2 splits work along two axes: each thread block handles a tile of the output Q×K^T (sequence × sequence shape) for one head. Tiles are computed by streaming K, V blocks through fast shared memory while accumulating the running softmax statistics. The kernel never materialises the full N×N attention matrix in HBM — at most an N×d tile per thread block.
The crucial micro-optimisations: fewer divisions in the online softmax, better register reuse, and a thread mapping that lets the tensor cores stay fed continuously.
Performance Characteristics#
- A100 throughput: ~50-72 % of peak BF16/FP16 — roughly 2× FA1.
- Memory: O(N) — never materialises the attention matrix.
- Supports causal masking, ALiBi, multi-query and grouped-query attention.
- On Hopper, FA2 runs well but FA3 is faster.
When to Use#
Use FA2 on Ampere hardware (A100, A40, A6000, etc.) and as a fallback on Hopper when FA3 is unavailable or the workload's attention shape is outside FA3's optimised range. Most LLM training stacks already select the right variant automatically — explicit choice is rarely necessary.
Pitfalls#
- Building from source needs a recent CUDA and PyTorch — use pre-built wheels where possible.
- Variable-length sequences require the `flash_attn_varlen_func` API; uniform-length sequences use the simpler `flash_attn_func`.
- Some attention masks (custom biases, non-causal sliding-window) are not all equally optimised.
Software#
- github.com/Dao-AILab/flash-attention — the same repository as FA3, version selected per hardware.
- PyTorch SDPA dispatches to FA2 on Ampere when applicable.
- All major training frameworks (Megatron, NeMo, FSDP-based) integrate it.
References
- FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning · arXiv (Dao, 2023)
- FlashAttention: Fast and Memory-Efficient Exact Attention · arXiv (Dao et al., 2022)
- FlashAttention on GitHub · GitHub