TL;DR
- Eight-bit floating point format standardised by NVIDIA, Arm and Intel; two variants: E4M3 (4 exponent, 3 mantissa) and E5M2 (5 exponent, 2 mantissa).
- Hardware support: NVIDIA H100/H200/B200 Transformer Engine, AMD MI300 series, Intel Gaudi 3.
- Roughly halves memory footprint and doubles peak throughput versus BF16, with minimal quality loss when used with per-tensor scaling.
- Now the default activation and weight dtype for LLM serving on Hopper and Blackwell; supported across vLLM, TensorRT-LLM, TGI, SGLang and MLC-LLM.
Why FP8#
INT8 quantisation has been the workhorse for accelerator deployments for a decade, but it struggles with the wide dynamic range needed for Transformer activations. INT4 weight-only quantisation works well for weights but does nothing for activations. FP8 fills the gap: an eight-bit floating-point format keeps a usable exponent range and can be applied to both weights and activations.
The Transformer Engine on H100 was designed around FP8. Tensor cores accept FP8 inputs, accumulate in FP32 and emit FP16/BF16/FP32 outputs. With dynamic per-tensor scaling, this gives close-to-BF16 accuracy at roughly twice the throughput.
Two Variants#
FP8 ships in two flavours. E4M3 uses four exponent bits and three mantissa bits — narrower dynamic range, more precision, used for weights and forward activations. E5M2 uses five exponent bits and two mantissa bits — wider dynamic range, less precision, used for gradients in training. For inference, E4M3 is the dominant variant.
| Variant | Exponent | Mantissa | Dynamic range | Use |
|---|---|---|---|---|
| E4M3 | 4 bits | 3 bits | ~448 (max) | Inference weights, activations |
| E5M2 | 5 bits | 2 bits | ~57344 (max) | Training gradients |
Scaling#
FP8 tensors carry an associated FP32 scaling factor that maps the FP8 representable range onto the actual tensor range. Per-tensor scaling is the simplest and most common; per-channel and per-block scaling improve accuracy at the cost of more bookkeeping.
Calibration is straightforward: run a small calibration set, observe per-tensor maximum activation magnitudes, set the FP8 scale so the maximum lands near the top of the FP8 range. Most runtimes automate this in a single command.
Throughput and Quality#
- Memory: weights and KV cache shrink ~2x versus BF16.
- Throughput on H100: ~2x peak tensor-core TFLOPS versus BF16.
- Throughput on B200: FP8 throughput pushed further; FP4 takes over as the headline dtype.
- Accuracy: typically <0.5 percentage point drop on standard LLM benchmarks for E4M3 with per-tensor scaling.
Tooling#
vLLM accepts `--quantization fp8` and an FP8 checkpoint. TensorRT-LLM exposes `--gemm_plugin fp8` and `--use_fp8_context_fmha enable` at build time. NVIDIA's `llm-compressor` (formerly AutoFP8) handles HF-to-FP8 conversion with calibration. SGLang, TGI and MLC-LLM all consume the same FP8 checkpoint format.
When to Use#
Default to FP8 on any Hopper or Blackwell deployment serving LLMs. Reach for INT4 (AWQ or GPTQ) when memory pressure is the binding constraint and you need the extra footprint savings. Stick with BF16 only when target hardware lacks FP8 tensor cores (older Ampere, some non-NVIDIA accelerators).
References
- FP8 Formats for Deep Learning · arXiv (Micikevicius et al., 2022)
- NVIDIA Transformer Engine · NVIDIA
- vLLM FP8 Documentation · vLLM