TL;DR
- Transformer Engine (TE) is the software + hardware pair NVIDIA introduced with Hopper to make FP8 / FP4 training and inference safe — fourth-gen Tensor Core silicon (Hopper) plus a PyTorch library that wraps modules with automatic per-tensor scaling factor management.
- First generation (Hopper, 2022): FP8 with E4M3 (forward activations/weights) and E5M2 (backward gradients). Second generation (Blackwell, 2024): adds FP4 (E2M1) and the OCP-standard MX microscaling formats (MXFP8, MXFP6, MXFP4).
- Throughput uplift on H100 SXM5 measured on real Llama 3 / Mixtral training runs: 1.5-1.9x vs BF16 baseline at iso-precision (loss curves within 0.5 % through 1T tokens). Inference uplift on vLLM / TensorRT-LLM: 1.6-2.1x decode TPS.
- Integrated as a first-class option in Megatron-Core / Megatron-LM, NeMo, DeepSpeed, PyTorch FSDP, HuggingFace transformers (via accelerate), vLLM, TensorRT-LLM, SGLang and torchtitan. Most modern training stacks ship TE on by default for Hopper-and-newer.
- Default in Yobibyte's managed fine-tune recipes for H100 / H200 / B200 workspaces — customers opt out, not in. The recipe-protected layer above selects formats per architecture so accuracy stays within typical noise bands.
Overview#
Transformer Engine (TE) is the software-and-hardware combination NVIDIA introduced with Hopper (2022) to make 8-bit (and now 4-bit) floating-point training and inference safe to use in production. The hardware half is the fourth-generation Tensor Core with native FP8 multiply-accumulate (E4M3 and E5M2 formats) on Hopper, extended to FP4 (E2M1) and OCP microscaling (MX) formats on Blackwell. The software half is an open-source PyTorch library (`transformer_engine.pytorch`) that wraps standard transformer modules — Linear, LayerNorm, attention, MoE routing — with automatic scaling-factor management, recipe selection and FP8/FP4 autocast.
Both halves matter because low-precision floats have very limited dynamic range. E4M3 represents values from roughly 2^-9 to 2^8 (~448 absolute max); E5M2 reaches ~57,344 but with one extra exponent bit at the cost of a mantissa bit; FP4 (E2M1) has only six representable non-zero magnitudes. Naively casting BF16 tensors to FP8 silently underflows the small-magnitude tail and overflows the large-magnitude tail, producing 1-2 point evaluation regressions that look indistinguishable from flaky training runs. TE exists to maintain per-tensor (or per-channel, or per-block in MX) scaling factors and a short history (amax tracking) so the cast is lossless within the format's precision budget.
This entry is the reference for teams enabling FP8 / FP4 on Hopper / Blackwell across training and inference: the formats and their dynamic ranges, the TE recipe surface, how it integrates with Megatron-Core / NeMo / vLLM / TensorRT-LLM, sizing guidance for the throughput uplift, the calibration pitfalls that cause silent regressions, and the migration paths from BF16 baselines. Yobibyte's default training and inference recipes enable Transformer Engine FP8 on Yobitel NeoCloud H100 / H200 / B200 capacity so customers inherit the uplift without re-deriving the format choice per architecture. This entry helps you decide when FP8 (or FP4) is safe for your model and what the throughput / accuracy trade-off looks like on Hopper / Blackwell.
Specifications: FP8, FP4 and MX number formats#
Authoritative format table. Sign / exponent / mantissa widths define the dynamic range and precision; the 'absolute max' column is what your scaling factor must keep under to avoid overflow.
- E4M3 is asymmetric: it represents 0 and ±NaN but no infinity, so the entire exponent range is usable for finite values (cost: no overflow trap).
- E5M2 is IEEE-754-style: ±inf, ±NaN, subnormals — the natural drop-in replacement for FP16 gradient flow.
- MX formats: 32-element blocks share an 8-bit (E8M0) scale exponent, giving finer-grained calibration than per-tensor scaling. Standardised by OCP (Open Compute Project) so AMD MI355X, Intel Gaudi 3 and others can interoperate.
- Hopper supports E4M3 and E5M2 only. Blackwell adds FP4 (E2M1), MXFP8, MXFP6 and MXFP4. Ampere and earlier have no FP8 silicon — TE falls back to BF16/FP16.
| Format | Sign | Exp | Mantissa | Abs max | Min normal | Typical use |
|---|---|---|---|---|---|---|
| FP32 | 1 | 8 | 23 | ~3.4e38 | ~1.2e-38 | Master weights, loss scaling reference |
| BF16 | 1 | 8 | 7 | ~3.4e38 | ~1.2e-38 | Default training compute on H100/A100 |
| FP16 | 1 | 5 | 10 | 65,504 | ~6.1e-5 | Legacy mixed-precision training |
| FP8 E4M3 | 1 | 4 | 3 | 448 (no inf) | 2^-6 ≈ 0.0156 | Forward activations + weights |
| FP8 E5M2 | 1 | 5 | 2 | 57,344 | 2^-14 ≈ 6.1e-5 | Backward gradients |
| FP4 E2M1 | 1 | 2 | 1 | 6 (no inf) | 0.5 | Blackwell inference; some training |
| MXFP8 (block 32) | 1 | — | — | Per-block scale (E8M0) | — | Microscaling training/inference |
| MXFP6 (block 32) | 1 | — | — | Per-block scale (E8M0) | — | Aggressive Blackwell inference |
| MXFP4 (block 32) | 1 | — | — | Per-block scale (E8M0) | — | Frontier Blackwell inference |
FP8's small dynamic range is the entire reason scaling matters. Without per-tensor scaling, FP8 training is not viable — Transformer Engine exists to handle this automatically. Hand-rolled FP8 paths that skip TE will look fine for 100-1k steps and silently diverge somewhere in the 10k-100k step range.
Architecture: how Transformer Engine handles scaling#
TE attaches a scaling-factor history (`amax_history`) to each FP8 tensor. The pattern is the same for forward activations, weights and backward gradients:
1. Compute the absolute maximum (amax) of the BF16/FP32 tensor before casting. 2. Divide by the format's representable max (448 for E4M3, 57,344 for E5M2) to get the scaling factor. 3. Cast the scaled tensor to FP8. 4. Store the scaling factor alongside the tensor; apply its inverse on every read. 5. Append the new amax to a short rolling window (default `amax_history_len=1024`) and use the window's max as the scaling factor next step.
TE supports three recipes: `HYBRID` (E4M3 forward, E5M2 backward — the default and the one used in almost every production FP8 training run), `E4M3` (E4M3 everywhere, narrower gradient range but more activation precision — for inference and short fine-tunes), and the MX-format recipes on Blackwell (`MXFP8`, `MXFP4`).
Two scaling strategies: 'delayed scaling' (use the previous-step amax — one fewer host-device sync per step, ~2-4 % throughput uplift, negligible accuracy cost) and 'current scaling' (compute amax on the in-flight tensor — slightly more accurate, slightly slower). Delayed is default.
- amax_history_len: rolling window size for amax tracking. Default 1024; raise to 2048-4096 if loss spikes during long-horizon training.
- fp8_format: HYBRID (E4M3 forward + E5M2 backward) is the production default. Pure E4M3 is inference-only.
- fp8_dpa: enable Distributed Parallel Attention with FP8 — combines TE FP8 with TP/CP parallelism for attention.
- Margin: number of bits of headroom in the scaling factor. Default 0; raise to 1-2 for noisy gradient distributions.
- TE wraps modules in-place: replace `torch.nn.Linear` with `transformer_engine.pytorch.Linear` and the same module accepts FP8 amax tracking with no caller-side changes.
# Minimal TE FP8 fine-tune (illustrative — Yobibyte hides this behind a recipe)
import torch
import transformer_engine.pytorch as te
from transformer_engine.common.recipe import Format, DelayedScaling
# Hybrid recipe: E4M3 forward, E5M2 backward — production default
fp8_recipe = DelayedScaling(
margin=0,
fp8_format=Format.HYBRID,
amax_history_len=1024,
amax_compute_algo="max",
)
# Replace nn.Linear with te.Linear — same signature, FP8-aware
model = torch.nn.Sequential(
te.Linear(4096, 11008, bias=False), # gate proj
te.Linear(4096, 11008, bias=False), # up proj
te.LayerNormLinear(11008, 4096, bias=False), # down proj + LN
).cuda()
# Train step — wrap forward in te.fp8_autocast
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-4)
for batch in dataloader:
with te.fp8_autocast(enabled=True, fp8_recipe=fp8_recipe):
out = model(batch.cuda())
loss = out.float().mean()
loss.backward()
optimizer.step()
optimizer.zero_grad()Form factor / hardware support#
Transformer Engine is silicon-bound: only Hopper and Blackwell Tensor Cores execute the FP8/FP4 multiply-accumulate paths natively. On Ampere and earlier, TE falls back to BF16/FP16 — the library still imports and runs, but the throughput uplift is zero.
- Native FP8 throughput on H100 SXM5: 1,979 TFLOPS dense (3,958 with 2:4 sparsity) — exactly 2x the BF16 figure on the same silicon.
- Native FP4 throughput on B200: 4,500 TFLOPS dense (9,000 sparse) — 2x the FP8 figure on the same silicon.
- Power: enabling FP8 / FP4 does not change board TDP, but compute is denser so observed power draw rises closer to TDP (a property to plan rack cooling around).
- Confidential Compute (CC-on) is orthogonal — FP8/FP4 paths work under attestation with a uniform ~3-5 % overhead.
| GPU family | Compute cap | FP8 (E4M3/E5M2) | FP4 (E2M1) | MX formats | Notes |
|---|---|---|---|---|---|
| A100 (Ampere) | sm_80 | No | No | No | TE falls back to BF16; throughput baseline. |
| L4 / L40S / RTX 6000 Ada | sm_89 | Yes (Ada gen-4 Tensor Core) | No | No | Same FP8 paths as Hopper; reduced peak FLOPS. |
| H100 / H200 (Hopper) | sm_90 / sm_90a | Yes (native MMA) | No | No | Production sweet spot for FP8 training and inference. |
| B100 / B200 (Blackwell) | sm_100 | Yes | Yes | MXFP8 / MXFP6 / MXFP4 | Second-generation Transformer Engine. |
| GB200 / GB300 NVL72 | sm_100 | Yes | Yes | MX + per-block scale offload | MX format scale-factor offload to NVSwitch SHARP-v3. |
Software ecosystem: where TE plugs in#
TE is open source (`github.com/NVIDIA/TransformerEngine`) and integrated as a first-class option in every major modern transformer training and inference stack. The integration model is intentionally module-level: you can adopt TE one Linear / LayerNorm at a time, or wrap an entire model, depending on how cautious you want to be.
- Training frameworks: Megatron-Core (TE is the default tensor-parallel building block from v0.5+), NeMo, PyTorch FSDP (with `te.fp8_autocast`), DeepSpeed (FP8-zero stages 1-3), HuggingFace `transformers` + `accelerate`, torchtitan (Meta's PyTorch-native trainer), Axolotl (fine-tune wrapper), Unsloth (fast fine-tune library, FP8 path on Hopper).
- Inference frameworks: vLLM (FP8 kv-cache + FP8 weights via `--quantization fp8 --kv-cache-dtype fp8_e5m2`), TensorRT-LLM (FP8 GEMM plugin, FP4 on Blackwell via `--gemm_plugin fp4`), SGLang (FP8 via TensorRT-LLM backend), Triton Inference Server (TensorRT-LLM backend), TGI (FP8 marlin and machete kernels).
- MoE: TE 1.7+ supports per-expert scaling so MoE routing on Mixtral / DeepSeek-V3 / DBRX stays FP8-stable through expert imbalance.
- MX format support: TE 1.10+ exposes MXFP8 / MXFP4 recipes; Blackwell-only at runtime. AMD MI355X and Intel Gaudi 3 expose MX paths through their own SDKs — the OCP spec gives cross-vendor weight portability.
- Yobibyte's managed inference and fine-tune workspaces select FP8 (Hopper / Hopper-Ada / Blackwell pools) or BF16 (Ampere pools) automatically per-workspace placement. Customers opt out via a `precision: bf16` override; the default is FP8.
Sizing: throughput uplift and accuracy budget#
Real-world FP8-vs-BF16 numbers from production training and inference runs. Treat as planning anchors; verify on InferenceBench (or your own eval suite) before locking into a precision choice.
- Rule of thumb: FP8 on Hopper delivers 1.5-1.9x BF16 throughput at iso-accuracy when integrated via TE. Standalone hand-rolled FP8 paths usually land closer to 1.2-1.4x with accuracy regressions.
- KV cache FP8 (`--kv-cache-dtype fp8_e5m2`) is a separate decision from FP8 weights — typically free accuracy-wise and doubles cache capacity (lets you raise `max_model_len` or `max_num_batched_tokens`).
- FP4 inference on Blackwell requires per-channel calibration with a representative dataset (typically 128-512 samples) to stay within the 0.5-1.0 % eval-delta budget. Yobibyte's Blackwell pools handle this in the recipe layer.
- Training FP8 economics: a 1T-token, 70B-parameter pre-train that was 250-350 H100-days on BF16 lands at ~150-200 H100-days with TE FP8 — a meaningful slice off a typical $500K-1M training bill on Yobitel NeoCloud reserved H100 pricing.
| Workload | Hardware | BF16 throughput | FP8 throughput | Uplift | Loss / eval delta |
|---|---|---|---|---|---|
| Llama 3 70B pre-train (Megatron-Core) | H100 SXM5 x 256 | ~13.8K tokens/s/GPU | ~24.5K tokens/s/GPU | 1.78x | +/- 0.3 % loss through 1T tokens |
| Llama 3 70B QLoRA fine-tune (TRL) | H100 SXM5 x 2 | ~3.2K tokens/s/GPU | ~5.1K tokens/s/GPU | 1.59x | Indistinguishable on eval |
| Llama 3 70B inference (vLLM, 32K ctx) | H100 SXM5 x 2 | ~950 TPS | ~1,650 TPS | 1.74x | <0.5 % eval delta on MMLU/GSM8k |
| Mixtral 8x22B inference (vLLM) | H100 SXM5 x 2 | ~580 TPS | ~1,100 TPS | 1.90x | <0.4 % eval delta |
| Llama 3 8B inference (TensorRT-LLM) | H100 PCIe x 1 | ~6.8K TPS | ~12.4K TPS | 1.82x | Negligible |
| Llama 3 70B inference FP4 | B200 SXM6 x 1 | ~1,950 TPS (FP8) | ~3,400 TPS (FP4) | 1.74x vs FP8 | 0.6-1.0 % eval delta — calibrate |
Always run an eval sweep at 10 %, 25 %, 50 % and 90 % of the planned training horizon when transitioning a recipe from BF16 to FP8. Silent divergence almost always shows up by 50 %; catching it early saves an entire run.
Cost and TCO#
Enabling Transformer Engine FP8 does not change the per-GPU-hour rate — it changes the cost-per-token (training or inference) by 1.5-1.9x. On Yobitel NeoCloud H100 SXM5 at $2.00/GPU-hr reserved, the FP8 vs BF16 savings break down as follows.
- Yobibyte exposes precision as a workspace setting; defaulting to FP8 on Hopper-or-newer pools is what makes the published Yobibyte per-token rates competitive against Bedrock and Vertex equivalents.
- Omniscient Compute treats precision as a search dimension when arbitrating capacity — a 'FP8-OK' workload can land on lower-priced Blackwell or Hopper-Ada pools that are blocked for 'BF16 only' jobs.
- Spot/preemptible H100 capacity with FP8 enabled is the cheapest cost-per-token configuration on Yobitel NeoCloud (about 50-60 % of reserved); restricted to fine-tunes only — not inference SLAs.
| Workload | BF16 cost | FP8 cost | Savings | Yobitel NeoCloud anchor |
|---|---|---|---|---|
| Llama 3 70B 1T-token pre-train (256x H100) | ~$1.4M (300 H100-days) | ~$840K (180 H100-days) | ~40 % | H100 SXM5 SuperPOD-256, 3yr reserved |
| 70B QLoRA fine-tune (5 epochs, 1B tokens) | ~$680 | ~$420 | ~38 % | 2x H100 SXM5, on-demand |
| Cost-per-million-output-tokens, Llama 3 70B inference | ~$0.85 | ~$0.50 | ~41 % | 1-2x H100 SXM5, on-demand |
| Cost-per-million-output-tokens, Llama 3 70B inference FP4 | n/a | ~$0.28 | ~67 % vs BF16 | 1x B200 SXM6, on-demand |
Migration and alternatives#
Choices when FP8 is on the table but TE is not a perfect fit, or when targeting non-NVIDIA silicon.
- Migration path BF16 -> TE FP8: replace `nn.Linear` with `te.Linear`, wrap forward in `te.fp8_autocast`, choose `Format.HYBRID`, validate eval at 10/25/50/90 % of training horizon.
- Migration TE FP8 -> Blackwell MXFP8 / MXFP4: change recipe to `MXFP8Format` or `MXFP4Format` and run calibration sweep; same module surface.
- Migration NVIDIA TE -> AMD ROCm FP8: the FP8 number format is interoperable but the runtime is different. ROCm uses its own equivalent autocast; weights are bit-compatible but scaling factors are recipe-specific.
| Option | Vendor / scope | When to pick | Trade-off vs TE |
|---|---|---|---|
| NVIDIA Transformer Engine | Hopper / Blackwell | Default for NVIDIA + PyTorch + transformers | None for the supported set |
| PyTorch native FP8 (torch.float8_e4m3fn) | PyTorch 2.4+, Hopper+ | Custom kernels not in TE | You write your own amax tracking |
| TorchAO `Float8Linear` | Hopper+ | torch.compile-first stacks | Smaller surface; less production-vetted than TE |
| AMD ROCm FP8 | MI300X / MI325X / MI355X | All-AMD stacks | Format-compatible with E4M3/E5M2; ROCm ecosystem catching up |
| Intel Gaudi 3 FP8 | Gaudi 3 only | Gaudi-resident workloads | Synapse SDK; smaller stack |
| NVIDIA Modelopt PTQ FP8 | Inference only (TensorRT-LLM) | Post-training quantisation of BF16 checkpoints | Inference path only; no training support |
| MX format ecosystem (OCP standard) | AMD + NVIDIA + Intel | Cross-vendor weight portability | Newer; tooling maturity uneven |
Pitfalls / operational notes#
The operational issues we see most often when teams adopt TE FP8 in production, ranked by frequency.
- Calibration skipped: dropping FP8 in without TE's amax tracking produces 1-2 point eval regressions in the 10k-100k-step range. Always integrate via the supported path — there is no 'just cast to FP8' shortcut.
- Format swap: E4M3 forward + E5M2 backward is correct. Swapping them silently degrades training stability — gradients overflow E4M3 routinely.
- Layer-specific sensitivity: input embeddings, output projection / LM head, and the first attention layer tolerate FP8 poorly on some architectures. TE recipes that skip these layers (keep them in BF16) are a standard pattern; Yobibyte recipes do this by default.
- Hardware gating: code paths that assume FP8 silently fall back to BF16 on Ampere — measured throughput will not match Hopper plan. Gate on `torch.cuda.get_device_capability() >= (8, 9)`.
- amax_history_len too small: default 1024 is fine for stable distributions; long-horizon training with mixed-domain data (e.g. code + math + multilingual) benefits from 2048-4096.
- MoE expert imbalance: experts that fire 10x less often than the median have stale amax histories; TE 1.7+ handles per-expert tracking. Older versions silently drift.
- torch.compile interaction: `te.Linear` is fully torch.compile-compatible only on PyTorch 2.4+; older combos require `fullgraph=False`.
- Save / load: FP8 scaling factors must be saved alongside weights. HuggingFace `save_pretrained` from TE-wrapped modules works on `transformers>=4.45`; older versions drop the scales — accuracy silently halves on reload.
- FP4 calibration set size: under-calibrated (32-64 samples) FP4 inference shows 2-3 % eval regressions. Use 128-512 representative samples and verify on a held-out eval before serving.
Where this fits in the Yobitel stack#
Transformer Engine FP8 is the default precision for Yobibyte's training and inference workspaces on Hopper-or-newer Yobitel NeoCloud capacity. Customers describe a model and a workspace; Yobibyte selects the precision recipe per placement (FP8 on H100/H200/B200 pools, BF16 on Ampere pools, FP4 with calibration on Blackwell pools) and exposes the result through the standard workspace API. The TE recipe choices in this entry (`HYBRID` format, delayed scaling, layer-skip lists for embeddings and output projections) are the building blocks Yobibyte's managed layer composes — customers see precision = 'fp8' on their workspace, not the underlying amax_history_len.
Yobitel NeoCloud H100, H200 and B200 SKUs are sized in their published throughput tables assuming FP8 enabled, because that is how the silicon is meant to be used in 2026. The cost-per-million-output-tokens numbers in the NeoCloud pricing surface are FP8 numbers; BF16 numbers are available on request and roughly 60-70 % of FP8 throughput.
InferenceBench publishes FP8 and BF16 throughput, latency and cost-per-token numbers for every covered open-weight model on Hopper, Ada and Blackwell SKUs side-by-side, so customers can map their accuracy budget directly to a precision choice and a NeoCloud SKU before committing to a recipe. Omniscient Compute treats precision as a search dimension when arbitrating capacity, so a workspace that opts into FP8 inherits a wider eligible-capacity set across Yobitel NeoCloud and partner clouds.
References
- NVIDIA Transformer Engine GitHub · NVIDIA
- Transformer Engine User Guide · NVIDIA
- FP8 Formats for Deep Learning (Micikevicius et al., 2022) · arXiv
- OCP Microscaling Formats Specification (v1.0) · Open Compute Project
- Megatron-Core FP8 training guide · NVIDIA
- vLLM FP8 quantisation documentation · vLLM
- TensorRT-LLM FP8 / FP4 quantisation · NVIDIA