NVIDIA Transformer Engine & FP8

TL;DR

Transformer Engine (TE) is the software + hardware pair NVIDIA introduced with Hopper to make FP8 / FP4 training and inference safe — fourth-gen Tensor Core silicon (Hopper) plus a PyTorch library that wraps modules with automatic per-tensor scaling factor management.
First generation (Hopper, 2022): FP8 with E4M3 (forward activations/weights) and E5M2 (backward gradients). Second generation (Blackwell, 2024): adds FP4 (E2M1) and the OCP-standard MX microscaling formats (MXFP8, MXFP6, MXFP4).
Throughput uplift on H100 SXM5 measured on real Llama 3 / Mixtral training runs: 1.5-1.9x vs BF16 baseline at iso-precision (loss curves within 0.5 % through 1T tokens). Inference uplift on vLLM / TensorRT-LLM: 1.6-2.1x decode TPS.
Integrated as a first-class option in Megatron-Core / Megatron-LM, NeMo, DeepSpeed, PyTorch FSDP, HuggingFace transformers (via accelerate), vLLM, TensorRT-LLM, SGLang and torchtitan. Most modern training stacks ship TE on by default for Hopper-and-newer.
Default in Yobibyte's managed fine-tune recipes for H100 / H200 / B200 workspaces — customers opt out, not in. The recipe-protected layer above selects formats per architecture so accuracy stays within typical noise bands.

Overview

Transformer Engine (TE) is the software-and-hardware combination NVIDIA introduced with Hopper (2022) to make 8-bit (and now 4-bit) floating-point training and inference safe to use in production. The hardware half is the fourth-generation Tensor Core with native FP8 multiply-accumulate (E4M3 and E5M2 formats) on Hopper, extended to FP4 (E2M1) and OCP microscaling (MX) formats on Blackwell. The software half is an open-source PyTorch library (transformer_engine.pytorch) that wraps standard transformer modules — Linear, LayerNorm, attention, MoE routing — with automatic scaling-factor management, recipe selection and FP8/FP4 autocast.

Both halves matter because low-precision floats have very limited dynamic range. E4M3 represents values from roughly 2^-9 to 2^8 (~448 absolute max); E5M2 reaches ~57,344 but with one extra exponent bit at the cost of a mantissa bit; FP4 (E2M1) has only six representable non-zero magnitudes. Naively casting BF16 tensors to FP8 silently underflows the small-magnitude tail and overflows the large-magnitude tail, producing 1-2 point evaluation regressions that look indistinguishable from flaky training runs. TE exists to maintain per-tensor (or per-channel, or per-block in MX) scaling factors and a short history (amax tracking) so the cast is lossless within the format's precision budget.

This entry is the reference for teams enabling FP8 / FP4 on Hopper / Blackwell across training and inference: the formats and their dynamic ranges, the TE recipe surface, how it integrates with Megatron-Core / NeMo / vLLM / TensorRT-LLM, sizing guidance for the throughput uplift, the calibration pitfalls that cause silent regressions, and the migration paths from BF16 baselines. Yobibyte's default training and inference recipes enable Transformer Engine FP8 on Yobitel NeoCloud H100 / H200 / B200 capacity so customers inherit the uplift without re-deriving the format choice per architecture. This entry helps you decide when FP8 (or FP4) is safe for your model and what the throughput / accuracy trade-off looks like on Hopper / Blackwell.

Specifications: FP8, FP4 and MX number formats

Authoritative format table. Sign / exponent / mantissa widths define the dynamic range and precision; the 'absolute max' column is what your scaling factor must keep under to avoid overflow.

E4M3 is asymmetric: it represents 0 and ±NaN but no infinity, so the entire exponent range is usable for finite values (cost: no overflow trap).
E5M2 is IEEE-754-style: ±inf, ±NaN, subnormals — the natural drop-in replacement for FP16 gradient flow.
MX formats: 32-element blocks share an 8-bit (E8M0) scale exponent, giving finer-grained calibration than per-tensor scaling. Standardised by OCP (Open Compute Project) so AMD MI355X, Intel Gaudi 3 and others can interoperate.
Hopper supports E4M3 and E5M2 only. Blackwell adds FP4 (E2M1), MXFP8, MXFP6 and MXFP4. Ampere and earlier have no FP8 silicon — TE falls back to BF16/FP16.

Format	Sign	Exp	Mantissa	Abs max	Min normal	Typical use
FP32	1	8	23	~3.4e38	~1.2e-38	Master weights, loss scaling reference
BF16	1	8	7	~3.4e38	~1.2e-38	Default training compute on H100/A100
FP16	1	5	10	65,504	~6.1e-5	Legacy mixed-precision training
FP8 E4M3	1	4	3	448 (no inf)	2^-6 ≈ 0.0156	Forward activations + weights
FP8 E5M2	1	5	2	57,344	2^-14 ≈ 6.1e-5	Backward gradients
FP4 E2M1	1	2	1	6 (no inf)	0.5	Blackwell inference; some training
MXFP8 (block 32)	1	—	—	Per-block scale (E8M0)	—	Microscaling training/inference
MXFP6 (block 32)	1	—	—	Per-block scale (E8M0)	—	Aggressive Blackwell inference
MXFP4 (block 32)	1	—	—	Per-block scale (E8M0)	—	Frontier Blackwell inference

Warning: FP8's small dynamic range is the entire reason scaling matters. Without per-tensor scaling, FP8 training is not viable — Transformer Engine exists to handle this automatically. Hand-rolled FP8 paths that skip TE will look fine for 100-1k steps and silently diverge somewhere in the 10k-100k step range.

Architecture: how Transformer Engine handles scaling

TE attaches a scaling-factor history (amax_history) to each FP8 tensor. The pattern is the same for forward activations, weights and backward gradients:

Compute the absolute maximum (amax) of the BF16/FP32 tensor before casting. 2. Divide by the format's representable max (448 for E4M3, 57,344 for E5M2) to get the scaling factor. 3. Cast the scaled tensor to FP8. 4. Store the scaling factor alongside the tensor; apply its inverse on every read. 5. Append the new amax to a short rolling window (default amax_history_len=1024) and use the window's max as the scaling factor next step.

TE supports three recipes: HYBRID (E4M3 forward, E5M2 backward — the default and the one used in almost every production FP8 training run), E4M3 (E4M3 everywhere, narrower gradient range but more activation precision — for inference and short fine-tunes), and the MX-format recipes on Blackwell (MXFP8, MXFP4).

Two scaling strategies: 'delayed scaling' (use the previous-step amax — one fewer host-device sync per step, ~2-4 % throughput uplift, negligible accuracy cost) and 'current scaling' (compute amax on the in-flight tensor — slightly more accurate, slightly slower). Delayed is default.

amax_history_len: rolling window size for amax tracking. Default 1024; raise to 2048-4096 if loss spikes during long-horizon training.
fp8_format: HYBRID (E4M3 forward + E5M2 backward) is the production default. Pure E4M3 is inference-only.
fp8_dpa: enable Distributed Parallel Attention with FP8 — combines TE FP8 with TP/CP parallelism for attention.
Margin: number of bits of headroom in the scaling factor. Default 0; raise to 1-2 for noisy gradient distributions.
TE wraps modules in-place: replace torch.nn.Linear with transformer_engine.pytorch.Linear and the same module accepts FP8 amax tracking with no caller-side changes.

# Minimal TE FP8 fine-tune (illustrative — Yobibyte hides this behind a recipe)
import torch
import transformer_engine.pytorch as te
from transformer_engine.common.recipe import Format, DelayedScaling

# Hybrid recipe: E4M3 forward, E5M2 backward — production default
fp8_recipe = DelayedScaling(
    margin=0,
    fp8_format=Format.HYBRID,
    amax_history_len=1024,
    amax_compute_algo="max",
)

# Replace nn.Linear with te.Linear — same signature, FP8-aware
model = torch.nn.Sequential(
    te.Linear(4096, 11008, bias=False),     # gate proj
    te.Linear(4096, 11008, bias=False),     # up proj
    te.LayerNormLinear(11008, 4096, bias=False),  # down proj + LN
).cuda()

# Train step — wrap forward in te.fp8_autocast
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-4)
for batch in dataloader:
    with te.fp8_autocast(enabled=True, fp8_recipe=fp8_recipe):
        out = model(batch.cuda())
        loss = out.float().mean()
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

Form factor / hardware support

Transformer Engine is silicon-bound: only Hopper and Blackwell Tensor Cores execute the FP8/FP4 multiply-accumulate paths natively. On Ampere and earlier, TE falls back to BF16/FP16 — the library still imports and runs, but the throughput uplift is zero.

Native FP8 throughput on H100 SXM5: 1,979 TFLOPS dense (3,958 with 2:4 sparsity) — exactly 2x the BF16 figure on the same silicon.
Native FP4 throughput on B200: 4,500 TFLOPS dense (9,000 sparse) — 2x the FP8 figure on the same silicon.
Power: enabling FP8 / FP4 does not change board TDP, but compute is denser so observed power draw rises closer to TDP (a property to plan rack cooling around).
Confidential Compute (CC-on) is orthogonal — FP8/FP4 paths work under attestation with a uniform ~3-5 % overhead.

GPU family	Compute cap	FP8 (E4M3/E5M2)	FP4 (E2M1)	MX formats	Notes
A100 (Ampere)	sm_80	No	No	No	TE falls back to BF16; throughput baseline.
L4 / L40S / RTX 6000 Ada	sm_89	Yes (Ada gen-4 Tensor Core)	No	No	Same FP8 paths as Hopper; reduced peak FLOPS.
H100 / H200 (Hopper)	sm_90 / sm_90a	Yes (native MMA)	No	No	Production sweet spot for FP8 training and inference.
B100 / B200 (Blackwell)	sm_100	Yes	Yes	MXFP8 / MXFP6 / MXFP4	Second-generation Transformer Engine.
GB200 / GB300 NVL72	sm_100	Yes	Yes	MX + per-block scale offload	MX format scale-factor offload to NVSwitch SHARP-v3.

Software ecosystem: where TE plugs in

TE is open source (github.com/NVIDIA/TransformerEngine) and integrated as a first-class option in every major modern transformer training and inference stack. The integration model is intentionally module-level: you can adopt TE one Linear / LayerNorm at a time, or wrap an entire model, depending on how cautious you want to be.

Training frameworks: Megatron-Core (TE is the default tensor-parallel building block from v0.5+), NeMo, PyTorch FSDP (with te.fp8_autocast), DeepSpeed (FP8-zero stages 1-3), HuggingFace transformers + accelerate, torchtitan (Meta's PyTorch-native trainer), Axolotl (fine-tune wrapper), Unsloth (fast fine-tune library, FP8 path on Hopper).
Inference frameworks: vLLM (FP8 kv-cache + FP8 weights via --quantization fp8 --kv-cache-dtype fp8_e5m2), TensorRT-LLM (FP8 GEMM plugin, FP4 on Blackwell via --gemm_plugin fp4), SGLang (FP8 via TensorRT-LLM backend), Triton Inference Server (TensorRT-LLM backend), TGI (FP8 marlin and machete kernels).
MoE: TE 1.7+ supports per-expert scaling so MoE routing on Mixtral / DeepSeek-V3 / DBRX stays FP8-stable through expert imbalance.
MX format support: TE 1.10+ exposes MXFP8 / MXFP4 recipes; Blackwell-only at runtime. AMD MI355X and Intel Gaudi 3 expose MX paths through their own SDKs — the OCP spec gives cross-vendor weight portability.
Yobibyte's managed inference and fine-tune workspaces select FP8 (Hopper / Hopper-Ada / Blackwell pools) or BF16 (Ampere pools) automatically per-workspace placement. Customers opt out via a precision: bf16 override; the default is FP8.

Sizing: throughput uplift and accuracy budget

Real-world FP8-vs-BF16 numbers from production training and inference runs. Treat as planning anchors; verify on InferenceBench (or your own eval suite) before locking into a precision choice.

Rule of thumb: FP8 on Hopper delivers 1.5-1.9x BF16 throughput at iso-accuracy when integrated via TE. Standalone hand-rolled FP8 paths usually land closer to 1.2-1.4x with accuracy regressions.
KV cache FP8 (--kv-cache-dtype fp8_e5m2) is a separate decision from FP8 weights — typically free accuracy-wise and doubles cache capacity (lets you raise max_model_len or max_num_batched_tokens).
FP4 inference on Blackwell requires per-channel calibration with a representative dataset (typically 128-512 samples) to stay within the 0.5-1.0 % eval-delta budget. Yobibyte's Blackwell pools handle this in the recipe layer.
Training FP8 economics: a 1T-token, 70B-parameter pre-train that was 250-350 H100-days on BF16 lands at ~150-200 H100-days with TE FP8 — a meaningful slice off a typical $500K-1M training bill on Yobitel NeoCloud reserved H100 pricing.

Workload	Hardware	BF16 throughput	FP8 throughput	Uplift	Loss / eval delta
Llama 3 70B pre-train (Megatron-Core)	H100 SXM5 x 256	~13.8K tokens/s/GPU	~24.5K tokens/s/GPU	1.78x	+/- 0.3 % loss through 1T tokens
Llama 3 70B QLoRA fine-tune (TRL)	H100 SXM5 x 2	~3.2K tokens/s/GPU	~5.1K tokens/s/GPU	1.59x	Indistinguishable on eval
Llama 3 70B inference (vLLM, 32K ctx)	H100 SXM5 x 2	~950 TPS	~1,650 TPS	1.74x	<0.5 % eval delta on MMLU/GSM8k
Mixtral 8x22B inference (vLLM)	H100 SXM5 x 2	~580 TPS	~1,100 TPS	1.90x	<0.4 % eval delta
Llama 3 8B inference (TensorRT-LLM)	H100 PCIe x 1	~6.8K TPS	~12.4K TPS	1.82x	Negligible
Llama 3 70B inference FP4	B200 SXM6 x 1	~1,950 TPS (FP8)	~3,400 TPS (FP4)	1.74x vs FP8	0.6-1.0 % eval delta — calibrate

Tip: Always run an eval sweep at 10 %, 25 %, 50 % and 90 % of the planned training horizon when transitioning a recipe from BF16 to FP8. Silent divergence almost always shows up by 50 %; catching it early saves an entire run.

Cost and TCO

Enabling Transformer Engine FP8 does not change the per-GPU-hour rate — it changes the cost-per-token (training or inference) by 1.5-1.9x. On Yobitel NeoCloud H100 SXM5 at $2.00/GPU-hr reserved, the FP8 vs BF16 savings break down as follows.

Yobibyte exposes precision as a workspace setting; defaulting to FP8 on Hopper-or-newer pools is what makes the published Yobibyte per-token rates competitive against Bedrock and Vertex equivalents.
Omniscient Compute treats precision as a search dimension when arbitrating capacity — a 'FP8-OK' workload can land on lower-priced Blackwell or Hopper-Ada pools that are blocked for 'BF16 only' jobs.
Spot/preemptible H100 capacity with FP8 enabled is the cheapest cost-per-token configuration on Yobitel NeoCloud (about 50-60 % of reserved); restricted to fine-tunes only — not inference SLAs.

Workload	BF16 cost	FP8 cost	Savings	Yobitel NeoCloud anchor
Llama 3 70B 1T-token pre-train (256x H100)	~$1.4M (300 H100-days)	~$840K (180 H100-days)	~40 %	H100 SXM5 SuperPOD-256, 3yr reserved
70B QLoRA fine-tune (5 epochs, 1B tokens)	~$680	~$420	~38 %	2x H100 SXM5, on-demand
Cost-per-million-output-tokens, Llama 3 70B inference	~$0.85	~$0.50	~41 %	1-2x H100 SXM5, on-demand
Cost-per-million-output-tokens, Llama 3 70B inference FP4	n/a	~$0.28	~67 % vs BF16	1x B200 SXM6, on-demand

Migration and alternatives

Choices when FP8 is on the table but TE is not a perfect fit, or when targeting non-NVIDIA silicon.

Migration path BF16 -> TE FP8: replace nn.Linear with te.Linear, wrap forward in te.fp8_autocast, choose Format.HYBRID, validate eval at 10/25/50/90 % of training horizon.
Migration TE FP8 -> Blackwell MXFP8 / MXFP4: change recipe to MXFP8Format or MXFP4Format and run calibration sweep; same module surface.
Migration NVIDIA TE -> AMD ROCm FP8: the FP8 number format is interoperable but the runtime is different. ROCm uses its own equivalent autocast; weights are bit-compatible but scaling factors are recipe-specific.

Option	Vendor / scope	When to pick	Trade-off vs TE
NVIDIA Transformer Engine	Hopper / Blackwell	Default for NVIDIA + PyTorch + transformers	None for the supported set
PyTorch native FP8 (torch.float8_e4m3fn)	PyTorch 2.4+, Hopper+	Custom kernels not in TE	You write your own amax tracking
TorchAO `Float8Linear`	Hopper+	torch.compile-first stacks	Smaller surface; less production-vetted than TE
AMD ROCm FP8	MI300X / MI325X / MI355X	All-AMD stacks	Format-compatible with E4M3/E5M2; ROCm ecosystem catching up
Intel Gaudi 3 FP8	Gaudi 3 only	Gaudi-resident workloads	Synapse SDK; smaller stack
NVIDIA Modelopt PTQ FP8	Inference only (TensorRT-LLM)	Post-training quantisation of BF16 checkpoints	Inference path only; no training support
MX format ecosystem (OCP standard)	AMD + NVIDIA + Intel	Cross-vendor weight portability	Newer; tooling maturity uneven

Pitfalls / operational notes

The operational issues we see most often when teams adopt TE FP8 in production, ranked by frequency.

Calibration skipped: dropping FP8 in without TE's amax tracking produces 1-2 point eval regressions in the 10k-100k-step range. Always integrate via the supported path — there is no 'just cast to FP8' shortcut.
Format swap: E4M3 forward + E5M2 backward is correct. Swapping them silently degrades training stability — gradients overflow E4M3 routinely.
Layer-specific sensitivity: input embeddings, output projection / LM head, and the first attention layer tolerate FP8 poorly on some architectures. TE recipes that skip these layers (keep them in BF16) are a standard pattern; Yobibyte recipes do this by default.
Hardware gating: code paths that assume FP8 silently fall back to BF16 on Ampere — measured throughput will not match Hopper plan. Gate on torch.cuda.get_device_capability() >= (8, 9).
amax_history_len too small: default 1024 is fine for stable distributions; long-horizon training with mixed-domain data (e.g. code + math + multilingual) benefits from 2048-4096.
MoE expert imbalance: experts that fire 10x less often than the median have stale amax histories; TE 1.7+ handles per-expert tracking. Older versions silently drift.
torch.compile interaction: te.Linear is fully torch.compile-compatible only on PyTorch 2.4+; older combos require fullgraph=False.
Save / load: FP8 scaling factors must be saved alongside weights. HuggingFace save_pretrained from TE-wrapped modules works on transformers>=4.45; older versions drop the scales — accuracy silently halves on reload.
FP4 calibration set size: under-calibrated (32-64 samples) FP4 inference shows 2-3 % eval regressions. Use 128-512 representative samples and verify on a held-out eval before serving.

Where this fits in the Yobitel stack

Transformer Engine FP8 is the default precision for Yobibyte's training and inference workspaces on Hopper-or-newer Yobitel NeoCloud capacity. Customers describe a model and a workspace; Yobibyte selects the precision recipe per placement (FP8 on H100/H200/B200 pools, BF16 on Ampere pools, FP4 with calibration on Blackwell pools) and exposes the result through the standard workspace API. The TE recipe choices in this entry (HYBRID format, delayed scaling, layer-skip lists for embeddings and output projections) are the building blocks Yobibyte's managed layer composes — customers see precision = 'fp8' on their workspace, not the underlying amax_history_len.

Yobitel NeoCloud H100, H200 and B200 SKUs are sized in their published throughput tables assuming FP8 enabled, because that is how the silicon is meant to be used in 2026. The cost-per-million-output-tokens numbers in the NeoCloud pricing surface are FP8 numbers; BF16 numbers are available on request and roughly 60-70 % of FP8 throughput.

InferenceBench publishes FP8 and BF16 throughput, latency and cost-per-token numbers for every covered open-weight model on Hopper, Ada and Blackwell SKUs side-by-side, so customers can map their accuracy budget directly to a precision choice and a NeoCloud SKU before committing to a recipe. Omniscient Compute treats precision as a search dimension when arbitrating capacity, so a workspace that opts into FP8 inherits a wider eligible-capacity set across Yobitel NeoCloud and partner clouds.

References

NVIDIA Transformer Engine GitHub · NVIDIA
Transformer Engine User Guide · NVIDIA
FP8 Formats for Deep Learning (Micikevicius et al., 2022) · arXiv
OCP Microscaling Formats Specification (v1.0) · Open Compute Project
Megatron-Core FP8 training guide · NVIDIA
vLLM FP8 quantisation documentation · vLLM
TensorRT-LLM FP8 / FP4 quantisation · NVIDIA

TL;DR

Transformer Engine (TE) is the software + hardware pair NVIDIA introduced with Hopper to make FP8 / FP4 training and inference safe — fourth-gen Tensor Core silicon (Hopper) plus a PyTorch library that wraps modules with automatic per-tensor scaling factor management.
First generation (Hopper, 2022): FP8 with E4M3 (forward activations/weights) and E5M2 (backward gradients). Second generation (Blackwell, 2024): adds FP4 (E2M1) and the OCP-standard MX microscaling formats (MXFP8, MXFP6, MXFP4).
Throughput uplift on H100 SXM5 measured on real Llama 3 / Mixtral training runs: 1.5-1.9x vs BF16 baseline at iso-precision (loss curves within 0.5 % through 1T tokens). Inference uplift on vLLM / TensorRT-LLM: 1.6-2.1x decode TPS.
Integrated as a first-class option in Megatron-Core / Megatron-LM, NeMo, DeepSpeed, PyTorch FSDP, HuggingFace transformers (via accelerate), vLLM, TensorRT-LLM, SGLang and torchtitan. Most modern training stacks ship TE on by default for Hopper-and-newer.
Default in Yobibyte's managed fine-tune recipes for H100 / H200 / B200 workspaces — customers opt out, not in. The recipe-protected layer above selects formats per architecture so accuracy stays within typical noise bands.

Overview

Specifications: FP8, FP4 and MX number formats

Authoritative format table. Sign / exponent / mantissa widths define the dynamic range and precision; the 'absolute max' column is what your scaling factor must keep under to avoid overflow.

E4M3 is asymmetric: it represents 0 and ±NaN but no infinity, so the entire exponent range is usable for finite values (cost: no overflow trap).
E5M2 is IEEE-754-style: ±inf, ±NaN, subnormals — the natural drop-in replacement for FP16 gradient flow.
MX formats: 32-element blocks share an 8-bit (E8M0) scale exponent, giving finer-grained calibration than per-tensor scaling. Standardised by OCP (Open Compute Project) so AMD MI355X, Intel Gaudi 3 and others can interoperate.
Hopper supports E4M3 and E5M2 only. Blackwell adds FP4 (E2M1), MXFP8, MXFP6 and MXFP4. Ampere and earlier have no FP8 silicon — TE falls back to BF16/FP16.

Format	Sign	Exp	Mantissa	Abs max	Min normal	Typical use
FP32	1	8	23	~3.4e38	~1.2e-38	Master weights, loss scaling reference
BF16	1	8	7	~3.4e38	~1.2e-38	Default training compute on H100/A100
FP16	1	5	10	65,504	~6.1e-5	Legacy mixed-precision training
FP8 E4M3	1	4	3	448 (no inf)	2^-6 ≈ 0.0156	Forward activations + weights
FP8 E5M2	1	5	2	57,344	2^-14 ≈ 6.1e-5	Backward gradients
FP4 E2M1	1	2	1	6 (no inf)	0.5	Blackwell inference; some training
MXFP8 (block 32)	1	—	—	Per-block scale (E8M0)	—	Microscaling training/inference
MXFP6 (block 32)	1	—	—	Per-block scale (E8M0)	—	Aggressive Blackwell inference
MXFP4 (block 32)	1	—	—	Per-block scale (E8M0)	—	Frontier Blackwell inference

Warning: FP8's small dynamic range is the entire reason scaling matters. Without per-tensor scaling, FP8 training is not viable — Transformer Engine exists to handle this automatically. Hand-rolled FP8 paths that skip TE will look fine for 100-1k steps and silently diverge somewhere in the 10k-100k step range.

Architecture: how Transformer Engine handles scaling

TE attaches a scaling-factor history (amax_history) to each FP8 tensor. The pattern is the same for forward activations, weights and backward gradients:

Compute the absolute maximum (amax) of the BF16/FP32 tensor before casting. 2. Divide by the format's representable max (448 for E4M3, 57,344 for E5M2) to get the scaling factor. 3. Cast the scaled tensor to FP8. 4. Store the scaling factor alongside the tensor; apply its inverse on every read. 5. Append the new amax to a short rolling window (default amax_history_len=1024) and use the window's max as the scaling factor next step.

amax_history_len: rolling window size for amax tracking. Default 1024; raise to 2048-4096 if loss spikes during long-horizon training.
fp8_format: HYBRID (E4M3 forward + E5M2 backward) is the production default. Pure E4M3 is inference-only.
fp8_dpa: enable Distributed Parallel Attention with FP8 — combines TE FP8 with TP/CP parallelism for attention.
Margin: number of bits of headroom in the scaling factor. Default 0; raise to 1-2 for noisy gradient distributions.
TE wraps modules in-place: replace torch.nn.Linear with transformer_engine.pytorch.Linear and the same module accepts FP8 amax tracking with no caller-side changes.

# Minimal TE FP8 fine-tune (illustrative — Yobibyte hides this behind a recipe)
import torch
import transformer_engine.pytorch as te
from transformer_engine.common.recipe import Format, DelayedScaling

# Hybrid recipe: E4M3 forward, E5M2 backward — production default
fp8_recipe = DelayedScaling(
    margin=0,
    fp8_format=Format.HYBRID,
    amax_history_len=1024,
    amax_compute_algo="max",
)

# Replace nn.Linear with te.Linear — same signature, FP8-aware
model = torch.nn.Sequential(
    te.Linear(4096, 11008, bias=False),     # gate proj
    te.Linear(4096, 11008, bias=False),     # up proj
    te.LayerNormLinear(11008, 4096, bias=False),  # down proj + LN
).cuda()

# Train step — wrap forward in te.fp8_autocast
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-4)
for batch in dataloader:
    with te.fp8_autocast(enabled=True, fp8_recipe=fp8_recipe):
        out = model(batch.cuda())
        loss = out.float().mean()
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

Form factor / hardware support

Native FP8 throughput on H100 SXM5: 1,979 TFLOPS dense (3,958 with 2:4 sparsity) — exactly 2x the BF16 figure on the same silicon.
Native FP4 throughput on B200: 4,500 TFLOPS dense (9,000 sparse) — 2x the FP8 figure on the same silicon.
Power: enabling FP8 / FP4 does not change board TDP, but compute is denser so observed power draw rises closer to TDP (a property to plan rack cooling around).
Confidential Compute (CC-on) is orthogonal — FP8/FP4 paths work under attestation with a uniform ~3-5 % overhead.

GPU family	Compute cap	FP8 (E4M3/E5M2)	FP4 (E2M1)	MX formats	Notes
A100 (Ampere)	sm_80	No	No	No	TE falls back to BF16; throughput baseline.
L4 / L40S / RTX 6000 Ada	sm_89	Yes (Ada gen-4 Tensor Core)	No	No	Same FP8 paths as Hopper; reduced peak FLOPS.
H100 / H200 (Hopper)	sm_90 / sm_90a	Yes (native MMA)	No	No	Production sweet spot for FP8 training and inference.
B100 / B200 (Blackwell)	sm_100	Yes	Yes	MXFP8 / MXFP6 / MXFP4	Second-generation Transformer Engine.
GB200 / GB300 NVL72	sm_100	Yes	Yes	MX + per-block scale offload	MX format scale-factor offload to NVSwitch SHARP-v3.

Software ecosystem: where TE plugs in

Training frameworks: Megatron-Core (TE is the default tensor-parallel building block from v0.5+), NeMo, PyTorch FSDP (with te.fp8_autocast), DeepSpeed (FP8-zero stages 1-3), HuggingFace transformers + accelerate, torchtitan (Meta's PyTorch-native trainer), Axolotl (fine-tune wrapper), Unsloth (fast fine-tune library, FP8 path on Hopper).
Inference frameworks: vLLM (FP8 kv-cache + FP8 weights via --quantization fp8 --kv-cache-dtype fp8_e5m2), TensorRT-LLM (FP8 GEMM plugin, FP4 on Blackwell via --gemm_plugin fp4), SGLang (FP8 via TensorRT-LLM backend), Triton Inference Server (TensorRT-LLM backend), TGI (FP8 marlin and machete kernels).
MoE: TE 1.7+ supports per-expert scaling so MoE routing on Mixtral / DeepSeek-V3 / DBRX stays FP8-stable through expert imbalance.
MX format support: TE 1.10+ exposes MXFP8 / MXFP4 recipes; Blackwell-only at runtime. AMD MI355X and Intel Gaudi 3 expose MX paths through their own SDKs — the OCP spec gives cross-vendor weight portability.
Yobibyte's managed inference and fine-tune workspaces select FP8 (Hopper / Hopper-Ada / Blackwell pools) or BF16 (Ampere pools) automatically per-workspace placement. Customers opt out via a precision: bf16 override; the default is FP8.

Sizing: throughput uplift and accuracy budget

Real-world FP8-vs-BF16 numbers from production training and inference runs. Treat as planning anchors; verify on InferenceBench (or your own eval suite) before locking into a precision choice.

Rule of thumb: FP8 on Hopper delivers 1.5-1.9x BF16 throughput at iso-accuracy when integrated via TE. Standalone hand-rolled FP8 paths usually land closer to 1.2-1.4x with accuracy regressions.
KV cache FP8 (--kv-cache-dtype fp8_e5m2) is a separate decision from FP8 weights — typically free accuracy-wise and doubles cache capacity (lets you raise max_model_len or max_num_batched_tokens).
FP4 inference on Blackwell requires per-channel calibration with a representative dataset (typically 128-512 samples) to stay within the 0.5-1.0 % eval-delta budget. Yobibyte's Blackwell pools handle this in the recipe layer.
Training FP8 economics: a 1T-token, 70B-parameter pre-train that was 250-350 H100-days on BF16 lands at ~150-200 H100-days with TE FP8 — a meaningful slice off a typical $500K-1M training bill on Yobitel NeoCloud reserved H100 pricing.

Workload	Hardware	BF16 throughput	FP8 throughput	Uplift	Loss / eval delta
Llama 3 70B pre-train (Megatron-Core)	H100 SXM5 x 256	~13.8K tokens/s/GPU	~24.5K tokens/s/GPU	1.78x	+/- 0.3 % loss through 1T tokens
Llama 3 70B QLoRA fine-tune (TRL)	H100 SXM5 x 2	~3.2K tokens/s/GPU	~5.1K tokens/s/GPU	1.59x	Indistinguishable on eval
Llama 3 70B inference (vLLM, 32K ctx)	H100 SXM5 x 2	~950 TPS	~1,650 TPS	1.74x	<0.5 % eval delta on MMLU/GSM8k
Mixtral 8x22B inference (vLLM)	H100 SXM5 x 2	~580 TPS	~1,100 TPS	1.90x	<0.4 % eval delta
Llama 3 8B inference (TensorRT-LLM)	H100 PCIe x 1	~6.8K TPS	~12.4K TPS	1.82x	Negligible
Llama 3 70B inference FP4	B200 SXM6 x 1	~1,950 TPS (FP8)	~3,400 TPS (FP4)	1.74x vs FP8	0.6-1.0 % eval delta — calibrate

Tip: Always run an eval sweep at 10 %, 25 %, 50 % and 90 % of the planned training horizon when transitioning a recipe from BF16 to FP8. Silent divergence almost always shows up by 50 %; catching it early saves an entire run.

Cost and TCO

Yobibyte exposes precision as a workspace setting; defaulting to FP8 on Hopper-or-newer pools is what makes the published Yobibyte per-token rates competitive against Bedrock and Vertex equivalents.
Omniscient Compute treats precision as a search dimension when arbitrating capacity — a 'FP8-OK' workload can land on lower-priced Blackwell or Hopper-Ada pools that are blocked for 'BF16 only' jobs.
Spot/preemptible H100 capacity with FP8 enabled is the cheapest cost-per-token configuration on Yobitel NeoCloud (about 50-60 % of reserved); restricted to fine-tunes only — not inference SLAs.

Workload	BF16 cost	FP8 cost	Savings	Yobitel NeoCloud anchor
Llama 3 70B 1T-token pre-train (256x H100)	~$1.4M (300 H100-days)	~$840K (180 H100-days)	~40 %	H100 SXM5 SuperPOD-256, 3yr reserved
70B QLoRA fine-tune (5 epochs, 1B tokens)	~$680	~$420	~38 %	2x H100 SXM5, on-demand
Cost-per-million-output-tokens, Llama 3 70B inference	~$0.85	~$0.50	~41 %	1-2x H100 SXM5, on-demand
Cost-per-million-output-tokens, Llama 3 70B inference FP4	n/a	~$0.28	~67 % vs BF16	1x B200 SXM6, on-demand

Migration and alternatives

Choices when FP8 is on the table but TE is not a perfect fit, or when targeting non-NVIDIA silicon.

Migration path BF16 -> TE FP8: replace nn.Linear with te.Linear, wrap forward in te.fp8_autocast, choose Format.HYBRID, validate eval at 10/25/50/90 % of training horizon.
Migration TE FP8 -> Blackwell MXFP8 / MXFP4: change recipe to MXFP8Format or MXFP4Format and run calibration sweep; same module surface.
Migration NVIDIA TE -> AMD ROCm FP8: the FP8 number format is interoperable but the runtime is different. ROCm uses its own equivalent autocast; weights are bit-compatible but scaling factors are recipe-specific.

Option	Vendor / scope	When to pick	Trade-off vs TE
NVIDIA Transformer Engine	Hopper / Blackwell	Default for NVIDIA + PyTorch + transformers	None for the supported set
PyTorch native FP8 (torch.float8_e4m3fn)	PyTorch 2.4+, Hopper+	Custom kernels not in TE	You write your own amax tracking
TorchAO `Float8Linear`	Hopper+	torch.compile-first stacks	Smaller surface; less production-vetted than TE
AMD ROCm FP8	MI300X / MI325X / MI355X	All-AMD stacks	Format-compatible with E4M3/E5M2; ROCm ecosystem catching up
Intel Gaudi 3 FP8	Gaudi 3 only	Gaudi-resident workloads	Synapse SDK; smaller stack
NVIDIA Modelopt PTQ FP8	Inference only (TensorRT-LLM)	Post-training quantisation of BF16 checkpoints	Inference path only; no training support
MX format ecosystem (OCP standard)	AMD + NVIDIA + Intel	Cross-vendor weight portability	Newer; tooling maturity uneven

Pitfalls / operational notes

The operational issues we see most often when teams adopt TE FP8 in production, ranked by frequency.

Calibration skipped: dropping FP8 in without TE's amax tracking produces 1-2 point eval regressions in the 10k-100k-step range. Always integrate via the supported path — there is no 'just cast to FP8' shortcut.
Format swap: E4M3 forward + E5M2 backward is correct. Swapping them silently degrades training stability — gradients overflow E4M3 routinely.
Layer-specific sensitivity: input embeddings, output projection / LM head, and the first attention layer tolerate FP8 poorly on some architectures. TE recipes that skip these layers (keep them in BF16) are a standard pattern; Yobibyte recipes do this by default.
Hardware gating: code paths that assume FP8 silently fall back to BF16 on Ampere — measured throughput will not match Hopper plan. Gate on torch.cuda.get_device_capability() >= (8, 9).
amax_history_len too small: default 1024 is fine for stable distributions; long-horizon training with mixed-domain data (e.g. code + math + multilingual) benefits from 2048-4096.
MoE expert imbalance: experts that fire 10x less often than the median have stale amax histories; TE 1.7+ handles per-expert tracking. Older versions silently drift.
torch.compile interaction: te.Linear is fully torch.compile-compatible only on PyTorch 2.4+; older combos require fullgraph=False.
Save / load: FP8 scaling factors must be saved alongside weights. HuggingFace save_pretrained from TE-wrapped modules works on transformers>=4.45; older versions drop the scales — accuracy silently halves on reload.
FP4 calibration set size: under-calibrated (32-64 samples) FP4 inference shows 2-3 % eval regressions. Use 128-512 representative samples and verify on a held-out eval before serving.

Where this fits in the Yobitel stack

References

NVIDIA Transformer Engine GitHub · NVIDIA
Transformer Engine User Guide · NVIDIA
FP8 Formats for Deep Learning (Micikevicius et al., 2022) · arXiv
OCP Microscaling Formats Specification (v1.0) · Open Compute Project
Megatron-Core FP8 training guide · NVIDIA
vLLM FP8 quantisation documentation · vLLM
TensorRT-LLM FP8 / FP4 quantisation · NVIDIA

Transformer Engine and FP8

Overview

Specifications: FP8, FP4 and MX number formats

Architecture: how Transformer Engine handles scaling

Form factor / hardware support

Software ecosystem: where TE plugs in

Sizing: throughput uplift and accuracy budget

Cost and TCO

Migration and alternatives

Pitfalls / operational notes

Where this fits in the Yobitel stack

References

Browse all entries

Deploy on Yobibyte

Transformer Engine and FP8

Overview

Specifications: FP8, FP4 and MX number formats

Architecture: how Transformer Engine handles scaling

Form factor / hardware support

Software ecosystem: where TE plugs in

Sizing: throughput uplift and accuracy budget

Cost and TCO

Migration and alternatives

Pitfalls / operational notes

Where this fits in the Yobitel stack

References

Browse all entries

Deploy on Yobibyte