TL;DR
- Second-generation Trainium launched late 2024; ~1.3 PFLOPS BF16 dense per chip with 96 GB HBM.
- Trn2 instances pair 16 chips at 1.5 TB total HBM; UltraServer scales to 64 chips per instance.
- Headline customer: Anthropic — Project Rainier is a multi-hundred-thousand-chip Trainium 2 cluster.
- Significant improvement over Trainium 1 on both raw throughput and software maturity.
Overview#
Trainium 2 is AWS's second-generation training accelerator, announced at re:Invent 2024 and rolled out through 2025. Per-chip throughput steps up substantially over Trainium 1, HBM grows to 96 GB, and the Trn2 instance family extends to a 64-chip UltraServer that behaves as a single tightly coupled training unit.
The headline deployment is Anthropic's Project Rainier — a multi-hundred-thousand-chip Trainium 2 cluster announced in late 2024 as the substrate for future Claude model training. The commitment shifted Trainium 2 from 'AWS niche' to 'real frontier training option'.
Specifications#
| Metric | Trainium 2 (per chip) |
|---|---|
| BF16 (dense) | ~1.3 PFLOPS |
| FP8 (dense) | ~2.6 PFLOPS |
| Memory | 96 GB HBM |
| Memory bandwidth | 2.9 TB/s |
| NeuronCores per chip | 8 (NeuronCore v3) |
| Inter-chip link | NeuronLink v3 |
| Trn2 instance | 16 chips, 1.5 TB HBM |
| Trn2 UltraServer | 64 chips coherent |
Architecture Notes#
Trainium 2 introduces NeuronCore v3 with eight cores per chip — a substantial jump from Trainium 1's two cores per chip. FP8 support arrives at this generation, narrowing the precision gap with H100 / H200.
The UltraServer concept is the architectural standout. 64 Trainium 2 chips share a NeuronLink v3 domain that AWS exposes as a single tightly coupled training unit, conceptually similar to NVIDIA's NVL72. Project Rainier composes UltraServers into the larger cluster.
When to Pick Trainium 2#
- Large-scale training on AWS where pricing and integration with the rest of the AWS stack are advantageous.
- Workloads where Neuron SDK has matured to acceptable parity for the model family in question.
- Multi-vendor strategies seeking a non-NVIDIA training option without on-prem buildout.
- Pick H100 / H200 / B200 if CUDA ecosystem features (Flash Attention, TensorRT-LLM) dominate.
- Pick TPU v5p / Trillium if Google Cloud and JAX are already the standard.
Pitfalls#
- Software ecosystem still narrower than CUDA, though significantly improved over Trainium 1.
- FP8 calibration tooling lags NVIDIA Transformer Engine.
- Workload portability across vendors requires explicit architecture-aware code paths.
- AWS-exclusive — no on-prem option.
Software Notes#
AWS Neuron SDK 2.20+, PyTorch/XLA on Neuron, and JAX on Neuron all support Trainium 2. Hugging Face Optimum-Neuron has Trainium 2 recipes for Llama, Mixtral and other common families. NxD (Neuron Distributed) provides FSDP-equivalent parallelism for large-scale training.