AWS Trainium 2

TL;DR

Second-generation Trainium launched late 2024; ~1.3 PFLOPS BF16 dense per chip with 96 GB HBM.
Trn2 instances pair 16 chips at 1.5 TB total HBM; UltraServer scales to 64 chips per instance.
Headline customer: Anthropic — Project Rainier is a multi-hundred-thousand-chip Trainium 2 cluster.
Significant improvement over Trainium 1 on both raw throughput and software maturity.

Overview#

Trainium 2 is AWS's second-generation training accelerator, announced at re:Invent 2024 and rolled out through 2025. Per-chip throughput steps up substantially over Trainium 1, HBM grows to 96 GB, and the Trn2 instance family extends to a 64-chip UltraServer that behaves as a single tightly coupled training unit.

The headline deployment is Anthropic's Project Rainier — a multi-hundred-thousand-chip Trainium 2 cluster announced in late 2024 as the substrate for future Claude model training. The commitment shifted Trainium 2 from 'AWS niche' to 'real frontier training option'.

Specifications#

Metric	Trainium 2 (per chip)
BF16 (dense)	~1.3 PFLOPS
FP8 (dense)	~2.6 PFLOPS
Memory	96 GB HBM
Memory bandwidth	2.9 TB/s
NeuronCores per chip	8 (NeuronCore v3)
Inter-chip link	NeuronLink v3
Trn2 instance	16 chips, 1.5 TB HBM
Trn2 UltraServer	64 chips coherent

Architecture Notes#

Trainium 2 introduces NeuronCore v3 with eight cores per chip — a substantial jump from Trainium 1's two cores per chip. FP8 support arrives at this generation, narrowing the precision gap with H100 / H200.

The UltraServer concept is the architectural standout. 64 Trainium 2 chips share a NeuronLink v3 domain that AWS exposes as a single tightly coupled training unit, conceptually similar to NVIDIA's NVL72. Project Rainier composes UltraServers into the larger cluster.

When to Pick Trainium 2#

Large-scale training on AWS where pricing and integration with the rest of the AWS stack are advantageous.
Workloads where Neuron SDK has matured to acceptable parity for the model family in question.
Multi-vendor strategies seeking a non-NVIDIA training option without on-prem buildout.
Pick H100 / H200 / B200 if CUDA ecosystem features (Flash Attention, TensorRT-LLM) dominate.
Pick TPU v5p / Trillium if Google Cloud and JAX are already the standard.

Pitfalls#

Software ecosystem still narrower than CUDA, though significantly improved over Trainium 1.
FP8 calibration tooling lags NVIDIA Transformer Engine.
Workload portability across vendors requires explicit architecture-aware code paths.
AWS-exclusive — no on-prem option.

Software Notes#

AWS Neuron SDK 2.20+, PyTorch/XLA on Neuron, and JAX on Neuron all support Trainium 2. Hugging Face Optimum-Neuron has Trainium 2 recipes for Llama, Mixtral and other common families. NxD (Neuron Distributed) provides FSDP-equivalent parallelism for large-scale training.

References

AWS Trainium 2 Announcement · AWS
AWS Neuron SDK 2.x Documentation · AWS

Overview#

Metric

Trainium 2 (per chip)

BF16 (dense)

~1.3 PFLOPS

FP8 (dense)

~2.6 PFLOPS

Memory

96 GB HBM

Memory bandwidth

2.9 TB/s

NeuronCores per chip

8 (NeuronCore v3)

Inter-chip link

NeuronLink v3

Trn2 instance

16 chips, 1.5 TB HBM

Trn2 UltraServer

64 chips coherent

Architecture Notes#

When to Pick Trainium 2#

Large-scale training on AWS where pricing and integration with the rest of the AWS stack are advantageous.

Workloads where Neuron SDK has matured to acceptable parity for the model family in question.

Multi-vendor strategies seeking a non-NVIDIA training option without on-prem buildout.

Pick H100 / H200 / B200 if CUDA ecosystem features (Flash Attention, TensorRT-LLM) dominate.

Pick TPU v5p / Trillium if Google Cloud and JAX are already the standard.

AWS Trainium 2

Overview#

Specifications#

Architecture Notes#

When to Pick Trainium 2#

Pitfalls#

Software Notes#

References

Browse all entries

Deploy on Yobitel

AWS Trainium 2

Overview#

Specifications#

Architecture Notes#

When to Pick Trainium 2#

Pitfalls#

Software Notes#

References

Browse all entries

Deploy on Yobitel