TL;DR
- Mid-cycle refresh of MI300X with HBM3e instead of HBM3, lifting per-GPU memory from 192 GB to 256 GB and bandwidth to ~6 TB/s.
- Same CDNA3 chiplet silicon as MI300X with refreshed memory stacks and tuned clocks.
- Volume shipping from Q4 2024 — positioned head-to-head against NVIDIA H200.
- Targets inference of large dense and MoE models where the larger HBM pool is the decisive feature.
Overview#
MI325X is to MI300X what H200 is to H100: same compute silicon, faster and denser HBM. AMD announced the part at Computex 2024 and shipped in volume from Q4 2024. The CDNA3 XCDs are unchanged; the HBM3e upgrade lifts capacity to 256 GB and bandwidth to roughly 6 TB/s.
Positioning is squarely against NVIDIA H200 in the dense-inference market. Where H200 offers 141 GB at 4.8 TB/s, MI325X offers 256 GB at ~6 TB/s — the largest single-GPU memory pool generally available through 2025.
Specifications vs MI300X#
| Metric | MI325X | MI300X |
|---|---|---|
| Architecture | CDNA3 | CDNA3 |
| Memory | 256 GB HBM3e | 192 GB HBM3 |
| Memory bandwidth | ~6 TB/s | 5.3 TB/s |
| FP8 (Matrix, sparse) | 2,614 TFLOPS | 2,614 TFLOPS |
| BF16 (Matrix, sparse) | 1,307 TFLOPS | 1,307 TFLOPS |
| TDP | ~1,000 W | 750 W |
| Form factor | OAM | OAM |
MI325X's 256 GB headline came down slightly in shipping spec from AMD's initial announcement; figures here reflect production hardware. Treat 256 GB as 'AMD's largest 2024-era HBM pool', H200 at 141 GB as the practical comparison point.
Why HBM3e Now#
The motivation parallels NVIDIA's H200 refresh. Inference workloads — particularly long-context decode — are memory-bandwidth bound, and HBM density caps practical replica size for large models. HBM3e production from SK hynix and Micron made the upgrade viable in 2024.
The TDP increase reflects the higher HBM power consumption, not significant silicon changes. ROCm and PyTorch treat MI325X identically to MI300X with larger memory budgets and updated clock targets.
When to Pick MI325X#
- Inference of 70B+ models where 256 GB allows single-GPU replicas without tensor parallelism.
- Long-context inference where KV-cache pressure dominates.
- MoE inference workloads where expert state and routed activations fit comfortably.
- Multi-vendor production deployments where AMD supply complements NVIDIA capacity.
- Pick MI355X if available and FP4 support matters.
- Pick H200 or B200 if CUDA ecosystem dominates or specific TensorRT-LLM features are required.
Pitfalls#
- Higher TDP (~1,000 W) means liquid cooling is effectively required for dense deployments.
- Software stack identical to MI300X — same ROCm version requirements, same kernel gaps.
- Supply through 2025 was constrained by HBM3e availability, similar to NVIDIA H200.
- FP4 is not supported on CDNA3; production FP4 paths require MI355X or later.
Software Notes#
ROCm 6.2+ adds full MI325X support. vLLM, SGLang, and PyTorch ROCm backend all treat the card as MI300X with larger memory. Hot kernels tuned for MI300X transfer directly; the only changes typically needed are memory-budget tuning and KV-cache sizing.