TL;DR
- CDNA2-architecture accelerator launched November 2021; two GPU dies presented as a single OAM module with 128 GB HBM2e total.
- Powered the Frontier exascale supercomputer at Oak Ridge — the world's first publicly verified exascale system.
- FP64 was the headline: ~95 TFLOPS — exceptional for HPC, more modest for AI relative to MI300X.
- Largely superseded by MI300X / MI325X for AI; remains relevant in HPC and amortised HPC+AI deployments.
Overview#
MI250 (and the slightly higher-clocked MI250X) was AMD's CDNA2-generation HPC accelerator. Launched November 2021, it presented two GPU dies as a single OAM module with 128 GB of HBM2e and an aggregate 3.2 TB/s of bandwidth. From the software side it appeared as two GPUs sharing a board rather than one unified device.
Its claim to fame is HPC. Frontier — the first verified exascale supercomputer, deployed at Oak Ridge in 2022 — uses MI250X accelerators. The 95 TFLOPS FP64 throughput was substantially ahead of A100 (19.5 TFLOPS) and remains competitive on FP64-heavy scientific workloads.
Specifications#
| Metric | MI250X | MI250 |
|---|---|---|
| Architecture | CDNA2 (dual-die) | CDNA2 (dual-die) |
| FP64 (Matrix) | 95.7 TFLOPS | 90.5 TFLOPS |
| FP32 (Matrix) | 95.7 TFLOPS | 90.5 TFLOPS |
| BF16 / FP16 (Matrix) | 383 TFLOPS | 362 TFLOPS |
| INT8 (Matrix) | 383 TOPS | 362 TOPS |
| Memory | 128 GB HBM2e | 128 GB HBM2e |
| Memory bandwidth | 3.2 TB/s | 3.2 TB/s |
| TDP | 560 W | 500 W |
| Infinity Fabric | 100 GB/s per link | 100 GB/s per link |
| Form factor | OAM | OAM |
MI250 dies are not unified — the runtime sees two GPUs per OAM module. This affects how scheduling and collective patterns interact with workloads designed assuming single-device-per-socket.
Architecture Notes#
CDNA2 emphasises FP64 and matrix throughput for traditional HPC over the AI-skewed precision sets of later generations. There is no FP8 and no chiplet IO die: the two dies on the package are linked directly by Infinity Fabric, and each die has its own HBM controllers.
For AI workloads, the lack of FP8 and the dual-die programming model are the main constraints. Throughput on BF16 is healthy, but framework support assumes you'll handle the two-GPU-per-board topology explicitly.
When MI250 Still Makes Sense#
- HPC workloads where FP64 throughput remains the binding constraint.
- Existing Frontier-class or similar deployments running mixed HPC + AI workloads.
- Amortised on-prem clusters where TCO trumps generational throughput.
- Pick MI300X / MI325X for AI-first deployments with FP8 inference paths.
Pitfalls#
- Two-GPU-per-OAM presentation breaks naive multi-tenant assumptions.
- No FP8 — modern quantised inference paths skip MI250.
- ROCm support continues but new optimisations increasingly target CDNA3+.
- HBM2e capacity per die (64 GB) is meaningfully lower than per-GPU memory on modern parts.
Software Notes#
ROCm 5.x through 6.x supports MI250. PyTorch ROCm backend treats each die as a discrete GPU. HPC stacks (OpenMP target offload, HIP, Kokkos) are well-supported and tuned.
References
- AMD Instinct MI250X Datasheet · AMD
- Frontier Supercomputer Overview · Oak Ridge National Laboratory