AMD Instinct MI250 / MI250X

TL;DR

CDNA2-architecture accelerator launched November 2021; two GPU dies presented as a single OAM module with 128 GB HBM2e total.
Powered the Frontier exascale supercomputer at Oak Ridge — the world's first publicly verified exascale system.
FP64 was the headline: ~95 TFLOPS — exceptional for HPC, more modest for AI relative to MI300X.
Largely superseded by MI300X / MI325X for AI; remains relevant in HPC and amortised HPC+AI deployments.

Overview#

MI250 (and the slightly higher-clocked MI250X) was AMD's CDNA2-generation HPC accelerator. Launched November 2021, it presented two GPU dies as a single OAM module with 128 GB of HBM2e and an aggregate 3.2 TB/s of bandwidth. From the software side it appeared as two GPUs sharing a board rather than one unified device.

Its claim to fame is HPC. Frontier — the first verified exascale supercomputer, deployed at Oak Ridge in 2022 — uses MI250X accelerators. The 95 TFLOPS FP64 throughput was substantially ahead of A100 (19.5 TFLOPS) and remains competitive on FP64-heavy scientific workloads.

Specifications#

Metric	MI250X	MI250
Architecture	CDNA2 (dual-die)	CDNA2 (dual-die)
FP64 (Matrix)	95.7 TFLOPS	90.5 TFLOPS
FP32 (Matrix)	95.7 TFLOPS	90.5 TFLOPS
BF16 / FP16 (Matrix)	383 TFLOPS	362 TFLOPS
INT8 (Matrix)	383 TOPS	362 TOPS
Memory	128 GB HBM2e	128 GB HBM2e
Memory bandwidth	3.2 TB/s	3.2 TB/s
TDP	560 W	500 W
Infinity Fabric	100 GB/s per link	100 GB/s per link
Form factor	OAM	OAM

MI250 dies are not unified — the runtime sees two GPUs per OAM module. This affects how scheduling and collective patterns interact with workloads designed assuming single-device-per-socket.

Architecture Notes#

CDNA2 emphasises FP64 and matrix throughput for traditional HPC over the AI-skewed precision sets of later generations. There is no FP8 and no chiplet IO die: the two dies on the package are linked directly by Infinity Fabric, and each die has its own HBM controllers.

For AI workloads, the lack of FP8 and the dual-die programming model are the main constraints. Throughput on BF16 is healthy, but framework support assumes you'll handle the two-GPU-per-board topology explicitly.

When MI250 Still Makes Sense#

HPC workloads where FP64 throughput remains the binding constraint.
Existing Frontier-class or similar deployments running mixed HPC + AI workloads.
Amortised on-prem clusters where TCO trumps generational throughput.
Pick MI300X / MI325X for AI-first deployments with FP8 inference paths.

Pitfalls#

Two-GPU-per-OAM presentation breaks naive multi-tenant assumptions.
No FP8 — modern quantised inference paths skip MI250.
ROCm support continues but new optimisations increasingly target CDNA3+.
HBM2e capacity per die (64 GB) is meaningfully lower than per-GPU memory on modern parts.

Software Notes#

ROCm 5.x through 6.x supports MI250. PyTorch ROCm backend treats each die as a discrete GPU. HPC stacks (OpenMP target offload, HIP, Kokkos) are well-supported and tuned.

References

AMD Instinct MI250X Datasheet · AMD
Frontier Supercomputer Overview · Oak Ridge National Laboratory

Overview#

Specifications#

Metric	MI250X	MI250
Architecture	CDNA2 (dual-die)	CDNA2 (dual-die)
FP64 (Matrix)	95.7 TFLOPS	90.5 TFLOPS
FP32 (Matrix)	95.7 TFLOPS	90.5 TFLOPS
BF16 / FP16 (Matrix)	383 TFLOPS	362 TFLOPS
INT8 (Matrix)	383 TOPS	362 TOPS
Memory	128 GB HBM2e	128 GB HBM2e
Memory bandwidth	3.2 TB/s	3.2 TB/s
TDP	560 W	500 W
Infinity Fabric	100 GB/s per link	100 GB/s per link
Form factor	OAM	OAM

MI250 dies are not unified — the runtime sees two GPUs per OAM module. This affects how scheduling and collective patterns interact with workloads designed assuming single-device-per-socket.

Architecture Notes#

AMD Instinct MI250 / MI250X

Overview#

Specifications#

Architecture Notes#

When MI250 Still Makes Sense#

Pitfalls#

Software Notes#

References

Browse all entries

Deploy on Yobitel

AMD Instinct MI250 / MI250X

Overview#

Specifications#

Architecture Notes#

When MI250 Still Makes Sense#

Pitfalls#

Software Notes#

References

Browse all entries

Deploy on Yobitel