TL;DR
- CDNA4-architecture accelerator announced for late 2025 / early 2026 — AMD's direct response to NVIDIA Blackwell.
- Adds native FP4 (OCP MX format) and FP6 support, targeting roughly 2× the FP8 throughput of MI300X.
- 288 GB HBM3e per OAM module at ~8 TB/s; positioned for frontier inference and large-batch training.
- Software story rests on ROCm 7 and a maturing kernel library; production readiness is improving but trails the CUDA stack.
Overview#
MI355X is the first CDNA4 part — AMD's generational answer to NVIDIA Blackwell. Architecturally it carries forward the chiplet design pioneered in MI300X, with new XCD silicon that adds FP4 (using the OCP-standard MX format) and FP6 alongside FP8 and BF16. Memory steps up to 288 GB of HBM3e per OAM module at roughly 8 TB/s.
Public detail at the time of writing is less complete than for NVIDIA Blackwell. The headline claim — frontier inference and training competitive with B200/B300 — is plausible given the silicon and memory configurations, but the software story remains AMD's principal competitive challenge.
Specifications#
| Metric | MI355X |
|---|---|
| Architecture | CDNA4 |
| Process | TSMC 3 nm (chiplets) |
| Memory | 288 GB HBM3e |
| Memory bandwidth | ~8 TB/s |
| FP8 (Matrix, sparse) | ~5,000 TFLOPS |
| FP6 (Matrix, sparse) | ~5,000 TFLOPS |
| FP4 (Matrix, sparse) | ~10,000 TFLOPS |
| TDP | ~1,400 W |
| Form factor | OAM |
Several MI355X figures were preliminary at launch. Memory capacity (288 GB) and the addition of FP4 / FP6 are the load-bearing claims; absolute throughput numbers will be revised as production parts and independent benchmarks land.
CDNA4 and the OCP MX Formats#
CDNA4's FP4 / FP6 support is built on the OCP Microscaling formats — the same standard that underpins NVIDIA's Blackwell MX formats. This is mostly good news for portability: code paths developed for one vendor's MX support translate relatively cleanly to the other.
The remaining differences are in kernel optimisation. AMD's ROCm libraries are tuned per-generation; treating MI355X as 'MI300X with FP4' will leave significant performance on the table without explicit re-tuning of attention and GEMM kernels.
When to Pick MI355X#
- Frontier inference at scale where FP4 throughput translates to lower $/token.
- Multi-vendor strategies that need an AMD-native FP4 path.
- Workloads already running on ROCm where a CDNA3-to-CDNA4 upgrade preserves software investment.
- Pick B200 / B300 when CUDA ecosystem maturity dominates and supply allows.
- Pick MI325X if FP4 isn't required and CDNA3 software stability is preferred.
Pitfalls#
- Software maturity for CDNA4 FP4 lags Blackwell FP4 — expect a non-trivial integration window.
- TDP step-up to ~1,400 W requires substantial cooling and power provisioning.
- Public benchmarks were limited at launch; vendor-published figures should be treated cautiously.
- MoE-specific kernels (expert routing, token dispatch) may need vendor-side tuning before reaching CUDA parity.
Software Notes#
ROCm 7 is the production target. vLLM and SGLang gained CDNA4 paths in late 2025; PyTorch's ROCm backend includes initial CDNA4 support but profile tuning is ongoing. AMD's Composable Kernel library underpins most of the high-performance attention and GEMM paths.
References
- AMD Instinct MI350 Series Announcement · AMD
- OCP Microscaling Formats Specification · Open Compute Project