AMD Instinct MI355X Accelerator

TL;DR

CDNA4-architecture accelerator announced for late 2025 / early 2026 — AMD's direct response to NVIDIA Blackwell.
Adds native FP4 (OCP MX format) and FP6 support, targeting roughly 2× the FP8 throughput of MI300X.
288 GB HBM3e per OAM module at ~8 TB/s; positioned for frontier inference and large-batch training.
Software story rests on ROCm 7 and a maturing kernel library; production readiness is improving but trails the CUDA stack.

Overview#

MI355X is the first CDNA4 part — AMD's generational answer to NVIDIA Blackwell. Architecturally it carries forward the chiplet design pioneered in MI300X, with new XCD silicon that adds FP4 (using the OCP-standard MX format) and FP6 alongside FP8 and BF16. Memory steps up to 288 GB of HBM3e per OAM module at roughly 8 TB/s.

Public detail at the time of writing is less complete than for NVIDIA Blackwell. The headline claim — frontier inference and training competitive with B200/B300 — is plausible given the silicon and memory configurations, but the software story remains AMD's principal competitive challenge.

Specifications#

Metric	MI355X
Architecture	CDNA4
Process	TSMC 3 nm (chiplets)
Memory	288 GB HBM3e
Memory bandwidth	~8 TB/s
FP8 (Matrix, sparse)	~5,000 TFLOPS
FP6 (Matrix, sparse)	~5,000 TFLOPS
FP4 (Matrix, sparse)	~10,000 TFLOPS
TDP	~1,400 W
Form factor	OAM

Several MI355X figures were preliminary at launch. Memory capacity (288 GB) and the addition of FP4 / FP6 are the load-bearing claims; absolute throughput numbers will be revised as production parts and independent benchmarks land.

CDNA4 and the OCP MX Formats#

CDNA4's FP4 / FP6 support is built on the OCP Microscaling formats — the same standard that underpins NVIDIA's Blackwell MX formats. This is mostly good news for portability: code paths developed for one vendor's MX support translate relatively cleanly to the other.

The remaining differences are in kernel optimisation. AMD's ROCm libraries are tuned per-generation; treating MI355X as 'MI300X with FP4' will leave significant performance on the table without explicit re-tuning of attention and GEMM kernels.

When to Pick MI355X#

Frontier inference at scale where FP4 throughput translates to lower $/token.
Multi-vendor strategies that need an AMD-native FP4 path.
Workloads already running on ROCm where a CDNA3-to-CDNA4 upgrade preserves software investment.
Pick B200 / B300 when CUDA ecosystem maturity dominates and supply allows.
Pick MI325X if FP4 isn't required and CDNA3 software stability is preferred.

Pitfalls#

Software maturity for CDNA4 FP4 lags Blackwell FP4 — expect a non-trivial integration window.
TDP step-up to ~1,400 W requires substantial cooling and power provisioning.
Public benchmarks were limited at launch; vendor-published figures should be treated cautiously.
MoE-specific kernels (expert routing, token dispatch) may need vendor-side tuning before reaching CUDA parity.

Software Notes#

ROCm 7 is the production target. vLLM and SGLang gained CDNA4 paths in late 2025; PyTorch's ROCm backend includes initial CDNA4 support but profile tuning is ongoing. AMD's Composable Kernel library underpins most of the high-performance attention and GEMM paths.

References

AMD Instinct MI350 Series Announcement · AMD
OCP Microscaling Formats Specification · Open Compute Project

Overview#

Specifications#

Metric	MI355X
Architecture	CDNA4
Process	TSMC 3 nm (chiplets)
Memory	288 GB HBM3e
Memory bandwidth	~8 TB/s
FP8 (Matrix, sparse)	~5,000 TFLOPS
FP6 (Matrix, sparse)	~5,000 TFLOPS
FP4 (Matrix, sparse)	~10,000 TFLOPS
TDP	~1,400 W
Form factor	OAM

CDNA4 and the OCP MX Formats#

When to Pick MI355X#

Frontier inference at scale where FP4 throughput translates to lower $/token.

Multi-vendor strategies that need an AMD-native FP4 path.

Workloads already running on ROCm where a CDNA3-to-CDNA4 upgrade preserves software investment.

Pick B200 / B300 when CUDA ecosystem maturity dominates and supply allows.

Pick MI325X if FP4 isn't required and CDNA3 software stability is preferred.

Pitfalls#

Software maturity for CDNA4 FP4 lags Blackwell FP4 — expect a non-trivial integration window.

TDP step-up to ~1,400 W requires substantial cooling and power provisioning.

Public benchmarks were limited at launch; vendor-published figures should be treated cautiously.

MoE-specific kernels (expert routing, token dispatch) may need vendor-side tuning before reaching CUDA parity.

AMD Instinct MI355X Accelerator

Overview#

Specifications#

CDNA4 and the OCP MX Formats#

When to Pick MI355X#

Pitfalls#

Software Notes#

References

Browse all entries

Deploy on Yobitel

AMD Instinct MI355X Accelerator

Overview#

Specifications#

CDNA4 and the OCP MX Formats#

When to Pick MI355X#

Pitfalls#

Software Notes#

References

Browse all entries

Deploy on Yobitel