TL;DR
- Spectrum-X is NVIDIA's Ethernet platform engineered specifically for AI workloads — pairing the Spectrum-4 switch ASIC (SN5600, 51.2 Tb/s, 64 x 800 GbE, sub-microsecond hop) with the BlueField-3 SuperNIC at the endpoint, jointly implementing AI-tuned RoCEv2 extensions.
- Adds three behaviours that vanilla Ethernet does not have: per-flow adaptive routing (packet spraying with end-host re-ordering), AI-tuned congestion control finer than stock DCQCN, and tenant performance isolation.
- 32K-GPU AI cluster reference design at 800 Gb/s per port competes directly with InfiniBand XDR; achieves comparable AllReduce throughput at 60-70 % of the InfiniBand bill-of-materials cost.
- Announced May 2023 alongside NVIDIA's Israel-1 supercomputer; broadly available in DGX H100/H200/B200/GB200 reference architectures from 2024 onward as the certified Ethernet path.
- Software stack: Cumulus Linux 5.x on switches, NetQ for fabric telemetry, DOCA 2.7+ on the BlueField-3 endpoints — operationally identical to a standard Ethernet shop's existing skills.
Overview#
Spectrum-X is NVIDIA's answer to the question of whether Ethernet can be made to behave well enough for AI training. The platform combines the Spectrum-4 ASIC (51.2 Tb/s aggregate switching capacity at 800G port speed) with the BlueField-3 SuperNIC on the endpoint side. Together they implement a set of Ethernet extensions — packet-spraying adaptive routing, AI-tuned per-flow congestion control, and performance isolation — that mimic InfiniBand's lossless behaviour while preserving Ethernet's operational familiarity, multi-vendor switch sourcing, and lower per-port cost.
The platform was introduced at NVIDIA Computex 2023 alongside the Israel-1 supercomputer, an internal NVIDIA reference build of 4,096 H100s on a pure Spectrum-X fabric. By 2024 Spectrum-X had become an officially supported alternative to InfiniBand in DGX SuperPOD reference architectures, with hyperscalers (Microsoft Azure ND H200 v5 series, Oracle Cloud OCI Supercluster), several US neoclouds (xAI Colossus, Vultr), and a growing number of sovereign / regional AI clouds standardising on it for new builds. By 2026 the platform is on its second generation of endpoint silicon (BlueField-3 in production, BlueField-4 sampling) and remains the canonical Ethernet-for-AI design.
Spectrum-X is one of the AI-Ethernet options Yobitel evaluates for next-generation NeoCloud regions, alongside InfiniBand XDR — the choice per region is driven by sovereign-skill availability (Cumulus Linux ops vs InfiniBand specialists) and tenant TCO targets. This entry helps you pick the right AI fabric for your training cluster and understand what Yobitel runs on NeoCloud, including the cost and operational differences from a Quantum-3 XDR build at the same port speed.
Specifications#
Authoritative figures for the SN5600 (flagship 800G leaf/spine appliance) and the BlueField-3 SuperNIC (the endpoint). Spectrum-X also includes the SN5400 (400G variant for mixed builds) and the older Spectrum-3 SN4000 family that still ships at 200/400G for storage-tier fabrics.
| Property | SN5600 (Spectrum-4) | BlueField-3 SuperNIC |
|---|---|---|
| Role | Switch (leaf or spine) | Endpoint NIC + DPU |
| Aggregate / per-port bandwidth | 51.2 Tb/s; 64 x 800 GbE | Single dual-port 400 GbE (or 1 x 800 GbE) |
| Port count | 64 x 800 GbE (or 128 x 400 GbE split) | 2 ports |
| Connector | OSFP | QSFP-DD / OSFP |
| Switch latency | ~600 ns hop (cut-through) | n/a |
| NIC PCIe | n/a | Gen5 x16 |
| ASIC silicon process | Spectrum-4 silicon | Arm 16-core A78 + ConnectX-7 inline NIC |
| Form factor | 2U appliance | FHHL or HHHL PCIe; OCP NIC 3.0 |
| Switch power (typical / max) | 1,000 W / 1,500 W | 75-150 W |
| Switch OS | Cumulus Linux 5.x or SONiC | DOCA-Host 2.7+ on host kernel |
| AI-tuned RoCE | Packet spraying + AI ECN | End-host re-ordering, AI CC |
| First shipments | 2023 (SN5600), 2024 (volume) | 2023 |
| Latest firmware (2026) | Cumulus 5.10+, NOS-X 1.x | BSP 4.7+ |
Architecture: what makes Spectrum-X different from generic Ethernet#
Spectrum-X is, in essence, RoCEv2 with three AI-specific enhancements that close most of the technical gap to InfiniBand. Operators who already understand vanilla RoCEv2 (see the RoCEv2 entry) need to layer these three behaviours on top.
1. Collective-aware adaptive routing (packet spraying). Spectrum-4 routes RoCEv2 elephant flows (training AllReduce, AllToAll) by spraying packets across all available equal-cost paths on a per-packet basis, then relies on the BlueField-3 SuperNIC at the receiver to re-order packets back into sequence before delivering to the host's RDMA engine. The switch knows the difference between elephant flows (the training collective) and mice flows (control plane, IPMI, monitoring) via DPI of the BTH; mice continue to use static ECMP hashing for stable ordering. The result: a 4,096-GPU AllToAll that would tail-bound on a static-ECMP RoCE fabric runs near uniformly across all spine uplinks.
2. AI-tuned congestion control. Stock DCQCN tunes well for storage and microservices traffic; it under-responds to the bursty, synchronised, correlated nature of training collectives. The Spectrum-X variant uses BlueField-3's hardware telemetry (per-QP one-way delay, instantaneous receive buffer occupancy, CNP rate) to feed a per-flow rate-control loop that reacts in microseconds rather than milliseconds — much closer to InfiniBand's credit-based behaviour. Tunable via DOCA but the defaults are AI-workload-aware.
3. Tenant performance isolation. In a multi-tenant pod (multiple training jobs sharing the same fabric), noisy-neighbour effects on tail latency are the killer. Spectrum-X uses per-tenant priority queues + dedicated buffer pools on Spectrum-4 to bound how much fabric capacity any one tenant can consume, plus per-tenant ECN marking thresholds. Job A's AllReduce burst no longer adds 500 us to Job B's tail latency.
Above the silicon, the BlueField-3 SuperNIC does the heavy lifting on the endpoint: re-ordering sprayed packets, executing the AI CC algorithm in hardware, and emitting per-flow telemetry to the switch for the closed-loop tuning. A Spectrum-X fabric without BlueField-3 endpoints is just RoCEv2 with extra steps; the magic is the joint optimisation.
Form factor and physical deployment#
The SN5600 is a 2U appliance with 64 OSFP cages and front-to-back airflow. Power draw at full 800G utilisation across all 64 ports is roughly 1.5 kW; rack PDU planning should assume 2 kW per leaf. The BlueField-3 SuperNIC ships in FHHL and HHHL PCIe Gen5 x16 form factors as well as OCP NIC 3.0; choice depends on the host chassis.
| Cable / optic | Reach | Approx unit cost (USD, 2026) | Typical use |
|---|---|---|---|
| 800G DAC passive copper, 1-2 m | 1-2 m | $420-650 | Intra-rack BlueField-3 to leaf |
| 800G AOC active optical, 3-30 m | 3-30 m | $2,400-4,000 | Adjacent-rack runs |
| 800G-DR8 single-mode optic | Up to 500 m | $2,800-3,600 each end | Spine uplinks, hall-to-hall |
| 800G-FR8 single-mode optic | Up to 2 km | $4,500-6,500 each end | Campus interconnect |
| 400G DAC (split mode) | 1-3 m | $280-450 | Mixed-speed access tiers |
| Linear pluggable optics (LPO 800G) | Up to 500 m | $2,200-2,800 each end | Power-sensitive deployments (~50% transceiver power savings) |
Software ecosystem#
Spectrum-X switches run NVIDIA Cumulus Linux 5.x or community SONiC — both standard Ethernet NOSes operationally familiar to any Ethernet team. Cumulus is the supported path for AI-fabric features (NetQ telemetry integration, NV-API automation); SONiC supports the silicon but Spectrum-X-specific tuning has fewer turnkey defaults.
- Cumulus Linux 5.x — primary supported NOS. Configured via NV CLI (`nv set / nv config apply`) or NV API (REST/gRPC). Ships with AI-fabric default templates.
- NetQ — fabric telemetry, flow monitoring, validation engine. Replaces what UFM does for InfiniBand; integrates with Prometheus, Grafana, ServiceNow.
- DOCA 2.7+ on BlueField-3 endpoints — the host kernel module + userland that exposes the AI CC algorithm, packet-spraying re-order engine, and per-flow telemetry.
- NVIDIA Air — cloud-hosted digital twin of the fabric for validation before bring-up.
- Optional: NVIDIA Mission Control (NV-MC) — top-of-rack-to-job-submission management for DGX SuperPOD-class Spectrum-X clusters.
- Standards-based observability: Prometheus exporters, OpenTelemetry traces, sFlow/IPFIX flow records. All available via NetQ or directly from Cumulus.
# Bring up a Spectrum-X leaf-spine pair with default AI-fabric template
# Apply on the leaf:
nv set system aaa user nvue-admin role system-admin password '...'
nv set interface swp1-64 link mtu 9216
nv set interface swp1-32 type swp-leaf-server # downlink to BF-3 host
nv set interface swp33-64 type swp-leaf-spine # uplink to spine
nv set qos roce mode lossless # turnkey AI-fabric defaults
nv set qos roce congestion-control algorithm spectrum-x-cc
nv config apply
# Apply on the spine:
nv set interface swp1-64 link mtu 9216
nv set interface swp1-64 type swp-spine-leaf
nv set qos roce mode lossless
nv set qos roce congestion-control algorithm spectrum-x-cc
nv config apply
# Verify Spectrum-X-specific behaviour
nv show qos roce
nv show platform congestion-control
netq show fabric utilisation
netq show events --severity warningSizing and capacity planning#
Spectrum-X scales from a single-rack pod (8 hosts, 64 GPUs) to a 32,000-GPU AI factory in three tiers. The reference cluster sizes below are NVIDIA-published validated designs; intermediate sizes are linear interpolations.
- Rail-optimised cabling: each BlueField-3 SuperNIC port is mapped to one of 8 rails, with spines colour-coded per rail. Identical to InfiniBand fat-tree cabling discipline.
- Oversubscription: 1:1 (non-blocking) is the default for training; 2:1 spine oversubscription halves spine count for inference / batch fabrics where AllReduce isn't the bottleneck.
- Power: 1.5 kW per leaf at full load, 1.0 kW for spine (less optic density). Plan 2 kW PDU headroom per leaf.
- Cooling: front-to-back airflow; SN5600 fits standard cold-aisle hot-aisle racks. No liquid cooling required at 800G in 2026.
- Yobitel's NeoCloud reference design for Ethernet-preferring sovereign tenants lands on the 1,024-GPU two-tier Spectrum-X build above as the standard footprint, scaling to the 4,096-GPU three-tier shape for frontier-training reservations.
| Pod size (GPUs) | Topology | SN5600 leaves | SN5600 spines | Switches total | BlueField-3 NICs | Indicative fabric BOM (USD) |
|---|---|---|---|---|---|---|
| 64 (one HGX) | Single-tier | 1 | 0 | 1 | 8 | $70-95k |
| 256 | Two-tier (8x8) | 8 | 8 | 16 | 32 | $120-180k |
| 1,024 | Two-tier (32x16) | 32 | 16 | 48 | 128 | $320-450k |
| 4,096 | Three-tier | 128 | 64 + 32 | 224 | 512 | $1.4-1.9M |
| 8,192 | Three-tier non-blocking | 256 | 128 + 64 | 448 | 1,024 | $2.7-3.5M |
| 16,384 | Three-tier | 512 | 256 + 128 | 896 | 2,048 | $5.2-6.8M |
| 32,000 | Three-tier (NV ref) | 1,000+ | 500 + 250 | 1,750+ | 4,000 | $11-14M |
Cost and TCO versus InfiniBand XDR#
Spectrum-X's commercial pitch is comparable AI training performance at 60-70 % of the cost of an equivalent InfiniBand XDR fabric. The figures below are indicative USD ranges for new builds in early-to-mid 2026; negotiated pricing varies meaningfully.
At the same fabric scale, Spectrum-X is consistently 30-35 % cheaper than InfiniBand XDR on bill of materials. AllReduce throughput is within ~5 % at large messages, AllToAll within ~10 % — the gap is small enough that the cost saving wins for most operators with existing Ethernet ops capability. The exception is hyperscale-frontier training (50k+ GPUs) where InfiniBand's slightly tighter tail latency still wins.
| Line item | Spectrum-X 800G | InfiniBand XDR (Quantum-3) 800G | Delta |
|---|---|---|---|
| Switch (1U/2U, 64-port 800G) | $65-90k | $95-140k | -30 to -35 % |
| NIC per host (dual-port 400 or 1x800G) | $3,200-4,800 (BlueField-3 SuperNIC) | $3,800-5,500 (ConnectX-8 IB) | -15 to -20 % |
| Optics per end (800G DR8) | $2,800-3,600 | $3,200-4,200 | -15 to -20 % |
| Switch OS / mgmt | $0-2k per port (Cumulus included) | $5-8k per port-year (UFM Enterprise) | Much cheaper |
| Operator skill | Existing Ethernet team | InfiniBand specialist team | Variable; cluster-specific |
| Full 4,096-GPU fabric BOM (incl optics) | $8-11M | $13-16M | Roughly -30-35 % |
| Full 16,384-GPU fabric BOM | $32-42M | $50-65M | Roughly -30-35 % |
Migration paths#
Spectrum-X is most often deployed in one of two migration shapes: greenfield (new AI fabric, no legacy), or brownfield replacement of generic Ethernet + RoCEv2. Migration from InfiniBand to Spectrum-X is less common but increasing — driven by TCO at scale.
- Brownfield from generic Ethernet: the win is roughly 25-40 % AllReduce throughput uplift at the same wire speed, plus tail-latency reduction in multi-tenant pods. Justifies the BlueField-3 rollout for most large-fabric operators.
- Brownfield from InfiniBand: the win is 30-35 % TCO savings on the next fabric refresh. Run both fabrics in parallel during cut-over (12-16 weeks); pod-by-pod migration; revalidate every training job's NCCL performance before retiring the IB fabric.
- Spectrum-X is forward-compatible with BlueField-4 (sampling 2026, volume 2027) and the SN6000 Spectrum-5 switch (sampling 2026) — buys an extra hardware generation of headroom.
| Migration from | Effort level | Key risk | Typical timeline |
|---|---|---|---|
| Greenfield | Low | First-time AI-fabric ops learning curve | 8-12 weeks design to production |
| Generic Ethernet + RoCEv2 (Tomahawk) | Medium | BlueField-3 NIC rollout across all hosts | 12-20 weeks (NIC swaps + cluster validation) |
| InfiniBand NDR (Quantum-2) | High | Software re-certification, NIC + switch + cabling swap | 16-24 weeks; usually pod-by-pod cutover |
| InfiniBand HDR (legacy) | High | Coupled refresh: HDR endpoints unsupported on Spectrum-X anyway | Treat as greenfield + decommission |
Pitfalls and operational notes#
- Spectrum-X-specific congestion control requires BlueField-3 SuperNIC on every endpoint. ConnectX-7 endpoints still get RoCEv2 with adaptive routing, but the full AI-tuned CC loop needs BF-3.
- Packet spraying assumes the receiver re-orders correctly. A misconfigured BlueField-3 (DOCA version drift) will deliver out-of-order packets to the RDMA engine, which collapses throughput silently. Pin DOCA version per cluster.
- NetQ telemetry retention defaults to 7 days. For incident post-mortem capability, raise to 30-60 days and budget the storage.
- Cable bend radius at 800G is tight; survey rack cabling plans before installation. Dirty optics produce slowly-growing symbol-error counters that flap a port days later.
- Mixed Spectrum-4 and older Spectrum-3 in the same fabric works but caps the slower side; segregate where possible.
- PFC watchdog must be enabled. Spectrum-X's congestion control reduces the need for PFC under normal load, but PFC remains the safety net; a stuck pause frame still kills a port without the watchdog.
- BlueField-3 also runs DOCA services (storage offload, security, host management) — coordinate the AI-fabric DOCA versions with the storage / security teams' DOCA expectations.
When evaluating Spectrum-X versus InfiniBand, run `nccl-tests` AllReduce AND AllToAll at the message sizes and rank counts your real workload uses — synthetic 8 GB AllReduce often makes both look identical; a 64 MB tensor-parallel AllReduce or a 32 MB MoE AllToAll reveals the differences. Decide on real-workload numbers, not headline marketing throughput.
References
- NVIDIA Spectrum-X Platform · NVIDIA
- Spectrum-X Switch Series (SN5600) · NVIDIA
- NVIDIA BlueField-3 SuperNIC · NVIDIA
- Optimised Ethernet for AI: Spectrum-X Whitepaper · NVIDIA
- NVIDIA Cumulus Linux Documentation · NVIDIA
- DOCA Software Framework · NVIDIA