InfiniBand NDR (Next Data Rate)

TL;DR

InfiniBand Next Data Rate (NDR) is the IBTA-defined 400 Gb/s per-port fabric generation: 4 lanes of 100 Gb/s PAM4 with RS-FEC, ratified in IBTA 1.5 (2020) and shipping in volume from 2022 onward.
Quantum-2 (MQM9700, 64 x 400 Gb/s, 51.2 Tb/s ASIC, ~330 ns cut-through) is the canonical NDR switch; ConnectX-7 is the canonical host channel adapter; OSFP twin-port is the connector.
Native RDMA, hardware-offloaded credit-based flow control, lossless behaviour without operator tuning, SHARPv3 in-network reduction — what lets NCCL AllReduce hit ~95 % of line-rate at 1,024-GPU scale.
The default fabric inside DGX SuperPOD reference designs and almost every neocloud H100/H200/GB200 build through 2026; competes with 400G Ethernet + RoCEv2 (cheaper) and InfiniBand XDR (faster).
Sizing rule: a 1,024-GPU NDR fat-tree uses 32 leaves + 16 spines of Quantum-2 (~48 switches, ~$0.55-0.75M switch BOM plus ~$1.2-1.6M of optics) — verify against your bill of materials.

Overview

InfiniBand NDR is the fifth high-rate generation of the InfiniBand standard (after SDR, DDR, QDR, FDR, EDR, HDR), defined by the IBTA in the 1.5 specification roadmap and timed to match the Hopper GPU launch. Each NDR port aggregates four lanes of 100 Gb/s PAM4 serial signalling for a usable line rate of 400 Gb/s per port; an HCA may present 1, 2, or 4 ports, and an aggregated NDR2 link uses 8 lanes for 800 Gb/s on a single cable.

Its relevance to AI infrastructure is direct: training a frontier-scale model requires synchronising gradients across hundreds or thousands of GPUs every optimiser step, and the time spent in those collectives directly bounds training throughput. NDR's combination of 400 Gb/s per port, sub-microsecond switch latency, hardware RDMA, and adaptive routing without packet loss is what keeps gradient AllReduce overhead in the single-digit percent of step time on H100-scale jobs.

By mid-2026, NDR is mature, broadly deployed and reasonably priced. It is the default training fabric in NVIDIA DGX SuperPOD reference architectures from 2023 onward, the dominant choice in NVIDIA-Partner neoclouds (CoreWeave, Lambda, Crusoe, Nebius, Yotta), and the safe pick for any new H100/H200/GB200 build that does not have an existing in-house Ethernet/Spectrum-X operations team. Yobitel NeoCloud uses InfiniBand NDR for tightly-coupled training pods in its UK and EU regions, with NCSC OFFICIAL alignment on the sovereign UK footprint.

This entry helps you pick the right AI fabric for your training cluster and understand what Yobitel runs on NeoCloud — what NDR delivers in practice, where it sits versus 400G Ethernet and the newer XDR generation, and how to size a non-blocking fat tree without surprises in the optics bill.

Specifications

Authoritative figures: IBTA-defined NDR line rates plus the practical numbers an operator cares about when sizing a fabric. Throughput, latency, power and reach assume current Quantum-2 silicon and ConnectX-7 endpoints at the stated DOCA / MLNX_OFED levels.

Property	Value	Notes
Per-lane signalling	100 Gb/s PAM4	Modulation: PAM4 with RS(544,514) KP4 FEC
Lanes per standard port	4	Aggregated NDR2 uses 8 lanes for 800 Gb/s
Per-port line rate	400 Gb/s (NDR)	Per IBTA 1.5
Aggregated link (NDR2)	800 Gb/s	8 lanes on a single OSFP cable
Switch ASIC	NVIDIA Quantum-2	MQM9700 series, 51.2 Tb/s aggregate
Switch radix	64 x 400 Gb/s (or 128 x 200 Gb/s split)	Single ASIC, 1U appliance
Switch port-to-port latency	~330 ns (cut-through)	Endpoint-to-endpoint single hop ~600 ns
MPI half round-trip latency	1.0-1.5 us typical	ConnectX-7 to ConnectX-7 through Quantum-2
Host adapter	ConnectX-7 (CX-7)	1, 2, 4 NDR ports; PCIe Gen5 x16
Connector	OSFP twin-port	Single- and twin-port cages co-exist; mark clearly
Passive copper (DAC) reach	~3 m at NDR	1.5 m typical conservative spec
Active optical (AOC) reach	Up to 30 m	Common intra-row cabling
Optics reach	100 m (SR4 MMF), 500 m / 2 km (DR4/FR4 SMF)	Per pluggable transceiver type
Power per port (switch)	12-18 W	ASIC + optic, varies by transceiver
Subnet management	OpenSM or NVIDIA UFM	Mandatory; one active SM per subnet
In-network reduction	SHARPv3	AllReduce/Reduce/Barrier offload
Routing	Adaptive + SHIELD self-healing	Per-packet path selection
Lossless	Yes (credit-based flow control)	No PFC/ECN tuning required
Year ratified	2020 (IBTA 1.5)	First volume shipments 2022

Architecture: how NDR achieves lossless RDMA

InfiniBand is fundamentally a credit-based, lossless, RDMA-first fabric — and NDR inherits that architecture rather than reinventing it. Three mechanisms together produce the headline behaviour:

Credit-based link-level flow control. Every link in the fabric maintains a count of receive buffers free at the downstream port. A sender cannot transmit a packet until the receiver has advertised enough credits. Buffers therefore never overflow, packets are never dropped due to congestion, and the retransmission tax that lossy Ethernet pays under load simply does not exist. This is the most important difference between InfiniBand and RoCEv2-on-Ethernet.

Hardware RDMA with GPUDirect. Remote Direct Memory Access lets one node read or write the memory of another node without involving the remote CPU. The HCA executes the transfer directly to or from registered memory pages; in a GPU cluster, GPUDirect RDMA extends this to GPU HBM via the PCIe peer-to-peer path, so an AllReduce across 1,024 GPUs never copies a tensor through host DRAM. The ConnectX-7 plus a Hopper or Blackwell GPU on the same PCIe Gen5 root complex (or NVSwitch-adjacent) is what makes this fast in practice.

Adaptive routing with SHIELD self-healing. Quantum-2 chooses among equal-cost paths on a per-packet basis using switch-local congestion telemetry, with the SHIELD mechanism rerouting around link faults within hundreds of nanoseconds. Combined with credit-based flow control, this keeps tail latency tight even when the fabric is running at 80-90 % offered load — the regime where lossy Ethernet's tail behaviour collapses.

Layered on top of these are partition keys (PKeys, the InfiniBand multi-tenancy primitive), service levels and virtual lanes (QoS), and the SHARPv3 in-network reduction engine that runs collective arithmetic inside the switch (see the SHARP entry for the details).

Form factor, cabling and optics

OSFP twin-port is the NDR-era cage. Each cage accepts a twin-port cable carrying two independent 400 Gb/s links, or a single-port optic carrying one. This doubles port density at the switch — a 32-cage 1U switch presents 64 ports — at the cost of meaningful cabling discipline: a twin-port AOC plugged into a single-port host cage will negotiate at half the expected bandwidth and produce no error message.

For intra-rack runs (under ~3 m), passive copper DAC is the cheapest option. Beyond that, active optical cables (AOC) extend reach to ~30 m without the optic-per-end cost, and pluggable transceivers cover the longer reaches. The transceiver bill of materials is usually the largest single line item in an NDR fabric.

Cable / optic	Reach	Approx unit cost (USD, 2026)	Typical use
NDR DAC passive copper, 1-3 m	1-3 m	$280-450	Intra-rack leaf-to-host
NDR AOC active optical, 3-30 m	3-30 m	$1,500-2,800	Adjacent-rack leaf-to-host
OSFP NDR SR4 multimode optic	Up to 100 m	$1,800-2,400 each end	Within a hall, MMF
OSFP NDR DR4 single-mode optic	Up to 500 m	$2,400-3,200 each end	Hall-to-hall SMF
OSFP NDR FR4 single-mode optic	Up to 2 km	$3,500-5,000 each end	Campus SMF
Twin-port AOC (NDR2 800 Gb/s)	3-30 m	$2,800-4,500	Switch-to-switch spine uplinks

Warning: Single-port and twin-port OSFP cages are mechanically identical. A twin-port AOC plugged into a single-port host will link up at 200 Gb/s with no error — verify each link's negotiated rate via ibstat after cabling, not only after the first benchmark run.

Software ecosystem: drivers, subnet manager, tooling

Operating an NDR fabric is a software stack as much as a hardware deployment. The minimum supported set per cluster is below; mixing versions across hosts is the most common silent source of fabric flakiness.

MLNX_OFED or DOCA-Host on every endpoint — mlx5_core driver, libibverbs, rdma-core utilities. DOCA-Host (DOCA 2.7+) is the supported path on RHEL 9 / Ubuntu 22.04+ in 2026.
OpenSM (open source) or NVIDIA UFM Enterprise (commercial) as Subnet Manager. UFM adds topology visualisation, fabric-wide telemetry export, predictive failure detection, and Cyber-AI anomaly alerts; OpenSM is fine at single-pod scale.
MLNX-OS as the switch operating system on MQM9700 appliances.
Quantum-2 firmware aligned across all switches in the fabric. Mixed firmware causes sporadic SHARP CollNet failures and silent SHIELD-route rejections.
Diagnostic utilities: ibstat, ibhosts, iblinkinfo, ibdiagnet, ib_write_bw, ib_read_lat. The nccl-tests AllReduce benchmark is the practical fabric validator.
Optional integrations: Prometheus exporter via UFM Telemetry, Grafana dashboards from NVIDIA's dcgm-exporter companion repo, OpenTelemetry traces from training framework profilers (PyTorch Profiler, Nsight Systems).

# Verify NDR link rates across the host's HCAs
ibstat | grep -E "Active|Rate"
# Rate: 400 (4X NDR)
# Rate: 400 (4X NDR)

# Topology sweep + cable health check (run after every firmware bump)
ibdiagnet --pc \
  --get_cable_info \
  --get_phy_info \
  --counter_freq 5 \
  --out_ibdiagnet2_dir /var/log/ibdiagnet

# Point-to-point bandwidth test between two hosts
# server side:
ib_write_bw -d mlx5_0 -i 1 --report_gbits -F
# client side:
ib_write_bw -d mlx5_0 -i 1 --report_gbits -F <server-ip>
# Expect ~395 Gb/s payload, ~96-98 % of line rate.

# Full-cluster fabric validation
mpirun -np 256 -hostfile hosts \
  --mca pml ucx --mca btl ^vader,tcp,openib \
  -x NCCL_DEBUG=INFO -x NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3 \
  ./build/all_reduce_perf -b 8M -e 8G -f 2 -g 1

Sizing and capacity planning

NDR fabrics are sized around the AllReduce collective that dominates data-parallel training. The default topology is a non-blocking fat-tree: every leaf has equal up- and down-link bandwidth, so any GPU can sustain full line-rate to any other GPU through the fabric. The tables below cover the common pod sizes; larger clusters add a third tier.

Rail-optimised cabling: every GPU's HCA is mapped to a specific rail (1, 2, 3, 4 — matching the 4 HCAs per HGX H100 baseboard). Spines are colour-coded per rail. This avoids cross-rail traffic and simplifies cabling audits.
Optics budget rule of thumb: at 1,024-GPU scale, optics cost roughly $1.2-1.6M and dwarf the switch BOM. At 4,096+ GPUs, optics can hit $5-8M and approach the GPU cost on a per-rack basis.
Power: a fully loaded MQM9700 draws ~750 W typical (~1 kW peak). Plan 1 kW per leaf for PDU sizing.
Cooling: switch fans are loud but standard front-to-back airflow; no liquid cooling required at NDR scale (unlike Quantum-3 which often sits in rear-door heat exchangers).
Oversubscription: 2:1 leaf-to-spine oversubscription roughly halves switch and optics cost at the price of capping AllReduce throughput. Acceptable for inference fleets, rarely acceptable for training.
Yobitel NeoCloud's UK training-pod reference design lands on the 1,024-GPU two-tier fat tree above (32 Quantum-2 leaves + 16 spines) as the standard sovereign-region building block; larger sovereign tenants land on the 2,048-GPU design and frontier-training reservations extend into the three-tier 4,096-GPU footprint.

Pod size (GPUs)	Topology	Quantum-2 leaves	Quantum-2 spines	Switches total	Approx switch BOM (USD)
64 (one HGX)	Single-tier	1	0	1	$70-95k
256	Two-tier fat tree (8x8)	8	8	16	$120-160k
512	Two-tier (16x8)	16	8	24	$180-240k
1,024	Two-tier (32x16)	32	16	48	$360-480k
2,048	Two-tier (64x32) full radix	64	32	96	$720-960k
4,096	Three-tier fat tree	128	64 + 32 core	224	$1.7-2.2M
10,000+	Three-tier non-blocking	Sized per design	Sized per design	300+	$2.8M and up

Cost and TCO

Switch silicon prices are negotiated; the figures below are list-adjacent indicative ranges in USD for new builds in early-to-mid 2026. Optic prices are softer (more vendors, more competition) and have dropped roughly 25 % year-on-year since 2024.

For a 1,024-GPU NDR build (single 8-host HGX = 64 GPUs x 16 hosts), the indicative full fabric bill of materials is roughly $2.5-3.5M (switches + HCAs + optics + cabling) — about 10-15 % of the underlying GPU spend. Quantum-2 is ~30-40 % more expensive per port than equivalent 400G Ethernet, but typically wins on operational simplicity for training-only fabrics.

Line item	Indicative USD price	Notes
Quantum-2 MQM9700 1U switch	$70,000-95,000 per switch	List, before negotiation
ConnectX-7 dual-port NDR HCA	$2,800-4,200 per card	OEM-rebrand pricing varies
NDR DAC 1-3 m	$280-450 per cable	Cheapest option intra-rack
NDR AOC 3-30 m	$1,500-2,800 per cable	Adjacent-rack runs
NDR optic per end (SR4 / DR4 / FR4)	$1,800-5,000 each	Two ends per link
NVIDIA UFM Enterprise licence	~$5,000-8,000 per switch port-year	Optional commercial SM
DOCA-Host subscription	Included with HCA support contract	Per ConnectX-7 part number
MLNX_OFED community	Free	Use for pre-production, lab

Migration and alternatives

NDR sits between two adjacent fabric choices on the bandwidth axis (HDR below, XDR above) and one alternative on the technology axis (Ethernet + RoCEv2). Choose deliberately.

Migrating from HDR to NDR is mostly a swap of leaf and spine appliances plus HCAs and cables — NDR ConnectX-7 fits Gen5 hosts and the cabling system changes from QSFP56 to OSFP. Plan a coupled refresh, not a piecemeal one.
Migrating from NDR to RoCEv2 means re-platforming the operations stack: PFC tuning, ECN/DCQCN tuning, switch buffer sizing, and per-flow telemetry. Budget 6-12 weeks of network-engineering time before the first production training run.
Mixed NDR/HDR subnets are supported via Quantum-2's HDR-rate-capable ports but cap the slowest link's bandwidth across each topology level — segregate where possible.

Alternative	Per-port bandwidth	Typical cost vs NDR	Best for
InfiniBand HDR (Quantum)	200 Gb/s	~50-60 % cheaper per port	Inference fabrics, smaller training pods, legacy A100 builds
InfiniBand NDR (Quantum-2)	400 Gb/s	Baseline	H100/H200/GB200 training pods up to ~16k GPUs
InfiniBand XDR (Quantum-3)	800 Gb/s	~70-90 % more expensive per port	B200/GB200/GB300 frontier-model training above 4k GPUs
400G Ethernet + RoCEv2 (Tomahawk 4)	400 Gb/s	~30-40 % cheaper per port	Sites with strong Ethernet ops; multi-tenant pods; storage fabrics
800G Spectrum-X (Spectrum-4 + BlueField-3)	800 Gb/s	Comparable to NDR per Gb/s, cheaper at scale	Large new builds where the operator wants Ethernet familiarity at XDR-class bandwidth
Slingshot 11 (HPE Cray)	200 Gb/s	Vendor-specific pricing	DOE/HPC exascale procurements, niche outside that segment

Pitfalls and operational notes

Subnet Manager is mandatory and single-active. Run redundant SMs (one primary, one standby) at production scale; UFM HA is recommended above 256 nodes.
Firmware drift is the most common silent failure: ConnectX HCAs, Quantum switches, and the SM should be kept on the same MLNX_OFED / DOCA release across the fabric. Pin a release per cluster and document the upgrade window.
Cluster verification benchmarks (ib_write_bw, nccl-tests AllReduce) must be re-run after every firmware update — silent regressions in switch microcode have shipped more than once.
Cable lengths and bend radius matter at NDR signalling rates. A single dirty OSFP optic in a 256-node cluster can manifest as random NCCL timeout errors hours into a multi-week training run.
Mix lossy management traffic with RDMA on the same HCA only via IPoIB on a dedicated PKey — do not bridge generic Ethernet onto the same QSFP cages.
SHARP is enabled per-job. The fact that the fabric supports it does not mean the job is using it; verify NCCL_COLLNET_ENABLE=1 is set and the NCCL log shows Connected ... using CollNet.
OSFP twin-port and single-port cages co-exist on Quantum-2; label every cable end and audit annually.
Cooling: at high port utilisation, MQM9700 fan speed ramps audibly. Ensure rack-level cold-aisle containment to keep ASIC junction temperature below 95 C.

Warning: Mixing optic vendors on a single link (one end Vendor A, the other Vendor B) frequently links up but is not a supported configuration — link-margin issues surface as occasional ibdiagnet symbol-error counters that grow slowly until a flap takes the job down. Standardise optics per fabric.

References

InfiniBand Architecture Specification Volume 1, Release 1.5 · InfiniBand Trade Association
NVIDIA Quantum-2 InfiniBand Platform · NVIDIA
NVIDIA ConnectX-7 InfiniBand Adapter Product Brief · NVIDIA
DGX SuperPOD Reference Architecture (H100 / H200 / GB200) · NVIDIA
NVIDIA UFM Enterprise Documentation · NVIDIA

TL;DR

InfiniBand Next Data Rate (NDR) is the IBTA-defined 400 Gb/s per-port fabric generation: 4 lanes of 100 Gb/s PAM4 with RS-FEC, ratified in IBTA 1.5 (2020) and shipping in volume from 2022 onward.
Quantum-2 (MQM9700, 64 x 400 Gb/s, 51.2 Tb/s ASIC, ~330 ns cut-through) is the canonical NDR switch; ConnectX-7 is the canonical host channel adapter; OSFP twin-port is the connector.
Native RDMA, hardware-offloaded credit-based flow control, lossless behaviour without operator tuning, SHARPv3 in-network reduction — what lets NCCL AllReduce hit ~95 % of line-rate at 1,024-GPU scale.
The default fabric inside DGX SuperPOD reference designs and almost every neocloud H100/H200/GB200 build through 2026; competes with 400G Ethernet + RoCEv2 (cheaper) and InfiniBand XDR (faster).
Sizing rule: a 1,024-GPU NDR fat-tree uses 32 leaves + 16 spines of Quantum-2 (~48 switches, ~$0.55-0.75M switch BOM plus ~$1.2-1.6M of optics) — verify against your bill of materials.

Overview

Specifications

Property	Value	Notes
Per-lane signalling	100 Gb/s PAM4	Modulation: PAM4 with RS(544,514) KP4 FEC
Lanes per standard port	4	Aggregated NDR2 uses 8 lanes for 800 Gb/s
Per-port line rate	400 Gb/s (NDR)	Per IBTA 1.5
Aggregated link (NDR2)	800 Gb/s	8 lanes on a single OSFP cable
Switch ASIC	NVIDIA Quantum-2	MQM9700 series, 51.2 Tb/s aggregate
Switch radix	64 x 400 Gb/s (or 128 x 200 Gb/s split)	Single ASIC, 1U appliance
Switch port-to-port latency	~330 ns (cut-through)	Endpoint-to-endpoint single hop ~600 ns
MPI half round-trip latency	1.0-1.5 us typical	ConnectX-7 to ConnectX-7 through Quantum-2
Host adapter	ConnectX-7 (CX-7)	1, 2, 4 NDR ports; PCIe Gen5 x16
Connector	OSFP twin-port	Single- and twin-port cages co-exist; mark clearly
Passive copper (DAC) reach	~3 m at NDR	1.5 m typical conservative spec
Active optical (AOC) reach	Up to 30 m	Common intra-row cabling
Optics reach	100 m (SR4 MMF), 500 m / 2 km (DR4/FR4 SMF)	Per pluggable transceiver type
Power per port (switch)	12-18 W	ASIC + optic, varies by transceiver
Subnet management	OpenSM or NVIDIA UFM	Mandatory; one active SM per subnet
In-network reduction	SHARPv3	AllReduce/Reduce/Barrier offload
Routing	Adaptive + SHIELD self-healing	Per-packet path selection
Lossless	Yes (credit-based flow control)	No PFC/ECN tuning required
Year ratified	2020 (IBTA 1.5)	First volume shipments 2022

Architecture: how NDR achieves lossless RDMA

InfiniBand is fundamentally a credit-based, lossless, RDMA-first fabric — and NDR inherits that architecture rather than reinventing it. Three mechanisms together produce the headline behaviour:

Form factor, cabling and optics

Cable / optic	Reach	Approx unit cost (USD, 2026)	Typical use
NDR DAC passive copper, 1-3 m	1-3 m	$280-450	Intra-rack leaf-to-host
NDR AOC active optical, 3-30 m	3-30 m	$1,500-2,800	Adjacent-rack leaf-to-host
OSFP NDR SR4 multimode optic	Up to 100 m	$1,800-2,400 each end	Within a hall, MMF
OSFP NDR DR4 single-mode optic	Up to 500 m	$2,400-3,200 each end	Hall-to-hall SMF
OSFP NDR FR4 single-mode optic	Up to 2 km	$3,500-5,000 each end	Campus SMF
Twin-port AOC (NDR2 800 Gb/s)	3-30 m	$2,800-4,500	Switch-to-switch spine uplinks

Warning: Single-port and twin-port OSFP cages are mechanically identical. A twin-port AOC plugged into a single-port host will link up at 200 Gb/s with no error — verify each link's negotiated rate via ibstat after cabling, not only after the first benchmark run.

Software ecosystem: drivers, subnet manager, tooling

MLNX_OFED or DOCA-Host on every endpoint — mlx5_core driver, libibverbs, rdma-core utilities. DOCA-Host (DOCA 2.7+) is the supported path on RHEL 9 / Ubuntu 22.04+ in 2026.
OpenSM (open source) or NVIDIA UFM Enterprise (commercial) as Subnet Manager. UFM adds topology visualisation, fabric-wide telemetry export, predictive failure detection, and Cyber-AI anomaly alerts; OpenSM is fine at single-pod scale.
MLNX-OS as the switch operating system on MQM9700 appliances.
Quantum-2 firmware aligned across all switches in the fabric. Mixed firmware causes sporadic SHARP CollNet failures and silent SHIELD-route rejections.
Diagnostic utilities: ibstat, ibhosts, iblinkinfo, ibdiagnet, ib_write_bw, ib_read_lat. The nccl-tests AllReduce benchmark is the practical fabric validator.
Optional integrations: Prometheus exporter via UFM Telemetry, Grafana dashboards from NVIDIA's dcgm-exporter companion repo, OpenTelemetry traces from training framework profilers (PyTorch Profiler, Nsight Systems).

# Verify NDR link rates across the host's HCAs
ibstat | grep -E "Active|Rate"
# Rate: 400 (4X NDR)
# Rate: 400 (4X NDR)

# Topology sweep + cable health check (run after every firmware bump)
ibdiagnet --pc \
  --get_cable_info \
  --get_phy_info \
  --counter_freq 5 \
  --out_ibdiagnet2_dir /var/log/ibdiagnet

# Point-to-point bandwidth test between two hosts
# server side:
ib_write_bw -d mlx5_0 -i 1 --report_gbits -F
# client side:
ib_write_bw -d mlx5_0 -i 1 --report_gbits -F <server-ip>
# Expect ~395 Gb/s payload, ~96-98 % of line rate.

# Full-cluster fabric validation
mpirun -np 256 -hostfile hosts \
  --mca pml ucx --mca btl ^vader,tcp,openib \
  -x NCCL_DEBUG=INFO -x NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3 \
  ./build/all_reduce_perf -b 8M -e 8G -f 2 -g 1

Sizing and capacity planning

Rail-optimised cabling: every GPU's HCA is mapped to a specific rail (1, 2, 3, 4 — matching the 4 HCAs per HGX H100 baseboard). Spines are colour-coded per rail. This avoids cross-rail traffic and simplifies cabling audits.
Optics budget rule of thumb: at 1,024-GPU scale, optics cost roughly $1.2-1.6M and dwarf the switch BOM. At 4,096+ GPUs, optics can hit $5-8M and approach the GPU cost on a per-rack basis.
Power: a fully loaded MQM9700 draws ~750 W typical (~1 kW peak). Plan 1 kW per leaf for PDU sizing.
Cooling: switch fans are loud but standard front-to-back airflow; no liquid cooling required at NDR scale (unlike Quantum-3 which often sits in rear-door heat exchangers).
Oversubscription: 2:1 leaf-to-spine oversubscription roughly halves switch and optics cost at the price of capping AllReduce throughput. Acceptable for inference fleets, rarely acceptable for training.
Yobitel NeoCloud's UK training-pod reference design lands on the 1,024-GPU two-tier fat tree above (32 Quantum-2 leaves + 16 spines) as the standard sovereign-region building block; larger sovereign tenants land on the 2,048-GPU design and frontier-training reservations extend into the three-tier 4,096-GPU footprint.

Pod size (GPUs)	Topology	Quantum-2 leaves	Quantum-2 spines	Switches total	Approx switch BOM (USD)
64 (one HGX)	Single-tier	1	0	1	$70-95k
256	Two-tier fat tree (8x8)	8	8	16	$120-160k
512	Two-tier (16x8)	16	8	24	$180-240k
1,024	Two-tier (32x16)	32	16	48	$360-480k
2,048	Two-tier (64x32) full radix	64	32	96	$720-960k
4,096	Three-tier fat tree	128	64 + 32 core	224	$1.7-2.2M
10,000+	Three-tier non-blocking	Sized per design	Sized per design	300+	$2.8M and up

Cost and TCO

Line item	Indicative USD price	Notes
Quantum-2 MQM9700 1U switch	$70,000-95,000 per switch	List, before negotiation
ConnectX-7 dual-port NDR HCA	$2,800-4,200 per card	OEM-rebrand pricing varies
NDR DAC 1-3 m	$280-450 per cable	Cheapest option intra-rack
NDR AOC 3-30 m	$1,500-2,800 per cable	Adjacent-rack runs
NDR optic per end (SR4 / DR4 / FR4)	$1,800-5,000 each	Two ends per link
NVIDIA UFM Enterprise licence	~$5,000-8,000 per switch port-year	Optional commercial SM
DOCA-Host subscription	Included with HCA support contract	Per ConnectX-7 part number
MLNX_OFED community	Free	Use for pre-production, lab

Migration and alternatives

NDR sits between two adjacent fabric choices on the bandwidth axis (HDR below, XDR above) and one alternative on the technology axis (Ethernet + RoCEv2). Choose deliberately.

Migrating from HDR to NDR is mostly a swap of leaf and spine appliances plus HCAs and cables — NDR ConnectX-7 fits Gen5 hosts and the cabling system changes from QSFP56 to OSFP. Plan a coupled refresh, not a piecemeal one.
Migrating from NDR to RoCEv2 means re-platforming the operations stack: PFC tuning, ECN/DCQCN tuning, switch buffer sizing, and per-flow telemetry. Budget 6-12 weeks of network-engineering time before the first production training run.
Mixed NDR/HDR subnets are supported via Quantum-2's HDR-rate-capable ports but cap the slowest link's bandwidth across each topology level — segregate where possible.

Alternative	Per-port bandwidth	Typical cost vs NDR	Best for
InfiniBand HDR (Quantum)	200 Gb/s	~50-60 % cheaper per port	Inference fabrics, smaller training pods, legacy A100 builds
InfiniBand NDR (Quantum-2)	400 Gb/s	Baseline	H100/H200/GB200 training pods up to ~16k GPUs
InfiniBand XDR (Quantum-3)	800 Gb/s	~70-90 % more expensive per port	B200/GB200/GB300 frontier-model training above 4k GPUs
400G Ethernet + RoCEv2 (Tomahawk 4)	400 Gb/s	~30-40 % cheaper per port	Sites with strong Ethernet ops; multi-tenant pods; storage fabrics
800G Spectrum-X (Spectrum-4 + BlueField-3)	800 Gb/s	Comparable to NDR per Gb/s, cheaper at scale	Large new builds where the operator wants Ethernet familiarity at XDR-class bandwidth
Slingshot 11 (HPE Cray)	200 Gb/s	Vendor-specific pricing	DOE/HPC exascale procurements, niche outside that segment

Pitfalls and operational notes

Subnet Manager is mandatory and single-active. Run redundant SMs (one primary, one standby) at production scale; UFM HA is recommended above 256 nodes.
Firmware drift is the most common silent failure: ConnectX HCAs, Quantum switches, and the SM should be kept on the same MLNX_OFED / DOCA release across the fabric. Pin a release per cluster and document the upgrade window.
Cluster verification benchmarks (ib_write_bw, nccl-tests AllReduce) must be re-run after every firmware update — silent regressions in switch microcode have shipped more than once.
Cable lengths and bend radius matter at NDR signalling rates. A single dirty OSFP optic in a 256-node cluster can manifest as random NCCL timeout errors hours into a multi-week training run.
Mix lossy management traffic with RDMA on the same HCA only via IPoIB on a dedicated PKey — do not bridge generic Ethernet onto the same QSFP cages.
SHARP is enabled per-job. The fact that the fabric supports it does not mean the job is using it; verify NCCL_COLLNET_ENABLE=1 is set and the NCCL log shows Connected ... using CollNet.
OSFP twin-port and single-port cages co-exist on Quantum-2; label every cable end and audit annually.
Cooling: at high port utilisation, MQM9700 fan speed ramps audibly. Ensure rack-level cold-aisle containment to keep ASIC junction temperature below 95 C.

Warning: Mixing optic vendors on a single link (one end Vendor A, the other Vendor B) frequently links up but is not a supported configuration — link-margin issues surface as occasional ibdiagnet symbol-error counters that grow slowly until a flap takes the job down. Standardise optics per fabric.

References

InfiniBand Architecture Specification Volume 1, Release 1.5 · InfiniBand Trade Association
NVIDIA Quantum-2 InfiniBand Platform · NVIDIA
NVIDIA ConnectX-7 InfiniBand Adapter Product Brief · NVIDIA
DGX SuperPOD Reference Architecture (H100 / H200 / GB200) · NVIDIA
NVIDIA UFM Enterprise Documentation · NVIDIA

InfiniBand NDR (Next Data Rate)

Overview

Specifications

Architecture: how NDR achieves lossless RDMA

Form factor, cabling and optics

Software ecosystem: drivers, subnet manager, tooling

Sizing and capacity planning

Cost and TCO

Migration and alternatives

Pitfalls and operational notes

References

Browse all entries

Deploy on Yobibyte

InfiniBand NDR (Next Data Rate)

Overview

Specifications

Architecture: how NDR achieves lossless RDMA

Form factor, cabling and optics

Software ecosystem: drivers, subnet manager, tooling

Sizing and capacity planning

Cost and TCO

Migration and alternatives

Pitfalls and operational notes

References

Browse all entries

Deploy on Yobibyte