TL;DR
- InfiniBand Next Data Rate (NDR) is the IBTA-defined 400 Gb/s per-port fabric generation: 4 lanes of 100 Gb/s PAM4 with RS-FEC, ratified in IBTA 1.5 (2020) and shipping in volume from 2022 onward.
- Quantum-2 (MQM9700, 64 x 400 Gb/s, 51.2 Tb/s ASIC, ~330 ns cut-through) is the canonical NDR switch; ConnectX-7 is the canonical host channel adapter; OSFP twin-port is the connector.
- Native RDMA, hardware-offloaded credit-based flow control, lossless behaviour without operator tuning, SHARPv3 in-network reduction — what lets NCCL AllReduce hit ~95 % of line-rate at 1,024-GPU scale.
- The default fabric inside DGX SuperPOD reference designs and almost every neocloud H100/H200/GB200 build through 2026; competes with 400G Ethernet + RoCEv2 (cheaper) and InfiniBand XDR (faster).
- Sizing rule: a 1,024-GPU NDR fat-tree uses 32 leaves + 16 spines of Quantum-2 (~48 switches, ~$0.55-0.75M switch BOM plus ~$1.2-1.6M of optics) — verify against your bill of materials.
Overview#
InfiniBand NDR is the fifth high-rate generation of the InfiniBand standard (after SDR, DDR, QDR, FDR, EDR, HDR), defined by the IBTA in the 1.5 specification roadmap and timed to match the Hopper GPU launch. Each NDR port aggregates four lanes of 100 Gb/s PAM4 serial signalling for a usable line rate of 400 Gb/s per port; an HCA may present 1, 2, or 4 ports, and an aggregated NDR2 link uses 8 lanes for 800 Gb/s on a single cable.
Its relevance to AI infrastructure is direct: training a frontier-scale model requires synchronising gradients across hundreds or thousands of GPUs every optimiser step, and the time spent in those collectives directly bounds training throughput. NDR's combination of 400 Gb/s per port, sub-microsecond switch latency, hardware RDMA, and adaptive routing without packet loss is what keeps gradient AllReduce overhead in the single-digit percent of step time on H100-scale jobs.
By mid-2026, NDR is mature, broadly deployed and reasonably priced. It is the default training fabric in NVIDIA DGX SuperPOD reference architectures from 2023 onward, the dominant choice in NVIDIA-Partner neoclouds (CoreWeave, Lambda, Crusoe, Nebius, Yotta), and the safe pick for any new H100/H200/GB200 build that does not have an existing in-house Ethernet/Spectrum-X operations team. Yobitel NeoCloud uses InfiniBand NDR for tightly-coupled training pods in its UK and EU regions, with NCSC OFFICIAL alignment on the sovereign UK footprint.
This entry helps you pick the right AI fabric for your training cluster and understand what Yobitel runs on NeoCloud — what NDR delivers in practice, where it sits versus 400G Ethernet and the newer XDR generation, and how to size a non-blocking fat tree without surprises in the optics bill.
Specifications#
Authoritative figures: IBTA-defined NDR line rates plus the practical numbers an operator cares about when sizing a fabric. Throughput, latency, power and reach assume current Quantum-2 silicon and ConnectX-7 endpoints at the stated DOCA / MLNX_OFED levels.
| Property | Value | Notes |
|---|---|---|
| Per-lane signalling | 100 Gb/s PAM4 | Modulation: PAM4 with RS(544,514) KP4 FEC |
| Lanes per standard port | 4 | Aggregated NDR2 uses 8 lanes for 800 Gb/s |
| Per-port line rate | 400 Gb/s (NDR) | Per IBTA 1.5 |
| Aggregated link (NDR2) | 800 Gb/s | 8 lanes on a single OSFP cable |
| Switch ASIC | NVIDIA Quantum-2 | MQM9700 series, 51.2 Tb/s aggregate |
| Switch radix | 64 x 400 Gb/s (or 128 x 200 Gb/s split) | Single ASIC, 1U appliance |
| Switch port-to-port latency | ~330 ns (cut-through) | Endpoint-to-endpoint single hop ~600 ns |
| MPI half round-trip latency | 1.0-1.5 us typical | ConnectX-7 to ConnectX-7 through Quantum-2 |
| Host adapter | ConnectX-7 (CX-7) | 1, 2, 4 NDR ports; PCIe Gen5 x16 |
| Connector | OSFP twin-port | Single- and twin-port cages co-exist; mark clearly |
| Passive copper (DAC) reach | ~3 m at NDR | 1.5 m typical conservative spec |
| Active optical (AOC) reach | Up to 30 m | Common intra-row cabling |
| Optics reach | 100 m (SR4 MMF), 500 m / 2 km (DR4/FR4 SMF) | Per pluggable transceiver type |
| Power per port (switch) | 12-18 W | ASIC + optic, varies by transceiver |
| Subnet management | OpenSM or NVIDIA UFM | Mandatory; one active SM per subnet |
| In-network reduction | SHARPv3 | AllReduce/Reduce/Barrier offload |
| Routing | Adaptive + SHIELD self-healing | Per-packet path selection |
| Lossless | Yes (credit-based flow control) | No PFC/ECN tuning required |
| Year ratified | 2020 (IBTA 1.5) | First volume shipments 2022 |
Architecture: how NDR achieves lossless RDMA#
InfiniBand is fundamentally a credit-based, lossless, RDMA-first fabric — and NDR inherits that architecture rather than reinventing it. Three mechanisms together produce the headline behaviour:
Credit-based link-level flow control. Every link in the fabric maintains a count of receive buffers free at the downstream port. A sender cannot transmit a packet until the receiver has advertised enough credits. Buffers therefore never overflow, packets are never dropped due to congestion, and the retransmission tax that lossy Ethernet pays under load simply does not exist. This is the most important difference between InfiniBand and RoCEv2-on-Ethernet.
Hardware RDMA with GPUDirect. Remote Direct Memory Access lets one node read or write the memory of another node without involving the remote CPU. The HCA executes the transfer directly to or from registered memory pages; in a GPU cluster, GPUDirect RDMA extends this to GPU HBM via the PCIe peer-to-peer path, so an AllReduce across 1,024 GPUs never copies a tensor through host DRAM. The ConnectX-7 plus a Hopper or Blackwell GPU on the same PCIe Gen5 root complex (or NVSwitch-adjacent) is what makes this fast in practice.
Adaptive routing with SHIELD self-healing. Quantum-2 chooses among equal-cost paths on a per-packet basis using switch-local congestion telemetry, with the SHIELD mechanism rerouting around link faults within hundreds of nanoseconds. Combined with credit-based flow control, this keeps tail latency tight even when the fabric is running at 80-90 % offered load — the regime where lossy Ethernet's tail behaviour collapses.
Layered on top of these are partition keys (PKeys, the InfiniBand multi-tenancy primitive), service levels and virtual lanes (QoS), and the SHARPv3 in-network reduction engine that runs collective arithmetic inside the switch (see the SHARP entry for the details).
Form factor, cabling and optics#
OSFP twin-port is the NDR-era cage. Each cage accepts a twin-port cable carrying two independent 400 Gb/s links, or a single-port optic carrying one. This doubles port density at the switch — a 32-cage 1U switch presents 64 ports — at the cost of meaningful cabling discipline: a twin-port AOC plugged into a single-port host cage will negotiate at half the expected bandwidth and produce no error message.
For intra-rack runs (under ~3 m), passive copper DAC is the cheapest option. Beyond that, active optical cables (AOC) extend reach to ~30 m without the optic-per-end cost, and pluggable transceivers cover the longer reaches. The transceiver bill of materials is usually the largest single line item in an NDR fabric.
| Cable / optic | Reach | Approx unit cost (USD, 2026) | Typical use |
|---|---|---|---|
| NDR DAC passive copper, 1-3 m | 1-3 m | $280-450 | Intra-rack leaf-to-host |
| NDR AOC active optical, 3-30 m | 3-30 m | $1,500-2,800 | Adjacent-rack leaf-to-host |
| OSFP NDR SR4 multimode optic | Up to 100 m | $1,800-2,400 each end | Within a hall, MMF |
| OSFP NDR DR4 single-mode optic | Up to 500 m | $2,400-3,200 each end | Hall-to-hall SMF |
| OSFP NDR FR4 single-mode optic | Up to 2 km | $3,500-5,000 each end | Campus SMF |
| Twin-port AOC (NDR2 800 Gb/s) | 3-30 m | $2,800-4,500 | Switch-to-switch spine uplinks |
Single-port and twin-port OSFP cages are mechanically identical. A twin-port AOC plugged into a single-port host will link up at 200 Gb/s with no error — verify each link's negotiated rate via `ibstat` after cabling, not only after the first benchmark run.
Software ecosystem: drivers, subnet manager, tooling#
Operating an NDR fabric is a software stack as much as a hardware deployment. The minimum supported set per cluster is below; mixing versions across hosts is the most common silent source of fabric flakiness.
- MLNX_OFED or DOCA-Host on every endpoint — `mlx5_core` driver, libibverbs, `rdma-core` utilities. DOCA-Host (DOCA 2.7+) is the supported path on RHEL 9 / Ubuntu 22.04+ in 2026.
- OpenSM (open source) or NVIDIA UFM Enterprise (commercial) as Subnet Manager. UFM adds topology visualisation, fabric-wide telemetry export, predictive failure detection, and Cyber-AI anomaly alerts; OpenSM is fine at single-pod scale.
- MLNX-OS as the switch operating system on MQM9700 appliances.
- Quantum-2 firmware aligned across all switches in the fabric. Mixed firmware causes sporadic SHARP CollNet failures and silent SHIELD-route rejections.
- Diagnostic utilities: `ibstat`, `ibhosts`, `iblinkinfo`, `ibdiagnet`, `ib_write_bw`, `ib_read_lat`. The `nccl-tests` AllReduce benchmark is the practical fabric validator.
- Optional integrations: Prometheus exporter via UFM Telemetry, Grafana dashboards from NVIDIA's `dcgm-exporter` companion repo, OpenTelemetry traces from training framework profilers (PyTorch Profiler, Nsight Systems).
# Verify NDR link rates across the host's HCAs
ibstat | grep -E "Active|Rate"
# Rate: 400 (4X NDR)
# Rate: 400 (4X NDR)
# Topology sweep + cable health check (run after every firmware bump)
ibdiagnet --pc \
--get_cable_info \
--get_phy_info \
--counter_freq 5 \
--out_ibdiagnet2_dir /var/log/ibdiagnet
# Point-to-point bandwidth test between two hosts
# server side:
ib_write_bw -d mlx5_0 -i 1 --report_gbits -F
# client side:
ib_write_bw -d mlx5_0 -i 1 --report_gbits -F <server-ip>
# Expect ~395 Gb/s payload, ~96-98 % of line rate.
# Full-cluster fabric validation
mpirun -np 256 -hostfile hosts \
--mca pml ucx --mca btl ^vader,tcp,openib \
-x NCCL_DEBUG=INFO -x NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3 \
./build/all_reduce_perf -b 8M -e 8G -f 2 -g 1Sizing and capacity planning#
NDR fabrics are sized around the AllReduce collective that dominates data-parallel training. The default topology is a non-blocking fat-tree: every leaf has equal up- and down-link bandwidth, so any GPU can sustain full line-rate to any other GPU through the fabric. The tables below cover the common pod sizes; larger clusters add a third tier.
- Rail-optimised cabling: every GPU's HCA is mapped to a specific rail (1, 2, 3, 4 — matching the 4 HCAs per HGX H100 baseboard). Spines are colour-coded per rail. This avoids cross-rail traffic and simplifies cabling audits.
- Optics budget rule of thumb: at 1,024-GPU scale, optics cost roughly $1.2-1.6M and dwarf the switch BOM. At 4,096+ GPUs, optics can hit $5-8M and approach the GPU cost on a per-rack basis.
- Power: a fully loaded MQM9700 draws ~750 W typical (~1 kW peak). Plan 1 kW per leaf for PDU sizing.
- Cooling: switch fans are loud but standard front-to-back airflow; no liquid cooling required at NDR scale (unlike Quantum-3 which often sits in rear-door heat exchangers).
- Oversubscription: 2:1 leaf-to-spine oversubscription roughly halves switch and optics cost at the price of capping AllReduce throughput. Acceptable for inference fleets, rarely acceptable for training.
- Yobitel NeoCloud's UK training-pod reference design lands on the 1,024-GPU two-tier fat tree above (32 Quantum-2 leaves + 16 spines) as the standard sovereign-region building block; larger sovereign tenants land on the 2,048-GPU design and frontier-training reservations extend into the three-tier 4,096-GPU footprint.
| Pod size (GPUs) | Topology | Quantum-2 leaves | Quantum-2 spines | Switches total | Approx switch BOM (USD) |
|---|---|---|---|---|---|
| 64 (one HGX) | Single-tier | 1 | 0 | 1 | $70-95k |
| 256 | Two-tier fat tree (8x8) | 8 | 8 | 16 | $120-160k |
| 512 | Two-tier (16x8) | 16 | 8 | 24 | $180-240k |
| 1,024 | Two-tier (32x16) | 32 | 16 | 48 | $360-480k |
| 2,048 | Two-tier (64x32) full radix | 64 | 32 | 96 | $720-960k |
| 4,096 | Three-tier fat tree | 128 | 64 + 32 core | 224 | $1.7-2.2M |
| 10,000+ | Three-tier non-blocking | Sized per design | Sized per design | 300+ | $2.8M and up |
Cost and TCO#
Switch silicon prices are negotiated; the figures below are list-adjacent indicative ranges in USD for new builds in early-to-mid 2026. Optic prices are softer (more vendors, more competition) and have dropped roughly 25 % year-on-year since 2024.
For a 1,024-GPU NDR build (single 8-host HGX = 64 GPUs x 16 hosts), the indicative full fabric bill of materials is roughly $2.5-3.5M (switches + HCAs + optics + cabling) — about 10-15 % of the underlying GPU spend. Quantum-2 is ~30-40 % more expensive per port than equivalent 400G Ethernet, but typically wins on operational simplicity for training-only fabrics.
| Line item | Indicative USD price | Notes |
|---|---|---|
| Quantum-2 MQM9700 1U switch | $70,000-95,000 per switch | List, before negotiation |
| ConnectX-7 dual-port NDR HCA | $2,800-4,200 per card | OEM-rebrand pricing varies |
| NDR DAC 1-3 m | $280-450 per cable | Cheapest option intra-rack |
| NDR AOC 3-30 m | $1,500-2,800 per cable | Adjacent-rack runs |
| NDR optic per end (SR4 / DR4 / FR4) | $1,800-5,000 each | Two ends per link |
| NVIDIA UFM Enterprise licence | ~$5,000-8,000 per switch port-year | Optional commercial SM |
| DOCA-Host subscription | Included with HCA support contract | Per ConnectX-7 part number |
| MLNX_OFED community | Free | Use for pre-production, lab |
Migration and alternatives#
NDR sits between two adjacent fabric choices on the bandwidth axis (HDR below, XDR above) and one alternative on the technology axis (Ethernet + RoCEv2). Choose deliberately.
- Migrating from HDR to NDR is mostly a swap of leaf and spine appliances plus HCAs and cables — NDR ConnectX-7 fits Gen5 hosts and the cabling system changes from QSFP56 to OSFP. Plan a coupled refresh, not a piecemeal one.
- Migrating from NDR to RoCEv2 means re-platforming the operations stack: PFC tuning, ECN/DCQCN tuning, switch buffer sizing, and per-flow telemetry. Budget 6-12 weeks of network-engineering time before the first production training run.
- Mixed NDR/HDR subnets are supported via Quantum-2's HDR-rate-capable ports but cap the slowest link's bandwidth across each topology level — segregate where possible.
| Alternative | Per-port bandwidth | Typical cost vs NDR | Best for |
|---|---|---|---|
| InfiniBand HDR (Quantum) | 200 Gb/s | ~50-60 % cheaper per port | Inference fabrics, smaller training pods, legacy A100 builds |
| InfiniBand NDR (Quantum-2) | 400 Gb/s | Baseline | H100/H200/GB200 training pods up to ~16k GPUs |
| InfiniBand XDR (Quantum-3) | 800 Gb/s | ~70-90 % more expensive per port | B200/GB200/GB300 frontier-model training above 4k GPUs |
| 400G Ethernet + RoCEv2 (Tomahawk 4) | 400 Gb/s | ~30-40 % cheaper per port | Sites with strong Ethernet ops; multi-tenant pods; storage fabrics |
| 800G Spectrum-X (Spectrum-4 + BlueField-3) | 800 Gb/s | Comparable to NDR per Gb/s, cheaper at scale | Large new builds where the operator wants Ethernet familiarity at XDR-class bandwidth |
| Slingshot 11 (HPE Cray) | 200 Gb/s | Vendor-specific pricing | DOE/HPC exascale procurements, niche outside that segment |
Pitfalls and operational notes#
- Subnet Manager is mandatory and single-active. Run redundant SMs (one primary, one standby) at production scale; UFM HA is recommended above 256 nodes.
- Firmware drift is the most common silent failure: ConnectX HCAs, Quantum switches, and the SM should be kept on the same MLNX_OFED / DOCA release across the fabric. Pin a release per cluster and document the upgrade window.
- Cluster verification benchmarks (`ib_write_bw`, `nccl-tests` AllReduce) must be re-run after every firmware update — silent regressions in switch microcode have shipped more than once.
- Cable lengths and bend radius matter at NDR signalling rates. A single dirty OSFP optic in a 256-node cluster can manifest as random NCCL timeout errors hours into a multi-week training run.
- Mix lossy management traffic with RDMA on the same HCA only via IPoIB on a dedicated PKey — do not bridge generic Ethernet onto the same QSFP cages.
- SHARP is enabled per-job. The fact that the fabric supports it does not mean the job is using it; verify `NCCL_COLLNET_ENABLE=1` is set and the NCCL log shows `Connected ... using CollNet`.
- OSFP twin-port and single-port cages co-exist on Quantum-2; label every cable end and audit annually.
- Cooling: at high port utilisation, MQM9700 fan speed ramps audibly. Ensure rack-level cold-aisle containment to keep ASIC junction temperature below 95 C.
Mixing optic vendors on a single link (one end Vendor A, the other Vendor B) frequently links up but is not a supported configuration — link-margin issues surface as occasional `ibdiagnet` symbol-error counters that grow slowly until a flap takes the job down. Standardise optics per fabric.
References
- InfiniBand Architecture Specification Volume 1, Release 1.5 · InfiniBand Trade Association
- NVIDIA Quantum-2 InfiniBand Platform · NVIDIA
- NVIDIA ConnectX-7 InfiniBand Adapter Product Brief · NVIDIA
- DGX SuperPOD Reference Architecture (H100 / H200 / GB200) · NVIDIA
- NVIDIA UFM Enterprise Documentation · NVIDIA