TL;DR
- NVIDIA's proprietary GPU-to-GPU interconnect across five generations: NVLink 1.0 (Pascal, 2016, 160 GB/s/GPU) through NVLink 5.0 (Blackwell, 2024, 1.8 TB/s/GPU) — roughly 11x lift over a decade.
- NVSwitch is the crossbar ASIC that turns NVLink from a point-to-point link into an all-to-all fabric — the substrate beneath HGX baseboards (8 GPUs) and NVL72 racks (72 GPUs).
- Per-generation per-GPU bandwidth: P100 160 GB/s, V100 300 GB/s, A100 600 GB/s, H100/H200 900 GB/s, B100/B200/GB200 1.8 TB/s. Each generation roughly doubled bisection bandwidth at iso-radix.
- NVL72 connects 72 GB200 GPUs into a single NVLink coherence domain with 130 TB/s aggregate bandwidth — the largest non-blocking NVLink fabric NVIDIA ships. Above 72 (NVL576 SuperPOD), the topology switches to InfiniBand and collective performance steps down.
- Tensor-parallel training and large-MoE inference are the workloads where NVLink dominates: AllReduce on NVLink runs at 450 GB/s/dir intra-HGX vs ~45 GB/s/dir over InfiniBand NDR — a 10x gap that shows up directly in step time.
Overview#
NVLink and NVSwitch are the silicon that make multi-GPU NVIDIA systems behave like one giant accelerator rather than a cluster of accelerators. NVLink is the per-GPU serial link standard — high-speed lanes that connect a GPU's NVHS PHYs to other GPUs, switches or Grace CPUs. NVSwitch is the crossbar ASIC that turns those point-to-point links into an all-to-all fabric, allowing every GPU on a baseboard or in a rack to DMA into every other GPU's HBM at line rate with no contention.
Why the pair matters: collective operations (AllReduce, AllGather, ReduceScatter, AllToAll) sit on the critical path of every tensor-parallel training step and every multi-GPU inference replica. NVLink 5.0 delivers 1.8 TB/s per GPU and the fifth-generation NVSwitch delivers 130 TB/s aggregate inside NVL72; InfiniBand NDR offers 400 Gb/s (50 GB/s) per port — roughly 30x slower per GPU. That gap is what makes 'one DGX' faster than 'eight standalone GPUs wired with IB' for any workload where collectives are not asymptotically dominated by compute.
This entry is the reference for architects sizing NVLink-bounded systems on Yobitel NeoCloud or partner capacity: full generation table, NVSwitch radix maths, topology shapes (HGX, NVL72, NVL576), what NVLink Sharp / SHARP-v3 offload buys you, the comparison to PCIe + InfiniBand, and the operational signals to watch. Yobitel NeoCloud's DGX H100, HGX H200 and NVL72 pods are sold as full NVLink domains; Yobibyte's scheduler will only place tensor-parallel inference replicas on NVLink-attached GPUs because cross-domain TP is what makes 70B-class chat unhappy. This entry helps you decide which NVLink domain size (8, 32, 72, 256, 576) your workload actually needs and what it costs to oversubscribe it.
Specifications: NVLink generations 1.0 - 5.0#
Authoritative per-generation figures. 'Per-GPU bandwidth' is bidirectional aggregate across all NVLink ports on the GPU. Port counts and per-port speeds matter when comparing partial-bandwidth options (H100 PCIe with the 600 GB/s bridge exposes fewer ports than SXM5) — quote both numbers when sizing.
- Per-port speed doubled at NVLink 5.0 by moving from 100 GT/s to 200 GT/s PAM4 signalling — the port count stayed at 18.
- Bidirectional figures: 900 GB/s on H100 = 450 GB/s/direction. NCCL AllReduce throughput is usually quoted per-direction.
- NVLink-C2C is the Grace-to-Hopper / Grace-to-Blackwell coherent variant at 900 GB/s — same fabric family, different package physics.
- Inside CC-on (Confidential Compute) mode, NVLink traffic is AES-256-GCM encrypted end-to-end; throughput penalty is ~1-2 % at NVLink 4.0+.
| Generation | GPU SKU(s) | Year | Ports / GPU | Per-port speed | Per-GPU BW (bidir) | Signalling |
|---|---|---|---|---|---|---|
| NVLink 1.0 | P100 (SXM2) | 2016 | 4 | 40 GB/s | 160 GB/s | 20 GT/s NRZ |
| NVLink 2.0 | V100 (SXM2/3) | 2017 | 6 | 50 GB/s | 300 GB/s | 25 GT/s NRZ |
| NVLink 3.0 | A100 (SXM4) | 2020 | 12 | 50 GB/s | 600 GB/s | 50 GT/s NRZ |
| NVLink 4.0 | H100 / H200 (SXM5) | 2022 | 18 | 50 GB/s | 900 GB/s | 100 GT/s PAM4 |
| NVLink 5.0 | B100 / B200 / GB200 | 2024 | 18 | 100 GB/s | 1,800 GB/s | 200 GT/s PAM4 |
When a vendor says '900 GB/s NVLink' they almost always mean bidirectional aggregate. For collective bandwidth maths, halve it. NCCL benchmarks report busBW which already accounts for the algorithmic factor — do not also halve busBW.
Architecture: NVSwitch ASIC and the all-to-all fabric#
NVSwitch turned NVLink from a point-to-point cable standard into a switched fabric. The first-generation NVSwitch shipped with DGX-2 (16 V100s, 2018) — a single 18-port crossbar wired six per board to give every GPU a non-blocking line to every other. The second generation (DGX A100, HGX A100) used six 36-port NVSwitches. The third generation (DGX H100, HGX H100/H200) used four 64-port NVSwitches per HGX baseboard. The fourth generation (GB200 NVL72) uses nine NVSwitch trays per rack, each with two switch chips, exposing 72 NVLink ports per switch chip — enough to wire 72 GPUs all-to-all at full radix with cabled NVLink Switch backplane.
An HGX-H100 / H200 baseboard places 8 GPUs alongside 4 NVSwitch chips. Each GPU's 18 NVLink ports are split across the 4 switches; every switch connects to every GPU on the board. Any GPU can DMA into any other GPU's HBM at 450 GB/s/direction with no fabric contention — the foundation of why intra-baseboard NCCL AllReduce hits 380-440 GB/s busBW.
NVL72 (Blackwell era) extends this to 72 GPUs in one NVLink domain via 9 NVSwitch trays and a copper-cable backplane. Aggregate bandwidth: 130 TB/s. Bisection: 65 TB/s. AllReduce at NVL72 scale stays within 8-12 % of intra-HGX line rate — for the first time, 72 GPUs behave like one shared-memory machine. Above 72 GPUs (NVL576 SuperPOD), the topology splits across InfiniBand and collective performance steps down by an order of magnitude — that step is the most important number in any large training plan.
| NVSwitch gen | System | Ports/chip | Per-port speed | GPUs in fabric | Aggregate BW |
|---|---|---|---|---|---|
| 1st (NVSwitch) | DGX-2 | 18 | 50 GB/s | 16 (V100) | 2.4 TB/s |
| 2nd (NVSwitch) | DGX A100 / HGX A100 | 36 | 50 GB/s | 8 (A100) | 4.8 TB/s |
| 3rd (NVSwitch) | DGX H100 / HGX H100 | 64 | 50 GB/s | 8 (H100) intra-baseboard, up to 256 with NVLink Switch System | 7.2 TB/s intra, 57.6 TB/s NVL Switch |
| 4th (NVSwitch) | GB200 NVL72 | 72 | 100 GB/s | 72 (GB200) | 130 TB/s |
| 5th (NVSwitch) | GB300 NVL72 | 72 | 100 GB/s | 72 (GB300) | 130 TB/s (+SHARP-v3) |
Topology shapes: HGX, NVL72, NVL576#
Most NVLink-aware sizing questions reduce to 'which domain size?'. The three shapes that matter in 2026 are: HGX baseboard (8 GPUs, the workhorse), NVL72 (72 GPUs in one coherent rack), and NVL576 (8 NVL72 racks stitched with InfiniBand/Quantum-X800). Knowing which one you need before you procure changes the sticker price by 2-4x.
- HGX (8 GPUs): the default. Tensor parallel up to TP=8 with no penalty; pipeline-parallel across baseboards over InfiniBand. Covers Llama 3 70B, Mixtral 8x22B, most 100B-class training and all inference.
- NVLink Switch System (32-256 GPUs, Hopper era): external NVLink switch trays stitched to HGX baseboards. 57.6 TB/s bisection at 256 GPUs — the DGX SuperPOD H100 substrate.
- NVL72 (72 GPUs, Blackwell era): single rack, copper backplane, 130 TB/s aggregate. TP up to 72; EP (expert-parallel for MoE) at full radix. The substrate for GPT-class frontier training.
- NVL576 (576 GPUs across 8 NVL72 racks): inter-rack stitching via Quantum-X800 InfiniBand. Collectives that span racks pay 5-10x latency uplift; size with this step in mind.
- Above NVL576 (1k+ GPUs): scale-out InfiniBand or Spectrum-X RoCE — the topology is no longer NVLink-coherent.
Yobitel NeoCloud sells NVLink domains by rack size, not by GPU count. Asking for '64x H100' will get you 8 HGX baseboards stitched with NDR (good for data-parallel, bad for TP>=8); asking for 'one NVLink domain, 64 GPUs' will land you on a SuperPOD H100 slice. State the domain shape, not just the count.
Topology + bandwidth math (collective sizing)#
Useful per-collective rules of thumb. These are the back-of-envelope numbers used to decide whether a model 'fits' a given NVLink shape; they assume well-tuned NCCL with NVLS enabled where applicable.
- AllReduce on H100 SXM5 intra-HGX (8 GPUs): 380-440 GB/s busBW with NCCL_ALGO=NVLS. ~3.5-4 % of step time on Llama 3 70B BF16 TP=8.
- AllReduce H100 across NVLink Switch System (256 GPUs): 280-330 GB/s busBW. Pipeline parallelism becomes preferable past 32 ranks.
- AllReduce GB200 NVL72: 750-900 GB/s busBW intra-rack. AllToAll on MoE dispatch saturates at ~600 GB/s busBW per GPU.
- AllReduce over InfiniBand NDR (cross-rack): 40-50 GB/s busBW per GPU — roughly 10x slower than intra-NVLink. Plan PP across the IB boundary.
- AllGather and ReduceScatter inherit AllReduce numbers (they are AllReduce primitives). AllToAll is 30-50 % lower than AllReduce on the same fabric.
- Sizing rule: if a collective consumes >15 % of step time, change topology (move workload inside an NVLink domain) before optimising kernels.
# Sanity-check NVLink topology and per-GPU bandwidth on a node.
# 1) Per-GPU NVLink port status (all 18 ports should be 'Active' on SXM5)
nvidia-smi nvlink --status -i 0
# GPU 0: NVIDIA H100 80GB HBM3 (UUID: GPU-...)
# Link 0: 26.562 GB/s Link 1: 26.562 GB/s ... Link 17: 26.562 GB/s
# 2) Pairwise topology — confirm NV# (NVLink) not PHB (PCIe-host-bridge) between GPUs
nvidia-smi topo -m
# GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
# GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18
# GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18
# 3) NCCL bandwidth test (run inside a node, then across nodes)
git clone https://github.com/NVIDIA/nccl-tests && cd nccl-tests && make
./build/all_reduce_perf -b 8M -e 8G -f 2 -g 8
# busBW column is the number that matters; expect 380-440 GB/s on intra-HGX H100.Form factor, power and thermal#
NVLink itself adds modest direct power — the per-port PHY draws ~3-5 W and the NVSwitch ASICs ~300-400 W per chip on H100-era HGX, ~700-800 W per chip on NVL72-era 5th-gen NVSwitch. The thermal headline is the rack itself: NVL72 dissipates ~120 kW per rack including the 72 GB200 GPUs, the 9 switch trays and the Grace CPUs, which forces liquid cooling (direct-to-chip DLC at 25-32 C supply temperature).
Practical implications for build-vs-buy: NVL72 is not a chassis you retrofit into an air-cooled cage. NeoCloud's UK and EU NVL72 regions land in purpose-built liquid-cooled halls; partner capacity that quotes 'NVL72 ready' but air-cooled is selling you a derated TDP.
- HGX H100/H200: air-cooled baseboard fits in standard 4U / 6U chassis (DGX H100 is 8U); 4 NVSwitch chips dissipate ~1.2-1.6 kW combined.
- NVL72: liquid-cooled rack at 120 kW; 9 NVSwitch trays dissipate ~14 kW; copper backplane is the cable plant.
- Cable-driven NVLink (between two HGX baseboards via the optional NVLink Switch tray, Hopper era): contributes ~5 W/m active copper drive on the GPU side.
- Confidential Compute (CC-on) mode encrypts NVLink traffic; CPU- and PHY-side overhead adds ~1-2 % thermal load, not a redesign concern.
Software ecosystem: NCCL, MPI and NVLink Sharp#
NCCL (NVIDIA Collective Communications Library) is the canonical NVLink-aware collective library used by PyTorch DDP, Megatron-LM, DeepSpeed, FSDP, vLLM tensor-parallel, TensorRT-LLM multi-GPU and almost every modern training stack. NCCL detects NVLink and NVSwitch automatically and picks the right algorithm (Ring, Tree, NVLS/NVLink Sharp, CollNet) per message size.
NVLink Sharp (NVLS) — exposed in NCCL 2.20+ — offloads AllReduce reductions onto NVSwitch silicon. The switch performs the reduction in-network so each GPU sends 1/N of the buffer and receives a fully-reduced result, halving the total NVLink traffic. SHARP-v3 (NVSwitch 5th gen on GB300 NVL72) extends this to lower-precision reductions (FP8, FP16) for further bandwidth savings.
CUDA-Aware MPI (Open MPI 4+, MVAPICH-GDR) leverages NVLink transparently for GPU-resident sends. NCCL is preferred for collectives; MPI for point-to-point patterns and legacy HPC code paths.
- NCCL_ALGO=NVLS forces NVLink Sharp algorithm — measurable on H100 SXM5 AllReduce above 256 MB.
- NCCL_TOPO_FILE=<path>: explicit topology XML overrides probe failures on mixed NVLink/IB fabrics.
- NCCL_P2P_LEVEL=NVL: fails loudly if a P2P transfer would fall back from NVLink to PCIe — set this in production.
- NCCL_IB_HCA, NCCL_IB_GID_INDEX: required when extending past one NVLink domain over RoCE/IB.
- NCCL_DEBUG=INFO logs the chosen algorithm and topology path per collective — first thing to grep when collectives feel slow.
Mixed-vendor topologies (NVIDIA + AMD or NVIDIA + Intel accelerators in the same job) cannot use NVLink. Plan multi-vendor jobs around RoCE/InfiniBand collectives; expect 10-30x slowdown vs intra-NVLink AllReduce.
Sizing: which NVLink domain size your workload needs#
Decision tree we use internally when scoping NVLink footprints. Match the workload to the smallest domain that holds it — over-buying NVLink is the most expensive over-spec error in 2026 GPU capacity planning.
- Yobitel NeoCloud sells NVLink domains in three SKUs: HGX-8 (single baseboard), SuperPOD-H100 (256 GPUs, Hopper), and NVL72 (Blackwell GB200/GB300). Pricing is per-GPU-hour with a per-rack reservation surcharge for SuperPOD/NVL72.
- Above 256 GPUs on Hopper or 72 GPUs on Blackwell, the InfiniBand step changes the maths — quote your largest collective tensor and the cross-rack RTT before sizing.
- Yobibyte schedules tensor-parallel inference replicas onto NVLink-attached nodes only; cross-domain TP is blocked at admission, which removes the most common silent slowdown.
- For data-parallel-only training (DP=N, TP=1), NVLink is overprovisioned — InfiniBand or RoCE pods are 30-50 % cheaper at iso-throughput. Use the smallest NVLink domain that fits your TP factor, then scale out over IB.
| Workload | Recommended NVLink domain | Why |
|---|---|---|
| LLM inference, model fits on 1 GPU (7B-13B FP8) | Any (no NVLink needed) | Replicas are independent; PCIe is fine. |
| LLM inference 34B-70B FP8, TP=2 | HGX (intra-baseboard NVLink) | AllReduce on attention every layer; NVLink-pinned replica is the only viable shape. |
| LLM inference 70B-405B FP8, TP=4 or TP=8 | HGX (8 GPUs) | Stay inside one NVSwitch crossbar. |
| MoE inference (Mixtral 8x22B, DeepSeek-V3) | HGX or NVL72 | AllToAll dispatch is bandwidth-hungry; NVL72 needed past ~64 experts. |
| LLM fine-tune 70B QLoRA | 1-2 GPUs (no NVLink needed) | Adapter-only updates; DDP over PCIe sufficient. |
| LLM fine-tune 70B full FP16/BF16 | HGX (8 GPUs) with FSDP | ZeRO-3 shards across 8 GPUs; NVLink AllGather every layer. |
| LLM pre-train 70B from scratch | DGX SuperPOD (32-256 GPU NVLink + IB) | TP=8 intra-NVLink + DP across racks over IB. |
| LLM pre-train 405B-1T | NVL72 or NVL576 | TP=8 or TP=72 intra-rack; PP and DP across racks. |
| Frontier MoE pre-train (1T+ active params) | NVL576 SuperPOD | Expert-parallel at 72-rank radix demands NVL72; DP across racks. |
Cost and TCO#
Direct NVLink-only pricing is not a thing customers buy — you buy GPUs and inherit the fabric. But the choice of NVLink domain size is a meaningful TCO lever because larger domains command a per-GPU premium for the switch silicon and copper plant.
- Rule of thumb: NVLink premium pays for itself when TP>=4 collectives consume >10 % of step or decode time; below that, you are buying fabric you do not use.
- Spot/preemptible NVL72 capacity does not exist — the rack is monolithic. SuperPOD-256 occasionally surfaces on hyperscaler spot at ~40 % discount.
- Yobitel Omniscient Compute indexes NVLink domain shape (not just GPU count) when arbitrating capacity across providers — quoting 'NVL72 with TP=72' vs '8 HGX baseboards with TP=8' yields very different placements.
| NVLink domain | Typical $/GPU-hr (on-demand) | Reserved (3yr) | Premium vs PCIe | Yobitel NeoCloud equivalent |
|---|---|---|---|---|
| PCIe (no NVLink) | $1.60-2.20 (H100 PCIe) | $0.80-1.10 | Baseline | NeoCloud GPU Cloud, PCIe SKU |
| HGX-8 (intra-baseboard NVLink) | $2.20-3.00 (H100 SXM5) | $1.00-1.40 | +25-40 % | NeoCloud DGX/HGX-H100 (UK + EU regions) |
| SuperPOD-256 (NVLink Switch System, Hopper) | $2.60-3.40 | $1.20-1.60 | +50-70 % | NeoCloud SuperPOD-H100 (reserved only) |
| NVL72 (GB200, Blackwell) | $5.50-7.50 per GB200 | $3.00-4.20 | +100-200 % | NeoCloud NVL72 (committed only) |
| NVL576 SuperPOD (GB300) | Committed only | Bespoke | +150-300 % | Yobitel-managed frontier capacity |
Migration and alternatives#
NVLink alternatives, in increasing order of distance from the NVIDIA stack: PCIe Gen5 + bridge (still NVIDIA), InfiniBand NDR/XDR (vendor-neutral), AMD Infinity Fabric (AMD-only), Intel Xe Link (Intel-only), and Spectrum-X RoCE (NVIDIA ethernet alternative). None match NVLink intra-domain bandwidth — the choice is about scale-out economics, not scale-up performance.
| Alternative | Per-GPU BW | Scope | When it makes sense | When it does not |
|---|---|---|---|---|
| PCIe Gen5 x16 | 128 GB/s | Inside one host | Inference replicas that fit on 1 GPU; CPU<->GPU staging | Anything multi-GPU collective-heavy |
| NVLink bridge (H100 PCIe) | 600 GB/s pair | Two cards in one chassis | Drop-in 2x H100 inference replica | Beyond 2 GPUs |
| InfiniBand NDR (400 Gb/s) | 50 GB/s per port | Inter-host, multi-vendor | Data parallel, cross-rack stitching | Tensor parallel >=4 |
| InfiniBand XDR (800 Gb/s) | 100 GB/s per port | Inter-host, frontier scale-out | NVL576+ topology | Intra-rack collective hot path |
| Spectrum-X RoCE (800 Gb/s) | 100 GB/s per port | Ethernet-only DC fabric | Multi-vendor or ethernet-mandate sites | Same as IB for collective patterns |
| AMD Infinity Fabric (MI300X) | 896 GB/s bidir | MI300X 8-GPU OAM | All-AMD stacks; ROCm-native code | Mixed NVIDIA/AMD; CUDA-only kernels |
| UALink (open consortium) | TBD (post-2026) | Multi-vendor | Future open-fabric strategies | Production through 2026 |
Pitfalls / operational notes#
Operational issues that show up on NVLink/NVSwitch fabrics in production, ranked by frequency.
- Single NVLink port down: DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL drops from ~900 GB/s to ~850 GB/s on H100. Usually a mezzanine reseat; persistent failures need RMA.
- NCCL falls back to PCIe silently: cluster shows full GPU utilisation but collective bandwidth is 1/20th expected. Cause: missing or stale NCCL topology XML on a heterogeneous (NVLink + IB) fabric. Fix: regenerate with `nccl-topo-dump`, set `NCCL_TOPO_FILE`.
- Cross-baseboard tensor parallel: the most common silent slowdown. Two GPUs on different HGX boards behind the same IB fabric — collective hops over PCIe + IB instead of NVLink. Verify with `nvidia-smi topo -m` (expect `NV#`, not `PHB` or `SYS`).
- NVSwitch thermal: NVSwitch chips dissipate 300-800 W each; rack inlet temperature spikes throttle the fabric before the GPUs notice. Monitor switch tray exhaust temps, not just GPU die.
- Mixing NVLink generations: technically supported but the slowest link bounds the binding bandwidth. A V100 added to an H100 job is now a V100 job from the AllReduce perspective.
- CC-on (Confidential Compute) does not span NVL72 in 2026 — the encryption keys are per-host. NVL72 jobs that need attested CC must shard at the HGX boundary.
- NVLink Switch firmware updates are disruptive — the fabric goes dark on flash. Schedule with maintenance windows; do not field-patch on production training runs.
Where this fits in the Yobitel stack#
NVLink topology is the substrate every Yobitel-managed GPU product is sized against. Yobitel NeoCloud sells three NVLink shapes — HGX-8 (single-baseboard, the workhorse), SuperPOD-H100 (256-GPU Hopper NVLink Switch System) and NVL72 (Blackwell GB200/GB300, committed only) — and pricing is published per-GPU-hour with a domain-shape surcharge so customers can model FinOps before booking. The UK NCSC OFFICIAL-aligned and EU sovereign regions both ship HGX-H100/H200 capacity; NVL72 is currently available in the UK Tier-III hub.
Yobibyte's scheduler is NVLink-topology-aware by design: tensor-parallel inference replicas (TP>=2) are admitted only onto nodes that share an NVLink crossbar, which is what eliminates the cross-baseboard silent slowdown that is otherwise the most common production incident on multi-GPU LLM inference. Customers describe replicas in workspace terms (model name, replicas, region pin); Yobibyte translates that into HGX-locality constraints under the hood.
Omniscient Compute treats NVLink domain shape as a first-class search dimension when arbitrating capacity across Yobitel NeoCloud and partner clouds — a request for '8x H100 with TP=8 inside one NVLink domain' resolves to a different SKU than '8x H100 with TP=1 across 2 nodes'. InferenceBench's open-source benchmark harness publishes per-NVLink-shape throughput tables for every covered model so customers can map InferenceBench numbers directly to NeoCloud SKUs without re-deriving the topology premium.
References
- NVIDIA NVLink and NVSwitch Overview · NVIDIA
- GB200 NVL72 Whitepaper · NVIDIA
- DGX H100 Architecture Whitepaper · NVIDIA
- NCCL Documentation · NVIDIA
- NVLink Sharp (NVLS) algorithm overview · NVIDIA Developer Blog
- Hopper Architecture Whitepaper (NVSwitch 3rd gen) · NVIDIA