TL;DR
- GPUDirect RDMA lets a third-party PCIe device — typically an RDMA-capable NIC like ConnectX-7 or BlueField-3 — read or write GPU HBM directly via PCIe BAR1, with no copy through host CPU registers, host DRAM, or the GPU driver's bounce buffer.
- Introduced in CUDA 5.0 (2013) and hardened across every GPU generation since; modern stacks use the `nvidia-peermem` kernel module (CUDA 11.5+) — the older `nv_peer_mem` out-of-tree module is deprecated.
- Eliminates the canonical four-copy round trip (GPU->host DRAM->NIC, NIC->host DRAM->GPU) and the associated CPU and memory-bandwidth tax; NCCL inter-node AllReduce on InfiniBand or RoCEv2 depends on it.
- Correct operation depends on PCIe topology (NIC and GPU should sit under the same PCIe switch when possible), GPU BAR1 size, ACS disabled on intermediate bridges, and a coherent driver/firmware bundle (NVIDIA driver, MLNX_OFED or DOCA-Host).
- On Yobitel NeoCloud, every H100, H200 and GB200 NVL72 node ships with GPUDirect RDMA enabled by default; Yobibyte multi-node training pods rely on it for NCCL collective throughput.
Overview#
GPUDirect RDMA is the PCIe peer-to-peer mechanism that lets a third-party PCIe device — almost always an RDMA-capable NIC such as NVIDIA ConnectX-7, BlueField-3 SuperNIC, or BlueField-4 — read from and write to GPU HBM directly, with no detour through host CPU registers or host DRAM. The DMA engine in the NIC targets a PCIe BAR exposed by the GPU; the GPU memory controller serves the request from HBM; the host operating system is involved only to register the buffer up front, not on the data path.
The mechanism matters because distributed training is dominated by collectives that move tensors between GPUs on different hosts. Without GPUDirect RDMA, every inter-node tensor exchange in NCCL would have to copy the tensor from GPU HBM to host DRAM, then from host DRAM through the NIC to the wire, and the reverse on the receive side — four copies and four trips across PCIe per logical transfer. With GPUDirect RDMA, the same exchange is a single peer-to-peer DMA between the source GPU and the NIC, and on the receive side a single peer-to-peer DMA between the NIC and the destination GPU. The host CPU and host DRAM are not on the critical path.
The throughput consequence is substantial. On a representative ConnectX-7 + H100 pair under NCCL at 400 Gb/s line rate, GPUDirect RDMA delivers ~370-390 Gb/s of useful AllReduce payload bandwidth; the host-staged fallback path delivers ~150-200 Gb/s and saturates the host's memory bandwidth long before it saturates the NIC. Yobitel NeoCloud enables GPUDirect RDMA by default on every H100, H200 and GB200 NVL72 node and validates it during burn-in — Yobibyte's managed multi-node training pods would not be able to publish their advertised collective throughput without it.
This entry helps you understand exactly what GPUDirect RDMA is doing under the hood, what to verify before you trust it, the failure modes that silently halve your fabric throughput, and how Yobitel NeoCloud's standard configuration avoids the common pitfalls so you can either trust the managed substrate or replicate the discipline in your own build.
How it works#
Mechanically, GPUDirect RDMA stitches three PCIe-level capabilities together: a GPU that exposes its HBM as a PCIe-addressable memory window (the BAR1 region), an RDMA NIC whose DMA engine can target arbitrary PCIe addresses inside the same root complex, and a kernel-side bridge that translates between the GPU's virtual address space and the NIC's PCIe physical address.
GPU BAR1 window. NVIDIA data-centre GPUs expose a configurable portion of their HBM as a PCIe BAR1 region. For H100/H200 with 80-141 GB of HBM, BAR1 defaults to either 128 GB or the GPU's full HBM size, configurable from the host BIOS. Once mapped, that region is addressable by any other PCIe device on the same root complex.
NIC DMA targeting. The RDMA NIC's verbs interface (libibverbs) lets an application register a memory region with `ibv_reg_mr`. When the registered buffer lives in GPU memory rather than host memory, the kernel bridge module resolves the buffer's GPU virtual address into a PCIe BAR1 physical address and returns a memory region handle (`ibv_mr`) the NIC can use. Subsequent RDMA verbs (RDMA Write, RDMA Read, Send, Recv) that reference that handle cause the NIC to DMA directly to or from the GPU's BAR1 region.
The kernel bridge. Two implementations have existed. The older out-of-tree `nv_peer_mem` module shipped from 2013 through CUDA 11.4; the newer in-tree `nvidia-peermem` module shipped with CUDA 11.5 (2021) and is the supported path in CUDA 12.x and 13.x. Both expose the same `ibv_reg_mr` semantics; only `nvidia-peermem` is supported on modern driver branches.
Once these three pieces are in place, NCCL (and any other RDMA-aware library — MPI over UCX, NIXL, custom Verbs code) becomes the consumer. NCCL's `Net` plugin picks the right NIC per GPU based on PCIe topology and uses Verbs RDMA Write / Send to move ring or tree AllReduce payloads directly between GPU HBM endpoints. The application never sees the underlying mechanism.
PCIe topology and locality#
GPUDirect RDMA's correctness is universal, but its performance is local. The fundamental rule is that PCIe peer-to-peer DMA is fast when the source and destination sit close to each other in the PCIe tree and slow (or impossible) when they do not. The reasons are mechanical: peer-to-peer between two devices under the same PCIe switch never leaves the switch; peer-to-peer between two devices under the same CPU root complex but different switches traverses the CPU's PCIe controller; peer-to-peer between devices under different CPU sockets has to cross the inter-socket interconnect (Intel UPI, AMD Infinity Fabric), which is half-duplex relative to peer-to-peer expectations and often disables peer-to-peer entirely.
- On an HGX H100 or H200 baseboard, four NICs and eight GPUs are organised into four PCIe switches; one NIC and two GPUs per switch. NCCL's topology file pins each GPU to the NIC under its own switch — this is rail-optimised mapping.
- On a GB200 NVL72 compute tray, the BlueField-3 SuperNICs sit on the same PCIe switch as their adjacent Blackwell GPUs by design — the rack-scale architecture made this a board-level guarantee.
- Yobitel NeoCloud's NeoCloud H100/H200/GB200 fleet uses NVIDIA reference baseboards, so PCIe topology is correct by construction; the validation step still runs during pre-production burn-in to catch firmware-level surprises.
- On a vanilla 2P Genoa or Sapphire Rapids host with one or two NICs and add-in GPUs, PCIe topology is the operator's problem. `nvidia-smi topo -m` is the first thing to check; it prints a matrix showing the PCIe relationship between every GPU and every NIC.
| NIC-to-GPU PCIe relationship | Behaviour | Throughput cost |
|---|---|---|
| Same PCIe switch (PIX in nvidia-smi) | Optimal — DMA stays within switch | ~95-98% of line rate |
| Same root complex, different switch (PHB) | Works — traverses CPU PCIe controller | ~85-92% of line rate |
| Cross-socket via UPI / Infinity Fabric (NODE / SYS) | Often disabled; if enabled, severe penalty | ~30-50% of line rate, when working |
| No common root complex | Fails; verify with `lspci -tv` | N/A |
| Through PCIe ACS-enabled bridge | Blocked entirely | 0 - peer-to-peer disabled |
PCIe ACS (Access Control Services) is enabled by default on many server BIOSes for IOMMU correctness — and silently disables peer-to-peer DMA when enabled on the bridge between the NIC and the GPU. If GPUDirect RDMA appears to work but throughput is half of expected, suspect ACS first.
Variants and architectural choices#
GPUDirect is a family of related capabilities. The RDMA variant is the one this entry focuses on; the others are useful adjacent context because they share infrastructure.
- GPUDirect RDMA (this entry): peer-to-peer DMA between a third-party PCIe device (typically an RDMA NIC) and GPU HBM. The path NCCL uses for inter-node collectives.
- GPUDirect P2P: peer-to-peer DMA between two GPUs in the same host, either via PCIe or NVLink. The intra-node alternative to staging through host DRAM.
- GPUDirect Storage (GDS): peer-to-peer DMA between an NVMe device (or NVMe-oF target) and GPU HBM. Used by DALI, cuFile, and NVIDIA Magnum IO for training-data ingestion that bypasses the page cache.
- GPUDirect Async (NCCL 2.18+): allows the GPU to enqueue RDMA operations directly without a host-side CUDA stream synchronisation. Reduces CPU overhead at high collective rates.
- GPUDirect for Video: a related but distinct PCIe peer-to-peer path used by capture cards into GPU memory, separate from the NIC-side RDMA path.
When to use it (always, with conditions)#
GPUDirect RDMA is not optional for modern multi-node training. The question is not whether to enable it but whether to allow the fallback path under specific topology conditions.
- Always enable GPUDirect on training pods. The fallback exists for debug, not production.
- Verify the GDR level NCCL chose — `NCCL_DEBUG=INFO` prints `GDR enabled: yes` / `no` per peer.
- Profile before tuning. Use `nccl-tests` with and without GDR (`NCCL_NET_GDR_DISABLE=1`) on the same job to quantify the gap.
| Workload | GPUDirect role | Notes |
|---|---|---|
| Multi-node NCCL AllReduce on IB or RoCE | Mandatory | Fallback halves throughput silently |
| Multi-node training across cross-socket NICs | Conditional | Set NCCL_NET_GDR_LEVEL=SYS only after measurement |
| GPUDirect Storage from NVMe-oF target | Recommended | Eliminates page-cache stalls during data loading |
| Single-node training (no NIC involved) | Not applicable | NVLink + GPUDirect P2P handles intra-node |
| LLM inference behind a gateway | Useful for KV-cache transfer | vLLM and NIXL use it for prefill-decode disaggregation |
| Federated / sovereign edge inference | Optional | Depends on whether multi-node collectives are involved |
Trade-offs and known limitations#
- BAR1 size is a hard upper bound on the resident registered GPU memory for RDMA. The default BAR1 on H100 is the GPU's full HBM (80 GB or 141 GB on H200) but BIOS misconfiguration can shrink it. Symptom: `ibv_reg_mr` fails with EINVAL when the application tries to register a large tensor.
- PCIe ACS on the bridge between the NIC and the GPU silently disables peer-to-peer DMA. Disable per OEM guidance; verify with `lspci -vv | grep ACSCtl` returning `SrcValid- TransBlk-`.
- Driver and firmware drift between the NVIDIA driver branch, the CUDA toolkit, the MLNX_OFED (or DOCA-Host) release, and the NIC firmware is the leading cause of silent fallback. Pin a coherent bundle per cluster.
- Cross-socket NICs work in degraded mode at best. Either provision per-socket NICs (canonical) or accept the cross-socket penalty and document it.
- Cgroups v2 on the host can interact poorly with pinned memory regions. RHEL 9 and Ubuntu 24.04 default cgroups v2 require explicit kernel boot flags for some configurations.
- IOMMU passthrough mode (`iommu=pt`) is required for predictable peer-to-peer behaviour. Without it, the IOMMU translates DMA addresses and can subtly serialise traffic.
- GPUDirect Storage to a remote NVMe-oF target requires both ends to support GPUDirect; a vendor-mismatched target may fall back to host-staged copies without warning.
Implementation notes#
What follows is the practical configuration and verification path for a new cluster — Yobitel NeoCloud runs essentially this checklist during pre-production burn-in for every new H100/H200/GB200 NVL72 node, and customers building their own clusters can replicate the discipline.
- `nvidia-peermem` is part of the NVIDIA driver package from CUDA 11.5 onward — never install the deprecated `nv_peer_mem` out-of-tree module on a modern driver.
- NCCL_NET_GDR_LEVEL controls how aggressive NCCL is about requiring same-switch peer-to-peer. PIX is strict, PHB allows host-bridge crossing, SYS allows cross-socket (rarely worth it).
- After every NVIDIA driver upgrade, MLNX_OFED upgrade, or NIC firmware flash, rerun the two-node `ib_write_bw --use_cuda` validation. Silent regressions have shipped before.
- Yobitel NeoCloud customers consuming the platform directly receive nodes with all of the above already validated; Yobibyte customers consume managed multi-node training pods where the entire substrate is opaque and the topology is guaranteed.
# 1. Verify the kernel bridge module is loaded
lsmod | grep nvidia_peermem
# Expect: nvidia_peermem 20480 0
# 2. Verify the GPU exposes BAR1 and its size
nvidia-smi -q | grep -A 1 "BAR1 Memory Usage"
# Expect: Total : 131072 MiB (or full HBM size)
# 3. Print the host PCIe topology between NICs and GPUs
nvidia-smi topo -m
# Look for PIX (best) on each GPU-NIC pair you intend to bind
# 4. Verify GPUDirect support on the HCAs
ibstat | grep -E "Active|Rate|Link layer"
ibv_devinfo -v | grep -E "device_cap_flags|hca_core_clock"
# 5. Disable PCIe ACS on intermediate bridges (one-time, OEM-specific)
# Example for a Genoa / SP5 host with a setpci-driven disable script:
sudo /opt/nvidia/disable-pcie-acs.sh
# 6. NCCL environment for rail-optimised binding
export NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3
export NCCL_IB_PCI_RELAXED_ORDERING=1
export NCCL_IB_GID_INDEX=3 # RoCEv2 only
export NCCL_NET_GDR_LEVEL=PIX # Strict: require same-switch GDR
export NCCL_DEBUG=INFO
# 7. Two-node bandwidth validation using GPU memory
# Server side:
ib_write_bw --use_cuda=0 -d mlx5_0 -F --report_gbits
# Client side:
ib_write_bw --use_cuda=0 -d mlx5_0 -F --report_gbits <server-ip>
# Expect ~370-390 Gb/s on a 400 Gb/s NDR / RoCEv2 link
# 8. Cluster-wide AllReduce regression
mpirun -np 256 -hostfile hosts \
-x NCCL_DEBUG=INFO -x NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3 \
./build/all_reduce_perf -b 8M -e 8G -f 2 -g 1Where it fits in the Yobitel stack#
GPUDirect RDMA is one of the invisible-but-essential capabilities that makes the Yobitel NeoCloud GPU fleet usable for distributed training. Every H100, H200 and GB200 NVL72 node in the UK and EU sovereign regions ships with the kernel bridge loaded, BAR1 sized to the full HBM, PCIe ACS disabled on the relevant bridges, and per-rail NIC binding validated during burn-in. The result is that an NCCL job launched on Yobitel NeoCloud sees same-switch peer-to-peer bandwidth on every rail without operator effort.
Yobibyte's managed multi-node training pods inherit this substrate. When a customer submits a fine-tune that spans more than one node, the platform's placement engine routes the job to a pod where rail-locality is intact and GPUDirect is verified active. Customers do not interact with NCCL flags, BAR1 sizing or ACS configuration — they see a managed endpoint with a published throughput number. InferenceBench's published prefill-decode disaggregation numbers also depend on this path; the leaderboard would not be reproducible without it. For customers building their own clusters, the checklist above is the same discipline Yobitel applies internally.
References
- GPUDirect RDMA Documentation · NVIDIA
- GPUDirect RDMA Installation Guide · NVIDIA
- NCCL Environment Variables · NVIDIA
- Magnum IO Architecture Overview · NVIDIA
- GPUDirect Storage Overview · NVIDIA