TL;DR
- RoCEv2 encapsulates InfiniBand transport headers inside UDP/IP packets (UDP destination port 4791), letting RDMA verbs run over routed IP/Ethernet fabrics with InfiniBand semantics.
- Defined by IBTA Annex A17 (2014); requires lossless or near-lossless Ethernet via PFC (IEEE 802.1Qbb), ECN (RFC 3168), and DCQCN (Microsoft/Mellanox, SIGCOMM 2015) to perform at scale.
- Implemented by every modern data-centre NIC: NVIDIA ConnectX-6/7/8, BlueField-2/3/4 SuperNICs, AMD Pensando, Intel E810, Broadcom Thor 2, and AWS Nitro/EFA.
- Forms the basis of every modern Ethernet GPU fabric — NVIDIA Spectrum-X, Broadcom Tomahawk-5-based AI fabrics, AWS EFAv2, and the entire Ultra Ethernet Consortium 2.0 effort all build on RoCEv2 semantics.
- Typical 400G port reaches 380-395 Gb/s payload at near-zero packet loss when PFC/ECN/DCQCN are correctly tuned; collapses to 50-100 Gb/s with congestion-loss-driven retransmissions when they are not.
Overview#
RDMA over Converged Ethernet version 2 (RoCEv2) is how RDMA gets carried over commodity Ethernet at data-centre scale. The first version of RoCE was a pure Layer 2 protocol — RDMA frames encapsulated in Ethernet, non-routable, confined to a single broadcast domain. RoCEv2 added UDP/IP encapsulation so the same RDMA semantics work across routed IP fabrics.
For AI infrastructure, RoCEv2 is the protocol that makes Ethernet a credible competitor to InfiniBand. It carries the same InfiniBand transport headers, supports the same Verbs API, and — given a properly tuned lossless or near-lossless Ethernet underlay — delivers comparable latency and throughput to native InfiniBand. The trade-off is the operational complexity of running lossless Ethernet: PFC, ECN, and DCQCN must all be configured and tuned correctly, and the network team needs the skills to diagnose pause storms, ECN mis-marking, and head-of-line blocking when something goes wrong.
By 2026, RoCEv2 underpins essentially every Ethernet AI fabric in production. NVIDIA's Spectrum-X (Spectrum-4 ASIC + BlueField-3 SuperNIC) is a RoCEv2 platform with vendor-specific extensions. Broadcom's Tomahawk-5-based AI fabrics from Arista, Cisco, and Juniper are RoCEv2 platforms. AWS EFAv2 is RoCEv2-derived. The Ultra Ethernet Consortium specifications that aim to supplant RoCEv2 for AI workloads (UEC 1.0, ratified 2025) keep wire-format compatibility for the migration path.
RoCEv2 is the lossless-Ethernet alternative Yobitel NeoCloud offers for cost-sensitive multi-node serving and for sovereign UK pods where the in-house network team operates Cumulus-Linux Ethernet rather than InfiniBand. This entry helps you operate RoCEv2 in production and reach the line-rate behaviour you paid for — the PFC/ECN/DCQCN tuning that turns a working RoCE fabric into a non-regressing one.
Quick start: enable RoCEv2 on a ConnectX-7 + Cumulus Linux switch#
The minimum config to bring up a working RoCEv2 link. This assumes a ConnectX-7 NIC with the `mlx5_core` driver on a recent Linux distribution (RHEL 9.4+, Ubuntu 22.04 LTS+) and a Spectrum-4 or Tomahawk-5 switch running Cumulus Linux 5.x. Adjust DSCP/PCP values to match your switch fabric's class-of-service plan.
# --- Host side: ConnectX-7 + mlx5 ---
# 1) Confirm RoCEv2 capability on the HCA
ibv_devinfo -d mlx5_0 | grep -E "transport|rocev2"
# transport: InfiniBand
# active_mtu: 4096 (5)
# 2) Set the GID index for RoCEv2 over IPv4 (typically index 3)
# show_gids prints the GID table; the v2 entries are tagged "RoCE v2"
show_gids
# DEV PORT INDEX GID IPv4 TYPE NDEV
# mlx5_0 1 0 fe80::... (link) - RoCE v1 enp1s0f0
# mlx5_0 1 1 fe80::... (link) - RoCE v2 enp1s0f0
# mlx5_0 1 2 ... 10.1.1.10 RoCE v1 enp1s0f0
# mlx5_0 1 3 ... 10.1.1.10 RoCE v2 enp1s0f0 <-- use this
# 3) Set DSCP 26 (PCP 3) for RoCEv2 traffic at the IP layer
# Map application priority -> RoCE -> DSCP via mlnx_qos
mlnx_qos -i enp1s0f0 --trust=dscp
mlnx_qos -i enp1s0f0 --pfc=0,0,0,1,0,0,0,0 # enable PFC on priority 3
# 4) Test bandwidth between two hosts (server then client)
ib_write_bw -d mlx5_0 -F --report_gbits -x 3 # server, GID 3
ib_write_bw -d mlx5_0 -F --report_gbits -x 3 <server-ip> # client
# --- Switch side: NVIDIA Cumulus Linux 5.x (Spectrum-4 example) ---
# /etc/nvue.d/roce.yaml
cat <<'EOF' | nv config patch --apply -
- set:
interface:
swp1-64:
ip:
neighbor-discovery:
router-advertisement:
enable: off
link:
mtu: 9216 # jumbo, leaves headroom for VXLAN
qos:
pfc:
switch-priority:
3:
enable: on # match host PFC priority
congestion-control:
wred-ecn:
enable: on
min-threshold: 150000 # bytes; tune per buffer depth
max-threshold: 1500000
probability: 100
qos:
mapping:
dscp-to-switch-priority:
26: 3 # DSCP 26 -> SP3
EOFBring up the data path first (steps 1-4 host, then switch QoS), and verify with `ib_write_bw` at near-line-rate before enabling production training. Most RoCE bring-up debugging time goes to GID index, DSCP-to-priority mapping, and PFC mismatch between host and switch — not to anything more exotic.
How it works: packet structure#
A RoCEv2 packet is built up as follows, outermost to innermost: Ethernet header -> IP header -> UDP header (dst port 4791) -> InfiniBand Base Transport Header (BTH) -> InfiniBand Extended Transport Header (ETH, optional) -> payload -> Invariant CRC.
Because the BTH is unchanged from native InfiniBand, the NIC's RDMA engine can process inbound RoCEv2 packets through the same hardware path as native IB packets after the UDP/IP envelope is stripped. This is what lets a single ConnectX HCA family support both transports.
RoCEv2 packet on the wire:
+-----------+--------+--------+--------+-----+---------+------+
| Ethernet | IPv4/6 | UDP | BTH | ETH | Payload | ICRC |
| 14 bytes | 20/40 | 8 bytes| 12 | 4 | 0..4096 | 4 |
+-----------+--------+--------+--------+-----+---------+------+
dport
4791
^------- InfiniBand transport
(unchanged from native IB)Why RoCEv2 needs lossless Ethernet#
RDMA transport assumes near-zero packet loss. The InfiniBand transport originally retransmitted entire transfers on any loss — a go-back-N style recovery that is catastrophic for large messages under congestion. Modern ConnectX-7+ NICs implement selective repeat retransmission (SR) and improved loss recovery, but even with SR, high loss still collapses throughput. The whole point of the PFC/ECN/DCQCN apparatus is to make loss a rare event.
Three mechanisms layered together produce near-lossless behaviour on Ethernet. ECN/DCQCN do the bulk of the work in steady state by throttling senders before congestion becomes severe; PFC is the safety net for transient bursts that ECN-driven backoff cannot react to quickly enough. PFC alone (without ECN) creates pause storms and head-of-line blocking under sustained load; ECN alone (without PFC) loses packets during the gap between mark and rate-cut response.
- PFC (Priority Flow Control, IEEE 802.1Qbb): per-priority pause frames that stop upstream senders when downstream buffers fill. Lowest-level loss prevention. Hop-by-hop, no end-to-end signalling.
- ECN (Explicit Congestion Notification, RFC 3168): switches mark packets in the IP header when queues build, before drops occur. End-to-end feedback signal.
- DCQCN (Data Centre Quantised Congestion Notification, SIGCOMM 2015): the end-host congestion control algorithm that translates ECN marks into per-QP rate cuts. Hardware-implemented on the NIC.
Reference: NIC sysctls, kernel module options, and verbs#
Operational reference for the NVIDIA `mlx5` driver — by far the most common RoCEv2 NIC. The relevant sysctls, module parameters, and verbs queries an operator touches in production.
| Knob | Where | Default | Notes / when to change |
|---|---|---|---|
| roce_mode | mlx5_core module | auto | Force `2` to disable RoCEv1; eliminates GID-index ambiguity |
| roce_ecn_marking_enable | mlx5_core module / mlxconfig | 0 (off) | Enable on switch + NIC for ECN feedback path |
| NCCL_IB_GID_INDEX | Process env | Auto | Pin to 3 (RoCEv2/IPv4) explicitly; auto-detection misfires |
| NCCL_IB_TC | Process env | 0 | DSCP class for outbound RoCE; set to 106 (= DSCP 26 x 4) |
| NCCL_IB_TIMEOUT | Process env | 20 | Raise to 22-24 on lossy paths; lower hangs jobs |
| NCCL_IB_RETRY_CNT | Process env | 7 | RDMA retries before failing the QP |
| net.ipv4.tcp_ecn | Linux sysctl | 2 | Enable ECN system-wide (2 = ECN if peer supports) |
| net.core.rmem_max / wmem_max | Linux sysctl | 212k | Raise to 16-64 MB for high-bandwidth flows |
| mtu | Interface | 1500 | Set 9000+ jumbo for RoCE; reduce RPC count, headroom for VXLAN |
| mlxconfig CNP_DSCP | Firmware | 48 | DSCP for Congestion Notification Packets (DCQCN feedback) |
| mlxconfig CNP_PRIO | Firmware | 6 | 802.1p priority for CNP |
| mlxconfig ROCE_CC_ALGO | Firmware | DCQCN | DCQCN | DCTCP | None; DCQCN is the default |
# Inspect current RoCE-relevant firmware config
mlxconfig -d /dev/mst/mt4131_pciconf0 query | grep -E "ROCE|CNP|ECN"
# Persistently set DCQCN-friendly config (requires NIC reboot)
mlxconfig -d /dev/mst/mt4131_pciconf0 set \
ROCE_CC_ALGORITHM_P1=DCQCN \
CNP_DSCP_P1=48 \
CNP_802P_PRIO_P1=6 \
ROCE_NEXT_PROTOCOL_P1=4791
# Live per-port congestion counters (the canonical RoCE perf signal)
ethtool -S enp1s0f0 | grep -E "rx_pause|tx_pause|rx_congestion|ecn_marked|rx_out_of_buffer"
# Per-QP retransmission counters (raise concern if non-zero growth)
cat /sys/class/infiniband/mlx5_0/ports/1/hw_counters/np_cnp_sent
cat /sys/class/infiniband/mlx5_0/ports/1/hw_counters/rp_cnp_handled
cat /sys/class/infiniband/mlx5_0/ports/1/hw_counters/out_of_sequenceSwitch configuration: Cumulus Linux, Arista EOS, SONiC#
Switch-side config differs by NOS but the concepts are identical: classify RoCE by DSCP, map to a dedicated priority queue, enable PFC on that priority, enable WRED/ECN marking with thresholds appropriate to the buffer depth. Examples below for the three most common NOSes in AI fabrics.
- DSCP 26 / PCP 3 / Switch-Priority 3 is the de facto convention for RoCEv2 on AI fabrics. Pick a different value if your fabric already uses SP3 for something else — but document the mapping cluster-wide.
- WRED/ECN thresholds depend on switch buffer depth. Shallow-buffer Tomahawk: min ~100 KB, max ~1 MB. Deep-buffer Jericho: min ~1 MB, max ~10 MB. Wrong thresholds either fail to mark (loss) or over-mark (throughput collapse).
- PFC headroom (the extra buffer reserved per pause-enabled priority) must be sized for the longest cable round-trip in the fabric. Default headroom assumes ~10 m fibre; ~100 m runs need explicit headroom uplift.
- Enable PFC watchdog. Stuck pause frames take a port out of service in seconds; PFC watchdog detects and breaks the loop within milliseconds.
# --- NVIDIA Cumulus Linux 5.x (Spectrum-4) ---
nv set qos congestion-control switch-priority 3 wred-ecn enable on
nv set qos congestion-control switch-priority 3 wred-ecn min-threshold 150KB
nv set qos congestion-control switch-priority 3 wred-ecn max-threshold 1500KB
nv set qos pfc switch-priority 3 enable on
nv set qos mapping dscp-to-switch-priority 26 3
nv config apply
# --- Arista EOS (Tomahawk-5 / Jericho) ---
qos map dscp 26 to traffic-class 3
priority-flow-control on
priority-flow-control priority 3 no-drop
queue 3 random-detect ecn minimum-threshold 150 kbytes maximum-threshold 1500 kbytes max-mark-probability 100
# --- SONiC (community / Enterprise) ---
config qos clear
config qos reload
# QoS templates live in /etc/sonic/qos.json; the AI-fabric template
# ships from the platform vendor with DSCP 26 -> TC 3 + PFC priority 3
# pre-configured. Validate with:
show pfc counters
show queue countersWorkload patterns#
RoCEv2 carries three distinct traffic patterns in an AI cluster, each with different congestion characteristics. Knowing which dominates your workload tells you which tuning lever to reach for first.
- Training collectives (AllReduce, AllToAll): bursty, large messages (8 MB - 8 GB), strongly congestion-correlated across QPs. Dominant lever: DCQCN parameters (Kmin, Kmax, Pmax, Rai, Rhai). Default DCQCN tuned for 100G; needs re-tuning at 400G and 800G.
- Storage I/O over RDMA (NVMe-oF, Lustre, GPFS): steady-state large reads/writes, less correlated across QPs. Dominant lever: PFC pause behaviour. Buffers must absorb sustained read bursts without pausing the training plane.
- Inference KV-cache transfers (vLLM disaggregated, prefill-decode separation): small-to-medium messages (1-64 MB), latency-sensitive, low-rate. Dominant lever: ECMP hashing — avoid placing prefill->decode flows on congested links. Adaptive routing helps when available.
Sizing and capacity planning#
Real-world line-rate numbers at 400G and 800G, ConnectX-7/8 endpoints, Spectrum-4 or Tomahawk-5 switches, DCQCN tuned, PFC enabled on SP3. Treat as planning anchors; verify with `ib_write_bw` and `nccl-tests`.
- At 400G, achievable single-flow throughput is 96-99 % of line rate when PFC/ECN/DCQCN are correctly tuned. Anything less suggests a misconfiguration.
- At 800G, single-flow throughput drops to 95-99 % of line rate due to packet-loss recovery overheads — verify with `ib_write_bw -F` and watch for retransmission counter growth.
- Tail latency under load is the operationally critical number. p99 above 200 us at 70 % offered load almost always indicates DCQCN under-tuning (slow rate recovery after a CNP).
- Plan 10-20 % bandwidth headroom on every link; running RoCEv2 fabrics above 80 % sustained load amplifies tail-latency variance.
- Yobitel NeoCloud's sovereign UK reference design lands on the 400G Spectrum-X (BlueField-3 + Spectrum-4) row above for Ethernet-preferring tenants — the same PFC/ECN/DCQCN profile ships in the cluster image, validated against the headline AllReduce numbers before customer access opens.
| Port speed | NIC + Switch | Single-flow throughput | AllReduce N=64 busBW | Tail latency p99 at 70% load |
|---|---|---|---|---|
| 100G | ConnectX-6 + Spectrum-3 | 94-97 Gb/s | 11-12 GB/s | < 50 us |
| 200G | ConnectX-6 + Spectrum-3 | 188-194 Gb/s | 22-24 GB/s | < 80 us |
| 400G (RoCE) | ConnectX-7 + Spectrum-4 | 380-395 Gb/s | 44-48 GB/s | < 120 us |
| 400G (Spectrum-X) | BlueField-3 + Spectrum-4 | 385-398 Gb/s | 47-50 GB/s | < 100 us |
| 800G (RoCE) | ConnectX-8 + Spectrum-X SN5600 | 760-790 Gb/s | 90-95 GB/s | < 130 us |
| 800G (UEC 1.0 packet spraying) | BlueField-3 SuperNIC + Spectrum-X | 780-798 Gb/s | 95-100 GB/s | < 90 us |
Observability#
RoCE health surfaces in three places: NIC hardware counters (Mellanox `hw_counters`), switch port counters (PFC pause counts, ECN marks, drops), and NCCL/MPI job logs. Production fabrics export all three to Prometheus + Grafana via the `dcgm-exporter`, NVIDIA UFM telemetry (works on RoCE too) or SONiC's `gnmi` exporter.
- Per-port: `rx_pause_count`, `tx_pause_count`, `rx_out_of_buffer`, `ecn_marked_packets`. Pause counts > 0 are normal; sustained growth is a problem.
- Per-QP: `out_of_sequence` (selective-repeat triggers), `np_cnp_sent` (CNPs generated by this node), `rp_cnp_handled` (CNPs received and rate-cut). Track per-job, alert on rates.
- Job-level: NCCL log lines `Connected ... using IB` confirm RoCE path active; `falling back to TCP` is the disaster mode — alert on any occurrence.
- Switch-level: PFC pause rate, ECN mark rate, WRED drop count, buffer occupancy per priority. Drops on the RoCE priority are always a red flag.
# Per-port RoCE health on a host
ethtool -S enp1s0f0 | grep -E "rx_pause|tx_pause|rx_out_of_buffer|ecn_marked"
# Per-QP counters (one per CX-7 port)
for c in $(ls /sys/class/infiniband/mlx5_0/ports/1/hw_counters); do
printf "%-30s %s\n" "$c" "$(cat /sys/class/infiniband/mlx5_0/ports/1/hw_counters/$c)"
done
# Prometheus exporter snippet — UFM emits these for RoCE fabrics
# Sample queries:
# rate(rx_pause_count[5m]) by (host, port) > 100
# sum(ecn_marked_packets) by (job) / sum(tx_packets) by (job) > 0.05
# absent_over_time(nccl_job_using_ib[5m]) -- alert on TCP fallbackCost and FinOps#
RoCEv2 itself has zero licence cost; the protocol is implemented in NIC hardware and switch silicon you already own. The cost-versus-InfiniBand argument is about hardware bill of materials and the operator-skill premium.
- The headline TCO win for RoCE is at large fabric scale (1k+ GPUs); below 256 GPUs the per-switch savings rarely justify the operational complexity over plug-and-play InfiniBand.
- Add 6-12 weeks of network-engineering time per new RoCE fabric for PFC/ECN/DCQCN tuning. Skipping this is the most common reason RoCE deployments underperform InfiniBand in the first 6 months.
- Multi-vendor sourcing is the secondary win: RoCE switches can be Tomahawk, Spectrum, Jericho, Silicon One — InfiniBand is single-vendor.
| Cost driver | RoCEv2 + 400G Ethernet | InfiniBand NDR (400 Gb/s) | Delta |
|---|---|---|---|
| Switch (1U, 64-port) | $45-70k (Tomahawk 5 / Spectrum-X SN5600) | $70-95k (Quantum-2 MQM9700) | -30 to -40 % |
| NIC per host (single dual-port 400G) | $2,400-3,500 (ConnectX-7 RoCE-mode) | $2,800-4,200 (CX-7 IB-mode) | -15 to -20 % |
| Optics per end (400G DR4) | $1,800-2,600 | $1,800-2,400 | Comparable |
| Switch OS / mgmt licence | $0-2k per port (Cumulus / SONiC) | $5-8k per port-year (UFM) | Much cheaper |
| Operator skill | Existing Ethernet team | Specialist IB skills | Variable; budget 6-12 wk ramp |
| Full 1,024-GPU fabric BOM | $1.6-2.4M | $2.5-3.5M | Roughly -30-35 % |
Security and compliance#
RoCEv2 inherits IP security primitives — IPsec, MACsec, VLAN segmentation — that pure InfiniBand does not have. For multi-tenant clusters, this matters: VRF isolation per tenant on the RoCE fabric is straightforward; the equivalent on InfiniBand requires partition-key (PKey) management discipline that fewer operators have.
Confidential RDMA: NVIDIA Hopper+ with Confidential Compute mode (CC-on) encrypts PCIe and HBM traffic; the RoCEv2 wire-format itself is unencrypted, but the data inside (GPU-to-GPU tensors) is protected by the CC envelope. For host-to-host plaintext-on-the-wire concerns, MACsec at the link layer is the operational answer; full end-to-end IPsec adds CPU overhead and is rarely chosen.
Compliance: RoCE fabrics qualify under NCSC Cloud Security Principles (UK), GDPR Article 32 (technical and organisational measures), HIPAA, and SOC 2 reference designs the same way any other Ethernet fabric does. The novel concerns are around tenant-isolation evidence (VRF + PKey audit trails) and against side-channel attacks on shared infrastructure (CC-on attestation).
Migration and alternatives#
RoCEv2 is the default RDMA-over-Ethernet today, but two adjacent technologies bracket it: iWARP (RDMA over TCP/IP) below, and the Ultra Ethernet Consortium 1.0 transport above. Both matter for migration planning.
- Migrating from InfiniBand to RoCEv2: re-platforming exercise, not a swap. Budget 6-12 weeks per fabric for PFC/ECN/DCQCN tuning and operator training. Run both fabrics in parallel during cut-over.
- Migrating from RoCEv2 to UEC 1.0: the UEC transport keeps RoCEv2's UDP/4791 wire format for handshake compatibility, but adds packet spraying, selective repeat, and rich per-flow telemetry. Switch silicon upgrade (Spectrum-X SN5600 or Tomahawk 5) plus SuperNIC required; no application changes.
- Choosing iWARP over RoCEv2: only for WAN-distance RDMA (storage replication across DCs) or environments where the lossless tax is genuinely impractical. Almost never the right choice for new AI training fabrics.
| Alternative | Encapsulation | Loss handling | Best for |
|---|---|---|---|
| RoCEv1 | Pure Ethernet (no IP) | Lossless required | Legacy single-L2 deployments; not for new builds |
| RoCEv2 | UDP/IP (port 4791) | Lossless required (PFC/ECN/DCQCN) | Default 2026 AI Ethernet fabric |
| iWARP (RFC 5040-5045) | TCP/IP | TCP-native, lossy-tolerant | WAN RDMA, storage; rare in AI |
| Ultra Ethernet 1.0 (UEC) | UDP/IP, packet spraying | Selective repeat at scale | New 2026+ AI fabrics; backward-compatible wire format with RoCEv2 |
| AWS SRD (EFAv2) | AWS-proprietary on UDP/IP | Selective repeat | AWS Trn1/P5 instances only |
| InfiniBand NDR/XDR | InfiniBand (own L1-4) | Lossless by design | Single-vendor, simpler ops, premium pricing |
Troubleshooting#
The RoCE failure-mode catalogue is large but the high-frequency entries are predictable. Map symptom to cause; verify with the listed action.
| Symptom | Most likely cause | First action |
|---|---|---|
| No traffic flows; ib_write_bw fails to connect | Wrong GID index (RoCEv1 vs v2, IPv4 vs IPv6) | Run `show_gids`; pin NCCL_IB_GID_INDEX to RoCEv2/IPv4 entry |
| Single-flow throughput half of line rate | PFC enabled on wrong priority or DSCP mismatch | Verify DSCP marking on host matches switch SP mapping; check NCCL_IB_TC |
| AllReduce collapses under load | ECN not configured or WRED thresholds wrong | Check `ecn_marked_packets` on host; check WRED config on switch |
| Port goes down under sustained traffic | PFC pause storm (stuck pause frames) | Enable PFC watchdog on switch; investigate upstream congestion source |
| Random NCCL timeouts after hours of running | ECN under-tuning -> brief packet loss -> SR retries | Raise NCCL_IB_TIMEOUT to 24; inspect `out_of_sequence` counter growth |
| TCP fallback active despite RoCE config | GID index detection failed at NCCL init | Pin NCCL_IB_GID_INDEX explicitly; verify show_gids output post-driver-load |
| High pause counts on storage fabric, healthy on training | Storage I/O bursts saturating shared switch buffers | Separate storage and training onto different priorities or fabrics |
| Head-of-line blocking — one slow host stalls others | PFC alone (no ECN) — buffers fill upstream | Enable ECN/WRED to provide rate-control feedback before PFC kicks in |
| Slow degradation over weeks | Optic ageing or marginal cable | Check `symbol_errors` and `fec_corrected_blocks`; swap suspect optics |
The classic failure mode: RoCEv2 enabled, PFC mapped to the wrong priority, ECN disabled at TOR. Under load the fabric drops packets, the RDMA transport collapses, and AllReduce times go from milliseconds to seconds. Always verify all three mechanisms with synthetic congestion (`ib_write_bw` from multiple hosts to one target) before opening to production traffic.
Where this fits in the Yobitel stack#
Yobitel runs both InfiniBand NDR/XDR and RoCEv2 fabrics in production: InfiniBand is the default on the H100 training pods and the GB200 NVL72 racks where single-vendor predictability matters most; RoCEv2 on Spectrum-X is the default on the H100/H200 sovereign UK pods where Cumulus-Linux-based operations align with the in-house Ethernet skill set. Both fabrics terminate into the same Yobitel GPU Cloud control plane and look identical to a Yobibyte customer — the difference is invisible above the cluster abstraction.
For customers running directly on Yobitel GPU Cloud rather than Yobibyte, the cluster image ships with the appropriate `/etc/profile.d/nccl.sh` profile: RoCE pods include the GID-index pinning, NCCL_IB_TC=106, and PFC-verified switch config; InfiniBand pods include the SHARP-enabling profile. The choice of fabric is exposed at provisioning time so workloads with specific fabric requirements can target accordingly.
References
- InfiniBand Architecture Specification Annex A17 (RoCEv2) · InfiniBand Trade Association
- RFC 3168 — The Addition of Explicit Congestion Notification to IP · IETF
- IEEE 802.1Qbb — Priority-based Flow Control · IEEE
- Congestion Control for Large-Scale RDMA Deployments (DCQCN, SIGCOMM 2015) · Microsoft / Mellanox / SIGCOMM
- NVIDIA RoCEv2 Configuration Best Practices · NVIDIA
- Ultra Ethernet Consortium 1.0 Specification · Ultra Ethernet Consortium