RoCEv2 — RDMA over Converged Ethernet Explained

TL;DR

RoCEv2 encapsulates InfiniBand transport headers inside UDP/IP packets (UDP destination port 4791), letting RDMA verbs run over routed IP/Ethernet fabrics with InfiniBand semantics.
Defined by IBTA Annex A17 (2014); requires lossless or near-lossless Ethernet via PFC (IEEE 802.1Qbb), ECN (RFC 3168), and DCQCN (Microsoft/Mellanox, SIGCOMM 2015) to perform at scale.
Implemented by every modern data-centre NIC: NVIDIA ConnectX-6/7/8, BlueField-2/3/4 SuperNICs, AMD Pensando, Intel E810, Broadcom Thor 2, and AWS Nitro/EFA.
Forms the basis of every modern Ethernet GPU fabric — NVIDIA Spectrum-X, Broadcom Tomahawk-5-based AI fabrics, AWS EFAv2, and the entire Ultra Ethernet Consortium 2.0 effort all build on RoCEv2 semantics.
Typical 400G port reaches 380-395 Gb/s payload at near-zero packet loss when PFC/ECN/DCQCN are correctly tuned; collapses to 50-100 Gb/s with congestion-loss-driven retransmissions when they are not.

Overview

RDMA over Converged Ethernet version 2 (RoCEv2) is how RDMA gets carried over commodity Ethernet at data-centre scale. The first version of RoCE was a pure Layer 2 protocol — RDMA frames encapsulated in Ethernet, non-routable, confined to a single broadcast domain. RoCEv2 added UDP/IP encapsulation so the same RDMA semantics work across routed IP fabrics.

For AI infrastructure, RoCEv2 is the protocol that makes Ethernet a credible competitor to InfiniBand. It carries the same InfiniBand transport headers, supports the same Verbs API, and — given a properly tuned lossless or near-lossless Ethernet underlay — delivers comparable latency and throughput to native InfiniBand. The trade-off is the operational complexity of running lossless Ethernet: PFC, ECN, and DCQCN must all be configured and tuned correctly, and the network team needs the skills to diagnose pause storms, ECN mis-marking, and head-of-line blocking when something goes wrong.

By 2026, RoCEv2 underpins essentially every Ethernet AI fabric in production. NVIDIA's Spectrum-X (Spectrum-4 ASIC + BlueField-3 SuperNIC) is a RoCEv2 platform with vendor-specific extensions. Broadcom's Tomahawk-5-based AI fabrics from Arista, Cisco, and Juniper are RoCEv2 platforms. AWS EFAv2 is RoCEv2-derived. The Ultra Ethernet Consortium specifications that aim to supplant RoCEv2 for AI workloads (UEC 1.0, ratified 2025) keep wire-format compatibility for the migration path.

RoCEv2 is the lossless-Ethernet alternative Yobitel NeoCloud offers for cost-sensitive multi-node serving and for sovereign UK pods where the in-house network team operates Cumulus-Linux Ethernet rather than InfiniBand. This entry helps you operate RoCEv2 in production and reach the line-rate behaviour you paid for — the PFC/ECN/DCQCN tuning that turns a working RoCE fabric into a non-regressing one.

Quick start: enable RoCEv2 on a ConnectX-7 + Cumulus Linux switch

The minimum config to bring up a working RoCEv2 link. This assumes a ConnectX-7 NIC with the mlx5_core driver on a recent Linux distribution (RHEL 9.4+, Ubuntu 22.04 LTS+) and a Spectrum-4 or Tomahawk-5 switch running Cumulus Linux 5.x. Adjust DSCP/PCP values to match your switch fabric's class-of-service plan.

# --- Host side: ConnectX-7 + mlx5 ---
# 1) Confirm RoCEv2 capability on the HCA
ibv_devinfo -d mlx5_0 | grep -E "transport|rocev2"
# transport:    InfiniBand
# active_mtu:   4096 (5)

# 2) Set the GID index for RoCEv2 over IPv4 (typically index 3)
#    show_gids prints the GID table; the v2 entries are tagged "RoCE v2"
show_gids
# DEV     PORT  INDEX  GID                  IPv4         TYPE        NDEV
# mlx5_0  1     0      fe80::... (link)     -            RoCE v1     enp1s0f0
# mlx5_0  1     1      fe80::... (link)     -            RoCE v2     enp1s0f0
# mlx5_0  1     2      ...                  10.1.1.10    RoCE v1     enp1s0f0
# mlx5_0  1     3      ...                  10.1.1.10    RoCE v2     enp1s0f0  <-- use this

# 3) Set DSCP 26 (PCP 3) for RoCEv2 traffic at the IP layer
#    Map application priority -> RoCE -> DSCP via mlnx_qos
mlnx_qos -i enp1s0f0 --trust=dscp
mlnx_qos -i enp1s0f0 --pfc=0,0,0,1,0,0,0,0      # enable PFC on priority 3

# 4) Test bandwidth between two hosts (server then client)
ib_write_bw -d mlx5_0 -F --report_gbits -x 3    # server, GID 3
ib_write_bw -d mlx5_0 -F --report_gbits -x 3 <server-ip>   # client

# --- Switch side: NVIDIA Cumulus Linux 5.x (Spectrum-4 example) ---
# /etc/nvue.d/roce.yaml
cat <<'EOF' | nv config patch --apply -
- set:
    interface:
      swp1-64:
        ip:
          neighbor-discovery:
            router-advertisement:
              enable: off
        link:
          mtu: 9216                          # jumbo, leaves headroom for VXLAN
        qos:
          pfc:
            switch-priority:
              3:
                enable: on                   # match host PFC priority
          congestion-control:
            wred-ecn:
              enable: on
              min-threshold: 150000          # bytes; tune per buffer depth
              max-threshold: 1500000
              probability: 100
    qos:
      mapping:
        dscp-to-switch-priority:
          26: 3                              # DSCP 26 -> SP3
EOF

Tip: Bring up the data path first (steps 1-4 host, then switch QoS), and verify with ib_write_bw at near-line-rate before enabling production training. Most RoCE bring-up debugging time goes to GID index, DSCP-to-priority mapping, and PFC mismatch between host and switch — not to anything more exotic.

How it works: packet structure

A RoCEv2 packet is built up as follows, outermost to innermost: Ethernet header -> IP header -> UDP header (dst port 4791) -> InfiniBand Base Transport Header (BTH) -> InfiniBand Extended Transport Header (ETH, optional) -> payload -> Invariant CRC.

Because the BTH is unchanged from native InfiniBand, the NIC's RDMA engine can process inbound RoCEv2 packets through the same hardware path as native IB packets after the UDP/IP envelope is stripped. This is what lets a single ConnectX HCA family support both transports.

RoCEv2 packet on the wire:

+-----------+--------+--------+--------+-----+---------+------+
| Ethernet  | IPv4/6 |  UDP   |  BTH   | ETH | Payload | ICRC |
| 14 bytes  | 20/40  | 8 bytes| 12     | 4   | 0..4096 | 4    |
+-----------+--------+--------+--------+-----+---------+------+
                      dport
                      4791
                                ^------- InfiniBand transport
                                         (unchanged from native IB)

Why RoCEv2 needs lossless Ethernet

RDMA transport assumes near-zero packet loss. The InfiniBand transport originally retransmitted entire transfers on any loss — a go-back-N style recovery that is catastrophic for large messages under congestion. Modern ConnectX-7+ NICs implement selective repeat retransmission (SR) and improved loss recovery, but even with SR, high loss still collapses throughput. The whole point of the PFC/ECN/DCQCN apparatus is to make loss a rare event.

Three mechanisms layered together produce near-lossless behaviour on Ethernet. ECN/DCQCN do the bulk of the work in steady state by throttling senders before congestion becomes severe; PFC is the safety net for transient bursts that ECN-driven backoff cannot react to quickly enough. PFC alone (without ECN) creates pause storms and head-of-line blocking under sustained load; ECN alone (without PFC) loses packets during the gap between mark and rate-cut response.

PFC (Priority Flow Control, IEEE 802.1Qbb): per-priority pause frames that stop upstream senders when downstream buffers fill. Lowest-level loss prevention. Hop-by-hop, no end-to-end signalling.
ECN (Explicit Congestion Notification, RFC 3168): switches mark packets in the IP header when queues build, before drops occur. End-to-end feedback signal.
DCQCN (Data Centre Quantised Congestion Notification, SIGCOMM 2015): the end-host congestion control algorithm that translates ECN marks into per-QP rate cuts. Hardware-implemented on the NIC.

Reference: NIC sysctls, kernel module options, and verbs

Operational reference for the NVIDIA mlx5 driver — by far the most common RoCEv2 NIC. The relevant sysctls, module parameters, and verbs queries an operator touches in production.

Knob	Where	Default	Notes / when to change
roce_mode	mlx5_core module	auto	Force `2` to disable RoCEv1; eliminates GID-index ambiguity
roce_ecn_marking_enable	mlx5_core module / mlxconfig	0 (off)	Enable on switch + NIC for ECN feedback path
NCCL_IB_GID_INDEX	Process env	Auto	Pin to 3 (RoCEv2/IPv4) explicitly; auto-detection misfires
NCCL_IB_TC	Process env	0	DSCP class for outbound RoCE; set to 106 (= DSCP 26 x 4)
NCCL_IB_TIMEOUT	Process env	20	Raise to 22-24 on lossy paths; lower hangs jobs
NCCL_IB_RETRY_CNT	Process env	7	RDMA retries before failing the QP
net.ipv4.tcp_ecn	Linux sysctl	2	Enable ECN system-wide (2 = ECN if peer supports)
net.core.rmem_max / wmem_max	Linux sysctl	212k	Raise to 16-64 MB for high-bandwidth flows
mtu	Interface	1500	Set 9000+ jumbo for RoCE; reduce RPC count, headroom for VXLAN
mlxconfig CNP_DSCP	Firmware	48	DSCP for Congestion Notification Packets (DCQCN feedback)
mlxconfig CNP_PRIO	Firmware	6	802.1p priority for CNP
mlxconfig ROCE_CC_ALGO	Firmware	DCQCN	DCQCN

# Inspect current RoCE-relevant firmware config
mlxconfig -d /dev/mst/mt4131_pciconf0 query | grep -E "ROCE|CNP|ECN"

# Persistently set DCQCN-friendly config (requires NIC reboot)
mlxconfig -d /dev/mst/mt4131_pciconf0 set \
  ROCE_CC_ALGORITHM_P1=DCQCN \
  CNP_DSCP_P1=48 \
  CNP_802P_PRIO_P1=6 \
  ROCE_NEXT_PROTOCOL_P1=4791

# Live per-port congestion counters (the canonical RoCE perf signal)
ethtool -S enp1s0f0 | grep -E "rx_pause|tx_pause|rx_congestion|ecn_marked|rx_out_of_buffer"

# Per-QP retransmission counters (raise concern if non-zero growth)
cat /sys/class/infiniband/mlx5_0/ports/1/hw_counters/np_cnp_sent
cat /sys/class/infiniband/mlx5_0/ports/1/hw_counters/rp_cnp_handled
cat /sys/class/infiniband/mlx5_0/ports/1/hw_counters/out_of_sequence

Switch configuration: Cumulus Linux, Arista EOS, SONiC

Switch-side config differs by NOS but the concepts are identical: classify RoCE by DSCP, map to a dedicated priority queue, enable PFC on that priority, enable WRED/ECN marking with thresholds appropriate to the buffer depth. Examples below for the three most common NOSes in AI fabrics.

DSCP 26 / PCP 3 / Switch-Priority 3 is the de facto convention for RoCEv2 on AI fabrics. Pick a different value if your fabric already uses SP3 for something else — but document the mapping cluster-wide.
WRED/ECN thresholds depend on switch buffer depth. Shallow-buffer Tomahawk: min ~100 KB, max ~1 MB. Deep-buffer Jericho: min ~1 MB, max ~10 MB. Wrong thresholds either fail to mark (loss) or over-mark (throughput collapse).
PFC headroom (the extra buffer reserved per pause-enabled priority) must be sized for the longest cable round-trip in the fabric. Default headroom assumes ~10 m fibre; ~100 m runs need explicit headroom uplift.
Enable PFC watchdog. Stuck pause frames take a port out of service in seconds; PFC watchdog detects and breaks the loop within milliseconds.

# --- NVIDIA Cumulus Linux 5.x (Spectrum-4) ---
nv set qos congestion-control switch-priority 3 wred-ecn enable on
nv set qos congestion-control switch-priority 3 wred-ecn min-threshold 150KB
nv set qos congestion-control switch-priority 3 wred-ecn max-threshold 1500KB
nv set qos pfc switch-priority 3 enable on
nv set qos mapping dscp-to-switch-priority 26 3
nv config apply

# --- Arista EOS (Tomahawk-5 / Jericho) ---
qos map dscp 26 to traffic-class 3
priority-flow-control on
priority-flow-control priority 3 no-drop
queue 3 random-detect ecn minimum-threshold 150 kbytes maximum-threshold 1500 kbytes max-mark-probability 100

# --- SONiC (community / Enterprise) ---
config qos clear
config qos reload
# QoS templates live in /etc/sonic/qos.json; the AI-fabric template
# ships from the platform vendor with DSCP 26 -> TC 3 + PFC priority 3
# pre-configured. Validate with:
show pfc counters
show queue counters

Workload patterns

RoCEv2 carries three distinct traffic patterns in an AI cluster, each with different congestion characteristics. Knowing which dominates your workload tells you which tuning lever to reach for first.

Training collectives (AllReduce, AllToAll): bursty, large messages (8 MB - 8 GB), strongly congestion-correlated across QPs. Dominant lever: DCQCN parameters (Kmin, Kmax, Pmax, Rai, Rhai). Default DCQCN tuned for 100G; needs re-tuning at 400G and 800G.
Storage I/O over RDMA (NVMe-oF, Lustre, GPFS): steady-state large reads/writes, less correlated across QPs. Dominant lever: PFC pause behaviour. Buffers must absorb sustained read bursts without pausing the training plane.
Inference KV-cache transfers (vLLM disaggregated, prefill-decode separation): small-to-medium messages (1-64 MB), latency-sensitive, low-rate. Dominant lever: ECMP hashing — avoid placing prefill->decode flows on congested links. Adaptive routing helps when available.

Sizing and capacity planning

Real-world line-rate numbers at 400G and 800G, ConnectX-7/8 endpoints, Spectrum-4 or Tomahawk-5 switches, DCQCN tuned, PFC enabled on SP3. Treat as planning anchors; verify with ib_write_bw and nccl-tests.

At 400G, achievable single-flow throughput is 96-99 % of line rate when PFC/ECN/DCQCN are correctly tuned. Anything less suggests a misconfiguration.
At 800G, single-flow throughput drops to 95-99 % of line rate due to packet-loss recovery overheads — verify with ib_write_bw -F and watch for retransmission counter growth.
Tail latency under load is the operationally critical number. p99 above 200 us at 70 % offered load almost always indicates DCQCN under-tuning (slow rate recovery after a CNP).
Plan 10-20 % bandwidth headroom on every link; running RoCEv2 fabrics above 80 % sustained load amplifies tail-latency variance.
Yobitel NeoCloud's sovereign UK reference design lands on the 400G Spectrum-X (BlueField-3 + Spectrum-4) row above for Ethernet-preferring tenants — the same PFC/ECN/DCQCN profile ships in the cluster image, validated against the headline AllReduce numbers before customer access opens.

Port speed	NIC + Switch	Single-flow throughput	AllReduce N=64 busBW	Tail latency p99 at 70% load
100G	ConnectX-6 + Spectrum-3	94-97 Gb/s	11-12 GB/s	< 50 us
200G	ConnectX-6 + Spectrum-3	188-194 Gb/s	22-24 GB/s	< 80 us
400G (RoCE)	ConnectX-7 + Spectrum-4	380-395 Gb/s	44-48 GB/s	< 120 us
400G (Spectrum-X)	BlueField-3 + Spectrum-4	385-398 Gb/s	47-50 GB/s	< 100 us
800G (RoCE)	ConnectX-8 + Spectrum-X SN5600	760-790 Gb/s	90-95 GB/s	< 130 us
800G (UEC 1.0 packet spraying)	BlueField-3 SuperNIC + Spectrum-X	780-798 Gb/s	95-100 GB/s	< 90 us

Observability

RoCE health surfaces in three places: NIC hardware counters (Mellanox hw_counters), switch port counters (PFC pause counts, ECN marks, drops), and NCCL/MPI job logs. Production fabrics export all three to Prometheus + Grafana via the dcgm-exporter, NVIDIA UFM telemetry (works on RoCE too) or SONiC's gnmi exporter.

Per-port: rx_pause_count, tx_pause_count, rx_out_of_buffer, ecn_marked_packets. Pause counts > 0 are normal; sustained growth is a problem.
Per-QP: out_of_sequence (selective-repeat triggers), np_cnp_sent (CNPs generated by this node), rp_cnp_handled (CNPs received and rate-cut). Track per-job, alert on rates.
Job-level: NCCL log lines Connected ... using IB confirm RoCE path active; falling back to TCP is the disaster mode — alert on any occurrence.
Switch-level: PFC pause rate, ECN mark rate, WRED drop count, buffer occupancy per priority. Drops on the RoCE priority are always a red flag.

# Per-port RoCE health on a host
ethtool -S enp1s0f0 | grep -E "rx_pause|tx_pause|rx_out_of_buffer|ecn_marked"

# Per-QP counters (one per CX-7 port)
for c in $(ls /sys/class/infiniband/mlx5_0/ports/1/hw_counters); do
  printf "%-30s %s\n" "$c" "$(cat /sys/class/infiniband/mlx5_0/ports/1/hw_counters/$c)"
done

# Prometheus exporter snippet — UFM emits these for RoCE fabrics
# Sample queries:
#   rate(rx_pause_count[5m]) by (host, port) > 100
#   sum(ecn_marked_packets) by (job) / sum(tx_packets) by (job) > 0.05
#   absent_over_time(nccl_job_using_ib[5m])  -- alert on TCP fallback

Cost and FinOps

RoCEv2 itself has zero licence cost; the protocol is implemented in NIC hardware and switch silicon you already own. The cost-versus-InfiniBand argument is about hardware bill of materials and the operator-skill premium.

The headline TCO win for RoCE is at large fabric scale (1k+ GPUs); below 256 GPUs the per-switch savings rarely justify the operational complexity over plug-and-play InfiniBand.
Add 6-12 weeks of network-engineering time per new RoCE fabric for PFC/ECN/DCQCN tuning. Skipping this is the most common reason RoCE deployments underperform InfiniBand in the first 6 months.
Multi-vendor sourcing is the secondary win: RoCE switches can be Tomahawk, Spectrum, Jericho, Silicon One — InfiniBand is single-vendor.

Cost driver	RoCEv2 + 400G Ethernet	InfiniBand NDR (400 Gb/s)	Delta
Switch (1U, 64-port)	$45-70k (Tomahawk 5 / Spectrum-X SN5600)	$70-95k (Quantum-2 MQM9700)	-30 to -40 %
NIC per host (single dual-port 400G)	$2,400-3,500 (ConnectX-7 RoCE-mode)	$2,800-4,200 (CX-7 IB-mode)	-15 to -20 %
Optics per end (400G DR4)	$1,800-2,600	$1,800-2,400	Comparable
Switch OS / mgmt licence	$0-2k per port (Cumulus / SONiC)	$5-8k per port-year (UFM)	Much cheaper
Operator skill	Existing Ethernet team	Specialist IB skills	Variable; budget 6-12 wk ramp
Full 1,024-GPU fabric BOM	$1.6-2.4M	$2.5-3.5M	Roughly -30-35 %

Security and compliance

RoCEv2 inherits IP security primitives — IPsec, MACsec, VLAN segmentation — that pure InfiniBand does not have. For multi-tenant clusters, this matters: VRF isolation per tenant on the RoCE fabric is straightforward; the equivalent on InfiniBand requires partition-key (PKey) management discipline that fewer operators have.

Confidential RDMA: NVIDIA Hopper+ with Confidential Compute mode (CC-on) encrypts PCIe and HBM traffic; the RoCEv2 wire-format itself is unencrypted, but the data inside (GPU-to-GPU tensors) is protected by the CC envelope. For host-to-host plaintext-on-the-wire concerns, MACsec at the link layer is the operational answer; full end-to-end IPsec adds CPU overhead and is rarely chosen.

Compliance: RoCE fabrics qualify under NCSC Cloud Security Principles (UK), GDPR Article 32 (technical and organisational measures), HIPAA, and SOC 2 reference designs the same way any other Ethernet fabric does. The novel concerns are around tenant-isolation evidence (VRF + PKey audit trails) and against side-channel attacks on shared infrastructure (CC-on attestation).

Migration and alternatives

RoCEv2 is the default RDMA-over-Ethernet today, but two adjacent technologies bracket it: iWARP (RDMA over TCP/IP) below, and the Ultra Ethernet Consortium 1.0 transport above. Both matter for migration planning.

Migrating from InfiniBand to RoCEv2: re-platforming exercise, not a swap. Budget 6-12 weeks per fabric for PFC/ECN/DCQCN tuning and operator training. Run both fabrics in parallel during cut-over.
Migrating from RoCEv2 to UEC 1.0: the UEC transport keeps RoCEv2's UDP/4791 wire format for handshake compatibility, but adds packet spraying, selective repeat, and rich per-flow telemetry. Switch silicon upgrade (Spectrum-X SN5600 or Tomahawk 5) plus SuperNIC required; no application changes.
Choosing iWARP over RoCEv2: only for WAN-distance RDMA (storage replication across DCs) or environments where the lossless tax is genuinely impractical. Almost never the right choice for new AI training fabrics.

Alternative	Encapsulation	Loss handling	Best for
RoCEv1	Pure Ethernet (no IP)	Lossless required	Legacy single-L2 deployments; not for new builds
RoCEv2	UDP/IP (port 4791)	Lossless required (PFC/ECN/DCQCN)	Default 2026 AI Ethernet fabric
iWARP (RFC 5040-5045)	TCP/IP	TCP-native, lossy-tolerant	WAN RDMA, storage; rare in AI
Ultra Ethernet 1.0 (UEC)	UDP/IP, packet spraying	Selective repeat at scale	New 2026+ AI fabrics; backward-compatible wire format with RoCEv2
AWS SRD (EFAv2)	AWS-proprietary on UDP/IP	Selective repeat	AWS Trn1/P5 instances only
InfiniBand NDR/XDR	InfiniBand (own L1-4)	Lossless by design	Single-vendor, simpler ops, premium pricing

Troubleshooting

The RoCE failure-mode catalogue is large but the high-frequency entries are predictable. Map symptom to cause; verify with the listed action.

Symptom	Most likely cause	First action
No traffic flows; ib_write_bw fails to connect	Wrong GID index (RoCEv1 vs v2, IPv4 vs IPv6)	Run `show_gids`; pin NCCL_IB_GID_INDEX to RoCEv2/IPv4 entry
Single-flow throughput half of line rate	PFC enabled on wrong priority or DSCP mismatch	Verify DSCP marking on host matches switch SP mapping; check NCCL_IB_TC
AllReduce collapses under load	ECN not configured or WRED thresholds wrong	Check `ecn_marked_packets` on host; check WRED config on switch
Port goes down under sustained traffic	PFC pause storm (stuck pause frames)	Enable PFC watchdog on switch; investigate upstream congestion source
Random NCCL timeouts after hours of running	ECN under-tuning -> brief packet loss -> SR retries	Raise NCCL_IB_TIMEOUT to 24; inspect `out_of_sequence` counter growth
TCP fallback active despite RoCE config	GID index detection failed at NCCL init	Pin NCCL_IB_GID_INDEX explicitly; verify show_gids output post-driver-load
High pause counts on storage fabric, healthy on training	Storage I/O bursts saturating shared switch buffers	Separate storage and training onto different priorities or fabrics
Head-of-line blocking — one slow host stalls others	PFC alone (no ECN) — buffers fill upstream	Enable ECN/WRED to provide rate-control feedback before PFC kicks in
Slow degradation over weeks	Optic ageing or marginal cable	Check `symbol_errors` and `fec_corrected_blocks`; swap suspect optics

Warning: The classic failure mode: RoCEv2 enabled, PFC mapped to the wrong priority, ECN disabled at TOR. Under load the fabric drops packets, the RDMA transport collapses, and AllReduce times go from milliseconds to seconds. Always verify all three mechanisms with synthetic congestion (ib_write_bw from multiple hosts to one target) before opening to production traffic.

Where this fits in the Yobitel stack

Yobitel runs both InfiniBand NDR/XDR and RoCEv2 fabrics in production: InfiniBand is the default on the H100 training pods and the GB200 NVL72 racks where single-vendor predictability matters most; RoCEv2 on Spectrum-X is the default on the H100/H200 sovereign UK pods where Cumulus-Linux-based operations align with the in-house Ethernet skill set. Both fabrics terminate into the same Yobitel GPU Cloud control plane and look identical to a Yobibyte customer — the difference is invisible above the cluster abstraction.

For customers running directly on Yobitel GPU Cloud rather than Yobibyte, the cluster image ships with the appropriate /etc/profile.d/nccl.sh profile: RoCE pods include the GID-index pinning, NCCL_IB_TC=106, and PFC-verified switch config; InfiniBand pods include the SHARP-enabling profile. The choice of fabric is exposed at provisioning time so workloads with specific fabric requirements can target accordingly.

References

InfiniBand Architecture Specification Annex A17 (RoCEv2) · InfiniBand Trade Association
RFC 3168 — The Addition of Explicit Congestion Notification to IP · IETF
IEEE 802.1Qbb — Priority-based Flow Control · IEEE
Congestion Control for Large-Scale RDMA Deployments (DCQCN, SIGCOMM 2015) · Microsoft / Mellanox / SIGCOMM
NVIDIA RoCEv2 Configuration Best Practices · NVIDIA
Ultra Ethernet Consortium 1.0 Specification · Ultra Ethernet Consortium

TL;DR

RoCEv2 encapsulates InfiniBand transport headers inside UDP/IP packets (UDP destination port 4791), letting RDMA verbs run over routed IP/Ethernet fabrics with InfiniBand semantics.
Defined by IBTA Annex A17 (2014); requires lossless or near-lossless Ethernet via PFC (IEEE 802.1Qbb), ECN (RFC 3168), and DCQCN (Microsoft/Mellanox, SIGCOMM 2015) to perform at scale.
Implemented by every modern data-centre NIC: NVIDIA ConnectX-6/7/8, BlueField-2/3/4 SuperNICs, AMD Pensando, Intel E810, Broadcom Thor 2, and AWS Nitro/EFA.
Forms the basis of every modern Ethernet GPU fabric — NVIDIA Spectrum-X, Broadcom Tomahawk-5-based AI fabrics, AWS EFAv2, and the entire Ultra Ethernet Consortium 2.0 effort all build on RoCEv2 semantics.
Typical 400G port reaches 380-395 Gb/s payload at near-zero packet loss when PFC/ECN/DCQCN are correctly tuned; collapses to 50-100 Gb/s with congestion-loss-driven retransmissions when they are not.

Overview

Quick start: enable RoCEv2 on a ConnectX-7 + Cumulus Linux switch

# --- Host side: ConnectX-7 + mlx5 ---
# 1) Confirm RoCEv2 capability on the HCA
ibv_devinfo -d mlx5_0 | grep -E "transport|rocev2"
# transport:    InfiniBand
# active_mtu:   4096 (5)

# 2) Set the GID index for RoCEv2 over IPv4 (typically index 3)
#    show_gids prints the GID table; the v2 entries are tagged "RoCE v2"
show_gids
# DEV     PORT  INDEX  GID                  IPv4         TYPE        NDEV
# mlx5_0  1     0      fe80::... (link)     -            RoCE v1     enp1s0f0
# mlx5_0  1     1      fe80::... (link)     -            RoCE v2     enp1s0f0
# mlx5_0  1     2      ...                  10.1.1.10    RoCE v1     enp1s0f0
# mlx5_0  1     3      ...                  10.1.1.10    RoCE v2     enp1s0f0  <-- use this

# 3) Set DSCP 26 (PCP 3) for RoCEv2 traffic at the IP layer
#    Map application priority -> RoCE -> DSCP via mlnx_qos
mlnx_qos -i enp1s0f0 --trust=dscp
mlnx_qos -i enp1s0f0 --pfc=0,0,0,1,0,0,0,0      # enable PFC on priority 3

# 4) Test bandwidth between two hosts (server then client)
ib_write_bw -d mlx5_0 -F --report_gbits -x 3    # server, GID 3
ib_write_bw -d mlx5_0 -F --report_gbits -x 3 <server-ip>   # client

# --- Switch side: NVIDIA Cumulus Linux 5.x (Spectrum-4 example) ---
# /etc/nvue.d/roce.yaml
cat <<'EOF' | nv config patch --apply -
- set:
    interface:
      swp1-64:
        ip:
          neighbor-discovery:
            router-advertisement:
              enable: off
        link:
          mtu: 9216                          # jumbo, leaves headroom for VXLAN
        qos:
          pfc:
            switch-priority:
              3:
                enable: on                   # match host PFC priority
          congestion-control:
            wred-ecn:
              enable: on
              min-threshold: 150000          # bytes; tune per buffer depth
              max-threshold: 1500000
              probability: 100
    qos:
      mapping:
        dscp-to-switch-priority:
          26: 3                              # DSCP 26 -> SP3
EOF

Tip: Bring up the data path first (steps 1-4 host, then switch QoS), and verify with ib_write_bw at near-line-rate before enabling production training. Most RoCE bring-up debugging time goes to GID index, DSCP-to-priority mapping, and PFC mismatch between host and switch — not to anything more exotic.

How it works: packet structure

RoCEv2 packet on the wire:

+-----------+--------+--------+--------+-----+---------+------+
| Ethernet  | IPv4/6 |  UDP   |  BTH   | ETH | Payload | ICRC |
| 14 bytes  | 20/40  | 8 bytes| 12     | 4   | 0..4096 | 4    |
+-----------+--------+--------+--------+-----+---------+------+
                      dport
                      4791
                                ^------- InfiniBand transport
                                         (unchanged from native IB)

Why RoCEv2 needs lossless Ethernet

PFC (Priority Flow Control, IEEE 802.1Qbb): per-priority pause frames that stop upstream senders when downstream buffers fill. Lowest-level loss prevention. Hop-by-hop, no end-to-end signalling.
ECN (Explicit Congestion Notification, RFC 3168): switches mark packets in the IP header when queues build, before drops occur. End-to-end feedback signal.
DCQCN (Data Centre Quantised Congestion Notification, SIGCOMM 2015): the end-host congestion control algorithm that translates ECN marks into per-QP rate cuts. Hardware-implemented on the NIC.

Reference: NIC sysctls, kernel module options, and verbs

Operational reference for the NVIDIA mlx5 driver — by far the most common RoCEv2 NIC. The relevant sysctls, module parameters, and verbs queries an operator touches in production.

Knob	Where	Default	Notes / when to change
roce_mode	mlx5_core module	auto	Force `2` to disable RoCEv1; eliminates GID-index ambiguity
roce_ecn_marking_enable	mlx5_core module / mlxconfig	0 (off)	Enable on switch + NIC for ECN feedback path
NCCL_IB_GID_INDEX	Process env	Auto	Pin to 3 (RoCEv2/IPv4) explicitly; auto-detection misfires
NCCL_IB_TC	Process env	0	DSCP class for outbound RoCE; set to 106 (= DSCP 26 x 4)
NCCL_IB_TIMEOUT	Process env	20	Raise to 22-24 on lossy paths; lower hangs jobs
NCCL_IB_RETRY_CNT	Process env	7	RDMA retries before failing the QP
net.ipv4.tcp_ecn	Linux sysctl	2	Enable ECN system-wide (2 = ECN if peer supports)
net.core.rmem_max / wmem_max	Linux sysctl	212k	Raise to 16-64 MB for high-bandwidth flows
mtu	Interface	1500	Set 9000+ jumbo for RoCE; reduce RPC count, headroom for VXLAN
mlxconfig CNP_DSCP	Firmware	48	DSCP for Congestion Notification Packets (DCQCN feedback)
mlxconfig CNP_PRIO	Firmware	6	802.1p priority for CNP
mlxconfig ROCE_CC_ALGO	Firmware	DCQCN	DCQCN

# Inspect current RoCE-relevant firmware config
mlxconfig -d /dev/mst/mt4131_pciconf0 query | grep -E "ROCE|CNP|ECN"

# Persistently set DCQCN-friendly config (requires NIC reboot)
mlxconfig -d /dev/mst/mt4131_pciconf0 set \
  ROCE_CC_ALGORITHM_P1=DCQCN \
  CNP_DSCP_P1=48 \
  CNP_802P_PRIO_P1=6 \
  ROCE_NEXT_PROTOCOL_P1=4791

# Live per-port congestion counters (the canonical RoCE perf signal)
ethtool -S enp1s0f0 | grep -E "rx_pause|tx_pause|rx_congestion|ecn_marked|rx_out_of_buffer"

# Per-QP retransmission counters (raise concern if non-zero growth)
cat /sys/class/infiniband/mlx5_0/ports/1/hw_counters/np_cnp_sent
cat /sys/class/infiniband/mlx5_0/ports/1/hw_counters/rp_cnp_handled
cat /sys/class/infiniband/mlx5_0/ports/1/hw_counters/out_of_sequence

Switch configuration: Cumulus Linux, Arista EOS, SONiC

DSCP 26 / PCP 3 / Switch-Priority 3 is the de facto convention for RoCEv2 on AI fabrics. Pick a different value if your fabric already uses SP3 for something else — but document the mapping cluster-wide.
WRED/ECN thresholds depend on switch buffer depth. Shallow-buffer Tomahawk: min ~100 KB, max ~1 MB. Deep-buffer Jericho: min ~1 MB, max ~10 MB. Wrong thresholds either fail to mark (loss) or over-mark (throughput collapse).
PFC headroom (the extra buffer reserved per pause-enabled priority) must be sized for the longest cable round-trip in the fabric. Default headroom assumes ~10 m fibre; ~100 m runs need explicit headroom uplift.
Enable PFC watchdog. Stuck pause frames take a port out of service in seconds; PFC watchdog detects and breaks the loop within milliseconds.

# --- NVIDIA Cumulus Linux 5.x (Spectrum-4) ---
nv set qos congestion-control switch-priority 3 wred-ecn enable on
nv set qos congestion-control switch-priority 3 wred-ecn min-threshold 150KB
nv set qos congestion-control switch-priority 3 wred-ecn max-threshold 1500KB
nv set qos pfc switch-priority 3 enable on
nv set qos mapping dscp-to-switch-priority 26 3
nv config apply

# --- Arista EOS (Tomahawk-5 / Jericho) ---
qos map dscp 26 to traffic-class 3
priority-flow-control on
priority-flow-control priority 3 no-drop
queue 3 random-detect ecn minimum-threshold 150 kbytes maximum-threshold 1500 kbytes max-mark-probability 100

# --- SONiC (community / Enterprise) ---
config qos clear
config qos reload
# QoS templates live in /etc/sonic/qos.json; the AI-fabric template
# ships from the platform vendor with DSCP 26 -> TC 3 + PFC priority 3
# pre-configured. Validate with:
show pfc counters
show queue counters

Workload patterns

RoCEv2 carries three distinct traffic patterns in an AI cluster, each with different congestion characteristics. Knowing which dominates your workload tells you which tuning lever to reach for first.

Training collectives (AllReduce, AllToAll): bursty, large messages (8 MB - 8 GB), strongly congestion-correlated across QPs. Dominant lever: DCQCN parameters (Kmin, Kmax, Pmax, Rai, Rhai). Default DCQCN tuned for 100G; needs re-tuning at 400G and 800G.
Storage I/O over RDMA (NVMe-oF, Lustre, GPFS): steady-state large reads/writes, less correlated across QPs. Dominant lever: PFC pause behaviour. Buffers must absorb sustained read bursts without pausing the training plane.
Inference KV-cache transfers (vLLM disaggregated, prefill-decode separation): small-to-medium messages (1-64 MB), latency-sensitive, low-rate. Dominant lever: ECMP hashing — avoid placing prefill->decode flows on congested links. Adaptive routing helps when available.

Sizing and capacity planning

At 400G, achievable single-flow throughput is 96-99 % of line rate when PFC/ECN/DCQCN are correctly tuned. Anything less suggests a misconfiguration.
At 800G, single-flow throughput drops to 95-99 % of line rate due to packet-loss recovery overheads — verify with ib_write_bw -F and watch for retransmission counter growth.
Tail latency under load is the operationally critical number. p99 above 200 us at 70 % offered load almost always indicates DCQCN under-tuning (slow rate recovery after a CNP).
Plan 10-20 % bandwidth headroom on every link; running RoCEv2 fabrics above 80 % sustained load amplifies tail-latency variance.
Yobitel NeoCloud's sovereign UK reference design lands on the 400G Spectrum-X (BlueField-3 + Spectrum-4) row above for Ethernet-preferring tenants — the same PFC/ECN/DCQCN profile ships in the cluster image, validated against the headline AllReduce numbers before customer access opens.

Port speed	NIC + Switch	Single-flow throughput	AllReduce N=64 busBW	Tail latency p99 at 70% load
100G	ConnectX-6 + Spectrum-3	94-97 Gb/s	11-12 GB/s	< 50 us
200G	ConnectX-6 + Spectrum-3	188-194 Gb/s	22-24 GB/s	< 80 us
400G (RoCE)	ConnectX-7 + Spectrum-4	380-395 Gb/s	44-48 GB/s	< 120 us
400G (Spectrum-X)	BlueField-3 + Spectrum-4	385-398 Gb/s	47-50 GB/s	< 100 us
800G (RoCE)	ConnectX-8 + Spectrum-X SN5600	760-790 Gb/s	90-95 GB/s	< 130 us
800G (UEC 1.0 packet spraying)	BlueField-3 SuperNIC + Spectrum-X	780-798 Gb/s	95-100 GB/s	< 90 us

Observability

Per-port: rx_pause_count, tx_pause_count, rx_out_of_buffer, ecn_marked_packets. Pause counts > 0 are normal; sustained growth is a problem.
Per-QP: out_of_sequence (selective-repeat triggers), np_cnp_sent (CNPs generated by this node), rp_cnp_handled (CNPs received and rate-cut). Track per-job, alert on rates.
Job-level: NCCL log lines Connected ... using IB confirm RoCE path active; falling back to TCP is the disaster mode — alert on any occurrence.
Switch-level: PFC pause rate, ECN mark rate, WRED drop count, buffer occupancy per priority. Drops on the RoCE priority are always a red flag.

# Per-port RoCE health on a host
ethtool -S enp1s0f0 | grep -E "rx_pause|tx_pause|rx_out_of_buffer|ecn_marked"

# Per-QP counters (one per CX-7 port)
for c in $(ls /sys/class/infiniband/mlx5_0/ports/1/hw_counters); do
  printf "%-30s %s\n" "$c" "$(cat /sys/class/infiniband/mlx5_0/ports/1/hw_counters/$c)"
done

# Prometheus exporter snippet — UFM emits these for RoCE fabrics
# Sample queries:
#   rate(rx_pause_count[5m]) by (host, port) > 100
#   sum(ecn_marked_packets) by (job) / sum(tx_packets) by (job) > 0.05
#   absent_over_time(nccl_job_using_ib[5m])  -- alert on TCP fallback

Cost and FinOps

The headline TCO win for RoCE is at large fabric scale (1k+ GPUs); below 256 GPUs the per-switch savings rarely justify the operational complexity over plug-and-play InfiniBand.
Add 6-12 weeks of network-engineering time per new RoCE fabric for PFC/ECN/DCQCN tuning. Skipping this is the most common reason RoCE deployments underperform InfiniBand in the first 6 months.
Multi-vendor sourcing is the secondary win: RoCE switches can be Tomahawk, Spectrum, Jericho, Silicon One — InfiniBand is single-vendor.

Cost driver	RoCEv2 + 400G Ethernet	InfiniBand NDR (400 Gb/s)	Delta
Switch (1U, 64-port)	$45-70k (Tomahawk 5 / Spectrum-X SN5600)	$70-95k (Quantum-2 MQM9700)	-30 to -40 %
NIC per host (single dual-port 400G)	$2,400-3,500 (ConnectX-7 RoCE-mode)	$2,800-4,200 (CX-7 IB-mode)	-15 to -20 %
Optics per end (400G DR4)	$1,800-2,600	$1,800-2,400	Comparable
Switch OS / mgmt licence	$0-2k per port (Cumulus / SONiC)	$5-8k per port-year (UFM)	Much cheaper
Operator skill	Existing Ethernet team	Specialist IB skills	Variable; budget 6-12 wk ramp
Full 1,024-GPU fabric BOM	$1.6-2.4M	$2.5-3.5M	Roughly -30-35 %

Security and compliance

Migration and alternatives

Migrating from InfiniBand to RoCEv2: re-platforming exercise, not a swap. Budget 6-12 weeks per fabric for PFC/ECN/DCQCN tuning and operator training. Run both fabrics in parallel during cut-over.
Migrating from RoCEv2 to UEC 1.0: the UEC transport keeps RoCEv2's UDP/4791 wire format for handshake compatibility, but adds packet spraying, selective repeat, and rich per-flow telemetry. Switch silicon upgrade (Spectrum-X SN5600 or Tomahawk 5) plus SuperNIC required; no application changes.
Choosing iWARP over RoCEv2: only for WAN-distance RDMA (storage replication across DCs) or environments where the lossless tax is genuinely impractical. Almost never the right choice for new AI training fabrics.

Alternative	Encapsulation	Loss handling	Best for
RoCEv1	Pure Ethernet (no IP)	Lossless required	Legacy single-L2 deployments; not for new builds
RoCEv2	UDP/IP (port 4791)	Lossless required (PFC/ECN/DCQCN)	Default 2026 AI Ethernet fabric
iWARP (RFC 5040-5045)	TCP/IP	TCP-native, lossy-tolerant	WAN RDMA, storage; rare in AI
Ultra Ethernet 1.0 (UEC)	UDP/IP, packet spraying	Selective repeat at scale	New 2026+ AI fabrics; backward-compatible wire format with RoCEv2
AWS SRD (EFAv2)	AWS-proprietary on UDP/IP	Selective repeat	AWS Trn1/P5 instances only
InfiniBand NDR/XDR	InfiniBand (own L1-4)	Lossless by design	Single-vendor, simpler ops, premium pricing

Troubleshooting

The RoCE failure-mode catalogue is large but the high-frequency entries are predictable. Map symptom to cause; verify with the listed action.

Symptom	Most likely cause	First action
No traffic flows; ib_write_bw fails to connect	Wrong GID index (RoCEv1 vs v2, IPv4 vs IPv6)	Run `show_gids`; pin NCCL_IB_GID_INDEX to RoCEv2/IPv4 entry
Single-flow throughput half of line rate	PFC enabled on wrong priority or DSCP mismatch	Verify DSCP marking on host matches switch SP mapping; check NCCL_IB_TC
AllReduce collapses under load	ECN not configured or WRED thresholds wrong	Check `ecn_marked_packets` on host; check WRED config on switch
Port goes down under sustained traffic	PFC pause storm (stuck pause frames)	Enable PFC watchdog on switch; investigate upstream congestion source
Random NCCL timeouts after hours of running	ECN under-tuning -> brief packet loss -> SR retries	Raise NCCL_IB_TIMEOUT to 24; inspect `out_of_sequence` counter growth
TCP fallback active despite RoCE config	GID index detection failed at NCCL init	Pin NCCL_IB_GID_INDEX explicitly; verify show_gids output post-driver-load
High pause counts on storage fabric, healthy on training	Storage I/O bursts saturating shared switch buffers	Separate storage and training onto different priorities or fabrics
Head-of-line blocking — one slow host stalls others	PFC alone (no ECN) — buffers fill upstream	Enable ECN/WRED to provide rate-control feedback before PFC kicks in
Slow degradation over weeks	Optic ageing or marginal cable	Check `symbol_errors` and `fec_corrected_blocks`; swap suspect optics

Warning: The classic failure mode: RoCEv2 enabled, PFC mapped to the wrong priority, ECN disabled at TOR. Under load the fabric drops packets, the RDMA transport collapses, and AllReduce times go from milliseconds to seconds. Always verify all three mechanisms with synthetic congestion (ib_write_bw from multiple hosts to one target) before opening to production traffic.

Where this fits in the Yobitel stack

References

InfiniBand Architecture Specification Annex A17 (RoCEv2) · InfiniBand Trade Association
RFC 3168 — The Addition of Explicit Congestion Notification to IP · IETF
IEEE 802.1Qbb — Priority-based Flow Control · IEEE
Congestion Control for Large-Scale RDMA Deployments (DCQCN, SIGCOMM 2015) · Microsoft / Mellanox / SIGCOMM
NVIDIA RoCEv2 Configuration Best Practices · NVIDIA
Ultra Ethernet Consortium 1.0 Specification · Ultra Ethernet Consortium

RoCEv2 (RDMA over Converged Ethernet, version 2)

Overview

Quick start: enable RoCEv2 on a ConnectX-7 + Cumulus Linux switch

How it works: packet structure

Why RoCEv2 needs lossless Ethernet

Reference: NIC sysctls, kernel module options, and verbs

Switch configuration: Cumulus Linux, Arista EOS, SONiC

Workload patterns

Sizing and capacity planning

Observability

Cost and FinOps

Security and compliance

Migration and alternatives

Troubleshooting

Where this fits in the Yobitel stack

References

Browse all entries

Deploy on Yobibyte

RoCEv2 (RDMA over Converged Ethernet, version 2)

Overview

Quick start: enable RoCEv2 on a ConnectX-7 + Cumulus Linux switch

How it works: packet structure

Why RoCEv2 needs lossless Ethernet

Reference: NIC sysctls, kernel module options, and verbs

Switch configuration: Cumulus Linux, Arista EOS, SONiC

Workload patterns

Sizing and capacity planning

Observability

Cost and FinOps

Security and compliance

Migration and alternatives

Troubleshooting

Where this fits in the Yobitel stack

References

Browse all entries

Deploy on Yobibyte