TL;DR
- Lossless Ethernet is the operational practice of running Ethernet such that packets are essentially never dropped — required for RoCEv2 and other RDMA-over-Ethernet protocols to perform.
- Achieved through Data Centre Bridging (DCB) extensions: PFC (802.1Qbb) for per-priority pause, ETS (802.1Qaz) for bandwidth guarantees, and DCBX (802.1Qaz) for capability negotiation.
- Layered with ECN (RFC 3168) and end-host algorithms (DCQCN, HPCC, Swift) for proactive congestion management.
- Modern AI fabrics (Spectrum-X, Tomahawk-based AI fabrics) extend the baseline DCB recipe with vendor-specific congestion control and adaptive routing.
Overview#
Lossless Ethernet is not a single standard — it is a configuration practice and a stack of cooperating mechanisms that together make Ethernet behave well enough for RDMA. The IEEE Data Centre Bridging (DCB) suite is the standardised baseline; PFC handles emergency pauses, ETS provides bandwidth guarantees, DCBX lets endpoints and switches negotiate capabilities.
On top of DCB, end-to-end congestion control (ECN/DCQCN, more modern HPCC and Swift) prevents queues from filling in the first place. The combination is what makes RoCEv2 viable at scale.
Data Centre Bridging Suite#
| Standard | Mechanism | Role |
|---|---|---|
| 802.1Qbb | PFC (Priority-based Flow Control) | Per-priority pause frames |
| 802.1Qaz | ETS (Enhanced Transmission Selection) | Per-priority bandwidth allocation |
| 802.1Qaz | DCBX (DCB Capability Exchange) | Negotiates DCB settings across links |
| 802.1Qau | QCN (Quantised Congestion Notification) | Layer 2 end-to-end congestion notification (rarely used in practice) |
End-to-End Congestion Control Algorithms#
- DCQCN: standard recipe; ECN-driven, deployed widely.
- HPCC (High-Precision Congestion Control, SIGCOMM 2019): uses in-band telemetry from switches, much tighter control.
- Swift (Google, SIGCOMM 2020): RTT-driven congestion control for shared data centre fabrics.
- Vendor extensions: Spectrum-X adaptive routing + congestion management; Cisco Nexus AI-optimised RoCE; Arista CCFC.
Operational Notes#
- Lossless behaviour is per-priority, not per-port: only the RoCE priority is paused; default traffic continues.
- PFC head-of-line blocking remains a real risk — keep RoCE on its own priority and monitor pause counters.
- ECN must be enabled symmetrically — switches must mark, end hosts must respond.
- Buffer sizing: deep buffers (Jericho) tolerate misconfiguration; shallow buffers (Tomahawk) demand tight tuning.
- Always benchmark with both quiet-fabric and synthetic-incast tests before declaring production-ready.
The single most useful operational tool is a per-port PFC pause counter alarm. Sustained PFC pauses above ~1% of link time indicate a congestion problem that will cascade if untreated.
References
- IEEE 802.1 Data Center Bridging · IEEE
- HPCC: High Precision Congestion Control · SIGCOMM 2019
- Swift: Delay is Simple and Effective for Congestion Control · SIGCOMM 2020
- DCQCN — Congestion Control for Large-Scale RDMA Deployments · SIGCOMM 2015