TL;DR
- iWARP carries RDMA semantics over TCP/IP rather than the InfiniBand transport — using TCP's own loss recovery instead of requiring a lossless network.
- Defined by IETF RFCs 5040-5045 (originally 2007), with DDP, MPA, and RDMAP layered above TCP.
- Implemented historically by Chelsio T6/T7 NICs and Intel E810; broadly less popular than RoCEv2 in AI training fabrics.
- Practical niche: WAN-distance RDMA, storage protocols over routed IP, and environments where lossless Ethernet tuning is impractical.
Overview#
iWARP is the alternative encapsulation for RDMA over IP. Where RoCEv2 carries the InfiniBand transport inside UDP and depends on the network to be lossless, iWARP runs RDMA over TCP and inherits TCP's loss recovery. This makes iWARP tolerant of lossy networks at the cost of higher per-message overhead and weaker latency tail behaviour.
The protocol is layered: at the bottom is TCP; above TCP sits MPA (Marker-PDU Aligned framing) which provides record boundaries; above that sits DDP (Direct Data Placement) which handles steering of payloads to memory regions; and at the top sits RDMAP (RDMA Protocol) which exposes the standard RDMA verbs.
Specifications#
| Property | Value |
|---|---|
| Standards | RFC 5040, 5041, 5043, 5044, 5045 |
| Transport | TCP |
| Encapsulation layers | RDMAP / DDP / MPA / TCP / IP |
| Loss recovery | TCP-native (does not require lossless underlay) |
| API | Verbs (libibverbs) |
| Vendor support | Chelsio (T6/T7), Intel (E810) |
| Routability | Native IP, WAN-capable |
Why It Lost to RoCEv2 in AI#
iWARP was designed before lossless Ethernet was practical at scale, and its loss-tolerance was its selling point. In the AI era, the GPU fabric is a controlled, well-engineered environment where the operator can guarantee near-lossless behaviour through PFC/ECN — and where the latency tax of TCP's congestion control matters.
RoCEv2's tighter latency, broader NIC support (especially NVIDIA's), and ecosystem inertia made it the default choice for AI training fabrics. iWARP retains a niche in WAN-extended storage protocols and some financial-services trading fabrics where its TCP-friendly nature is valued.
Operational Notes#
- TCP tuning matters: window scaling, SACK, and selective retransmission all interact with iWARP throughput.
- MPA markers can interact badly with TCP middleboxes that re-segment packets; bare-metal end-to-end paths are preferable.
- Latency is higher and more variable than RoCEv2; iWARP is rarely chosen for tight HPC collectives.
- Mixed iWARP/RoCE deployments are uncommon — most sites pick one and standardise.