TL;DR
- BlueField-3 (B3220 / B3210 / B3140 SKUs) is NVIDIA's third-generation DPU: a single ASIC combining a 400 Gb/s ConnectX-7-class NIC, 16 ARM Cortex-A78 cores at 2.0 GHz, up to 32 GB DDR5, and on-chip accelerators for crypto, regex, dedupe and storage protocols.
- Presents up to 2 × 200 Gb/s or 1 × 400 Gb/s as Ethernet or InfiniBand NDR; PCIe Gen5 x16 to the host, 16 × Cortex-A78 cores to the DOCA SDK, OSFP twin-port or QSFP112 connectors depending on SKU.
- Offloads RoCEv2 congestion control, NVMe-oF storage initiator / target, line-rate IPsec / TLS / MACsec, and packet telemetry — keeping host CPU cycles for tenant workloads and creating a hardware-isolated trust boundary between infrastructure and tenant.
- Anchors the endpoint side of NVIDIA Spectrum-X, ships standard in DGX H100 / H200 reference designs, and is the SuperNIC inside every GB200 NVL72 compute node — including the ones underneath Yobitel NeoCloud's UK and EU sovereign regions.
- Street price (early-to-mid 2026) is roughly $2,000-3,000 per card depending on SKU and channel; the BOM dwarfs the equivalent dumb 400 Gb/s NIC but pays back through host CPU savings, storage offload and multi-tenant isolation.
Overview#
BlueField-3 is the third generation of NVIDIA's data-processing-unit family — a programmable infrastructure platform built around a high-bandwidth NIC, a general-purpose ARM CPU complex, and a set of on-die hardware accelerators. It was announced at GTC 2021 and entered volume production in 2023, replacing BlueField-2 at the top of the SuperNIC line and remaining the workhorse DPU through 2026 alongside the newer BlueField-4.
The job description has not changed across generations: keep infrastructure work — networking, storage, security, observability — off the host CPU and provide a hardware trust boundary between tenant workloads on the host and the cloud operator's control plane on the DPU. What changed in BlueField-3 is the budget. PCIe Gen5 x16 to the host, up to 400 Gb/s of network bandwidth, 16 ARM Cortex-A78 cores at 2 GHz with up to 32 GB of DDR5, and hardware accelerators that run IPsec, TLS, MACsec, regex, dedupe and compression at line rate.
In a modern AI cluster, BlueField-3 sits behind every GPU server. NVIDIA's reference designs make it standard equipment: every DGX H100 / H200 baseboard, every GB200 NVL72 compute tray, and the canonical NVIDIA-Partner neocloud HGX template all include BlueField-3 SuperNICs. Yobitel NeoCloud follows the reference design — every H100, H200 and GB200 NVL72 node in the UK and EU regions ships with BlueField-3 SuperNICs in NIC-mode for line-rate RoCEv2 and GPUDirect RDMA, with DPU-mode enabled on the storage tier for NVMe-oF offload and on the gateway tier for multi-tenant isolation.
This entry helps you decide whether BlueField-3 is the right SuperNIC for your AI build — what each SKU does, what the DOCA software stack expects from you, when NIC mode is enough and when you actually need DPU mode, and how the card fits into the Yobitel NeoCloud architecture if you would rather consume the capability as a managed service through Yobibyte.
Specifications#
Authoritative figures for the three main SKUs as shipping in 2026. The B3220 is the dual-port 200 Gb/s flagship that sits in DGX H100/H200 hosts; the B3210 is a single-port 100 Gb/s variant for storage and gateway tiers; the B3140 is the 400 Gb/s single-port SKU paired with Spectrum-X SN5600 and Quantum-3 fabrics.
| Property | B3220 (dual 200G) | B3210 (single 100G) | B3140 (single 400G) |
|---|---|---|---|
| Network ports | 2 × 200 Gb/s | 1 × 100 Gb/s | 1 × 400 Gb/s |
| Protocols | Ethernet (RoCEv2) + IB NDR | Ethernet + IB EDR/HDR | Ethernet + IB NDR |
| Connector | QSFP112 / OSFP | QSFP56 | OSFP twin-port |
| ARM cores | 16 × Cortex-A78 @ 2.0 GHz | 16 × Cortex-A78 @ 2.0 GHz | 16 × Cortex-A78 @ 2.0 GHz |
| L2 / L3 cache | 8 MB shared L3 | 8 MB shared L3 | 8 MB shared L3 |
| DRAM | 16-32 GB DDR5 on-card | 16 GB DDR5 on-card | 16-32 GB DDR5 on-card |
| Host interface | PCIe Gen5 x16 | PCIe Gen5 x16 | PCIe Gen5 x16 |
| Hardware accelerators | IPsec, TLS, MACsec, RegEx, dedupe, compression | Same as B3220 | Same as B3220 |
| Crypto throughput | 200 Gb/s line-rate IPsec | 100 Gb/s line-rate IPsec | 400 Gb/s line-rate IPsec |
| RDMA support | RoCEv2 + IB NDR | RoCEv2 + IB HDR | RoCEv2 + IB NDR |
| GPUDirect RDMA | Yes (with NVIDIA GPUs) | Yes | Yes |
| Power (typical) | ~75 W | ~55 W | ~150 W |
| Form factor | PCIe HHHL / FHHL | PCIe HHHL | PCIe FHHL / OCP 3.0 |
| First shipped | Q1 2023 | Q2 2023 | Q3 2023 |
| Process node | TSMC 7 nm | TSMC 7 nm | TSMC 7 nm |
| Software | DOCA 2.x + DOCA-Host | DOCA 2.x + DOCA-Host | DOCA 2.x + DOCA-Host |
| Boot device | eMMC + optional NVMe | eMMC + optional NVMe | eMMC + optional NVMe |
| Out-of-band management | BMC interface + dedicated 1 GbE | BMC interface + dedicated 1 GbE | BMC interface + dedicated 1 GbE |
Architecture: what changed in BlueField-3#
BlueField-3 is built around four distinct silicon blocks on one die: a ConnectX-7-class network subsystem, an ARM CPU complex, a memory subsystem, and a fixed-function accelerator complex. Each block evolved meaningfully from BlueField-2.
Network subsystem. BlueField-3 inherits ConnectX-7 silicon, which delivers up to 400 Gb/s on a single port with PAM4 signalling, hardware RoCEv2 with NVIDIA's per-flow congestion control, InfiniBand NDR support, and GPUDirect RDMA over PCIe peer-to-peer. The same packet-processing engine implements Spectrum-X's adaptive routing decisions at the NIC side, complementing the switch-side decisions made by Spectrum-4.
ARM CPU complex. Sixteen Cortex-A78 cores at 2.0 GHz with 8 MB of shared L3 — roughly an order of magnitude more compute than BlueField-2's eight Cortex-A72 cores at 2.75 GHz. The A78 cores run a full Linux distribution (Ubuntu 22.04 or RHEL 9 are the supported targets) and host containerised offload workloads, with hardware-isolation between the DPU OS and the host OS.
Memory subsystem. Up to 32 GB of on-card DDR5 — large enough to hold an entire NVMe-oF target's metadata, a TLS session cache for a high-fan-out reverse proxy, or a Suricata IDS rule set, depending on how the operator chooses to use the card.
Accelerator complex. Hardware engines for IPsec, TLS, MACsec, regex (compatible with Hyperscan), data dedupe and LZ4 compression, and the SHA-2 family. The crypto engines run at line rate — a B3140 can encrypt 400 Gb/s of IPsec without dropping the ARM cores from idle.
Form factor, power and thermal#
BlueField-3 ships in three physical form factors. PCIe full-height half-length (FHHL) is the standard in DGX H100 / H200 hosts. PCIe half-height half-length (HHHL) is the option for 1U servers without front-panel access for the full-height card. OCP 3.0 is the form factor used inside GB200 NVL72 compute trays and most hyperscale designs.
Power draw varies sharply by SKU. The B3210 (100 Gb/s) draws ~55 W typical; the B3220 (2 × 200 Gb/s) draws ~75 W typical; the B3140 (400 Gb/s) draws ~150 W typical, primarily because the higher-rate SerDes and the larger DDR5 complex push the power budget. The 400 Gb/s variant typically needs active cooling — a forced-air slot near the front of the chassis or, in OCP 3.0, the host-supplied airflow.
Thermal: the ASIC operates safely up to 95 C junction temperature. Above 85 C the firmware throttles the ARM cores first, the crypto engines next, and finally the network throughput. Sustained throttling is visible through the DOCA telemetry endpoint and the `mlnx-mft` thermal counter — instrument both before declaring a deployment stable.
Interconnect: where BlueField-3 sits on PCIe and on the fabric#
On the host side, BlueField-3 presents itself as a PCIe Gen5 x16 device. On a Sapphire Rapids or Genoa host, the card sits on the same root complex as one or more NVIDIA GPUs — and the placement matters. GPUDirect RDMA between a BlueField-3 and an H100 / H200 / B200 GPU works best when both devices share the same PCIe switch, slightly worse when they share a CPU root complex but cross a NUMA boundary, and worst when they sit under different sockets and have to traverse UPI / Infinity Fabric.
On the network side, BlueField-3 connects to a Spectrum-X (SN5600), Quantum-2 (MQM9700) or Quantum-3 leaf switch over OSFP (NDR / 400 Gb/s) or QSFP112 (200 Gb/s) cables. RoCEv2 mode requires lossless Ethernet tuning end-to-end (PFC + ECN + DCQCN) but no special operator effort on the DPU side — the per-flow congestion control runs in the ConnectX-7 silicon transparently.
BlueField-3 also exposes an out-of-band management interface. A dedicated 1 GbE link plus a BMC channel let the operator manage the DPU's ARM OS independently of the host OS — important when the DPU is enforcing security policies the host should not be able to bypass or interrupt.
For GPUDirect RDMA performance, prefer hosts where the BlueField-3 and the GPU sit under the same PCIe switch (the PEX-class chip on HGX baseboards). `nvidia-smi topo -m` reveals the path; aim for PIX (same switch) over PHB (host-bridge) or NODE (cross-socket).
Software ecosystem: DOCA, drivers, deployment modes#
DOCA is NVIDIA's SDK and runtime for BlueField. It provides libraries and reference applications for flow programming, RDMA acceleration, telemetry export, packet processing, storage protocols, and security inspection. On the host side, DOCA-Host installs the driver stack (`mlx5_core`, `nvidia-peermem`, libibverbs, rdma-core). On the DPU side, DOCA-DPU installs the runtime, container engine and library set on the ARM OS.
BlueField-3 supports three deployment modes; the mode is chosen at deployment time and changes the host's view of the card and the operator's responsibilities.
- NIC mode. The DPU behaves as a fast RDMA NIC. The ARM OS runs minimal services; the card looks like a ConnectX-7 from the host's perspective. This is the default for most AI training clusters — Yobitel NeoCloud's GPU compute tier runs the SuperNICs in NIC mode for line-rate RoCEv2 and GPUDirect RDMA, and lets the host orchestration plane handle infrastructure logic.
- DPU mode. The ARM OS runs a full Linux distribution and hosts offload services — typically containerised. The host sees a NIC; the operator sees a separate addressable Linux machine on every card. Used in NeoCloud storage tier (NVMe-oF target offload) and in NeoCloud's multi-tenant gateway tier (where tenant traffic is encrypted at the DPU before it crosses the host).
- Zero-trust mode. The DPU enforces policies that the host cannot see or bypass — RBAC, encrypted-tenant-traffic, hardware-attested firewall. Used by Yobibyte's multi-tenant pods for tenant-to-tenant isolation: tenants share physical hosts but their inter-host traffic flows through DPU-enforced encryption and policy that the host kernel has no path to disable.
# Verify host-side DOCA installation
mst start && mst status -v
mlxconfig -d /dev/mst/mt41692_pciconf0 query | head
# Show BlueField-3 network port state from host
ibstat | grep -E "Active|Rate|Link layer"
ethtool ens6f0np0 | grep -E "Speed|Link"
# Check DPU mode (NIC / DPU / Zero-trust)
mlxconfig -d /dev/mst/mt41692_pciconf0 query INTERNAL_CPU_MODEL
# 0 = NIC mode, 1 = DPU mode (separate host)
# Bring up DOCA service container on the DPU
ssh ubuntu@<dpu-mgmt-ip>
docker run --rm --net=host \
--privileged \
nvcr.io/nvidia/doca/doca_telemetry:2.7.0-doca2.7.0Sizing and capacity planning#
BlueField-3 is rarely sized in isolation — the question is how many DPUs per GPU server, which SKU per tier, and how much of the work to offload to DOCA versus leave on the host. The table below maps Yobitel NeoCloud's choices to the workload class.
- For training-only fabrics, prefer NIC mode and avoid the DOCA operational tax. The DPU's value is delivered by the network silicon and GPUDirect, not the ARM cores.
- For multi-tenant pods, prefer zero-trust mode and budget the operations cost. The DPU becomes a second managed machine per host; treat its OS image, firmware and DOCA release as first-class lifecycle artefacts.
- DDR5 capacity matters per role. NVMe-oF targets need 32 GB; tenant gateways often run fine on 16 GB; light NIC-mode operation needs the minimum.
- Power-budget check: 8 × B3140 in a single DGX-class chassis adds ~1.2 kW to the host power draw. Verify PSU sizing in the original spec sheet before doubling DPU count.
| Workload tier | SKU per host | Mode | DPU role | Yobitel NeoCloud pattern |
|---|---|---|---|---|
| DGX H100 / H200 training | 4-8 × B3220 or B3140 | NIC mode | Line-rate RoCEv2 + GPUDirect RDMA | Standard NeoCloud training compute |
| GB200 NVL72 training | 8-18 × B3140 | NIC mode | 400G NDR per rail, SHARPv3 + GPUDirect | NeoCloud Blackwell pods, UK & EU |
| Inference / mixed-tenancy | 1-2 × B3220 | DPU mode | Tenant isolation, TLS offload | NeoCloud inference tier; Yobibyte managed endpoints sit on this tier |
| NVMe-oF storage target | 2 × B3220 or B3140 | DPU mode | NVMe-oF target, dedupe, compression | NeoCloud parallel storage cluster |
| Multi-tenant gateway | 2 × B3210 or B3220 | Zero-trust mode | Per-tenant IPsec, hardware-attested policy | NeoCloud tenant-edge; underlies Yobibyte tenant isolation |
| Edge / on-prem (sovereign) | 1-2 × B3210 | DPU mode | Local crypto, observability, light gateway | Optional NeoCloud Edge nodes |
Cost and TCO#
Card prices are negotiated and depend on SKU mix, channel, support contract and volume. The figures below are indicative USD ranges for new builds in mid-2026; OEM-rebrand variants (Dell, HPE, Supermicro) sit at the higher end.
- Total fabric BOM contribution: in a 1,024-GPU H100 cluster with 4 × B3220 per host (16 hosts × 8 GPUs), the DPU spend is roughly $130-180k — about 4-6 % of the GPU spend.
- Yobitel NeoCloud bakes the DPU cost into the per-GPU-hour pricing; customers consuming via Yobibyte never see a separate DPU line item.
- Compared with a dumb 400 Gb/s NIC at ~$1,800-2,400, the DPU premium is ~$500-800 per card. The payback is host CPU savings on RoCEv2 tuning, NVMe-oF state machines, and tenant-side encryption — easy to justify above 4 GPUs per host.
| Line item | Indicative USD price | Notes |
|---|---|---|
| B3210 single-port 100 Gb/s | $1,400-1,900 per card | Storage / gateway tier |
| B3220 dual-port 200 Gb/s | $2,000-2,800 per card | DGX H100/H200 standard |
| B3140 single-port 400 Gb/s | $2,400-3,200 per card | Spectrum-X / Quantum-3 endpoint |
| NVIDIA DOCA-Host subscription | Bundled with card support | Per-card; check the OEM contract |
| DOCA community | Free | Pre-production / lab use |
| Support contract (Bronze) | ~$200-350 per card/year | Updates only |
| Support contract (Gold/Premier) | ~$500-900 per card/year | Updates + RMA + named TAM |
Migration and alternatives#
BlueField-3 competes with two adjacent classes of device: dumb high-speed NICs (cheaper, less capable) and other DPU families (AMD Pensando, Intel IPU, Marvell Octeon, AWS Nitro). The right choice depends on what you actually intend to offload.
- Migrating from BlueField-2 to BlueField-3 is mechanical: same OCP/PCIe form factors, DOCA 2.x retains API compatibility, but the firmware lifecycle (BFB images) must be aligned across the fleet.
- Migrating from a dumb NIC to BlueField-3 is straightforward in NIC mode (same `mlx5_core` driver) but operationally heavy in DPU mode (new ARM OS, new container runtime, new attack surface).
- Yobitel NeoCloud's current standard is BlueField-3 across the H100/H200/GB200 fleet; the BlueField-4 transition begins with the GB300 NVL72 pods entering the UK region in 2026.
| Alternative | Strengths | Weaknesses | Best for |
|---|---|---|---|
| ConnectX-7 NIC (dumb) | Cheapest at line rate; same network silicon as B3220 | No ARM CPU, no DOCA, no offload | Pure training fabrics where DPU mode is unused |
| BlueField-3 (this entry) | Mature DOCA, broad ecosystem, GPUDirect | Most expensive per-port; complex operations in DPU mode | AI clusters wanting full NIC + DPU capability |
| BlueField-4 DPU | 800 Gb/s, ~64 ARM cores, Blackwell-era SKU | Newer, smaller install base, costlier | GB300 NVL72 era fabrics, 2026+ new builds |
| AMD Pensando DSC2-400 | Strong P4 pipeline, deployed by HPE/AMD | Smaller DOCA-equivalent ecosystem; weaker GPU integration | AMD MI300X-based clusters; HPE-standard SKUs |
| Intel IPU E2000 (Mount Evans) | Tight Intel Xeon integration, P4 pipeline | Limited InfiniBand; smaller AI footprint | Hyperscale builds standardised on Intel networking |
| AWS Nitro / Microsoft Hololake | Proven at hyperscale | Not for sale | Internal hyperscale only |
Pitfalls and operational notes#
- Firmware drift is the silent killer. The DPU runs three coupled firmwares (NIC, ARM bootloader, BFB image); mixing versions across a fleet causes sporadic RoCEv2 connection drops and silent GPUDirect fallback. Pin a DOCA release per pod and document the upgrade window.
- NIC mode versus DPU mode is a one-way migration in practice — moving from NIC mode to DPU mode after the host is in production requires a reboot, new firmware image, and re-cabling of the OOB management network.
- BAR1 sizing on the host GPU affects GPUDirect registration. If BAR1 is small (default on many BIOS), large RDMA registrations from training frameworks fail with cryptic `ibv_reg_mr` errors. Set BAR1 to the GPU's HBM size in BIOS.
- DPU-mode containers run on ARM, not x86. Building an ARM image, hardware-attesting it, and shipping it to a fleet of DPUs is a different supply chain than the host application supply chain. Treat it as such.
- OOB management network: do not let DPU management share a VLAN with host workloads. The DPU is meant to be a separate trust domain; a shared management VLAN collapses the model.
- PCIe ACS (Access Control Services) enabled on intermediate root-port bridges blocks GPUDirect peer-to-peer. Disable per OEM guidance; verify with `lspci -vv | grep ACSCtl`.
- Power: 8 × B3140 per host adds ~1.2 kW; verify chassis PSU and rack PDU sizing before scale-out.
- DOCA telemetry is opt-in. Enable it before the first production run, not after the first incident.
A BlueField-3 deployed in DPU mode with default settings will silently accept any container that lands on its container runtime — including via SSH from a misconfigured operator. Treat the DPU's ARM OS as a separate hardened host: signed images only, RBAC-controlled SSH, audit logging exported to a separate collector. Yobitel NeoCloud's DPU-mode tier runs hardware-attested signed images only; replicate that discipline before going to production.
Where it fits in the Yobitel stack#
BlueField-3 is the SuperNIC inside every Yobitel NeoCloud H100, H200 and GB200 NVL72 node. In the training tier it runs in NIC mode, delivering line-rate RoCEv2 (or InfiniBand NDR) and GPUDirect RDMA to NCCL. In the storage tier it runs in DPU mode, hosting NVMe-oF target offload and dedupe/compression. In the multi-tenant inference and gateway tier it runs in zero-trust mode, providing hardware-attested tenant isolation that lets Yobibyte safely share physical hosts across tenants while preserving the NCSC OFFICIAL classification on the UK sovereign region.
Customers consuming Yobitel NeoCloud directly see BlueField-3's effect as low-latency, line-rate inter-node bandwidth and a low host-CPU footprint for networking. Customers consuming through Yobibyte see it as managed multi-tenant inference endpoints that share hardware without sharing trust. Customers running InferenceBench's published throughput numbers see it as the unspoken substrate that lets the benchmark hit deterministic numbers across pods. The card is invisible in the customer surface; the behaviour it enables is not.
References
- NVIDIA BlueField-3 DPU Product Page · NVIDIA
- BlueField-3 DPU Datasheet · NVIDIA
- DOCA SDK Documentation · NVIDIA
- NVIDIA Spectrum-X Reference Architecture · NVIDIA
- DGX H100/H200 System Architecture · NVIDIA