Professional Services · Network Fabrics

Network fabrics for AI clusters, designed to the NVIDIA NCP reference

From 1 SU pilot clusters to 32 SU multi-hall builds. InfiniBand Quantum-2 NDR (QM9700) or Quantum-X800 XDR (Q3400-RA) for compute. NVIDIA Cumulus spine-leaf for storage and access. UFM Enterprise for the control plane. Every layer designed for lossless, non-blocking traffic at the bandwidths the GPUs actually need.

See the scale points

NCP reference-gradeSovereignty-perimeter aware · any regionQuantum-X800 XDR 800G readyCumulus EVPN-VxLAN certified design

Scale points

Sized to the NCP scalable unit (SU)

The NCP reference scales in SU shapes. On Quantum-2 NDR each SU is 32 HGX servers (the H100/H200 envelope, 256 GPUs per SU). On Quantum-X800 XDR each SU is 72 HGX servers (the B200/B300 envelope today and the forward-looking Rubin R100 platform, 576 GPUs per SU). NVL72 rail-optimised designs are a separate topology variant with their own sizing.

XDR InfiniBand at 800 Gbps on Q3400-RA switches (144×800G, marketed as Quantum-X800). The same fabric serves B200 / B300 HGX clusters today and the forward-looking Rubin R100 platform. NCP SU shapes count 8-GPU HGX server nodes: 72 nodes per SU. NVL72 rail-optimised designs are a separate topology variant with their own sizing.

1 SU

72 B200 / B300 / R100 nodes · 576 GPUs

Pilot · first-Blackwell or first-Rubin deployments

Topology: Two-tier leaf-spine, single fabric island
East-west: XDR 800G · single fabric island
UFM: Single UFM Enterprise instance

4 SU

288 B200 / B300 / R100 nodes · 2,304 GPUs

Production starter on Blackwell-class clusters

Topology: Two-tier with rail optimisation
East-west: XDR 800G · leaf-spine
UFM: UFM Enterprise active-standby HA + telemetry

8 SU

Most common

576 B200 / B300 / R100 nodes · 4,608 GPUs

Production scale · most XDR builds land here

Topology: Three-tier: leaf, spine, superspine
East-west: XDR 800G · congestion-aware routing
UFM: UFM HA + dedicated telemetry stack

16 SU

1,152 B200 / B300 / R100 nodes · 9,216 GPUs

Frontier-grade · ~10K-GPU envelope on XDR

Topology: Three-tier with extended superspine fabric
East-west: XDR 800G · multi-island, rail-optimised
UFM: Per-island UFM HA + central correlation

What we design

Six surfaces of a production AI fabric

Compute, storage, management, validation. Every surface designed against the NCP reference and signed off against acceptance criteria.

InfiniBand east-west

Quantum-2 NDR (QM9700, 64×400G) for H100/H200 fleets, Quantum-X800 XDR (Q3400-RA, 144×800G) for B200/B300 and forward-looking Rubin R100. SHARPv3 in-network reductions, rail-optimised topology, and adaptive/congestion-aware routing where the workload calls for it.

Quantum-2 NDR 400GQuantum-X800 XDR 800GSHARPv3 reductionsAdaptive routing

UFM HA + operations→

Unified Fabric Manager Enterprise in active-standby HA (Pacemaker + DRBD replication), plus UFM Telemetry, congestion control, and proactive alerting wired into your existing observability stack. Documented runbooks for routine ops and incidents.

UFM EnterpriseUFM TelemetryActive-standby HARunbooks

Read the deep-dive→

Cumulus Spine-Leaf Ethernet

NVIDIA Cumulus Linux spine-leaf with ECMP and multicast for general access. EVPN-VxLAN where multi-tenant segmentation matters. Tier-2/3 designs sized to the SU plan.

ECMPMulticastEVPN VxLANTier-2/3

Lossless storage fabric

RoCEv2 lossless on Cumulus for NVMe-oF/RDMA and parallel filesystems (Lustre/GPFS RDMA, GPUDirect Storage). NVMe/TCP supported as the non-RDMA transport over standard Ethernet. PFC + ECN tuned per traffic class so storage doesn't head-of-line block the compute fabric.

RoCEv2NVMe-oF / RDMA + TCPPFC / ECNLustre / GPFS / WEKA

Control + management plane

Out-of-band management, secure boot, image lifecycle, day-zero install. NVIDIA NetQ for Ethernet-fabric telemetry; integration with your existing IPAM, syslog, and SIEM.

OOB managementNVIDIA NetQSecure bootIPAM / SIEM

Validation + cutover

Benchmark plan covers NCCL all-reduce sweeps, GPUDirect Storage (GDS) runs, and end-to-end MLPerf-style workloads. Sign-off against pre-agreed acceptance criteria.

NCCL benchmarksGPUDirect StorageMLPerf-style runsAcceptance criteria

What you receive

Eight named exit artefacts

We don't hand you a deck. Every engagement closes with concrete, version- controlled artefacts your team can act on the day after we leave.

Where day-two is your team's, the runbooks and the as-code repo are the handover. Where day-two is ours, the same artefacts back the SLA.

Logical and physical fabric design aligned to NCP standards
Detailed bill of materials with NVIDIA SKU sheet
Rack / row layout + cabling plan
Day-zero install runbook + acceptance test plan
UFM HA operations playbook (or handover to your team)
Cumulus Linux configuration as code (Ansible / NetQ-aware)
Documented traffic-class policy (compute / storage / mgmt)
Post-cutover NCCL + GPUDirect Storage benchmark report

From the field

Writeups from the networking practice

Short reads on the parts of an AI fabric that decide whether a build holds up in production. Written by the engineers who design and commission these clusters.

Operations

Why UFM HA is the part of an AI fabric most teams underestimate

The active-standby design behind a production InfiniBand fabric. What the install guide leaves out, and what we put on paper before we type a command.

8 min read→

Coming next

Cumulus EVPN-VxLAN tuning for storage tenancy

Coming next

Acceptance-testing an NCP fabric before sign-off

Engagement shapes

Three ways we can work together

Yobitel-led when you need delivery speed. Collaborative when you have a strong team that wants outside acceleration. Advisory when you want a second opinion before procurement.

Yobitel-led

We own the design end-to-end

Full NCP-class design, BOM, install runbooks, on-site validation, optional day-2 managed operations handover. Best fit when your team is light on InfiniBand / Cumulus experience or wants delivery speed.

Best fit · Light in-house fabric capacity · production timeline pressure

Collaborative

We design with your team

Joint architecture reviews, BOM sanity-check, focused workshops on topics like SHARP / congestion control / EVPN multi-tenancy. Your team executes the build; we sign off on the design and join the cutover.

Best fit · Strong networking team · wants outside review + acceleration

Advisory

Time-boxed review

Fixed-window engagement to review a design you've already drafted. We spot risk, suggest a small set of focused changes, and deliver a written report.

Best fit · Design already complete · want a second opinion before procurement

GPU SKUs covered

H100 · H200 · B200 · B300 · GB200 · GB300 · Rubin R100

Bandwidths designed

400G NDR · 800G XDR

Sovereignty perimeters

NCSC · GDPR · FedRAMP · MeitY · any framework

Tell us what you're building.

The conversation takes about three minutes. Three short steps (technical, strategic, functional), then your details. Our networking practice lead replies inside one working day with the relevant NCP reference architecture and a first-cut bill of materials.

Prefer email? Contact us

Yobitel is UK-headquartered with engagement teams that deliver into any sovereignty perimeter: NCSC / G-Cloud, EU GDPR, US FedRAMP-equivalent, India MeitY / DPDP, and beyond. No fabric we've designed has ever come back with an unsolved congestion issue post-cutover.

Network fabrics for AI clusters, designed to the NVIDIA NCP reference

NCP reference-gradeSovereignty-perimeter aware · any regionQuantum-X800 XDR 800G readyCumulus EVPN-VxLAN certified design

Sized to the NCP scalable unit (SU)

Tell us what you're building.