Professional Services · Network Fabrics
Network fabrics for AI clusters, designed to the NVIDIA NCP reference
From 1 SU pilot clusters to 32 SU multi-hall builds. InfiniBand Quantum-2 NDR (QM9700) or Quantum-X800 XDR (Q3400-RA) for compute. NVIDIA Cumulus spine-leaf for storage and access. UFM Enterprise for the control plane. Every layer designed for lossless, non-blocking traffic at the bandwidths the GPUs actually need.
Scale points
Sized to the NCP scalable unit (SU)
The NCP reference scales in SU shapes. On Quantum-2 NDR each SU is 32 HGX servers (the H100/H200 envelope, 256 GPUs per SU). On Quantum-X800 XDR each SU is 72 HGX servers (the B200/B300 envelope today and the forward-looking Rubin R100 platform, 576 GPUs per SU). NVL72 rail-optimised designs are a separate topology variant with their own sizing.
XDR InfiniBand at 800 Gbps on Q3400-RA switches (144×800G, marketed as Quantum-X800). The same fabric serves B200 / B300 HGX clusters today and the forward-looking Rubin R100 platform. NCP SU shapes count 8-GPU HGX server nodes: 72 nodes per SU. NVL72 rail-optimised designs are a separate topology variant with their own sizing.
1 SU
72 B200 / B300 / R100 nodes · 576 GPUs
Pilot · first-Blackwell or first-Rubin deployments
- Topology
- Two-tier leaf-spine, single fabric island
- East-west
- XDR 800G · single fabric island
- UFM
- Single UFM Enterprise instance
4 SU
288 B200 / B300 / R100 nodes · 2,304 GPUs
Production starter on Blackwell-class clusters
- Topology
- Two-tier with rail optimisation
- East-west
- XDR 800G · leaf-spine
- UFM
- UFM Enterprise active-standby HA + telemetry
8 SU
Most common576 B200 / B300 / R100 nodes · 4,608 GPUs
Production scale · most XDR builds land here
- Topology
- Three-tier: leaf, spine, superspine
- East-west
- XDR 800G · congestion-aware routing
- UFM
- UFM HA + dedicated telemetry stack
16 SU
1,152 B200 / B300 / R100 nodes · 9,216 GPUs
Frontier-grade · ~10K-GPU envelope on XDR
- Topology
- Three-tier with extended superspine fabric
- East-west
- XDR 800G · multi-island, rail-optimised
- UFM
- Per-island UFM HA + central correlation
What we design
Six surfaces of a production AI fabric
Compute, storage, management, validation. Every surface designed against the NCP reference and signed off against acceptance criteria.
InfiniBand east-west
Quantum-2 NDR (QM9700, 64×400G) for H100/H200 fleets, Quantum-X800 XDR (Q3400-RA, 144×800G) for B200/B300 and forward-looking Rubin R100. SHARPv3 in-network reductions, rail-optimised topology, and adaptive/congestion-aware routing where the workload calls for it.
UFM HA + operations→
Unified Fabric Manager Enterprise in active-standby HA (Pacemaker + DRBD replication), plus UFM Telemetry, congestion control, and proactive alerting wired into your existing observability stack. Documented runbooks for routine ops and incidents.
Read the deep-dive→
Cumulus Spine-Leaf Ethernet
NVIDIA Cumulus Linux spine-leaf with ECMP and multicast for general access. EVPN-VxLAN where multi-tenant segmentation matters. Tier-2/3 designs sized to the SU plan.
Lossless storage fabric
RoCEv2 lossless on Cumulus for NVMe-oF/RDMA and parallel filesystems (Lustre/GPFS RDMA, GPUDirect Storage). NVMe/TCP supported as the non-RDMA transport over standard Ethernet. PFC + ECN tuned per traffic class so storage doesn't head-of-line block the compute fabric.
Control + management plane
Out-of-band management, secure boot, image lifecycle, day-zero install. NVIDIA NetQ for Ethernet-fabric telemetry; integration with your existing IPAM, syslog, and SIEM.
Validation + cutover
Benchmark plan covers NCCL all-reduce sweeps, GPUDirect Storage (GDS) runs, and end-to-end MLPerf-style workloads. Sign-off against pre-agreed acceptance criteria.
What you receive
Eight named exit artefacts
We don't hand you a deck. Every engagement closes with concrete, version- controlled artefacts your team can act on the day after we leave.
Where day-two is your team's, the runbooks and the as-code repo are the handover. Where day-two is ours, the same artefacts back the SLA.
- Logical and physical fabric design aligned to NCP standards
- Detailed bill of materials with NVIDIA SKU sheet
- Rack / row layout + cabling plan
- Day-zero install runbook + acceptance test plan
- UFM HA operations playbook (or handover to your team)
- Cumulus Linux configuration as code (Ansible / NetQ-aware)
- Documented traffic-class policy (compute / storage / mgmt)
- Post-cutover NCCL + GPUDirect Storage benchmark report
From the field
Writeups from the networking practice
Short reads on the parts of an AI fabric that decide whether a build holds up in production. Written by the engineers who design and commission these clusters.
Engagement shapes
Three ways we can work together
Yobitel-led when you need delivery speed. Collaborative when you have a strong team that wants outside acceleration. Advisory when you want a second opinion before procurement.
Yobitel-led
We own the design end-to-end
Full NCP-class design, BOM, install runbooks, on-site validation, optional day-2 managed operations handover. Best fit when your team is light on InfiniBand / Cumulus experience or wants delivery speed.
Best fit · Light in-house fabric capacity · production timeline pressure
Collaborative
We design with your team
Joint architecture reviews, BOM sanity-check, focused workshops on topics like SHARP / congestion control / EVPN multi-tenancy. Your team executes the build; we sign off on the design and join the cutover.
Best fit · Strong networking team · wants outside review + acceleration
Advisory
Time-boxed review
Fixed-window engagement to review a design you've already drafted. We spot risk, suggest a small set of focused changes, and deliver a written report.
Best fit · Design already complete · want a second opinion before procurement
GPU SKUs covered
H100 · H200 · B200 · B300 · GB200 · GB300 · Rubin R100
Bandwidths designed
400G NDR · 800G XDR
Sovereignty perimeters
NCSC · GDPR · FedRAMP · MeitY · any framework
Tell us what you're building.
The conversation takes about three minutes. Three short steps (technical, strategic, functional), then your details. Our networking practice lead replies inside one working day with the relevant NCP reference architecture and a first-cut bill of materials.
Yobitel is UK-headquartered with engagement teams that deliver into any sovereignty perimeter: NCSC / G-Cloud, EU GDPR, US FedRAMP-equivalent, India MeitY / DPDP, and beyond. No fabric we've designed has ever come back with an unsolved congestion issue post-cutover.