TL;DR
- Dual-die Blackwell GPU announced GTC March 2024, volume shipping from Q4 2024. Two reticle-limit dies on CoWoS-L, linked by NV-HBI at 10 TB/s, presented to CUDA as one GPU with one unified HBM pool and one SM cluster.
- 192 GB HBM3e at 8 TB/s (eight 24 GB stacks) — twice the H100's bandwidth and 2.4x the capacity in one package. Second-generation Transformer Engine adds native FP4 (E2M1) and microscaling MX formats (MXFP4, MXFP6, MXFP8).
- Headline throughput: 4,500 TFLOPS BF16 dense, 4,500 TFLOPS FP8 dense, 9,000 TFLOPS FP4 dense (2:4 sparse doubles every figure). Roughly 2x H100 SXM5 per card on iso-precision serving; 4-5x at FP4 vs H100 FP8 on calibration-friendly models.
- NVLink 5.0 at 1.8 TB/s per GPU (2x H100); 1,000 W TDP at the standard configuration (configurable to 1,200 W). Direct-to-chip liquid cooling is effectively mandatory at rack scale — 14.4 kW per HGX-B200 baseboard.
- Pricing through 2026: roughly $6.00/GPU-hr on-demand, $4.50 one-year reserved, $3.60 three-year reserved. NO spot tier — capacity remains tight enough that hyperscalers keep B200 reserved for committed customers.
Overview#
B200 is the data centre flagship of the Blackwell generation and the part that defined the post-Hopper performance ceiling. NVIDIA announced it at GTC in March 2024 and shipped in volume from Q4 2024. Two reticle-limit Blackwell dies (roughly 800 mm² each, the maximum size TSMC's lithography reticle allows) are packaged side-by-side on a CoWoS-L substrate, linked by NV-HBI (NVIDIA High-Bandwidth Interconnect) at 10 TB/s, and presented to CUDA as a single GPU with one unified HBM pool, one set of SMs and one set of NVLink ports.
The headline numbers tell the throughput story: 4,500 TFLOPS dense BF16 (9,000 TFLOPS at 2:4 sparse), 4,500 TFLOPS dense FP8 (9,000 sparse), 9,000 TFLOPS dense FP4 (18,000 sparse), 192 GB HBM3e at 8 TB/s, NVLink 5.0 at 1.8 TB/s per GPU. Roughly 2x H100 SXM5 across the board, and 4-5x H100 on FP4 inference for calibration-friendly models. The new precision format — FP4 with microscaling MX blocks — is the part that materially reshapes 2026 inference economics: a 70B chat model in MXFP4 sustains roughly 2x H100 SXM5 throughput per card at iso-quality, doubling effective replica density at a ~40 % per-card premium.
Two SKUs ship in volume. The HGX B200 SXM module fills 8-GPU baseboards (3.6 TB/s NVSwitch bisection inside one baseboard, same topology as HGX-H100/H200). The GB200 super-chip pairs two B200 dies with one Grace CPU on a single board, and 18 GB200 super-chips populate the GB200 NVL72 rack — 72 B200 dies plus 36 Grace CPUs inside one NVLink domain, 130 TB/s rack bisection, the canonical 2025-2026 frontier training topology.
This entry is the reference for teams operating B200 alongside or instead of H100/H200: full spec sheet, the sizing tables we use on InferenceBench, the Blackwell software-stack uplift (TensorRT-LLM 0.13+, vLLM 0.8+, Megatron-LM rebuild), the FP4/MX calibration pipeline, the liquid-cooling and NVLink Switch System operational guardrails, and where the part fits in the Yobibyte and Omniscient Compute stack. Yobitel NeoCloud offers B200 SXM capacity in preview availability across UK and EU regions with NCSC OFFICIAL alignment and direct-to-chip liquid-cooled racks, with GB200 NVL72 frontier-training racks expected to follow in committed capacity blocks. This entry helps you decide when B200 is the right pick for your workload and how to size and price it on Yobitel NeoCloud or your own cluster.
How it works: dual-die Blackwell and the second-generation Transformer Engine#
Two innovations defined Blackwell. The first is the dual-die package: NVIDIA hit the reticle limit on the GH100 die (~814 mm²) and could not scale a single Blackwell die further within one lithography pass. Instead, two reticle-limit dies are placed side-by-side on a CoWoS-L organic substrate, linked by NV-HBI — a 10 TB/s on-package fabric that delivers near-die-local latencies and presents the package to CUDA as a single GPU. From the software side this is invisible: one HBM pool, one SM cluster, one NVLink controller, one CUDA device ID. From the hardware side it is what let NVIDIA nearly double per-package throughput without waiting for the next process node.
The second innovation is the second-generation Transformer Engine and the FP4 / MX format family. Hopper introduced FP8 (E4M3 forward, E5M2 backward) with per-tensor amax tracking. Blackwell extends this with FP4 (E2M1 — two exponent bits, one mantissa bit, signed) and microscaling MX formats (MXFP4, MXFP6, MXFP8) defined by the Open Compute Project. MX formats group 32 values into a block sharing a single power-of-two scaling factor, giving finer-grained calibration than per-tensor scaling and materially reducing accuracy loss at FP4 precision. The Transformer Engine runtime selects per-layer precision on the fly — typically FP8 for outliers and FP4 for the bulk of GEMM tiles — and routes through cuBLAS LT engines compiled for each format.
The practical consequence: FP4 inference is viable today for most chat-tuned open-weight models when quantised with NVIDIA's ModelOpt toolchain, with accuracy loss bounded at < 1.5 perplexity points for Llama 3.1, Qwen 2.5 and Mistral families. Throughput is roughly 2x FP8 at iso-quality on B200, and 4-5x H100 FP8 at iso-quality — the largest single-generation inference uplift NVIDIA has ever shipped. FP4 training is still experimental in 2026; production training paths use FP8 (compatible with Hopper) or MXFP8 (Blackwell-only, near-production).
Beyond the two headlines: fifth-generation NVLink at 1.8 TB/s per GPU (2x H100), fifth-generation NVSwitch ASIC at 3.6 TB/s bisection per baseboard (same 8-GPU baseboard topology as Hopper but with double the bandwidth), NVLink Switch System extending NVLink domains to 576 GPUs on dedicated NVLink switches, ConfidentialCompute mode (CC-on) with TEE-style attestation and HBM page encryption identical to Hopper, and a dedicated decompression engine that accelerates dataloader I/O for training.
- Dual die: two reticle-limit Blackwell dies on CoWoS-L; NV-HBI at 10 TB/s die-to-die; presented as one CUDA device.
- Transformer Engine 2.0: native FP4 (E2M1), MXFP4/MXFP6/MXFP8 microscaling formats, per-layer runtime routing.
- NVLink 5.0: 1.8 TB/s per GPU, 18 ports × 100 GB/s; NVSwitch gen-5 ASIC; NVL domain up to 576 GPUs.
- 192 GB HBM3e: eight 24 GB stacks at 8 TB/s aggregate; per-pin signalling 8.0 Gbps.
- Decompression engine: hardware unzip / unsnappy / unlz4 for dataloader pipelines; offloads CPU during training.
- Confidential Compute (CC-on): AES-256-GCM PCIe + HBM page sealing with SPDM attestation, identical model to Hopper.
| Subsystem | H100 SXM5 | B200 SXM | Blackwell delta |
|---|---|---|---|
| Dies per package | 1 | 2 (linked by NV-HBI 10 TB/s) | Dual-die |
| Memory capacity | 80 GB HBM3 | 192 GB HBM3e | +140 % |
| Memory bandwidth | 3.35 TB/s | 8 TB/s | +139 % |
| FP8 dense (Tensor) | 1,979 TFLOPS | 4,500 TFLOPS | +127 % |
| FP4 dense (Tensor) | n/a | 9,000 TFLOPS | New format |
| NVLink per GPU | 900 GB/s | 1.8 TB/s | +100 % |
| NVLink-domain ceiling | 256 GPUs | 576 GPUs | +125 % |
| TDP | 700 W | 1,000 W (up to 1,200 W) | +43 % |
| Min driver | R525 | R550 | New baseline |
| Min CUDA | 12.0 | 12.4 | New baseline |
Reference: full specification sheet#
Authoritative per-SKU figures for the SXM B200 in its standard 1,000 W configuration. All Tensor figures assume 2:4 structured sparsity; dense throughput is half the sparse figure. FP4 throughput requires the second-generation Transformer Engine and a model quantised with NVIDIA ModelOpt (or MX-compatible quantisation tooling).
| Metric | B200 SXM | B200 NVL (PCIe pair) |
|---|---|---|
| Architecture | Blackwell (dual-die) | Blackwell (dual-die) x2 |
| Process | TSMC 4NP | TSMC 4NP |
| Transistors per package | 208 billion | 416 billion (pair) |
| Dies per package | 2 (NV-HBI 10 TB/s) | 2 per board x 2 boards |
| Compute capability | sm_100 / sm_100a | sm_100 / sm_100a |
| FP64 (Tensor) | 40 TFLOPS | 80 TFLOPS |
| FP32 | 80 TFLOPS | 160 TFLOPS |
| TF32 (Tensor, sparse) | 2,250 TFLOPS | 4,500 TFLOPS |
| BF16 / FP16 (Tensor, dense) | 4,500 TFLOPS | 9,000 TFLOPS |
| BF16 / FP16 (Tensor, sparse) | 9,000 TFLOPS | 18,000 TFLOPS |
| FP8 (Tensor, dense) | 4,500 TFLOPS | 9,000 TFLOPS |
| FP8 (Tensor, sparse) | 9,000 TFLOPS | 18,000 TFLOPS |
| FP4 (Tensor, dense) | 9,000 TFLOPS | 18,000 TFLOPS |
| FP4 (Tensor, sparse) | 18,000 TFLOPS | 36,000 TFLOPS |
| INT8 (Tensor, sparse) | 9,000 TOPS | 18,000 TOPS |
| Memory | 192 GB HBM3e (8 stacks) | 384 GB HBM3e (192 GB per board) |
| Memory bandwidth | 8 TB/s | 16 TB/s aggregate |
| NVLink | 1.8 TB/s (5.0, 18 ports) | 900 GB/s (bridge) |
| NV-HBI (die-to-die) | 10 TB/s | 10 TB/s per board |
| PCIe | Gen5 x16 (128 GB/s) | Gen5 x16 per board |
| TDP | 1,000 W default (configurable 700-1,200 W) | 2 x 1,000 W |
| MIG instances | Up to 7 | Up to 7 per board |
| Confidential Compute | Yes (CC-on attested) | Yes |
| Decompression engine | Yes (unzip / unsnappy / unlz4) | Yes |
| Form factor | SXM mezzanine | Dual FHFL PCIe + bridge |
| Cooling | Direct-to-chip liquid (mandatory at rack scale) | Liquid or air |
| Minimum driver | R550+ | R550+ |
| Minimum CUDA | 12.4+ | 12.4+ |
FP4 sparse numbers assume both MX-FP4 calibration and 2:4 structured sparsity. Real-world FP4 inference throughput on production chat models typically lands at 60-70 % of the headline sparse figure once KV-cache traffic and per-batch quantisation overhead are included. Quote dense numbers in capacity plans.
Interconnect: NVLink 5.0, NVSwitch gen-5, and the GB200 NVL72 rack#
Per-GPU NVLink 5.0 doubles Hopper's bandwidth: 18 ports × 100 GB/s = 1.8 TB/s aggregate per B200. The fifth-generation NVSwitch ASIC follows, delivering 3.6 TB/s bisection inside an 8-GPU HGX-B200 baseboard. The optional NVLink Switch System extends this to 576-GPU NVLink domains via external switches — more than 2x the 256-GPU ceiling of Hopper's NVL Switch.
The canonical Blackwell training topology is the GB200 NVL72 rack. Each rack houses 18 GB200 super-chips (one Grace CPU + two B200 dies per super-chip — note: two dies, not two GPUs; this is 36 B200 packages in the rack and 72 B200 dies). All 72 dies plus 36 Grace CPUs sit inside one NVLink domain with 130 TB/s rack bisection — roughly 30x the inter-server bandwidth of an equivalent 72-H100 InfiniBand cluster. The rack is liquid-cooled by design (120 kW typical, 132 kW peak) and ships pre-assembled.
Beyond NVL72, NVLink Switch System scales to NVL576 — eight NVL72 racks linked by external NVLink switches into one 576-die NVLink domain. Past 576 dies, the topology drops to InfiniBand XDR or Spectrum-X RoCE, with the same 5-10x latency uplift on cross-domain collectives that Hopper saw at 256 GPUs.
- Per-GPU NVLink: 1.8 TB/s (18 ports × 100 GB/s bidirectional).
- Per-baseboard NVSwitch bisection: 3.6 TB/s (8 GPUs).
- NVLink-domain ceiling: 576 GPUs (NVL576).
- GB200 NVL72: 72 B200 dies + 36 Grace CPUs, 130 TB/s rack bisection, 120 kW liquid-cooled rack.
- Above 576 GPUs: InfiniBand XDR (800 Gb/s per port) or Spectrum-X RoCE — plan for 5-10x latency uplift on cross-domain collectives.
Workload pattern A: Llama 3.1 70B at FP4 on a single B200#
The pattern that defined B200's inference value proposition. Llama 3.1 70B quantised to MXFP4 with NVIDIA ModelOpt occupies roughly 35 GB of weight memory; KV cache at 128K context for 32 concurrent streams sits around 50 GB. On a single B200 with 192 GB HBM3e, both fit comfortably with 100+ GB headroom. Throughput sustains roughly 2x H100 SXM5 FP8 at iso-quality (verified on InferenceBench), making FP4 the dominant 2026 inference precision on Blackwell.
# 70B at FP4 on 1x B200 SXM with vLLM 0.8+ and the Blackwell backend
# Pre-quantise to MXFP4 with NVIDIA ModelOpt (one-off, offline):
# modelopt-quantize \
# --model meta-llama/Meta-Llama-3.1-70B-Instruct \
# --output ./Llama-3.1-70B-Instruct-FP4 \
# --quant-format mxfp4 --awq-block-size 32 --calib-dataset wikitext-2
CUDA_VISIBLE_DEVICES=0 vllm serve ./Llama-3.1-70B-Instruct-FP4 \
--tensor-parallel-size 1 \
--quantization modelopt \
--kv-cache-dtype fp8_e5m2 \
--max-model-len 131072 \
--max-num-batched-tokens 32768 \
--max-num-seqs 64 \
--gpu-memory-utilization 0.92 \
--enable-prefix-caching \
--host 0.0.0.0 --port 8000
# Alternative: TensorRT-LLM 0.13+ for the absolute highest throughput
trtllm-build --checkpoint_dir ./Llama-3.1-70B-Instruct-FP4 \
--output_dir ./engines/llama31-70b-b200-fp4 \
--gemm_plugin fp4 --gpt_attention_plugin fp8 \
--max_input_len 131072 --max_seq_len 131072 \
--tp_size 1 --workers 1On a calibration-friendly chat model, FP4 on B200 typically clears 4,000-5,000 output TPS at 32K context — roughly 4x what H100 SXM5 sustains in FP8 at the same context. Validate accuracy on your eval set before locking in FP4 — for some domain-specific fine-tunes the perplexity hit warrants staying on FP8.
Workload pattern B: 405B training on GB200 NVL72#
The canonical Blackwell training topology. Llama 3.1 405B at FP8 (Transformer Engine HYBRID) on one GB200 NVL72 rack — 72 B200 dies in one NVLink domain, 130 TB/s bisection — completes one trillion training tokens in roughly 14-18 days at 95-98 % MFU on Megatron-LM with the Blackwell rebuild. Pipeline parallelism is unnecessary inside one rack; the entire model trains with TP+DP across the 72-die NVLink domain.
- MFU (model FLOPS utilisation) on GB200 NVL72 routinely clears 95 % for 405B FP8 training — the highest MFU NVIDIA has ever published.
- Pipeline parallelism is unnecessary within one NVL72 rack; tensor-parallel + data-parallel across 72 dies is enough.
- Multi-rack training links NVL72 racks via NVLink Switch System (NVL576) or InfiniBand XDR; both are supported by NCCL 2.21+.
# Megatron-LM on GB200 NVL72 — single-rack 405B FP8 training
# Requires Megatron-Core 0.10+ with Blackwell rebuild, Transformer Engine 1.10+
# Launch across 72 B200 dies inside one NVL72 rack (TP=8, PP=1, DP=9)
export NCCL_NVLS_ENABLE=1
export NCCL_IB_DISABLE=0 # NVL Switch System uses internal IB-like fabric
export TRANSFORMER_ENGINE_FP8=1
export TE_FP8_FORMAT=HYBRID # E4M3 forward, E5M2 backward
torchrun \
--nnodes=18 --nproc_per_node=4 \
--rdzv_endpoint=$HEAD_NODE:29500 \
pretrain_gpt.py \
--tensor-model-parallel-size 8 \
--pipeline-model-parallel-size 1 \
--data-parallel-size 9 \
--num-layers 126 --hidden-size 16384 --num-attention-heads 128 \
--seq-length 8192 --max-position-embeddings 131072 \
--micro-batch-size 1 --global-batch-size 1152 \
--train-iters 600000 \
--lr 1.5e-4 --min-lr 1.5e-5 --lr-decay-style cosine \
--weight-decay 0.1 --clip-grad 1.0 \
--transformer-impl transformer_engine \
--fp8-format hybrid --fp8-amax-history-len 1024 \
--use-flash-attn --use-distributed-optimizer \
--data-path /datasets/the-pile --tokenizer-type Llama3Tokenizer \
--save /checkpoints/llama31-405b --save-interval 1000Workload pattern C: MoE serving with 8x B200#
Mixture-of-experts serving on an HGX-B200 baseboard with TP=8 and an MoE-aware scheduler. Activated parameters per token fit on one B200 die's HBM; the routed-parameter pool (typically 4-10x the activated parameters) is sharded across 8 GPUs with expert parallelism. This is the topology that runs Mixtral 8x22B, DeepSeek V3 and the larger custom MoE models that filled B200 capacity through 2025-2026.
# Mixtral 8x22B FP8 on 8x B200 SXM with vLLM expert parallelism
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
NCCL_P2P_LEVEL=NVL \
NCCL_NVLS_ENABLE=1 \
vllm serve mistralai/Mixtral-8x22B-Instruct-v0.1 \
--tensor-parallel-size 8 \
--expert-parallel-size 8 \
--quantization fp8 \
--kv-cache-dtype fp8_e5m2 \
--max-model-len 65536 \
--max-num-batched-tokens 65536 \
--max-num-seqs 128 \
--gpu-memory-utilization 0.90 \
--enable-prefix-caching \
--host 0.0.0.0 --port 8000Sizing and capacity planning#
Sizing tables we use on InferenceBench. All figures assume B200 SXM, FP8 weights (or FP4 where flagged) via the second-generation Transformer Engine, vLLM 0.8+ or TensorRT-LLM 0.13+, and NVLink-local placement. The headline against H100/H200 is that almost every row moves to fewer GPUs per replica at materially higher throughput — and FP4 rows double throughput again where calibration is viable.
- Training rule of thumb: 405B at FP8 on one GB200 NVL72 rack covers roughly 1 trillion training tokens per 14-18 days at 95 % MFU. The equivalent on H100 SXM5 is 8x larger and 3x slower.
- Memory ceiling for a single B200: weights + KV cache + activations + cuBLAS scratch < 188 GB usable. Above 188 GB expect OOMs even with paged KV.
- For 1,000 RPS at 4K-token output, Llama 3.1 70B FP4 needs roughly 3-4 B200 SXM replicas vs 12-14 H100 SXM5 FP8 replicas — typical fleet compression ratio of 3-4x.
- AllReduce overhead at TP=8 inside one HGX-B200: ~5-7 % of step time for 405B FP8 — slightly better than Hopper because NVLink 5.0 doubles bandwidth.
| Model size | Precision | Context | GPUs per replica | TP / EP | Approx output TPS | Approx VRAM headroom |
|---|---|---|---|---|---|---|
| 7B (Mistral, Qwen) | FP4 | 8K | 1x B200 | 1 / 1 | 11,000-14,000 | 175 GB free |
| 13B | FP4 | 8K | 1x B200 | 1 / 1 | 7,500-9,500 | 170 GB free |
| 34B (Yi, Codestral) | FP4 | 8K | 1x B200 | 1 / 1 | 3,800-4,800 | 150 GB free |
| 70B (Llama 3.1) | FP4 | 8K | 1x B200 | 1 / 1 | 2,800-3,400 | 130 GB free |
| 70B (Llama 3.1) | FP4 | 32K | 1x B200 | 1 / 1 | 3,200-4,000 | 100 GB free |
| 70B (Llama 3.1) | FP4 | 128K | 1x B200 | 1 / 1 | 3,500-4,500 | 50 GB free |
| 70B (Llama 3.1) | FP8 | 32K | 1x B200 | 1 / 1 | 2,200-2,800 | 80 GB free |
| 140B MoE (Mixtral 8x22B) | FP8 | 32K | 2x B200 | 2 / 2 | 2,800-3,500 | 100 GB free per rank |
| 180B (Falcon, Bloom) | FP4 | 8K | 1x B200 | 1 / 1 | 1,400-1,800 | 55 GB free |
| 405B (Llama 3.1) | FP8 | 32K | 4x B200 | 4 / 1 | 1,000-1,300 | 75 GB free per rank |
| 405B (Llama 3.1) | FP4 | 128K | 4x B200 | 4 / 1 | 1,800-2,400 | 50 GB free per rank |
| GB200 NVL72 training (405B FP8) | FP8 | 8K | 72 B200 dies / rack | 8 / 9 | n/a (training MFU 95-98 %) | Rack-scale |
Cost & TCO#
Market-clearing B200 pricing through 2026 is roughly 2.4x H100 SXM5 at the same commitment tier — but cost-per-token at FP4 on chat models is typically 35-50 % cheaper than H100 FP8 because per-card throughput rises faster than per-card cost. The honest test: B200 wins on cost-per-token for FP4-friendly inference and for frontier training (NVL72 rack-scale); H100 still wins on cost-per-token for FP8 / BF16 chat serving and for capacity-constrained budgets where supply matters more than peak throughput.
- Cost-per-million-output-tokens on Llama 3.1 70B FP4, 1x B200 at $6.00/GPU-hr and 4,000 TPS sustained: roughly $0.42 per million tokens — 40 % cheaper than the same model on H100 FP8 ($0.50) and 45 % cheaper than H200 FP8 ($0.74).
- Commitment savings: 1y reserved ~= 25 % off on-demand, 3y reserved ~= 40 % off. Only commit when steady-state utilisation > 70 %.
- FP4 is the default for chat inference where calibration is viable — FP8 leaves roughly 1.8-2x throughput on the table.
- Egress and inter-region data movement at hyperscalers still applies; collocate model artefacts with compute.
| Provider class | SKU | On-demand $/GPU-hr | 1y reserved | 3y reserved | Notes |
|---|---|---|---|---|---|
| Hyperscaler (AWS p6 / GCP a4 / Azure ND B200 v6) | B200 SXM | $6.00 | $4.50 | $3.60 | Capacity-block preferred over on-demand. |
| Hyperscaler | B200 NVL PCIe | $4.80 | $3.60 | $2.90 | Limited regions; air-cooled chassis available. |
| Tier-1 neocloud | B200 SXM | $5.20 | $4.00 | $3.20 | GB200 NVL72 racks where available. |
| Tier-2 neocloud | B200 SXM | $4.60 | $3.60 | $2.95 | Best raw rate; verify NVL Switch System topology. |
| GB200 NVL72 rack (Tier-1 neocloud) | Full rack | $420/hr (per rack) | $320/hr | $255/hr | 72 dies + 36 Grace; $5.80/die-hr at on-demand. |
| Spot/preemptible | B200 SXM | n/a | n/a | n/a | No spot tier through 2026. |
| Yobitel NeoCloud (UK + EU, preview) | B200 SXM | $5.40-5.80 | $4.10-4.40 | $3.30-3.55 | Preview capacity; NCSC OFFICIAL-aligned, liquid-cooled racks. |
| Yobitel Omniscient Compute | B200 SXM multi-cloud | Market-clearing | Commit-discounted | Commit-discounted | Cross-provider arbitrage on top of NeoCloud + partner capacity. |
All cost figures land on the FinOps Foundation FOCUS billing spec when consumed via Yobitel: ServiceName=`AcceleratorCompute`, ChargeCategory=`Usage`, SkuId=`gpu.b200.sxm` (or `rack.gb200.nvl72`).
Migration and alternatives#
When B200 is the right choice and when it isn't. The dominant migrations in 2026 are H100/H200 -> B200 for FP4-friendly serving and frontier training, and standalone B200 SXM -> GB200 NVL72 for rack-scale training. Two heuristics: do not migrate to B200 if you cannot absorb the software stack lift (TensorRT-LLM 0.13+, vLLM 0.8+, Megatron rebuild) and the ModelOpt FP4 calibration pipeline; do migrate if you are FP4-viable on chat models or if you need rack-scale frontier training and have the liquid-cooled facility ready.
| From / to | When it pays | Migration effort | Key incompatibility |
|---|---|---|---|
| H100 SXM5 -> B200 SXM | FP4-friendly chat serving; rack-scale training | Medium — software lift + FP4 calibration | New software stack; sm_100a kernels |
| H200 SXM -> B200 SXM | FP4 throughput uplift on memory-pressured workloads | Medium — same software lift | New software stack |
| H100/H200 -> GB200 NVL72 | Frontier training at rack scale | Very high — pod-as-unit topology, liquid cooling | New facility envelope; pre-assembled racks |
| B200 SXM -> GB200 NVL72 | 405B+ training scale-out | Medium — same B200 dies, new rack form | Liquid cooling; rack BMC integration |
| B200 SXM -> B200 NVL PCIe | Air-cooled chassis required | Trivial | Lower per-card TDP; lower NVLink bandwidth |
| B200 -> MI300X/MI325X/MI355X | ROCm or NVIDIA-alternative supply | High — CUDA -> ROCm rewrite | CUDA kernels not portable |
| B200 -> H100/H200 (downgrade) | Supply or budget pressure | Trivial — Hopper backend still ships | Lose FP4 throughput; lose NVL576 |
Pitfalls and operational notes#
B200 inherits the Hopper operational profile and adds three Blackwell-specific failure surfaces — liquid cooling, the NV-HBI inter-die fabric, and the second-generation Transformer Engine's FP4 calibration path. Most production incidents on early B200 fleets trace back to one of those three.
Liquid-cooling discipline is the new operational baseline. Coolant supply above 25 C, flow rate below spec, or a CDU pump fault all surface as rack-wide throttling during sustained training. The rack BMC's Redfish endpoint exposes `coolant_supply_temp_c`, `coolant_flow_lpm` and `cdu_pump_state`; bringing supply below 23 C typically recovers 5-10 % of throughput on a throttled rack. Leak-detection telemetry must be monitored as a critical signal — the alert path should be the same severity tier as an ECC double-bit error.
NV-HBI inter-die fabric degradation is the most distinctive Blackwell failure mode. Sustained inter-die bandwidth below ~8 TB/s (against an 10 TB/s nominal) usually means substrate solder joint degradation or thermal stress on the CoWoS-L package; drain the workload, reseat the module, and if the degradation persists the package itself is failing — RMA, not a baseboard swap. NVLink-5.0 port flapping in NVL72 racks is the close cousin: `nvidia-smi nvlink --status`, check rack switch logs, reseat cassettes, and treat persistent failures as NVL switch ASIC candidates rather than per-GPU faults.
FP4 calibration regressions after ModelOpt quantisation are the dominant software pitfall. The default MXFP4 calibration set is too small for some activation distributions; bumping `--calib-samples` to 2048 or higher, switching `--awq-block-size` to 64, and selectively keeping attention QKV projections at FP8 (`--exclude-layers attention.qkv_proj`) recovers most of the perplexity gap. Always validate accuracy on the production eval set before locking FP4 — for some domain-specific fine-tunes the perplexity hit warrants staying on FP8 even at the throughput cost.
Two quieter pitfalls round out the list. TensorRT-LLM engine builds fail with an `sm_90` error when a pre-Blackwell installation is targeting `sm_100` — upgrade to TensorRT-LLM 0.13+ and confirm the toolchain resolves `sm_100a`. NCCL AllReduce often runs 30 % slower than expected on a fresh B200 fleet because the NCCL version is falling back to PXN instead of using the NVLink Switch SHARP path on NVLink 5.0; upgrade NCCL to 2.21+, set `NCCL_NVLS_ENABLE=1`, and verify selection with `NCCL_DEBUG=INFO`. Per-die measurement hash drift in CC-on attestation traces to firmware mismatch between the two dies after a partial update — reflash both dies to the same firmware revision and restart the attestation chain.
Early-life ECC double-bit errors run roughly 2x the H100-era infant-mortality rate on early Blackwell HBM3e stacks; quarantine and RMA promptly. If the decompression engine sits idle during training, the dataloader is still on CPU decompression — switch to `nvidia.dali` with the hardware decompression pipeline for 1.3-1.6x dataloader throughput. And if `nvidia-smi` reports 90 GB instead of 192 GB, the driver is older than R550 and does not expose the full Blackwell HBM3e capacity.
Confidential Compute mode on Blackwell binds per-die measurement hashes alongside the package measurement — finer-grained attestation than Hopper offered — but the customer-facing model is identical: SPDM-over-PCIe to NVIDIA NRAS, ~3-7 % throughput penalty when CC-on, and a one-way driver toggle until reboot. MIG slices are larger by capacity than H200 but otherwise identical in semantics.
Where this fits in the Yobitel stack#
B200 is the frontier-tier GPU across the Yobitel stack in 2026. Yobibyte — our AI-native managed platform — places FP4-viable serving workloads, frontier training jobs and rack-scale MoE workloads on B200 pools by default, falling back to H200 or H100 when calibration is not viable or when supply is tight in the requested sovereignty region. The vLLM and Megatron-LM commands in this entry are exactly what Yobibyte reconciles under the hood on the customer's behalf; the customer specifies model, region, replica count and spend cap, and the platform selects the SKU and the FP precision.
Omniscient Compute — our cross-cloud capacity broker — indexes B200 SXM and GB200 NVL72 capacity across every connected hyperscaler and Tier-1/Tier-2 neocloud, normalises pricing onto the FinOps Foundation FOCUS spec, and arbitrages workloads to the cheapest region that meets the workspace's residency posture. Because B200 supply is uneven by region in 2026, Omniscient Compute frequently splits a workspace's serving footprint across regions — H100 in one region, B200 in another — and load-balances based on cost-per-token in real time.
InferenceBench — our public, reproducible benchmarking harness — publishes B200 throughput, latency and cost-per-token numbers for every major open-weight model across vLLM, TensorRT-LLM, SGLang and TGI, including FP4 calibration results vs FP8 baselines. The sizing tables above are anchored on InferenceBench runs. If you are sizing a 2026 frontier footprint, start with InferenceBench, lift the platform configuration into the Yobibyte workspace, and let Omniscient Compute pick the region — including whether to deploy as standalone B200 SXM or as a full GB200 NVL72 rack.
References
- NVIDIA Blackwell Architecture Whitepaper · NVIDIA
- NVIDIA B200 Product Page · NVIDIA
- HGX B200 Platform Brief · NVIDIA
- GB200 NVL72 Rack Reference · NVIDIA
- NVIDIA ModelOpt — FP4 Quantisation Toolchain · NVIDIA
- TensorRT-LLM Blackwell Engines · NVIDIA
- Transformer Engine 2.0 (Blackwell) · NVIDIA
- Open Compute Project Microscaling (MX) Formats · Open Compute Project
- FinOps Foundation FOCUS billing specification · FinOps Foundation
- NCSC Cloud Security Principles · UK NCSC