TL;DR
- Mid-cycle Hopper refresh announced November 2023, volume shipping Q2 2024 — same GH100 silicon as H100 (132 SMs, 528 Tensor cores, sm_90a, NVLink 4.0, Transformer Engine) with the HBM stack upgraded from 80 GB HBM3 to 141 GB HBM3e at 4.8 TB/s.
- Compute throughput is identical to H100 by design: 989 TFLOPS BF16 dense, 1,979 TFLOPS BF16 sparse, 3,958 TFLOPS FP8 sparse. Training FLOPS for compute-bound runs are unchanged — the upgrade only earns its premium on memory-bound workloads.
- Headline inference win: Llama-3.1 405B in FP8 fits on 8x H200 SXM5 with TP=8 at 128K context with KV-cache headroom — the smallest topology that runs the full frontier model in one NVLink domain.
- Drop-in upgrade for any existing H100 fleet — same chassis, same baseboard footprint, same NVIDIA driver R535+ branch, same CUDA 12.4+, same vLLM / TensorRT-LLM / Megatron stack. Re-tune batch size and KV budget; everything else carries over.
- Market clearing pricing through 2026: roughly $4.25/GPU-hr on-demand, $3.20 one-year reserved, $2.55 three-year reserved, $1.75 spot — a ~30 % premium over H100 SXM5 that is usually paid back in fewer GPUs per replica on memory-bound serving.
Overview#
The H200 is the same Hopper silicon as the H100 paired with a faster, denser HBM3e stack — and that, all by itself, was enough to reshape the inference economics of 2024-2026. NVIDIA announced it at SC23 in November 2023 and shipped in volume from Q2 2024. From a compute standpoint nothing changed: the GH100 die, the fourth-generation Tensor Core, NVLink 4.0 at 900 GB/s, the Transformer Engine and the third-generation NVSwitch ASIC are all carried over byte-for-byte from H100 SXM5.
What did change is memory. 141 GB of HBM3e — six 24 GB stacks, one stack disabled for yield, totalling 141 GB of usable capacity at 4.8 TB/s bandwidth — replaces H100's 80 GB HBM3 at 3.35 TB/s. That is 76 % more capacity and 43 % more bandwidth at the same FLOPS, and for transformer decode (overwhelmingly memory-bandwidth bound) the bandwidth uplift translates almost linearly into single-stream tokens-per-second on memory-bound shapes.
Practically, the H200 ate three workloads outright in 2024-2026. Long-context inference of 70B-class dense models (8K -> 128K) collapses from a 4-card TP=4 topology on H100 to a single-card replica on H200. Llama-3.1 405B fits on 8x H200 SXM5 with TP=8 at 128K context in FP8 — the only single-NVLink-domain topology that carries the full frontier model. And MoE serving with large routed-parameter pools (Mixtral 8x22B and beyond) stops paging when the routed weights fit alongside the working KV-cache budget.
This entry is the reference for teams operating H200 alongside or instead of H100: full spec sheet, sizing tables we use on InferenceBench, the migration playbook from H100 (which is mostly 'change two flags'), the FinOps levers, the troubleshooting issues that are H200-specific (HBM3e thermal envelope, NVLink-4.0 cabling at higher sustained traffic), and where the part fits in the Yobibyte and Omniscient Compute stack. Yobitel NeoCloud offers H200 SXM capacity in UK and EU regions with NCSC OFFICIAL alignment, and is the default landing zone Yobibyte's managed inference workspaces target when memory-bound 70B and 405B serving justifies the H200 premium. This entry helps you decide when H200's 141 GB HBM3e is the right pick for your workload and how to size and price it on Yobitel NeoCloud or your own cluster.
How it works: the HBM3e uplift and what is identical#
H200 is best understood as a memory upgrade, not an architecture upgrade. The GH100 die — 80 billion transistors on TSMC 4N, 132 SMs, 528 fourth-generation Tensor cores, 60 MB L2, the Transformer Engine, TMA, Thread Block Clusters, DPX instructions, MIG gen-2 — is identical to H100. Compute capability stays at sm_90 / sm_90a; every Hopper-tuned kernel (Flash Attention 3, CUTLASS 3.x with `wgmma`, Triton's Hopper backend, cuBLAS LT) runs unchanged on H200.
The change is at the memory layer. HBM3e replaces HBM3 on six stacks of 24 GB each (vs eight stacks of 16 GB on H100 with one disabled). NVIDIA disables one stack on H200 for yield, leaving 141 GB usable (six × 24 GB minus reserved overheads at the controller). Per-pin signalling jumps to 6.4 Gbps from H100's 5.2 Gbps, lifting aggregate bandwidth to 4.8 TB/s. Memory controllers, L2 cache, and the on-die fabric are unchanged.
Why HBM3e matters more for decode than prefill: transformer decode is bandwidth-bound — every generated token streams the full weight tensor through the Tensor cores, and the activations are small. A 1.43x bandwidth uplift therefore translates to roughly 1.30-1.40x decode tokens-per-second on memory-bound shapes (70B+ dense, long context). Prefill — where the workload is compute-bound on the GEMM tile — sees almost no uplift; the FLOPS ceiling is identical.
Why HBM3e matters more for serving than training: training is dominated by gradient accumulation, optimiser state and activation checkpointing — for compute-bound runs (large batch, short-to-mid context) the FLOPS ceiling is the binding constraint and H200 offers no uplift. For activation-heavy training (long sequences, large batches without activation checkpointing) memory pressure improves materially, but the same outcome can usually be re-tuned on H100 with gradient accumulation. The honest summary: H200 is a serving GPU.
- Silicon: GH100 die, sm_90a, 132 SMs, 528 Tensor cores — byte-for-byte identical to H100 SXM5.
- Memory: 141 GB HBM3e at 4.8 TB/s (six 24 GB stacks, one disabled for yield).
- Per-pin signalling: 6.4 Gbps (vs 5.2 Gbps on H100 HBM3).
- Tensor cores: 989 TFLOPS BF16 dense, 1,979 TFLOPS BF16 sparse, 3,958 TFLOPS FP8 sparse — identical to H100.
- NVLink: 900 GB/s aggregate over 18 NVLink 4.0 ports — identical to H100.
- Confidential Compute (CC-on) mode: AES-256-GCM PCIe + HBM page sealing, SPDM attestation — identical to H100.
| Subsystem | H100 SXM5 | H200 SXM | Delta |
|---|---|---|---|
| Memory capacity | 80 GB HBM3 | 141 GB HBM3e | +76 % |
| Memory bandwidth | 3.35 TB/s | 4.8 TB/s | +43 % |
| Per-pin signalling | 5.2 Gbps | 6.4 Gbps | +23 % |
| SMs / Tensor cores | 132 / 528 | 132 / 528 | Identical |
| FP8 Tensor (sparse) | 3,958 TFLOPS | 3,958 TFLOPS | Identical |
| BF16 Tensor (sparse) | 1,979 TFLOPS | 1,979 TFLOPS | Identical |
| NVLink aggregate | 900 GB/s | 900 GB/s | Identical |
| TDP | 700 W | 700 W (700 W default; up to 1000 W H200 'extreme' config) | +0-300 W depending on SKU |
Reference: full specification sheet#
Authoritative per-SKU figures. SXM fills HGX-H200 baseboards and almost every cloud H200 instance; PCIe Gen5 is the drop-in card for retrofit servers; H200 NVL is the dual-card variant for memory-pressured inference in PCIe chassis. All Tensor figures assume 2:4 structured sparsity; dense throughput is half the sparse figure.
| Metric | H200 SXM | H200 PCIe Gen5 | H200 NVL (pair) |
|---|---|---|---|
| Architecture | Hopper GH100 | Hopper GH100 | Hopper GH100 x2 |
| Process | TSMC 4N | TSMC 4N | TSMC 4N |
| Transistors | 80 billion | 80 billion | 160 billion (pair) |
| SMs | 132 | 114 | 132 x 2 |
| Tensor cores | 528 | 456 | 528 x 2 |
| L2 cache | 60 MB | 50 MB | 60 MB x 2 |
| Compute capability | sm_90 / sm_90a | sm_90 / sm_90a | sm_90 / sm_90a |
| FP64 (Tensor) | 67 TFLOPS | 51 TFLOPS | 134 TFLOPS |
| TF32 (Tensor, sparse) | 989 TFLOPS | 756 TFLOPS | 1,978 TFLOPS |
| BF16 / FP16 (Tensor, sparse) | 1,979 TFLOPS | 1,513 TFLOPS | 3,958 TFLOPS |
| FP8 (Tensor, sparse) | 3,958 TFLOPS | 3,026 TFLOPS | 7,916 TFLOPS |
| INT8 (Tensor, sparse) | 3,958 TOPS | 3,026 TOPS | 7,916 TOPS |
| Memory | 141 GB HBM3e | 141 GB HBM3e | 282 GB HBM3e (141 GB per board) |
| Memory bandwidth | 4.8 TB/s | 4.8 TB/s | 9.6 TB/s aggregate |
| NVLink | 900 GB/s (4.0, 18 ports) | 600 GB/s (bridge) | 900 GB/s board-to-board |
| PCIe | Gen5 x16 (128 GB/s) | Gen5 x16 (128 GB/s) | Gen5 x16 per board |
| TDP | 700 W default (configurable 600-1000 W) | 600 W | 2 x 600 W |
| MIG instances | Up to 7 | Up to 7 | Up to 7 per board |
| Confidential Compute | Yes (CC-on attested) | Yes | Yes |
| Form factor | SXM mezzanine | FHFL dual-slot PCIe | Dual FHFL PCIe + bridge |
| Minimum driver | R535+ (R550+ recommended) | R535+ | R550+ |
| Minimum CUDA | 12.2 (12.4+ for full TE) | 12.2 | 12.4 |
Sparse Tensor numbers assume 2:4 structured sparsity. Real LLM serving rarely sustains this — quote dense numbers in capacity plans (half the sparse figure) and treat sparse as a marketing ceiling.
Workload pattern A: Llama 3.1 405B at 128K context on 8x H200#
The signature H200 topology. Llama 3.1 405B in FP8 needs roughly 410 GB of weight memory; the KV cache at 128K context for one stream sits around 35 GB. On 8x H100 SXM5 the budget collapses: 8 × 80 GB = 640 GB total minus weights leaves 230 GB shared across activations, KV cache and cuBLAS scratch — workable only at small batches with aggressive KV trimming. On 8x H200 SXM5 the budget is 8 × 141 GB = 1,128 GB, which leaves 500+ GB free across the eight ranks — enough for 12-24 concurrent 128K-context sessions in steady state.
# 405B serving on 8x H200 SXM5 (single HGX baseboard, TP=8)
# Requirements: vllm 0.6.3+, CUDA 12.4+, driver R550+, HGX-H200 baseboard
pip install "vllm==0.6.3" "torch==2.4.0"
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
NCCL_P2P_LEVEL=NVL \
NCCL_NVLS_ENABLE=1 \
vllm serve meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 \
--tensor-parallel-size 8 \
--quantization fp8 \
--kv-cache-dtype fp8_e5m2 \
--max-model-len 131072 \
--max-num-batched-tokens 65536 \
--gpu-memory-utilization 0.92 \
--enable-prefix-caching \
--disable-log-requests \
--host 0.0.0.0 --port 8000
# Smoke-test a 100K-token prefill + 1K-token generation
python -c "
import openai
client = openai.OpenAI(base_url='http://localhost:8000/v1', api_key='x')
import json, pathlib
prompt = pathlib.Path('long_context_100k.txt').read_text()
r = client.chat.completions.create(
model='meta-llama/Meta-Llama-3.1-405B-Instruct-FP8',
messages=[{'role':'user','content': prompt + '\nSummarise.'}],
max_tokens=1024,
)
print(r.choices[0].message.content)
"Pattern A gotcha: at TP=8, NCCL AllReduce on every decode step is the dominant inter-GPU traffic. The eight GPUs MUST share one HGX-H200 baseboard — verify with `nvidia-smi topo -m` that all pairs show `NV#` (NVLink), not `SYS` (PCIe through host). A single rank crossing onto a second baseboard over InfiniBand drops decode TPS by 60-80 %.
Workload pattern B: 70B at 128K on a single H200#
The workload that justified the H200 in the procurement model. On H100, Llama 3.1 70B at 128K context required TP=4 across four H100 SXM5 — the model fits on two cards but the KV cache for a meaningful concurrent batch (16+ streams) does not. On a single H200, weights (~35 GB in FP8) plus a 64 GB KV-cache budget plus working activations all fit on one card, eliminating the AllReduce overhead and halving the GPU bill.
- Single-card replica eliminates inter-GPU collectives — decode tail latency improves 30-50 % at p99.
- Concurrent sessions at 128K: 24-32 on a single H200 vs 16-24 on 4x H100 TP=4.
- Quadruples the number of replicas per HGX baseboard (8 replicas on H200 vs 2 replicas on H100).
# 70B at 128K on 1x H200 SXM5 — the single-card replica
CUDA_VISIBLE_DEVICES=0 vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
--tensor-parallel-size 1 \
--quantization fp8 --kv-cache-dtype fp8_e5m2 \
--max-model-len 131072 \
--max-num-batched-tokens 16384 \
--max-num-seqs 32 \
--gpu-memory-utilization 0.92 \
--enable-prefix-caching \
--host 0.0.0.0 --port 8000
# Compare cost-per-token vs the 4x H100 topology this replaces:
# 4x H100 SXM5 @ $2.50/GPU-hr = $10.00/replica-hr -> ~$2.78/M tokens
# 1x H200 SXM @ $4.25/GPU-hr = $4.25/replica-hr -> ~$1.18/M tokens
# H200 wins on cost-per-token even at the premium hourly rate.Workload pattern C: 70B QLoRA fine-tune at extended context#
QLoRA fine-tune of a 70B base model on 2x H200 SXM with extended-context sequences (8K-16K) — the workload that filled H200 capacity through 2024-2025 for enterprise customisation work. Same `transformers` + `peft` + `bitsandbytes` + `trl` stack as on H100, with `per_device_train_batch_size` raised because the larger HBM pool absorbs the activation memory.
# train.py — 70B QLoRA on 2x H200 SXM, 16K sequences
# Deps: pip install "transformers>=4.46" "peft>=0.13" "trl>=0.11" \
# "bitsandbytes>=0.43" "accelerate>=0.34"
import torch
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from peft import LoraConfig, prepare_model_for_kbit_training
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
MODEL_ID = "meta-llama/Meta-Llama-3.1-70B-Instruct"
bnb = BitsAndBytesConfig(
load_in_4bit=True, bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID); tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID, quantization_config=bnb, device_map="auto",
attn_implementation="flash_attention_2", torch_dtype=torch.bfloat16,
)
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)
lora = LoraConfig(r=64, lora_alpha=16, lora_dropout=0.05,
target_modules="all-linear", bias="none", task_type="CAUSAL_LM")
trainer = SFTTrainer(
model=model, tokenizer=tokenizer, peft_config=lora,
train_dataset=load_dataset("json", data_files="s3://my/data/*.jsonl", split="train"),
args=SFTConfig(
output_dir="./out/llama3-70b-qlora-h200",
num_train_epochs=3,
per_device_train_batch_size=4, # 2x the H100 setting
gradient_accumulation_steps=8, # global batch 64 on 2 GPUs
gradient_checkpointing=True,
optim="paged_adamw_8bit",
learning_rate=2e-4, lr_scheduler_type="cosine", warmup_ratio=0.03,
bf16=True, max_seq_length=16384, # 4x H100 setting
logging_steps=10, save_steps=500,
),
)
trainer.train()Sizing and capacity planning#
Sizing tables we use on InferenceBench. All figures assume H200 SXM, FP8 weights via the Transformer Engine, vLLM 0.6 with paged KV cache and prefix caching, and NVLink-local placement. The headline against H100 is that almost every memory-pressured row collapses to fewer GPUs per replica at higher throughput — and the H200 row is where many production fleets standardise.
- Training rule of thumb: compute-bound runs (large batch, short context, gradient accumulation) take the same H200-days as H100-days — the FLOPS are identical. Only activation-heavy training (long sequences, large microbatches without checkpointing) sees uplift from H200.
- Memory ceiling for a single H200: weights + KV cache + activations + cuBLAS scratch < 138 GB usable. Above 138 GB expect OOMs even with paged KV.
- For 500 RPS at 4K-token output, Llama 3.1 70B FP8 needs roughly 4-5 H200 SXM replicas vs 6-8 H100 SXM5 replicas — typical fleet compression ratio of 1.5x.
- AllReduce overhead at TP=8 inside one HGX-H200: 6-9 % of step time for 405B FP8 — identical to H100 because NVLink is unchanged.
- Spot/preemptible H200 capacity is viable for fine-tunes but not production inference — eviction rates of 6-12 % per day are typical through 2026.
| Model size | Precision | Context | GPUs per replica | TP / PP | Approx output TPS | Approx VRAM headroom |
|---|---|---|---|---|---|---|
| 7B (Mistral, Qwen) | FP8 | 8K | 1x H200 | 1 / 1 | 5,500-7,000 | 125 GB free |
| 13B | FP8 | 8K | 1x H200 | 1 / 1 | 3,800-4,800 | 115 GB free |
| 34B (Yi, Codestral) | FP8 | 8K | 1x H200 | 1 / 1 | 1,900-2,400 | 90 GB free |
| 70B (Llama 3.1) | FP8 | 8K | 1x H200 | 1 / 1 | 1,150-1,500 | 75 GB free |
| 70B (Llama 3.1) | FP8 | 32K | 1x H200 | 1 / 1 | 1,400-1,800 | 60 GB free |
| 70B (Llama 3.1) | FP8 | 128K | 1x H200 | 1 / 1 | 1,500-1,900 | 30 GB free |
| 140B MoE (Mixtral 8x22B) | FP8 | 32K | 1x H200 | 1 / 1 | 1,500-1,900 | 35 GB free |
| 180B (Falcon, Bloom) | FP8 | 8K | 2x H200 | 2 / 1 | 700-900 | 50 GB free per rank |
| 405B (Llama 3.1) | FP8 | 32K | 8x H200 | 8 / 1 | 400-500 | 85 GB free per rank |
| 405B (Llama 3.1) | FP8 | 128K | 8x H200 | 8 / 1 | 450-550 | 50 GB free per rank |
Cost & TCO#
Market-clearing H200 pricing through 2026 is roughly 30 % higher than H100 SXM5 at the same commitment tier. The premium is almost always paid back on memory-bound serving where H200 reduces the replica count. The honest test: if your workload uses a single H100 today at < 60 % of its 80 GB HBM and prefill is not the bottleneck, H200 does not pay back. If it uses 2-4x H100 SXM5 in TP today, H200 usually does.
- Cost-per-million-output-tokens on Llama 3.1 70B FP8 at 32K context, 1x H200 SXM at $4.25/GPU-hr and 1,600 TPS sustained: roughly $0.74 per million tokens — 47 % cheaper than the 4x H100 TP=4 baseline ($1.40) it replaces.
- Commitment savings track H100: 1y reserved ~= 25 % off on-demand, 3y reserved ~= 40 % off — only commit when steady-state utilisation exceeds 65 %.
- FP8 is the default — FP16 leaves roughly 1.6x throughput on the table at the same hourly rate.
- Egress and inter-region data movement frequently exceed 10 % of the H200 bill at hyperscalers — collocate model artefacts with compute.
| Provider class | SKU | On-demand $/GPU-hr | 1y reserved | 3y reserved | Notes |
|---|---|---|---|---|---|
| Hyperscaler (AWS p5e / GCP a3-ultra / Azure ND H200 v5) | H200 SXM | $4.25 | $3.20 | $2.55 | Best for hybrid stacks; data-egress costs matter. |
| Hyperscaler | H200 PCIe | $3.40 | $2.55 | $2.05 | Fewer regions; not all instances support NVLink topology. |
| Tier-1 neocloud | H200 SXM | $3.80 | $2.95 | $2.40 | Commonly cheapest at scale; verify NVLink topology. |
| Tier-2 neocloud | H200 SXM | $3.20 | $2.60 | $2.15 | Best raw rate; expect more variance in IB topology. |
| Spot/preemptible | H200 SXM | $1.75-2.40 | n/a | n/a | 6-12 % eviction/day; fine-tunes only. |
| Yobitel NeoCloud (UK + EU) | H200 SXM | $3.60-3.90 | $2.80-3.10 | $2.30-2.55 | NCSC OFFICIAL-aligned regions; FOCUS-conformant billing. |
| Yobitel Omniscient Compute | H200 SXM multi-cloud | Market-clearing | Commit-discounted | Commit-discounted | Cross-provider arbitrage on top of NeoCloud + partner capacity. |
All cost figures land on the FinOps Foundation FOCUS billing spec when consumed via Yobitel: ServiceName=`AcceleratorCompute`, ChargeCategory=`Usage`, SkuId=`gpu.h200.sxm`.
Migration and alternatives#
When H200 is the right choice and when it isn't. The dominant migration is H100 -> H200, and it is the cheapest GPU upgrade NVIDIA has shipped this decade: identical software stack, identical chassis, identical drivers — just larger batches and larger KV-cache budgets in the serving config.
Two heuristics. First: if the H100 workload is FLOPS-bound (training a 7B-34B model with full activation checkpointing, or prefill-heavy serving with short outputs), H200 offers nothing — stay on H100. Second: if the H100 workload is memory-bound (large dense models, long context, MoE serving) or splits across multiple H100s in TP, H200 typically pays back in 2-3 months on cost-per-token alone.
| From / to | When it pays | Migration effort | Key incompatibility |
|---|---|---|---|
| H100 SXM5 -> H200 SXM | Memory-bound serving; long context; MoE; TP=4 -> TP=1 compression | Trivial — same chassis, same software, retune batch | None — same GH100 silicon |
| H100 SXM5 -> H200 SXM (training) | Activation-heavy training only; otherwise no FLOPS uplift | Trivial; retune microbatch | None |
| H100 PCIe -> H200 PCIe | Drop-in retrofit; same memory uplift in PCIe chassis | Trivial | None |
| A100 -> H200 | Want HBM3e + FP8 + TMA in one upgrade | Medium — Hopper kernels, FP8 calibration | sm_80 vs sm_90a |
| H200 -> B200 | Need FP4 throughput or 8 TB/s bandwidth | Medium — CUDA 12.4+, FP4 quant, MX formats | New software stack; rebuild engines |
| H200 -> MI300X | Need ROCm or NVIDIA-alternative supply | High — CUDA -> ROCm rewrite | CUDA kernels not portable; vLLM ROCm gap |
| H200 -> GB200 NVL72 | Frontier training at rack scale | Very high — pod-as-unit topology | Liquid cooling, FP4 software stack |
Pitfalls and operational notes#
Most operational issues specific to H200 fleets are HBM3e-related, with a smaller set of software-tuning surprises that appear when teams lift a configuration straight off an H100 fleet without retuning. The headline pattern is that the silicon is unchanged, but the thermal envelope and memory budget both shifted enough to bite teams that assumed parity.
Thermal envelope is tighter than it looks on paper. Sustained 4.8 TB/s HBM3e traffic generates more localised heat in the stack than HBM3 did, and several early production fleets hit sustained throttling above 700 W on direct-to-chip loops that worked fine on H100. Coolant supply below 25 C is the operational target; if the supply is hot, dropping the TDP cap to 600 W via the OEM BMC usually recovers stable throughput. The HBM-die temperature series (`DCGM_FI_DEV_MEMORY_TEMP`) is the canonical leading indicator; alert at > 92 C.
Out-of-memory errors despite the 141 GB pool are almost always a configuration carry-over. vLLM defaults sized for the 80 GB H100 leave the larger pool unused — raise `--max-num-batched-tokens`, raise `--max-num-seqs`, push `--gpu-memory-utilization` to 0.92, and confirm the KV-cache dtype is `fp8_e5m2`. The HBM ceiling for safe steady-state is weights + KV cache + activations + cuBLAS scratch below 138 GB usable; above that, paged KV will not save the workload.
NVLink 4.0 cabling behaves slightly differently under H200's higher sustained decode traffic. Mezzanine-connector seating sensitivities that did not surface on H100 do surface on H200; `nvidia-smi nvlink --status` plus a drain-and-reseat resolves most cases, and persistent failures across reseats point at the baseboard, not the GPU. NCCL collectives that suddenly under-perform usually trace to a stale NCCL version not aware of the higher HBM bandwidth — upgrade to NCCL 2.21+, set `NCCL_NVLS_ENABLE=1`, and verify with `NCCL_DEBUG=INFO` that NVLink SHARP is selected.
Three quieter surprises round out the list. FP8 calibration regressions after an H100 -> H200 swap usually mean the Transformer Engine amax history is too short for the larger batch sizes H200 now sustains — raise `fp8_amax_history_len` from 1024 to 4096 and keep `fp8_format=HYBRID`. MIG slice creation fails when the H100 profile names are passed through unchanged; H200 exposes 1g.20gb instead of 1g.10gb, so list profiles with `nvidia-smi mig -lgip` first. And early-life ECC double-bit errors run higher than H100 fleets — quarantine and RMA; row-remap counts climbing past 50 total predict card failure within weeks. If the headline complaint is 'throughput identical to H100 on a training run', the workload is compute-bound (verify with `DCGM_FI_PROF_DRAM_ACTIVE` sitting below 50 %) and H200 is the wrong SKU — stay on H100.
Confidential Compute mode on H200 mirrors H100: AES-256-GCM PCIe + HBM3e page sealing with SPDM attestation to NVIDIA NRAS, ~3-7 % throughput penalty when CC-on. MIG slices are larger by capacity (the 1g.20gb minimum replaces H100's 1g.10gb) but otherwise identical in semantics. Both modes are toggled by driver and are one-way until reboot.
Where this fits in the Yobitel stack#
H200 is the default 'serving GPU' across the Yobitel stack from 2025 onward. Yobibyte — our AI-native managed platform — places memory-pressured inference workloads (70B+ dense, long context, MoE) on H200 pools by default, falling back to H100 only when the workload is FLOPS-bound or H200 capacity is tight in the requested sovereignty region. The vLLM serving config in this entry is exactly what Yobibyte reconciles under the hood on the customer's behalf; the customer specifies model name, region, replica count and spend cap, and the platform selects the SKU.
Omniscient Compute — our cross-cloud capacity broker — indexes H200 SKUs across every connected hyperscaler and Tier-1/Tier-2 neocloud, normalises pricing onto the FinOps Foundation FOCUS spec, and arbitrages workloads to the cheapest region that meets the workspace's residency posture. When you ask Yobitel for 8x H200 SXM in the UK sovereign region, Omniscient Compute is the layer that finds it.
InferenceBench — our public, reproducible benchmarking harness — publishes H200 throughput, latency and cost-per-token numbers for every major open-weight model across vLLM, TensorRT-LLM, SGLang and TGI. The sizing tables above are anchored on InferenceBench runs; production numbers your team will see in steady state are typically within 10 % of the published figures. If you are sizing a 2026 H200 footprint, start with InferenceBench, lift the platform configuration into the Yobibyte workspace, and let Omniscient Compute pick the region.
References
- NVIDIA H200 Tensor Core GPU Datasheet · NVIDIA
- HGX H200 Platform Brief · NVIDIA
- Hopper Architecture Whitepaper · NVIDIA
- Transformer Engine User Guide · NVIDIA
- DCGM Field Identifiers (Prometheus exporter) · NVIDIA
- vLLM FP8 quantisation on Hopper · vLLM
- TensorRT-LLM Hopper engines · NVIDIA
- FinOps Foundation FOCUS billing specification · FinOps Foundation
- NCSC Cloud Security Principles · UK NCSC