NVIDIA H200 — 141GB HBM3e Specs & When to Use

TL;DR

Mid-cycle Hopper refresh announced November 2023, volume shipping Q2 2024 — same GH100 silicon as H100 (132 SMs, 528 Tensor cores, sm_90a, NVLink 4.0, Transformer Engine) with the HBM stack upgraded from 80 GB HBM3 to 141 GB HBM3e at 4.8 TB/s.
Compute throughput is identical to H100 by design: 989 TFLOPS BF16 dense, 1,979 TFLOPS BF16 sparse, 3,958 TFLOPS FP8 sparse. Training FLOPS for compute-bound runs are unchanged — the upgrade only earns its premium on memory-bound workloads.
Headline inference win: Llama-3.1 405B in FP8 fits on 8x H200 SXM5 with TP=8 at 128K context with KV-cache headroom — the smallest topology that runs the full frontier model in one NVLink domain.
Drop-in upgrade for any existing H100 fleet — same chassis, same baseboard footprint, same NVIDIA driver R535+ branch, same CUDA 12.4+, same vLLM / TensorRT-LLM / Megatron stack. Re-tune batch size and KV budget; everything else carries over.
Market clearing pricing through 2026: roughly $4.25/GPU-hr on-demand, $3.20 one-year reserved, $2.55 three-year reserved, $1.75 spot — a ~30 % premium over H100 SXM5 that is usually paid back in fewer GPUs per replica on memory-bound serving.

Overview

The H200 is the same Hopper silicon as the H100 paired with a faster, denser HBM3e stack — and that, all by itself, was enough to reshape the inference economics of 2024-2026. NVIDIA announced it at SC23 in November 2023 and shipped in volume from Q2 2024. From a compute standpoint nothing changed: the GH100 die, the fourth-generation Tensor Core, NVLink 4.0 at 900 GB/s, the Transformer Engine and the third-generation NVSwitch ASIC are all carried over byte-for-byte from H100 SXM5.

What did change is memory. 141 GB of HBM3e — six 24 GB stacks, one stack disabled for yield, totalling 141 GB of usable capacity at 4.8 TB/s bandwidth — replaces H100's 80 GB HBM3 at 3.35 TB/s. That is 76 % more capacity and 43 % more bandwidth at the same FLOPS, and for transformer decode (overwhelmingly memory-bandwidth bound) the bandwidth uplift translates almost linearly into single-stream tokens-per-second on memory-bound shapes.

Practically, the H200 ate three workloads outright in 2024-2026. Long-context inference of 70B-class dense models (8K -> 128K) collapses from a 4-card TP=4 topology on H100 to a single-card replica on H200. Llama-3.1 405B fits on 8x H200 SXM5 with TP=8 at 128K context in FP8 — the only single-NVLink-domain topology that carries the full frontier model. And MoE serving with large routed-parameter pools (Mixtral 8x22B and beyond) stops paging when the routed weights fit alongside the working KV-cache budget.

This entry is the reference for teams operating H200 alongside or instead of H100: full spec sheet, sizing tables we use on InferenceBench, the migration playbook from H100 (which is mostly 'change two flags'), the FinOps levers, the troubleshooting issues that are H200-specific (HBM3e thermal envelope, NVLink-4.0 cabling at higher sustained traffic), and where the part fits in the Yobibyte and Omniscient Compute stack. Yobitel NeoCloud offers H200 SXM capacity in UK and EU regions with NCSC OFFICIAL alignment, and is the default landing zone Yobibyte's managed inference workspaces target when memory-bound 70B and 405B serving justifies the H200 premium. This entry helps you decide when H200's 141 GB HBM3e is the right pick for your workload and how to size and price it on Yobitel NeoCloud or your own cluster.

How it works: the HBM3e uplift and what is identical

H200 is best understood as a memory upgrade, not an architecture upgrade. The GH100 die — 80 billion transistors on TSMC 4N, 132 SMs, 528 fourth-generation Tensor cores, 60 MB L2, the Transformer Engine, TMA, Thread Block Clusters, DPX instructions, MIG gen-2 — is identical to H100. Compute capability stays at sm_90 / sm_90a; every Hopper-tuned kernel (Flash Attention 3, CUTLASS 3.x with wgmma, Triton's Hopper backend, cuBLAS LT) runs unchanged on H200.

The change is at the memory layer. HBM3e replaces HBM3 on six stacks of 24 GB each (vs eight stacks of 16 GB on H100 with one disabled). NVIDIA disables one stack on H200 for yield, leaving 141 GB usable (six × 24 GB minus reserved overheads at the controller). Per-pin signalling jumps to 6.4 Gbps from H100's 5.2 Gbps, lifting aggregate bandwidth to 4.8 TB/s. Memory controllers, L2 cache, and the on-die fabric are unchanged.

Why HBM3e matters more for decode than prefill: transformer decode is bandwidth-bound — every generated token streams the full weight tensor through the Tensor cores, and the activations are small. A 1.43x bandwidth uplift therefore translates to roughly 1.30-1.40x decode tokens-per-second on memory-bound shapes (70B+ dense, long context). Prefill — where the workload is compute-bound on the GEMM tile — sees almost no uplift; the FLOPS ceiling is identical.

Why HBM3e matters more for serving than training: training is dominated by gradient accumulation, optimiser state and activation checkpointing — for compute-bound runs (large batch, short-to-mid context) the FLOPS ceiling is the binding constraint and H200 offers no uplift. For activation-heavy training (long sequences, large batches without activation checkpointing) memory pressure improves materially, but the same outcome can usually be re-tuned on H100 with gradient accumulation. The honest summary: H200 is a serving GPU.

Silicon: GH100 die, sm_90a, 132 SMs, 528 Tensor cores — byte-for-byte identical to H100 SXM5.
Memory: 141 GB HBM3e at 4.8 TB/s (six 24 GB stacks, one disabled for yield).
Per-pin signalling: 6.4 Gbps (vs 5.2 Gbps on H100 HBM3).
Tensor cores: 989 TFLOPS BF16 dense, 1,979 TFLOPS BF16 sparse, 3,958 TFLOPS FP8 sparse — identical to H100.
NVLink: 900 GB/s aggregate over 18 NVLink 4.0 ports — identical to H100.
Confidential Compute (CC-on) mode: AES-256-GCM PCIe + HBM page sealing, SPDM attestation — identical to H100.

Subsystem	H100 SXM5	H200 SXM	Delta
Memory capacity	80 GB HBM3	141 GB HBM3e	+76 %
Memory bandwidth	3.35 TB/s	4.8 TB/s	+43 %
Per-pin signalling	5.2 Gbps	6.4 Gbps	+23 %
SMs / Tensor cores	132 / 528	132 / 528	Identical
FP8 Tensor (sparse)	3,958 TFLOPS	3,958 TFLOPS	Identical
BF16 Tensor (sparse)	1,979 TFLOPS	1,979 TFLOPS	Identical
NVLink aggregate	900 GB/s	900 GB/s	Identical
TDP	700 W	700 W (700 W default; up to 1000 W H200 'extreme' config)	+0-300 W depending on SKU

Reference: full specification sheet

Authoritative per-SKU figures. SXM fills HGX-H200 baseboards and almost every cloud H200 instance; PCIe Gen5 is the drop-in card for retrofit servers; H200 NVL is the dual-card variant for memory-pressured inference in PCIe chassis. All Tensor figures assume 2:4 structured sparsity; dense throughput is half the sparse figure.

Metric	H200 SXM	H200 PCIe Gen5	H200 NVL (pair)
Architecture	Hopper GH100	Hopper GH100	Hopper GH100 x2
Process	TSMC 4N	TSMC 4N	TSMC 4N
Transistors	80 billion	80 billion	160 billion (pair)
SMs	132	114	132 x 2
Tensor cores	528	456	528 x 2
L2 cache	60 MB	50 MB	60 MB x 2
Compute capability	sm_90 / sm_90a	sm_90 / sm_90a	sm_90 / sm_90a
FP64 (Tensor)	67 TFLOPS	51 TFLOPS	134 TFLOPS
TF32 (Tensor, sparse)	989 TFLOPS	756 TFLOPS	1,978 TFLOPS
BF16 / FP16 (Tensor, sparse)	1,979 TFLOPS	1,513 TFLOPS	3,958 TFLOPS
FP8 (Tensor, sparse)	3,958 TFLOPS	3,026 TFLOPS	7,916 TFLOPS
INT8 (Tensor, sparse)	3,958 TOPS	3,026 TOPS	7,916 TOPS
Memory	141 GB HBM3e	141 GB HBM3e	282 GB HBM3e (141 GB per board)
Memory bandwidth	4.8 TB/s	4.8 TB/s	9.6 TB/s aggregate
NVLink	900 GB/s (4.0, 18 ports)	600 GB/s (bridge)	900 GB/s board-to-board
PCIe	Gen5 x16 (128 GB/s)	Gen5 x16 (128 GB/s)	Gen5 x16 per board
TDP	700 W default (configurable 600-1000 W)	600 W	2 x 600 W
MIG instances	Up to 7	Up to 7	Up to 7 per board
Confidential Compute	Yes (CC-on attested)	Yes	Yes
Form factor	SXM mezzanine	FHFL dual-slot PCIe	Dual FHFL PCIe + bridge
Minimum driver	R535+ (R550+ recommended)	R535+	R550+
Minimum CUDA	12.2 (12.4+ for full TE)	12.2	12.4

Note: Sparse Tensor numbers assume 2:4 structured sparsity. Real LLM serving rarely sustains this — quote dense numbers in capacity plans (half the sparse figure) and treat sparse as a marketing ceiling.

Workload pattern A: Llama 3.1 405B at 128K context on 8x H200

The signature H200 topology. Llama 3.1 405B in FP8 needs roughly 410 GB of weight memory; the KV cache at 128K context for one stream sits around 35 GB. On 8x H100 SXM5 the budget collapses: 8 × 80 GB = 640 GB total minus weights leaves 230 GB shared across activations, KV cache and cuBLAS scratch — workable only at small batches with aggressive KV trimming. On 8x H200 SXM5 the budget is 8 × 141 GB = 1,128 GB, which leaves 500+ GB free across the eight ranks — enough for 12-24 concurrent 128K-context sessions in steady state.

# 405B serving on 8x H200 SXM5 (single HGX baseboard, TP=8)
# Requirements: vllm 0.6.3+, CUDA 12.4+, driver R550+, HGX-H200 baseboard
pip install "vllm==0.6.3" "torch==2.4.0"

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
NCCL_P2P_LEVEL=NVL \
NCCL_NVLS_ENABLE=1 \
vllm serve meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 \
  --tensor-parallel-size 8 \
  --quantization fp8 \
  --kv-cache-dtype fp8_e5m2 \
  --max-model-len 131072 \
  --max-num-batched-tokens 65536 \
  --gpu-memory-utilization 0.92 \
  --enable-prefix-caching \
  --disable-log-requests \
  --host 0.0.0.0 --port 8000

# Smoke-test a 100K-token prefill + 1K-token generation
python -c "
import openai
client = openai.OpenAI(base_url='http://localhost:8000/v1', api_key='x')
import json, pathlib
prompt = pathlib.Path('long_context_100k.txt').read_text()
r = client.chat.completions.create(
  model='meta-llama/Meta-Llama-3.1-405B-Instruct-FP8',
  messages=[{'role':'user','content': prompt + '\nSummarise.'}],
  max_tokens=1024,
)
print(r.choices[0].message.content)
"

Warning: Pattern A gotcha: at TP=8, NCCL AllReduce on every decode step is the dominant inter-GPU traffic. The eight GPUs MUST share one HGX-H200 baseboard — verify with nvidia-smi topo -m that all pairs show NV# (NVLink), not SYS (PCIe through host). A single rank crossing onto a second baseboard over InfiniBand drops decode TPS by 60-80 %.

Workload pattern B: 70B at 128K on a single H200

The workload that justified the H200 in the procurement model. On H100, Llama 3.1 70B at 128K context required TP=4 across four H100 SXM5 — the model fits on two cards but the KV cache for a meaningful concurrent batch (16+ streams) does not. On a single H200, weights (~35 GB in FP8) plus a 64 GB KV-cache budget plus working activations all fit on one card, eliminating the AllReduce overhead and halving the GPU bill.

Single-card replica eliminates inter-GPU collectives — decode tail latency improves 30-50 % at p99.
Concurrent sessions at 128K: 24-32 on a single H200 vs 16-24 on 4x H100 TP=4.
Quadruples the number of replicas per HGX baseboard (8 replicas on H200 vs 2 replicas on H100).

# 70B at 128K on 1x H200 SXM5 — the single-card replica
CUDA_VISIBLE_DEVICES=0 vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tensor-parallel-size 1 \
  --quantization fp8 --kv-cache-dtype fp8_e5m2 \
  --max-model-len 131072 \
  --max-num-batched-tokens 16384 \
  --max-num-seqs 32 \
  --gpu-memory-utilization 0.92 \
  --enable-prefix-caching \
  --host 0.0.0.0 --port 8000

# Compare cost-per-token vs the 4x H100 topology this replaces:
#   4x H100 SXM5 @ $2.50/GPU-hr = $10.00/replica-hr -> ~$2.78/M tokens
#   1x H200 SXM  @ $4.25/GPU-hr = $4.25/replica-hr  -> ~$1.18/M tokens
# H200 wins on cost-per-token even at the premium hourly rate.

Workload pattern C: 70B QLoRA fine-tune at extended context

QLoRA fine-tune of a 70B base model on 2x H200 SXM with extended-context sequences (8K-16K) — the workload that filled H200 capacity through 2024-2025 for enterprise customisation work. Same transformers + peft + bitsandbytes + trl stack as on H100, with per_device_train_batch_size raised because the larger HBM pool absorbs the activation memory.

# train.py — 70B QLoRA on 2x H200 SXM, 16K sequences
# Deps: pip install "transformers>=4.46" "peft>=0.13" "trl>=0.11" \
#                   "bitsandbytes>=0.43" "accelerate>=0.34"
import torch
from transformers import (AutoModelForCausalLM, AutoTokenizer,
                          BitsAndBytesConfig)
from peft import LoraConfig, prepare_model_for_kbit_training
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset

MODEL_ID = "meta-llama/Meta-Llama-3.1-70B-Instruct"

bnb = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID); tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, quantization_config=bnb, device_map="auto",
    attn_implementation="flash_attention_2", torch_dtype=torch.bfloat16,
)
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)

lora = LoraConfig(r=64, lora_alpha=16, lora_dropout=0.05,
                  target_modules="all-linear", bias="none", task_type="CAUSAL_LM")

trainer = SFTTrainer(
    model=model, tokenizer=tokenizer, peft_config=lora,
    train_dataset=load_dataset("json", data_files="s3://my/data/*.jsonl", split="train"),
    args=SFTConfig(
        output_dir="./out/llama3-70b-qlora-h200",
        num_train_epochs=3,
        per_device_train_batch_size=4,         # 2x the H100 setting
        gradient_accumulation_steps=8,         # global batch 64 on 2 GPUs
        gradient_checkpointing=True,
        optim="paged_adamw_8bit",
        learning_rate=2e-4, lr_scheduler_type="cosine", warmup_ratio=0.03,
        bf16=True, max_seq_length=16384,       # 4x H100 setting
        logging_steps=10, save_steps=500,
    ),
)
trainer.train()

Sizing and capacity planning

Sizing tables we use on InferenceBench. All figures assume H200 SXM, FP8 weights via the Transformer Engine, vLLM 0.6 with paged KV cache and prefix caching, and NVLink-local placement. The headline against H100 is that almost every memory-pressured row collapses to fewer GPUs per replica at higher throughput — and the H200 row is where many production fleets standardise.

Training rule of thumb: compute-bound runs (large batch, short context, gradient accumulation) take the same H200-days as H100-days — the FLOPS are identical. Only activation-heavy training (long sequences, large microbatches without checkpointing) sees uplift from H200.
Memory ceiling for a single H200: weights + KV cache + activations + cuBLAS scratch < 138 GB usable. Above 138 GB expect OOMs even with paged KV.
For 500 RPS at 4K-token output, Llama 3.1 70B FP8 needs roughly 4-5 H200 SXM replicas vs 6-8 H100 SXM5 replicas — typical fleet compression ratio of 1.5x.
AllReduce overhead at TP=8 inside one HGX-H200: 6-9 % of step time for 405B FP8 — identical to H100 because NVLink is unchanged.
Spot/preemptible H200 capacity is viable for fine-tunes but not production inference — eviction rates of 6-12 % per day are typical through 2026.

Model size	Precision	Context	GPUs per replica	TP / PP	Approx output TPS	Approx VRAM headroom
7B (Mistral, Qwen)	FP8	8K	1x H200	1 / 1	5,500-7,000	125 GB free
13B	FP8	8K	1x H200	1 / 1	3,800-4,800	115 GB free
34B (Yi, Codestral)	FP8	8K	1x H200	1 / 1	1,900-2,400	90 GB free
70B (Llama 3.1)	FP8	8K	1x H200	1 / 1	1,150-1,500	75 GB free
70B (Llama 3.1)	FP8	32K	1x H200	1 / 1	1,400-1,800	60 GB free
70B (Llama 3.1)	FP8	128K	1x H200	1 / 1	1,500-1,900	30 GB free
140B MoE (Mixtral 8x22B)	FP8	32K	1x H200	1 / 1	1,500-1,900	35 GB free
180B (Falcon, Bloom)	FP8	8K	2x H200	2 / 1	700-900	50 GB free per rank
405B (Llama 3.1)	FP8	32K	8x H200	8 / 1	400-500	85 GB free per rank
405B (Llama 3.1)	FP8	128K	8x H200	8 / 1	450-550	50 GB free per rank

Cost & TCO

Market-clearing H200 pricing through 2026 is roughly 30 % higher than H100 SXM5 at the same commitment tier. The premium is almost always paid back on memory-bound serving where H200 reduces the replica count. The honest test: if your workload uses a single H100 today at < 60 % of its 80 GB HBM and prefill is not the bottleneck, H200 does not pay back. If it uses 2-4x H100 SXM5 in TP today, H200 usually does.

Cost-per-million-output-tokens on Llama 3.1 70B FP8 at 32K context, 1x H200 SXM at $4.25/GPU-hr and 1,600 TPS sustained: roughly $0.74 per million tokens — 47 % cheaper than the 4x H100 TP=4 baseline ($1.40) it replaces.
Commitment savings track H100: 1y reserved ~= 25 % off on-demand, 3y reserved ~= 40 % off — only commit when steady-state utilisation exceeds 65 %.
FP8 is the default — FP16 leaves roughly 1.6x throughput on the table at the same hourly rate.
Egress and inter-region data movement frequently exceed 10 % of the H200 bill at hyperscalers — collocate model artefacts with compute.

Provider class	SKU	On-demand $/GPU-hr	1y reserved	3y reserved	Notes
Hyperscaler (AWS p5e / GCP a3-ultra / Azure ND H200 v5)	H200 SXM	$4.25	$3.20	$2.55	Best for hybrid stacks; data-egress costs matter.
Hyperscaler	H200 PCIe	$3.40	$2.55	$2.05	Fewer regions; not all instances support NVLink topology.
Tier-1 neocloud	H200 SXM	$3.80	$2.95	$2.40	Commonly cheapest at scale; verify NVLink topology.
Tier-2 neocloud	H200 SXM	$3.20	$2.60	$2.15	Best raw rate; expect more variance in IB topology.
Spot/preemptible	H200 SXM	$1.75-2.40	n/a	n/a	6-12 % eviction/day; fine-tunes only.
Yobitel NeoCloud (UK + EU)	H200 SXM	$3.60-3.90	$2.80-3.10	$2.30-2.55	NCSC OFFICIAL-aligned regions; FOCUS-conformant billing.
Yobitel Omniscient Compute	H200 SXM multi-cloud	Market-clearing	Commit-discounted	Commit-discounted	Cross-provider arbitrage on top of NeoCloud + partner capacity.

Note: All cost figures land on the FinOps Foundation FOCUS billing spec when consumed via Yobitel: ServiceName=AcceleratorCompute, ChargeCategory=Usage, SkuId=gpu.h200.sxm.

Migration and alternatives

When H200 is the right choice and when it isn't. The dominant migration is H100 -> H200, and it is the cheapest GPU upgrade NVIDIA has shipped this decade: identical software stack, identical chassis, identical drivers — just larger batches and larger KV-cache budgets in the serving config.

Two heuristics. First: if the H100 workload is FLOPS-bound (training a 7B-34B model with full activation checkpointing, or prefill-heavy serving with short outputs), H200 offers nothing — stay on H100. Second: if the H100 workload is memory-bound (large dense models, long context, MoE serving) or splits across multiple H100s in TP, H200 typically pays back in 2-3 months on cost-per-token alone.

From / to	When it pays	Migration effort	Key incompatibility
H100 SXM5 -> H200 SXM	Memory-bound serving; long context; MoE; TP=4 -> TP=1 compression	Trivial — same chassis, same software, retune batch	None — same GH100 silicon
H100 SXM5 -> H200 SXM (training)	Activation-heavy training only; otherwise no FLOPS uplift	Trivial; retune microbatch	None
H100 PCIe -> H200 PCIe	Drop-in retrofit; same memory uplift in PCIe chassis	Trivial	None
A100 -> H200	Want HBM3e + FP8 + TMA in one upgrade	Medium — Hopper kernels, FP8 calibration	sm_80 vs sm_90a
H200 -> B200	Need FP4 throughput or 8 TB/s bandwidth	Medium — CUDA 12.4+, FP4 quant, MX formats	New software stack; rebuild engines
H200 -> MI300X	Need ROCm or NVIDIA-alternative supply	High — CUDA -> ROCm rewrite	CUDA kernels not portable; vLLM ROCm gap
H200 -> GB200 NVL72	Frontier training at rack scale	Very high — pod-as-unit topology	Liquid cooling, FP4 software stack

Pitfalls and operational notes

Most operational issues specific to H200 fleets are HBM3e-related, with a smaller set of software-tuning surprises that appear when teams lift a configuration straight off an H100 fleet without retuning. The headline pattern is that the silicon is unchanged, but the thermal envelope and memory budget both shifted enough to bite teams that assumed parity.

Thermal envelope is tighter than it looks on paper. Sustained 4.8 TB/s HBM3e traffic generates more localised heat in the stack than HBM3 did, and several early production fleets hit sustained throttling above 700 W on direct-to-chip loops that worked fine on H100. Coolant supply below 25 C is the operational target; if the supply is hot, dropping the TDP cap to 600 W via the OEM BMC usually recovers stable throughput. The HBM-die temperature series (DCGM_FI_DEV_MEMORY_TEMP) is the canonical leading indicator; alert at > 92 C.

Out-of-memory errors despite the 141 GB pool are almost always a configuration carry-over. vLLM defaults sized for the 80 GB H100 leave the larger pool unused — raise --max-num-batched-tokens, raise --max-num-seqs, push --gpu-memory-utilization to 0.92, and confirm the KV-cache dtype is fp8_e5m2. The HBM ceiling for safe steady-state is weights + KV cache + activations + cuBLAS scratch below 138 GB usable; above that, paged KV will not save the workload.

NVLink 4.0 cabling behaves slightly differently under H200's higher sustained decode traffic. Mezzanine-connector seating sensitivities that did not surface on H100 do surface on H200; nvidia-smi nvlink --status plus a drain-and-reseat resolves most cases, and persistent failures across reseats point at the baseboard, not the GPU. NCCL collectives that suddenly under-perform usually trace to a stale NCCL version not aware of the higher HBM bandwidth — upgrade to NCCL 2.21+, set NCCL_NVLS_ENABLE=1, and verify with NCCL_DEBUG=INFO that NVLink SHARP is selected.

Three quieter surprises round out the list. FP8 calibration regressions after an H100 -> H200 swap usually mean the Transformer Engine amax history is too short for the larger batch sizes H200 now sustains — raise fp8_amax_history_len from 1024 to 4096 and keep fp8_format=HYBRID. MIG slice creation fails when the H100 profile names are passed through unchanged; H200 exposes 1g.20gb instead of 1g.10gb, so list profiles with nvidia-smi mig -lgip first. And early-life ECC double-bit errors run higher than H100 fleets — quarantine and RMA; row-remap counts climbing past 50 total predict card failure within weeks. If the headline complaint is 'throughput identical to H100 on a training run', the workload is compute-bound (verify with DCGM_FI_PROF_DRAM_ACTIVE sitting below 50 %) and H200 is the wrong SKU — stay on H100.

Confidential Compute mode on H200 mirrors H100: AES-256-GCM PCIe + HBM3e page sealing with SPDM attestation to NVIDIA NRAS, ~3-7 % throughput penalty when CC-on. MIG slices are larger by capacity (the 1g.20gb minimum replaces H100's 1g.10gb) but otherwise identical in semantics. Both modes are toggled by driver and are one-way until reboot.

Where this fits in the Yobitel stack

H200 is the default 'serving GPU' across the Yobitel stack from 2025 onward. Yobibyte — our AI-native managed platform — places memory-pressured inference workloads (70B+ dense, long context, MoE) on H200 pools by default, falling back to H100 only when the workload is FLOPS-bound or H200 capacity is tight in the requested sovereignty region. The vLLM serving config in this entry is exactly what Yobibyte reconciles under the hood on the customer's behalf; the customer specifies model name, region, replica count and spend cap, and the platform selects the SKU.

Omniscient Compute — our cross-cloud capacity broker — indexes H200 SKUs across every connected hyperscaler and Tier-1/Tier-2 neocloud, normalises pricing onto the FinOps Foundation FOCUS spec, and arbitrages workloads to the cheapest region that meets the workspace's residency posture. When you ask Yobitel for 8x H200 SXM in the UK sovereign region, Omniscient Compute is the layer that finds it.

InferenceBench — our public, reproducible benchmarking harness — publishes H200 throughput, latency and cost-per-token numbers for every major open-weight model across vLLM, TensorRT-LLM, SGLang and TGI. The sizing tables above are anchored on InferenceBench runs; production numbers your team will see in steady state are typically within 10 % of the published figures. If you are sizing a 2026 H200 footprint, start with InferenceBench, lift the platform configuration into the Yobibyte workspace, and let Omniscient Compute pick the region.

References

NVIDIA H200 Tensor Core GPU Datasheet · NVIDIA
HGX H200 Platform Brief · NVIDIA
Hopper Architecture Whitepaper · NVIDIA
Transformer Engine User Guide · NVIDIA
DCGM Field Identifiers (Prometheus exporter) · NVIDIA
vLLM FP8 quantisation on Hopper · vLLM
TensorRT-LLM Hopper engines · NVIDIA
FinOps Foundation FOCUS billing specification · FinOps Foundation
NCSC Cloud Security Principles · UK NCSC

TL;DR

Mid-cycle Hopper refresh announced November 2023, volume shipping Q2 2024 — same GH100 silicon as H100 (132 SMs, 528 Tensor cores, sm_90a, NVLink 4.0, Transformer Engine) with the HBM stack upgraded from 80 GB HBM3 to 141 GB HBM3e at 4.8 TB/s.
Compute throughput is identical to H100 by design: 989 TFLOPS BF16 dense, 1,979 TFLOPS BF16 sparse, 3,958 TFLOPS FP8 sparse. Training FLOPS for compute-bound runs are unchanged — the upgrade only earns its premium on memory-bound workloads.
Headline inference win: Llama-3.1 405B in FP8 fits on 8x H200 SXM5 with TP=8 at 128K context with KV-cache headroom — the smallest topology that runs the full frontier model in one NVLink domain.
Drop-in upgrade for any existing H100 fleet — same chassis, same baseboard footprint, same NVIDIA driver R535+ branch, same CUDA 12.4+, same vLLM / TensorRT-LLM / Megatron stack. Re-tune batch size and KV budget; everything else carries over.
Market clearing pricing through 2026: roughly $4.25/GPU-hr on-demand, $3.20 one-year reserved, $2.55 three-year reserved, $1.75 spot — a ~30 % premium over H100 SXM5 that is usually paid back in fewer GPUs per replica on memory-bound serving.

Overview

How it works: the HBM3e uplift and what is identical

Silicon: GH100 die, sm_90a, 132 SMs, 528 Tensor cores — byte-for-byte identical to H100 SXM5.
Memory: 141 GB HBM3e at 4.8 TB/s (six 24 GB stacks, one disabled for yield).
Per-pin signalling: 6.4 Gbps (vs 5.2 Gbps on H100 HBM3).
Tensor cores: 989 TFLOPS BF16 dense, 1,979 TFLOPS BF16 sparse, 3,958 TFLOPS FP8 sparse — identical to H100.
NVLink: 900 GB/s aggregate over 18 NVLink 4.0 ports — identical to H100.
Confidential Compute (CC-on) mode: AES-256-GCM PCIe + HBM page sealing, SPDM attestation — identical to H100.

Subsystem	H100 SXM5	H200 SXM	Delta
Memory capacity	80 GB HBM3	141 GB HBM3e	+76 %
Memory bandwidth	3.35 TB/s	4.8 TB/s	+43 %
Per-pin signalling	5.2 Gbps	6.4 Gbps	+23 %
SMs / Tensor cores	132 / 528	132 / 528	Identical
FP8 Tensor (sparse)	3,958 TFLOPS	3,958 TFLOPS	Identical
BF16 Tensor (sparse)	1,979 TFLOPS	1,979 TFLOPS	Identical
NVLink aggregate	900 GB/s	900 GB/s	Identical
TDP	700 W	700 W (700 W default; up to 1000 W H200 'extreme' config)	+0-300 W depending on SKU

Reference: full specification sheet

Metric	H200 SXM	H200 PCIe Gen5	H200 NVL (pair)
Architecture	Hopper GH100	Hopper GH100	Hopper GH100 x2
Process	TSMC 4N	TSMC 4N	TSMC 4N
Transistors	80 billion	80 billion	160 billion (pair)
SMs	132	114	132 x 2
Tensor cores	528	456	528 x 2
L2 cache	60 MB	50 MB	60 MB x 2
Compute capability	sm_90 / sm_90a	sm_90 / sm_90a	sm_90 / sm_90a
FP64 (Tensor)	67 TFLOPS	51 TFLOPS	134 TFLOPS
TF32 (Tensor, sparse)	989 TFLOPS	756 TFLOPS	1,978 TFLOPS
BF16 / FP16 (Tensor, sparse)	1,979 TFLOPS	1,513 TFLOPS	3,958 TFLOPS
FP8 (Tensor, sparse)	3,958 TFLOPS	3,026 TFLOPS	7,916 TFLOPS
INT8 (Tensor, sparse)	3,958 TOPS	3,026 TOPS	7,916 TOPS
Memory	141 GB HBM3e	141 GB HBM3e	282 GB HBM3e (141 GB per board)
Memory bandwidth	4.8 TB/s	4.8 TB/s	9.6 TB/s aggregate
NVLink	900 GB/s (4.0, 18 ports)	600 GB/s (bridge)	900 GB/s board-to-board
PCIe	Gen5 x16 (128 GB/s)	Gen5 x16 (128 GB/s)	Gen5 x16 per board
TDP	700 W default (configurable 600-1000 W)	600 W	2 x 600 W
MIG instances	Up to 7	Up to 7	Up to 7 per board
Confidential Compute	Yes (CC-on attested)	Yes	Yes
Form factor	SXM mezzanine	FHFL dual-slot PCIe	Dual FHFL PCIe + bridge
Minimum driver	R535+ (R550+ recommended)	R535+	R550+
Minimum CUDA	12.2 (12.4+ for full TE)	12.2	12.4

Note: Sparse Tensor numbers assume 2:4 structured sparsity. Real LLM serving rarely sustains this — quote dense numbers in capacity plans (half the sparse figure) and treat sparse as a marketing ceiling.

Workload pattern A: Llama 3.1 405B at 128K context on 8x H200

# 405B serving on 8x H200 SXM5 (single HGX baseboard, TP=8)
# Requirements: vllm 0.6.3+, CUDA 12.4+, driver R550+, HGX-H200 baseboard
pip install "vllm==0.6.3" "torch==2.4.0"

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
NCCL_P2P_LEVEL=NVL \
NCCL_NVLS_ENABLE=1 \
vllm serve meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 \
  --tensor-parallel-size 8 \
  --quantization fp8 \
  --kv-cache-dtype fp8_e5m2 \
  --max-model-len 131072 \
  --max-num-batched-tokens 65536 \
  --gpu-memory-utilization 0.92 \
  --enable-prefix-caching \
  --disable-log-requests \
  --host 0.0.0.0 --port 8000

# Smoke-test a 100K-token prefill + 1K-token generation
python -c "
import openai
client = openai.OpenAI(base_url='http://localhost:8000/v1', api_key='x')
import json, pathlib
prompt = pathlib.Path('long_context_100k.txt').read_text()
r = client.chat.completions.create(
  model='meta-llama/Meta-Llama-3.1-405B-Instruct-FP8',
  messages=[{'role':'user','content': prompt + '\nSummarise.'}],
  max_tokens=1024,
)
print(r.choices[0].message.content)
"

Warning: Pattern A gotcha: at TP=8, NCCL AllReduce on every decode step is the dominant inter-GPU traffic. The eight GPUs MUST share one HGX-H200 baseboard — verify with nvidia-smi topo -m that all pairs show NV# (NVLink), not SYS (PCIe through host). A single rank crossing onto a second baseboard over InfiniBand drops decode TPS by 60-80 %.

Workload pattern B: 70B at 128K on a single H200

Single-card replica eliminates inter-GPU collectives — decode tail latency improves 30-50 % at p99.
Concurrent sessions at 128K: 24-32 on a single H200 vs 16-24 on 4x H100 TP=4.
Quadruples the number of replicas per HGX baseboard (8 replicas on H200 vs 2 replicas on H100).

# 70B at 128K on 1x H200 SXM5 — the single-card replica
CUDA_VISIBLE_DEVICES=0 vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tensor-parallel-size 1 \
  --quantization fp8 --kv-cache-dtype fp8_e5m2 \
  --max-model-len 131072 \
  --max-num-batched-tokens 16384 \
  --max-num-seqs 32 \
  --gpu-memory-utilization 0.92 \
  --enable-prefix-caching \
  --host 0.0.0.0 --port 8000

# Compare cost-per-token vs the 4x H100 topology this replaces:
#   4x H100 SXM5 @ $2.50/GPU-hr = $10.00/replica-hr -> ~$2.78/M tokens
#   1x H200 SXM  @ $4.25/GPU-hr = $4.25/replica-hr  -> ~$1.18/M tokens
# H200 wins on cost-per-token even at the premium hourly rate.

Workload pattern C: 70B QLoRA fine-tune at extended context

# train.py — 70B QLoRA on 2x H200 SXM, 16K sequences
# Deps: pip install "transformers>=4.46" "peft>=0.13" "trl>=0.11" \
#                   "bitsandbytes>=0.43" "accelerate>=0.34"
import torch
from transformers import (AutoModelForCausalLM, AutoTokenizer,
                          BitsAndBytesConfig)
from peft import LoraConfig, prepare_model_for_kbit_training
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset

MODEL_ID = "meta-llama/Meta-Llama-3.1-70B-Instruct"

bnb = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID); tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, quantization_config=bnb, device_map="auto",
    attn_implementation="flash_attention_2", torch_dtype=torch.bfloat16,
)
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)

lora = LoraConfig(r=64, lora_alpha=16, lora_dropout=0.05,
                  target_modules="all-linear", bias="none", task_type="CAUSAL_LM")

trainer = SFTTrainer(
    model=model, tokenizer=tokenizer, peft_config=lora,
    train_dataset=load_dataset("json", data_files="s3://my/data/*.jsonl", split="train"),
    args=SFTConfig(
        output_dir="./out/llama3-70b-qlora-h200",
        num_train_epochs=3,
        per_device_train_batch_size=4,         # 2x the H100 setting
        gradient_accumulation_steps=8,         # global batch 64 on 2 GPUs
        gradient_checkpointing=True,
        optim="paged_adamw_8bit",
        learning_rate=2e-4, lr_scheduler_type="cosine", warmup_ratio=0.03,
        bf16=True, max_seq_length=16384,       # 4x H100 setting
        logging_steps=10, save_steps=500,
    ),
)
trainer.train()

Sizing and capacity planning

Training rule of thumb: compute-bound runs (large batch, short context, gradient accumulation) take the same H200-days as H100-days — the FLOPS are identical. Only activation-heavy training (long sequences, large microbatches without checkpointing) sees uplift from H200.
Memory ceiling for a single H200: weights + KV cache + activations + cuBLAS scratch < 138 GB usable. Above 138 GB expect OOMs even with paged KV.
For 500 RPS at 4K-token output, Llama 3.1 70B FP8 needs roughly 4-5 H200 SXM replicas vs 6-8 H100 SXM5 replicas — typical fleet compression ratio of 1.5x.
AllReduce overhead at TP=8 inside one HGX-H200: 6-9 % of step time for 405B FP8 — identical to H100 because NVLink is unchanged.
Spot/preemptible H200 capacity is viable for fine-tunes but not production inference — eviction rates of 6-12 % per day are typical through 2026.

Model size	Precision	Context	GPUs per replica	TP / PP	Approx output TPS	Approx VRAM headroom
7B (Mistral, Qwen)	FP8	8K	1x H200	1 / 1	5,500-7,000	125 GB free
13B	FP8	8K	1x H200	1 / 1	3,800-4,800	115 GB free
34B (Yi, Codestral)	FP8	8K	1x H200	1 / 1	1,900-2,400	90 GB free
70B (Llama 3.1)	FP8	8K	1x H200	1 / 1	1,150-1,500	75 GB free
70B (Llama 3.1)	FP8	32K	1x H200	1 / 1	1,400-1,800	60 GB free
70B (Llama 3.1)	FP8	128K	1x H200	1 / 1	1,500-1,900	30 GB free
140B MoE (Mixtral 8x22B)	FP8	32K	1x H200	1 / 1	1,500-1,900	35 GB free
180B (Falcon, Bloom)	FP8	8K	2x H200	2 / 1	700-900	50 GB free per rank
405B (Llama 3.1)	FP8	32K	8x H200	8 / 1	400-500	85 GB free per rank
405B (Llama 3.1)	FP8	128K	8x H200	8 / 1	450-550	50 GB free per rank

Cost & TCO

Cost-per-million-output-tokens on Llama 3.1 70B FP8 at 32K context, 1x H200 SXM at $4.25/GPU-hr and 1,600 TPS sustained: roughly $0.74 per million tokens — 47 % cheaper than the 4x H100 TP=4 baseline ($1.40) it replaces.
Commitment savings track H100: 1y reserved ~= 25 % off on-demand, 3y reserved ~= 40 % off — only commit when steady-state utilisation exceeds 65 %.
FP8 is the default — FP16 leaves roughly 1.6x throughput on the table at the same hourly rate.
Egress and inter-region data movement frequently exceed 10 % of the H200 bill at hyperscalers — collocate model artefacts with compute.

Provider class	SKU	On-demand $/GPU-hr	1y reserved	3y reserved	Notes
Hyperscaler (AWS p5e / GCP a3-ultra / Azure ND H200 v5)	H200 SXM	$4.25	$3.20	$2.55	Best for hybrid stacks; data-egress costs matter.
Hyperscaler	H200 PCIe	$3.40	$2.55	$2.05	Fewer regions; not all instances support NVLink topology.
Tier-1 neocloud	H200 SXM	$3.80	$2.95	$2.40	Commonly cheapest at scale; verify NVLink topology.
Tier-2 neocloud	H200 SXM	$3.20	$2.60	$2.15	Best raw rate; expect more variance in IB topology.
Spot/preemptible	H200 SXM	$1.75-2.40	n/a	n/a	6-12 % eviction/day; fine-tunes only.
Yobitel NeoCloud (UK + EU)	H200 SXM	$3.60-3.90	$2.80-3.10	$2.30-2.55	NCSC OFFICIAL-aligned regions; FOCUS-conformant billing.
Yobitel Omniscient Compute	H200 SXM multi-cloud	Market-clearing	Commit-discounted	Commit-discounted	Cross-provider arbitrage on top of NeoCloud + partner capacity.

Note: All cost figures land on the FinOps Foundation FOCUS billing spec when consumed via Yobitel: ServiceName=AcceleratorCompute, ChargeCategory=Usage, SkuId=gpu.h200.sxm.

Migration and alternatives

From / to	When it pays	Migration effort	Key incompatibility
H100 SXM5 -> H200 SXM	Memory-bound serving; long context; MoE; TP=4 -> TP=1 compression	Trivial — same chassis, same software, retune batch	None — same GH100 silicon
H100 SXM5 -> H200 SXM (training)	Activation-heavy training only; otherwise no FLOPS uplift	Trivial; retune microbatch	None
H100 PCIe -> H200 PCIe	Drop-in retrofit; same memory uplift in PCIe chassis	Trivial	None
A100 -> H200	Want HBM3e + FP8 + TMA in one upgrade	Medium — Hopper kernels, FP8 calibration	sm_80 vs sm_90a
H200 -> B200	Need FP4 throughput or 8 TB/s bandwidth	Medium — CUDA 12.4+, FP4 quant, MX formats	New software stack; rebuild engines
H200 -> MI300X	Need ROCm or NVIDIA-alternative supply	High — CUDA -> ROCm rewrite	CUDA kernels not portable; vLLM ROCm gap
H200 -> GB200 NVL72	Frontier training at rack scale	Very high — pod-as-unit topology	Liquid cooling, FP4 software stack

Pitfalls and operational notes

Where this fits in the Yobitel stack

References

NVIDIA H200 Tensor Core GPU Datasheet · NVIDIA
HGX H200 Platform Brief · NVIDIA
Hopper Architecture Whitepaper · NVIDIA
Transformer Engine User Guide · NVIDIA
DCGM Field Identifiers (Prometheus exporter) · NVIDIA
vLLM FP8 quantisation on Hopper · vLLM
TensorRT-LLM Hopper engines · NVIDIA
FinOps Foundation FOCUS billing specification · FinOps Foundation
NCSC Cloud Security Principles · UK NCSC

NVIDIA H200 Tensor Core GPU

Overview

How it works: the HBM3e uplift and what is identical

Reference: full specification sheet

Workload pattern A: Llama 3.1 405B at 128K context on 8x H200

Workload pattern B: 70B at 128K on a single H200

Workload pattern C: 70B QLoRA fine-tune at extended context

Sizing and capacity planning

Cost & TCO

Migration and alternatives

Pitfalls and operational notes

Where this fits in the Yobitel stack

References

Browse all entries

Deploy on Yobibyte

NVIDIA H200 Tensor Core GPU

Overview

How it works: the HBM3e uplift and what is identical

Reference: full specification sheet

Workload pattern A: Llama 3.1 405B at 128K context on 8x H200

Workload pattern B: 70B at 128K on a single H200

Workload pattern C: 70B QLoRA fine-tune at extended context

Sizing and capacity planning

Cost & TCO

Migration and alternatives

Pitfalls and operational notes

Where this fits in the Yobitel stack

References

Browse all entries

Deploy on Yobibyte