TL;DR
- Hopper-architecture data centre GPU (GH100) on TSMC 4N, 80 billion transistors, launched March 2022 and the default training accelerator from 2023 onward — still the most widely benchmarked and software-mature AI GPU in production through 2026.
- Two form factors: SXM5 (700 W TDP, 18-port NVLink 4.0 at 900 GB/s) and PCIe Gen5 (350 W TDP, 600 GB/s NVLink bridge, drop-in for retrofit servers). SXM5 is what fills DGX-H100, HGX-H100, AWS p5, GCP a3, Azure ND H100 v5 and almost every neocloud H100 instance.
- 80 GB HBM3 at 3.35 TB/s; fourth-generation Tensor Core delivers 989 TFLOPS BF16 and 3,958 TFLOPS FP8 (2:4 sparse); the Transformer Engine auto-casts layers between BF16 and FP8 (E4M3/E5M2) with runtime amax tracking.
- NVLink 4.0 + third-gen NVSwitch ASIC scales to 256-GPU NVLink-domain pods with 57.6 TB/s bisection — the substrate every multi-billion-parameter training run shipped between 2023 and 2025 ran on.
- Sizing rule of thumb: Llama 3 70B FP8 fits on 1x H100 with 8K context (no TP); 32K context needs 2x H100 with TP=2; QLoRA fine-tune of 70B fits on 2x H100 SXM5 (~70 GB peak per GPU including optimiser).
Overview#
The NVIDIA H100 is the data centre GPU that turned large language models from a research curiosity into an industrial product. Announced at GTC 2022 and shipping in volume from Q4 2022, it pairs the Hopper architecture (GH100, 80 billion transistors on TSMC 4N) with HBM3 memory and a dedicated Transformer Engine — the combination that let teams train 70B-parameter models in weeks rather than months and that defined the cost-per-token economics of the first ChatGPT-era serving fleet.
The headline numbers — 989 TFLOPS BF16, 1,979 TFLOPS FP8 dense, 3,958 TFLOPS FP8 with 2:4 sparsity, 80 GB HBM3, 3.35 TB/s memory bandwidth — only matter alongside the interconnect. NVLink 4.0 and the third-generation NVSwitch ASIC give H100 the lowest-latency, highest-bandwidth fabric of any commodity accelerator. That fabric is what made the H100 era distinct from the A100 era: not just more FLOPS per GPU, but a way to make 256 GPUs behave like one. By 2026, H100 capacity is broadly available across every hyperscaler, every NVIDIA-Partner neocloud and most regional sovereign clouds; pricing has compressed from $4-8/GPU-hour in 2023 to $1.10-3.00/GPU-hour, making it frequently the best price-per-training-token GPU NVIDIA ships.
This entry is the reference for teams operating H100 at scale: full spec sheet, the sizing tables we use internally on InferenceBench, the DCGM signals to alert on, the FinOps levers that move the needle, the migration paths to and from neighbouring SKUs, and the troubleshooting playbook for the issues every team eventually hits. Yobitel NeoCloud offers H100 SXM5 capacity in UK and EU regions with NCSC OFFICIAL alignment, NVLink-locality-aware placement, and FOCUS-conformant billing — most teams reading this entry consume H100 either through NeoCloud directly or through Yobibyte's managed inference workspaces. This entry helps you decide when H100 is the right pick for your workload and how to size and price it on Yobitel NeoCloud or your own cluster.
Quick start#
The shortest path from zero to a running H100 today. Three equivalent routes are shown below: an AWS p5.48xlarge (8x H100 SXM5) launched via the EC2 API, a GCP a3-highgpu-8g (8x H100 SXM5) launched via gcloud, and a bare-metal/colo path that exposes existing H100 nodes to Kubernetes via the NVIDIA GPU Operator. Pick whichever matches your fleet, then jump to Workload pattern A to serve Llama 3 70B on the GPUs you just provisioned.
# --- Route 1: AWS p5.48xlarge (8x H100 SXM5, 3.2 Tb/s EFA) ---
# Requires the "Running On-Demand P instances" service quota uplifted from 0.
aws ec2 run-instances \
--region eu-west-2 \
--image-id ami-0abcdef1234567890 \
--instance-type p5.48xlarge \
--key-name my-key \
--subnet-id subnet-0abc... \
--block-device-mappings 'DeviceName=/dev/sda1,Ebs={VolumeSize=500,VolumeType=gp3}' \
--tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=h100-train-01}]'
# Verify the 8 H100s are visible once the host boots
aws ssm start-session --target i-0abc... -- nvidia-smi -L
# --- Route 2: GCP a3-highgpu-8g (8x H100 SXM5) ---
gcloud compute instances create h100-train-01 \
--project=my-project \
--zone=europe-west2-a \
--machine-type=a3-highgpu-8g \
--image-family=common-cu124-debian-12 \
--image-project=deeplearning-platform-release \
--maintenance-policy=TERMINATE \
--boot-disk-size=500GB --boot-disk-type=pd-ssd \
--metadata="install-nvidia-driver=True"
gcloud compute ssh h100-train-01 --zone=europe-west2-a --command='nvidia-smi'
# --- Route 3: Bare-metal / colo K8s — expose existing H100 nodes ---
# Adds the NVIDIA GPU Operator, which installs drivers, container-toolkit,
# DCGM exporter, MIG manager and the device plugin.
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update
kubectl create namespace gpu-operator
helm install --wait gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--version v24.9.0 \
--set driver.enabled=true \
--set toolkit.enabled=true \
--set dcgmExporter.enabled=true \
--set migManager.enabled=true
# Verify the operator brought up driver + plugin + DCGM
kubectl get pods -n gpu-operator
kubectl get nodes -L nvidia.com/gpu.product
# NAME STATUS ROLES GPU.PRODUCT
# h100-node-01 Ready worker NVIDIA-H100-80GB-HBM3
kubectl describe node h100-node-01 | grep -E 'nvidia.com/gpu'
# Capacity: nvidia.com/gpu: 8
# Allocatable: nvidia.com/gpu: 8On AWS and GCP, the default quota for H100 instance families is zero on new accounts; file the uplift ticket weeks ahead. On bare-metal, the GPU Operator handles driver/container-toolkit/DCGM in one chart — do not install the host driver manually unless you are pinning a specific R535+/R550+ build.
How it works: Hopper architecture and the H100 pipeline#
Hopper introduced four innovations over Ampere that justified the generational leap, none of which were headline FLOPS in isolation.
First, the fourth-generation Tensor Core added native FP8 support (E4M3 and E5M2) at twice the throughput of FP16. Paired with the Transformer Engine — a runtime that maintains exponential moving averages of activation magnitudes (amax history) and selects per-layer FP8 vs BF16 vs FP32 — typical LLM training throughput roughly doubled at iso-precision against A100 BF16 baselines. E4M3 is used for forward activations and weights, E5M2 for gradients where extended range matters more than precision.
Second, Thread Block Clusters grouped multiple Cooperative Thread Arrays (CTAs) under a unified distributed-shared-memory namespace, letting kernels reuse data across SM groups without round-tripping to HBM. Combined with the new Tensor Memory Accelerator (TMA) — a dedicated copy engine that asynchronously moves tensor tiles between HBM and SMEM with descriptor-based addressing — this is what made Flash Attention 2 and 3 possible in their published forms. TMA is also why hand-tuned cuBLAS LT GEMMs on H100 routinely close to 80-90 % of peak.
Third, DPX instructions accelerated dynamic-programming inner loops — Smith-Waterman sequence alignment, route planning, certain reinforcement-learning search workloads — at up to 7x Ampere throughput.
Fourth, second-generation MIG (Multi-Instance GPU) added confidential-compute boundaries between instances and memory-bandwidth partitioning, letting a single H100 host multi-tenant inference with hardware-enforced isolation. MIG slices on H100 expose a fraction of SMs, a fixed share of HBM (10 GB per 1g.10gb slice up to 80 GB for a full 7g.80gb), and an isolated NVDEC/NVENC pair.
- GH100 die: 132 Streaming Multiprocessors (SMs), 528 fourth-generation Tensor Cores, 60 MB L2 cache, 50 MB combined L1/SMEM across SMs.
- Memory: 5 HBM3 stacks x 16 GB = 80 GB total at 3.35 TB/s on SXM5 (HBM2e on PCIe at 2.0 TB/s).
- Compute capability: sm_90 (sm_90a for the architecture-specific TMA and wgmma intrinsics used by CUTLASS, Flash Attention 3 and Triton's Hopper backend).
- Confidential Compute (CC-on) mode: AES-256-GCM encryption of all PCIe traffic and HBM-resident pages, attested via SPDM and NVIDIA's attestation service.
| Subsystem | Hopper detail | Practical consequence |
|---|---|---|
| Tensor Core (gen 4) | FP8 E4M3/E5M2, BF16, TF32, INT8 | Transformer Engine routing per-layer; ~2x iso-precision training throughput vs A100. |
| TMA | Async tensor-tile DMA, descriptor-based | Flash Attention 3, CUTLASS 3.x, Triton Hopper kernels reach 80-90 % of peak. |
| Thread Block Cluster | Up to 16 CTAs share distributed SMEM | Persistent kernels, larger working sets, lower HBM pressure. |
| DPX | Hardware DP inner-loop instructions | Genomics, RL search and graph workloads see 4-7x Ampere uplift. |
| MIG gen 2 | 7 slices, isolated HBM/L2/bandwidth/CC | Hard multi-tenant inference on one card. |
Reference: full specification sheet#
Authoritative per-SKU figures. SXM5 fills HGX-H100 baseboards and almost every cloud GPU instance; PCIe Gen5 is the drop-in card for retrofit servers; NVL pairs two PCIe boards via a 600 GB/s bridge with 188 GB HBM3 for memory-pressured inference. All Tensor figures assume 2:4 structured sparsity unless noted; dense throughput is half the sparse figure.
| Metric | H100 SXM5 | H100 PCIe Gen5 | H100 NVL (pair) |
|---|---|---|---|
| Architecture | Hopper GH100 | Hopper GH100 | Hopper GH100 x2 |
| Process | TSMC 4N | TSMC 4N | TSMC 4N |
| Transistors | 80 billion | 80 billion | 160 billion (pair) |
| SMs | 132 | 114 | 132 x 2 |
| Tensor cores | 528 | 456 | 528 x 2 |
| L2 cache | 60 MB | 50 MB | 60 MB x 2 |
| Compute capability | sm_90 / sm_90a | sm_90 / sm_90a | sm_90 / sm_90a |
| FP64 (Tensor) | 67 TFLOPS | 51 TFLOPS | 134 TFLOPS |
| FP32 | 67 TFLOPS | 51 TFLOPS | 134 TFLOPS |
| TF32 (Tensor, sparse) | 989 TFLOPS | 756 TFLOPS | 1,978 TFLOPS |
| BF16 / FP16 (Tensor, sparse) | 1,979 TFLOPS | 1,513 TFLOPS | 3,958 TFLOPS |
| FP8 (Tensor, sparse) | 3,958 TFLOPS | 3,026 TFLOPS | 7,916 TFLOPS |
| INT8 (Tensor, sparse) | 3,958 TOPS | 3,026 TOPS | 7,916 TOPS |
| Memory | 80 GB HBM3 | 80 GB HBM2e | 188 GB HBM3 (94 GB per board) |
| Memory bandwidth | 3.35 TB/s | 2.0 TB/s | 7.8 TB/s aggregate |
| NVLink | 900 GB/s (NVLink 4.0, 18 ports) | 600 GB/s (bridge, optional) | 600 GB/s board-to-board bridge |
| PCIe | Gen5 x16 (128 GB/s) | Gen5 x16 (128 GB/s) | Gen5 x16 per board |
| TDP | 700 W (configurable 600-700 W) | 350 W | 2 x 350-400 W |
| MIG instances | Up to 7 | Up to 7 | Up to 7 per board |
| Confidential Compute | Yes (CC-on attested) | Yes | Yes |
| Form factor | SXM5 mezzanine | FHFL dual-slot PCIe | Dual FHFL PCIe + bridge |
| Minimum driver | R525 (R535+ recommended) | R525 | R535+ |
| Minimum CUDA | 12.0 (12.4+ for full TE) | 12.0 | 12.2 |
Sparse Tensor numbers assume 2:4 structured sparsity — half the weights pruned in a fixed pattern. Real training and inference workloads rarely sustain this; dense FP8 throughput is roughly half the listed sparse figure. Quote dense numbers in capacity plans and treat sparse figures as marketing ceilings.
Interconnect: NVLink 4.0 and the NVSwitch fabric#
Every H100 SXM5 module exposes 18 NVLink 4.0 ports, each providing 50 GB/s bidirectional — 900 GB/s aggregate per GPU. That figure alone is interesting; the topology around it is what matters.
An HGX-H100 baseboard places 8 GPUs alongside 4 NVSwitch ASICs, wiring every GPU to every switch. The result is a fully non-blocking 8-GPU shared-memory fabric: any GPU can DMA into any other GPU's HBM at full NVLink bandwidth with no fabric contention. Inside one DGX H100, all-to-all collectives like AllReduce hit 450 GB/s per direction — close to the theoretical NVLink ceiling, and roughly 3x the equivalent A100 figure.
Beyond 8 GPUs, the optional NVLink Switch System extends the same topology to 256-GPU pods via external NVLink switches. The pod delivers 57.6 TB/s of bisection bandwidth — meaningfully faster than InfiniBand NDR (400 Gb/s per port x 256 ports ~= 12.8 TB/s) and the reason hyperscale training clusters increasingly look like 'one giant GPU' rather than 'a cluster of GPUs'. Beyond 256 GPUs the topology switches to InfiniBand or RoCE, and collective performance drops by an order of magnitude — sizing past 256 should account for that step.
- Per-GPU NVLink: 900 GB/s bidirectional (18 ports x 50 GB/s).
- Per-baseboard NVSwitch bisection: 3.6 TB/s (8 GPUs x 450 GB/s per direction).
- NVLink-domain ceiling: 256 GPUs, 57.6 TB/s bisection.
- Above 256 GPUs: InfiniBand NDR/XDR or Spectrum-X RoCE — plan for 5-10x latency uplift on cross-pod collectives.
Workload pattern A: Llama 3 70B inference at 32K context#
Single-replica throughput target, latency-sensitive endpoint. We size to two H100 SXM5 with TP=2 to fit the 70B weights plus a meaningful KV cache budget; FP8 reduces weight memory to ~35 GB per rank, leaving headroom for 16-32 concurrent sessions. The shortest path is `vllm serve` directly on the host — bind the OpenAI-compatible HTTP server, pin the replica to a single HGX baseboard via `CUDA_VISIBLE_DEVICES`, and the smoke-test is a single `curl`.
# 1) Install vLLM with FP8 support (Hopper requires CUDA 12.4+, driver R550+)
pip install "vllm==0.6.3" "torch==2.4.0"
# 2) Pin to the first two GPUs on the same HGX baseboard, then serve.
# vLLM exposes an OpenAI-compatible /v1/chat/completions endpoint on :8000.
CUDA_VISIBLE_DEVICES=0,1 \
NCCL_P2P_LEVEL=NVL \
vllm serve meta-llama/Meta-Llama-3-70B-Instruct \
--tensor-parallel-size 2 \
--quantization fp8 \
--kv-cache-dtype fp8_e5m2 \
--max-model-len 32768 \
--max-num-batched-tokens 32768 \
--gpu-memory-utilization 0.92 \
--enable-prefix-caching \
--disable-log-requests \
--host 0.0.0.0 --port 8000
# 3) Smoke-test the endpoint
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3-70B-Instruct",
"messages": [{"role":"user","content":"Summarise NVLink 4.0 in one sentence."}],
"max_tokens": 128
}' | jq .Pattern A gotcha: with TP=2 and 32K context, NCCL AllReduce on the attention output is the dominant inter-GPU traffic. If the two GPUs are on different NUMA nodes or behind a PCIe switch instead of NVLink, decode TPS collapses by 40-60 %. Always pin the replica to a single HGX baseboard — verify with `nvidia-smi topo -m` that the two devices share an `NV#` link, and set `NCCL_P2P_LEVEL=NVL` to fail loudly if they do not.
Workload pattern B: 70B QLoRA fine-tune#
QLoRA fine-tune of a 70B base model on 2x H100 SXM5 using `transformers` + `peft` + `bitsandbytes` + `trl`, launched with `accelerate launch`. NF4 base weights (~35 GB), BF16 LoRA adapters (~600 MB), paged AdamW optimiser state for adapters only, gradient checkpointing on every transformer block, Flash Attention 2 (FA3 is wired in via `attn_implementation="flash_attention_3"` on `transformers >= 4.46`). Peak working set lands around 70 GB per GPU at batch 2 / seq 4096.
- Launch on a 2x H100 node: `NCCL_P2P_LEVEL=NVL accelerate launch --num_processes 2 --mixed_precision bf16 train.py`.
- For multi-node: `accelerate launch --multi_gpu --num_machines N --machine_rank R --main_process_ip <head> train.py`, or switch to `torchrun --nproc_per_node 8 --nnodes N`.
- Monitor with `watch -n 2 nvidia-smi` and `tail -f out/llama3-70b-qlora/runs/*/events.out.tfevents.*` (TensorBoard).
- For higher-throughput Hopper-tuned kernels, swap the model loader for `unsloth` or wrap the same config in `axolotl` — both compile down to the `transformers` + `peft` primitives above.
# train.py — 70B QLoRA on 2x H100 SXM5
# Deps: pip install "transformers>=4.46" "peft>=0.13" "trl>=0.11" \
# "bitsandbytes>=0.43" "accelerate>=0.34" datasets
import torch
from datasets import load_dataset
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from peft import LoraConfig, prepare_model_for_kbit_training
from trl import SFTTrainer, SFTConfig
MODEL_ID = "meta-llama/Meta-Llama-3-70B-Instruct"
bnb = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
quantization_config=bnb,
device_map="auto", # shard across the 2 H100s
attn_implementation="flash_attention_2", # FA3 on transformers>=4.46
torch_dtype=torch.bfloat16,
)
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)
lora = LoraConfig(
r=64, lora_alpha=16, lora_dropout=0.05,
target_modules="all-linear", bias="none", task_type="CAUSAL_LM",
)
ds = load_dataset("json", data_files="s3://my-bucket/customer-support-v3/*.jsonl",
split="train", streaming=False)
trainer = SFTTrainer(
model=model, tokenizer=tokenizer, peft_config=lora, train_dataset=ds,
args=SFTConfig(
output_dir="./out/llama3-70b-qlora",
num_train_epochs=3,
per_device_train_batch_size=2,
gradient_accumulation_steps=16, # global batch 64 on 2 GPUs
gradient_checkpointing=True,
optim="paged_adamw_8bit",
learning_rate=2e-4, lr_scheduler_type="cosine", warmup_ratio=0.03,
bf16=True, max_seq_length=4096,
logging_steps=10, save_steps=500, report_to="tensorboard",
),
)
trainer.train()
trainer.save_model("./out/llama3-70b-qlora/final")QLoRA on 2x H100 is faster end-to-end than full FP16 fine-tune on 8x H100 for adapter-style customisation, at roughly 25 % of the GPU-hour cost. Reach for full fine-tune only when you need to update behaviour outside the LoRA rank budget.
Workload pattern C: Stable Diffusion XL serving#
Stable Diffusion XL 1.0 base + refiner at 1024x1024 on a single H100 PCIe with MIG-disabled. We use `diffusers` with `torch.compile` for the UNet and BF16 VAE; the workload is compute-bound rather than memory-bound, which makes the PCIe SKU competitive with SXM5 at a meaningfully lower hourly rate. For absolute peak throughput, a separate offline `trtllm-build` step compiles the UNet to a TensorRT engine, but the `diffusers` path below is what most teams run in production.
# sdxl_server.py — SDXL base + refiner on 1x H100 PCIe
# Deps: pip install "diffusers>=0.30" "transformers>=4.46" \
# "torch==2.4.0" accelerate safetensors fastapi uvicorn
import io, torch
from diffusers import StableDiffusionXLPipeline, StableDiffusionXLImg2ImgPipeline
from fastapi import FastAPI
from fastapi.responses import Response
BASE = "stabilityai/stable-diffusion-xl-base-1.0"
REFINER = "stabilityai/stable-diffusion-xl-refiner-1.0"
base = StableDiffusionXLPipeline.from_pretrained(
BASE, torch_dtype=torch.bfloat16, variant="fp16", use_safetensors=True,
).to("cuda")
refiner = StableDiffusionXLImg2ImgPipeline.from_pretrained(
REFINER, torch_dtype=torch.bfloat16,
text_encoder_2=base.text_encoder_2, vae=base.vae,
variant="fp16", use_safetensors=True,
).to("cuda")
# Hopper-tuned: SDPA attention is FA2 by default on torch>=2.4
base.unet = torch.compile(base.unet, mode="max-autotune", fullgraph=True)
refiner.unet = torch.compile(refiner.unet, mode="max-autotune", fullgraph=True)
app = FastAPI()
@app.post("/generate")
def generate(prompt: str, steps: int = 25, guidance: float = 7.0):
latent = base(prompt=prompt, num_inference_steps=steps,
guidance_scale=guidance, denoising_end=0.8,
output_type="latent").images
image = refiner(prompt=prompt, num_inference_steps=steps,
denoising_start=0.8, image=latent).images[0]
buf = io.BytesIO(); image.save(buf, format="PNG")
return Response(content=buf.getvalue(), media_type="image/png")
# Run with: uvicorn sdxl_server:app --host 0.0.0.0 --port 8000 --workers 1SDXL on H100 PCIe with `torch.compile` lands around 1.6 images/second at batch 1 (25 steps, 1024^2). H100 SXM5 is roughly 1.15x faster on the UNet but the PCIe SKU's lower hourly rate usually wins on cost-per-image. Verify with InferenceBench rather than assuming. For the last 1.5-2x on top of `torch.compile`, build a TensorRT 10 engine for the UNet with `trtllm-build` and swap `base.unet` for the engine wrapper.
Sizing and capacity planning#
Sizing tables we use internally to scope H100 footprints. All figures assume H100 SXM5, FP8 weights via the Transformer Engine, vLLM 0.6 with paged KV cache and prefix caching, and a healthy NVLink-local placement. Throughput is given in output tokens per second per replica at the listed concurrency; treat these as planning anchors, not contractual SLOs — verify on InferenceBench before production rollout.
- Training rule of thumb: 1 trillion training tokens x 70B parameters at BF16 needs roughly 250-350 H100-days on 64-GPU NVLink-domain clusters with Megatron-LM + Transformer Engine FP8.
- Memory ceiling for a single H100: weights + KV cache + activations + cuBLAS scratch < 78 GB. Above 78 GB, expect OOMs even with paged KV — drop precision, shrink context, or move to TP=2.
- AllReduce overhead at TP=8 inside one HGX-H100: ~6-9 % of step time for 70B BF16; jumps to 25-40 % the moment a rank crosses to a second NVLink domain over InfiniBand.
- For 500 RPS at 4K tokens output (mid-context chat traffic), Llama 3 70B FP8 needs roughly 6-8 H100 SXM5 replicas; size headroom for prefix-cache cold-start on rollouts.
- Spot/preemptible H100 capacity is viable for fine-tunes but not for production inference SLAs — eviction rates of 8-15 % per day are typical on hyperscaler spot.
| Model size | Precision | Context | GPUs per replica | TP / PP | Approx output TPS | Approx VRAM headroom |
|---|---|---|---|---|---|---|
| 7B (Mistral, Qwen) | FP8 | 8K | 1x H100 | 1 / 1 | 5,500-7,000 | 60 GB free |
| 13B | FP8 | 8K | 1x H100 | 1 / 1 | 3,800-4,800 | 50 GB free |
| 34B (Yi, Codestral) | FP8 | 8K | 1x H100 | 1 / 1 | 1,900-2,400 | 25 GB free |
| 70B (Llama 3) | FP8 | 8K | 1x H100 | 1 / 1 | 1,000-1,300 | 10-15 GB free |
| 70B (Llama 3) | FP8 | 32K | 2x H100 | 2 / 1 | 1,500-1,900 | 20 GB free per rank |
| 70B (Llama 3) | FP8 | 128K | 4x H100 | 4 / 1 | 1,700-2,200 | 12 GB free per rank |
| 140B MoE (Mixtral 8x22B) | FP8 | 32K | 2x H100 | 2 / 1 | 900-1,200 | 8 GB free per rank |
| 180B (Falcon, Bloom) | FP8 | 8K | 4x H100 | 4 / 1 | 600-800 | 15 GB free per rank |
| 405B (Llama 3.1) | FP8 | 32K | 8x H100 | 8 / 1 | 350-450 | 10 GB free per rank |
Limits and quotas#
Default per-account caps you will hit. Hyperscaler quotas are vendor-defined and require support tickets to raise. Plan procurement around lead times of 8-26 weeks for committed H100 capacity in 2026.
| Limit | Default | Ceiling | How to raise |
|---|---|---|---|
| AWS p5.48xlarge (8x H100) on-demand | 0 vCPU baseline | Account-negotiated | Service Quotas -> 'Running On-Demand P instances' |
| AWS p5.48xlarge capacity-block reservation | 0 | Region-negotiated | EC2 Capacity Blocks for ML, 1-182 day windows |
| GCP a3-highgpu-8g region quota | 0 | Org-negotiated | Cloud Console -> IAM & Admin -> Quotas (NVIDIA_H100_GPUS) |
| Azure ND H100 v5 cores per region | 0 | Org-negotiated | Azure portal -> Subscriptions -> Usage + quotas |
| Kubernetes nvidia.com/gpu per pod | node-allocatable | Hardware limit | Node selector + `resources.limits[nvidia.com/gpu]` |
| NVLink-domain size (NVL Switch) | 256 GPUs | Hardware limit | Span domains via InfiniBand; expect ~10x collective latency |
| MIG slices per H100 | 7 (max) | Hardware limit | Repartition; partition changes are destructive |
| NCCL message size in-flight | Drivers default | Cluster-tuned | `NCCL_MAX_NCHANNELS`, `NCCL_BUFFSIZE` tuning |
| Confidential Compute mode | Off | Per-card | Driver toggle + attestation service; one-way until reboot |
| TensorRT-LLM engine cache size | 8 GB | Disk-bound | `TRTLLM_CACHE_DIR` + larger PV |
On hyperscalers, a quota of 'zero' is the default for new accounts on every H100 SKU. File the quota uplift ticket weeks before you need capacity; in 2026, average grant time is 3-10 business days for the larger clouds.
Observability#
Production H100 observability is built around DCGM (Data Center GPU Manager) exporting Prometheus metrics, plus NCCL traces for collective hotspots and ECC counters for hardware health. The metrics below are the ones we alert on in every Yobitel-managed H100 deployment.
- DCGM_FI_DEV_GPU_UTIL — SM occupancy. Sustained < 60 % under load usually means dataloader stall, not under-provisioning.
- DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_FREE — framebuffer (HBM) used and free, in MiB. Alert when free < 4 GB on inference replicas.
- DCGM_FI_PROF_PIPE_TENSOR_ACTIVE — fraction of cycles the Tensor Cores are busy. The honest 'is the GPU actually doing math' metric.
- DCGM_FI_PROF_DRAM_ACTIVE — HBM bandwidth utilisation; pair with PIPE_TENSOR_ACTIVE to classify compute-bound vs memory-bound regimes.
- DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL — NVLink throughput per GPU; sudden drops correlate with single-port link-down events.
- DCGM_FI_DEV_GPU_TEMP / DCGM_FI_DEV_MEMORY_TEMP — die and HBM temperatures. Alert at die > 83 C (throttle threshold) and HBM > 95 C.
- DCGM_FI_DEV_ECC_SBE_VOL_TOTAL / DBE_VOL_TOTAL — single- and double-bit ECC error counts. Any non-zero double-bit error means quarantine the card.
- DCGM_FI_DEV_RETIRED_DBE / RETIRED_SBE — retired HBM pages; a steady climb predicts card failure within weeks.
- DCGM_FI_DEV_POWER_USAGE / DCGM_FI_DEV_POWER_VIOLATION — watts drawn and milliseconds throttled by power cap.
- DCGM_FI_DEV_THERMAL_VIOLATION — milliseconds throttled by thermal limit; correlate spikes with rack inlet temperature.
# Prometheus alert rules — H100 production fleet
groups:
- name: h100-health
interval: 30s
rules:
- alert: H100ThermalThrottle
expr: rate(DCGM_FI_DEV_THERMAL_VIOLATION[5m]) > 0
for: 2m
labels: { severity: warning }
annotations:
summary: "H100 {{ $labels.gpu }} thermal throttling"
runbook: https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html#thermal-management
- alert: H100ECCDoubleBit
expr: increase(DCGM_FI_DEV_ECC_DBE_VOL_TOTAL[1h]) > 0
labels: { severity: critical }
annotations:
summary: "DBE ECC error on H100 {{ $labels.gpu }} — quarantine card"
- alert: H100NVLinkDown
expr: DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL < 700e9
for: 5m
labels: { severity: critical }
annotations:
summary: "H100 {{ $labels.gpu }} NVLink degraded below 700 GB/s"
- alert: H100HBMNearFull
expr: DCGM_FI_DEV_FB_FREE < 4096
for: 10m
labels: { severity: warning }
- alert: H100TensorIdle
expr: avg_over_time(DCGM_FI_PROF_PIPE_TENSOR_ACTIVE[15m]) < 0.30
and on(gpu) DCGM_FI_DEV_GPU_UTIL > 70
for: 15m
labels: { severity: info }
annotations:
summary: "GPU busy but Tensor Cores idle — dataloader or kernel inefficiency"PIPE_TENSOR_ACTIVE is the single most useful Hopper signal. A replica showing 90 % DCGM_FI_DEV_GPU_UTIL but 25 % PIPE_TENSOR_ACTIVE is doing memory ops, not math — usually a dataloader bottleneck or an unfused kernel path. Fix that before adding GPUs.
Cost and FinOps#
H100 hourly pricing collapsed by roughly 60-70 % between 2023 and 2026 as supply caught up. In 2026 the public ranges below are typical; private commitments often clear 20-40 % under on-demand list. Pricing levers in order of impact: commitment term (1y reserved ~= 40 % off on-demand, 3y ~= 60 % off), neocloud vs hyperscaler (neocloud ~= 30-50 % cheaper at parity), FP8 enablement (~= 1.6x throughput vs FP16 baseline at the same hourly cost = same % cost reduction), and right-sizing replicas to NVLink-locality (avoiding cross-baseboard placement that doubles GPU count for the same TPS).
- Cost-per-million-output-tokens on Llama 3 70B FP8, 1x H100 SXM5 at $2.00/GPU-hr and 1,100 TPS sustained: roughly $0.50 per million tokens before margin.
- Switching from FP16 to FP8 with the Transformer Engine yields +1.6x throughput on H100 SXM5 at iso-context — a 38 % drop in cost-per-token.
- Reserving 3 years cuts effective $/GPU-hr roughly in half versus on-demand; only commit when steady-state utilisation exceeds 65 %.
- Idle replicas are the dominant overspend pattern in production inference fleets — set `minReplicas: 0` on non-critical endpoints with cold-start tolerance.
- Egress and inter-region data movement frequently exceed 10 % of total H100 bill at hyperscalers — collocate model artefacts with compute.
| Provider class | SKU | On-demand $/GPU-hr | 1y reserved | 3y reserved | Notes |
|---|---|---|---|---|---|
| Hyperscaler (AWS/GCP/Azure) | H100 SXM5 | $2.50-3.00 | $1.60-2.10 | $1.10-1.60 | Best for hybrid stacks; data-egress costs matter. |
| Hyperscaler | H100 PCIe | $1.80-2.40 | $1.20-1.60 | $0.85-1.20 | Fewer regions; not all instances support NVLink. |
| Tier-1 neocloud | H100 SXM5 | $1.80-2.40 | $1.40-1.80 | $1.00-1.40 | Commonly cheapest at scale; verify NVLink topology. |
| Tier-2 neocloud | H100 SXM5 | $1.40-1.90 | $1.10-1.50 | $0.85-1.20 | Best raw rate; expect more variance in IB topology. |
| Spot/preemptible | H100 SXM5 | $0.90-1.60 | n/a | n/a | 8-15 % eviction/day; fine-tunes only. |
| Yobitel NeoCloud (UK + EU) | H100 SXM5 | $1.90-2.40 | $1.40-1.80 | $1.00-1.40 | NCSC OFFICIAL-aligned regions; FOCUS-conformant billing. |
| Yobitel Omniscient Compute | H100 SXM5 multi-cloud | Market-clearing | Commit-discounted | Commit-discounted | Cross-provider arbitrage on top of NeoCloud + partner capacity. |
All cost figures land on the FinOps Foundation FOCUS billing spec when consumed via Yobitel: ServiceName=`AcceleratorCompute`, ChargeCategory=`Usage`, SkuId=`gpu.h100.sxm5`. This is what makes cross-provider arbitrage and cost attribution tractable at scale.
Security and compliance#
H100 ships with three independent isolation primitives — MIG, Confidential Compute (CC-on), and the standard CUDA process/IPC model — and the combination supports sovereign deployments under UK NCSC guidance, EU GDPR, US HIPAA and FedRAMP Moderate when paired with appropriate host hardening.
MIG provides hardware-enforced spatial partitioning: up to 7 instances per H100, each with isolated HBM, L2, NVDEC/NVENC and SM allocations. Inter-instance memory bandwidth contention is bounded by the partition. MIG slices appear to the OS as distinct PCIe devices, which means multi-tenant scheduling on Kubernetes via the NVIDIA Device Plugin is straightforward.
Confidential Compute mode (CC-on) encrypts all PCIe traffic between the H100 and the host with AES-256-GCM and seals HBM-resident pages so the host kernel cannot read them. Attestation is performed via SPDM-over-PCIe to NVIDIA's NRAS service; an attested boot binds the firmware version, driver, and the workload's measurement hash. CC-on costs roughly 3-7 % throughput on most inference workloads and is currently the only commercial GPU attestation path with FedRAMP Moderate coverage.
For Yobitel UK sovereign deployments the recommended posture is: MIG-off (full-card workloads only), CC-on, NCSC-aligned host hardening (CIS Ubuntu 22.04 LTS Level 2), NCSC Cloud Security Principles 1-14 evidence in the workspace audit log, and OFFICIAL-classification data segregation by workspace rather than by namespace.
- MIG: spatial isolation, 7 slices max, hardware-enforced memory and bandwidth partitioning.
- CC-on: cryptographic isolation, SPDM attestation, ~3-7 % throughput penalty.
- Per-replica IAM via Yobitel workspaces; encryption-at-rest for model artefacts; signed model provenance via Sigstore.
- Auditable: every Kubernetes admission event (Deployment, Job, InferenceService) lands in the cluster audit log; pair with Falco for runtime detection and Sigstore-verified container images for supply-chain provenance.
- GDPR: model weights and training data residency enforced at workspace level; cross-region inference requires explicit configuration.
Migration and alternatives#
When H100 is the right choice and when it isn't. The table below maps the practical migration paths in both directions; the code block below shows the real-world commands you can run today on existing infrastructure as a reference.
Two heuristics: pick H200 when memory pressure dominates (KV cache or weights); pick B200 only when you can absorb a new software stack and need FP4 throughput. Stay on H100 in every other case — the software lead alone usually justifies it through 2026.
| From / to | When it pays | Migration effort | Key incompatibility |
|---|---|---|---|
| A100 -> H100 | Need FP8 throughput or TMA/FA3 | Low (drop-in CUDA upgrade) | FP8 calibration; sm_90 kernels not on Ampere |
| H100 -> H200 | KV cache or weights memory-bound | Trivial (same software stack) | None — same GH100 silicon |
| H100 -> B200 | Need FP4 or 8 TB/s bandwidth | Medium (CUDA 12.4+, FP4 quantisation) | New MX formats; some kernels need rework |
| H100 PCIe -> H100 SXM5 | Workload uses NVLink collectives | Medium (chassis change) | Cooling envelope; thermal redesign |
| H100 -> MI300X | Need 192 GB HBM3 per GPU | High (CUDA -> ROCm rewrite) | CUDA kernels not portable; vLLM ROCm gap |
| H100 -> TPU v5p/Trillium | Already on JAX/XLA | High (full stack change) | PyTorch kernels need XLA path |
| H100 -> Inferentia 2 | Inference-only, AWS-resident | High (Neuron compiler) | Limited model coverage |
# --- Equivalents you can run today against existing stacks ---
# 1) AWS: launch an 8x H100 p5 instance (note: requires quota)
aws ec2 run-instances \
--image-id ami-0abcdef1234567890 \
--instance-type p5.48xlarge \
--key-name my-key \
--subnet-id subnet-0abc...
# 2) GCP: a3-highgpu-8g (8x H100 SXM5)
gcloud compute instances create h100-train-01 \
--machine-type=a3-highgpu-8g \
--zone=europe-west2-a \
--image-family=common-cu124 \
--image-project=deeplearning-platform-release \
--maintenance-policy=TERMINATE
# 3) Kubernetes (GPU Operator + Device Plugin): request 2 H100s
kubectl apply -f - <<'EOF'
apiVersion: apps/v1
kind: Deployment
metadata: { name: vllm-70b }
spec:
replicas: 2
selector: { matchLabels: { app: vllm-70b } }
template:
metadata: { labels: { app: vllm-70b } }
spec:
nodeSelector: { nvidia.com/gpu.product: NVIDIA-H100-80GB-HBM3 }
containers:
- name: vllm
image: vllm/vllm-openai:v0.6.0
args: ["--model","meta-llama/Meta-Llama-3-70B-Instruct",
"--tensor-parallel-size","2","--max-model-len","32768",
"--kv-cache-dtype","fp8_e5m2","--quantization","fp8"]
resources:
limits: { nvidia.com/gpu: 2 }
EOF
# 4) Direct vLLM serve (bare metal)
vllm serve meta-llama/Meta-Llama-3-70B-Instruct \
--tensor-parallel-size 2 \
--quantization fp8 --kv-cache-dtype fp8_e5m2 \
--max-model-len 32768
# 5) TensorRT-LLM engine build for H100
trtllm-build --checkpoint_dir ./hf_llama3_70b_fp8 \
--output_dir ./engines/llama3-70b-h100 \
--gemm_plugin fp8 --gpt_attention_plugin fp8 \
--max_input_len 16384 --max_seq_len 32768 \
--tp_size 2 --workers 2The kubectl / aws / gcloud commands above are what you run today on the underlying infrastructure. Yobitel operates this stack on customers' behalf — see 'Where this fits in the Yobitel stack' below for the integration boundary.
Troubleshooting#
Operational issues we see most often on H100 fleets, ranked by frequency. Each has a definitive diagnosis and a fix path.
| Error / symptom | Likely cause | Fix |
|---|---|---|
| GPU clocks throttling, die > 83 C | Thermal throttling — inlet > 27 C or coolant under-flow | Verify rack inlet temp, coolant supply temp, secondary loop dT; drop power cap to 600 W if persistent. |
| NCCL AllReduce hangs at job start | Missing or stale NCCL topology file on heterogeneous NVLink+IB cluster | Generate with `nccl-topo-dump`; set `NCCL_TOPO_FILE=/etc/nccl/topo.xml`; verify with `NCCL_DEBUG=INFO`. |
| `CUDA_ERROR_OUT_OF_MEMORY` on inference start | Batch x context too large for 80 GB after weights + KV cache + cuBLAS scratch | Reduce `max_model_len`, set `gpu_memory_utilization=0.88`, switch KV cache to `fp8_e5m2`, or move to TP=2. |
| Single NVLink port down — DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL drops to ~850 GB/s | Mezzanine reseat needed or cold solder joint | `nvidia-smi nvlink --status`; drain node; reseat module; if persistent, RMA. |
| MIG misconfiguration — `nvidia-smi mig -lgi` shows partial slices | Previous workload exited without releasing compute instances | `nvidia-smi mig -dgi -gi <id>` to destroy, then re-create. MIG repartition is destructive — drain workloads first. |
| ECC double-bit error in dmesg | HBM defect | Immediately quarantine card. Drain workloads, mark node unschedulable, RMA. Do not redeploy until replaced. |
| First-token latency 5-10x higher than steady state | Cold KV-cache and engine warm-up | Enable prefix caching, pre-warm replicas with synthetic traffic on rollout, use `--num-warmup-requests`. |
| Training step time 2-4x expected | Cross-baseboard tensor-parallel rank — collectives over IB instead of NVLink | Pin replica to single HGX baseboard; on K8s use NVLink-topology aware scheduler; verify with NCCL `PXN` debug. |
| FP8 training loss spikes after 1k-10k steps | Activation amax history saturated; activation scaling miscalibrated | Increase TE `fp8_amax_history_len`, sanity-check `fp8_format=HYBRID`, reduce learning rate, or fall back to BF16 on offending layer. |
| `nvidia-smi` shows 100 % util but tokens/sec is flat | Dataloader bound — Tensor Cores idle | Check `DCGM_FI_PROF_PIPE_TENSOR_ACTIVE`; increase dataloader workers, switch to prefetched parquet shards, enable IO pinned memory. |
| CUDA_ERROR_ECC_UNCORRECTABLE on a single rank | Stuck retired-page count saturating | Run `nvidia-smi --query-remapped-rows`; if remapped-rows > 50, the card is end-of-life — RMA. |
Where this fits in the Yobitel stack#
The H100 is the workhorse SKU across the Yobitel stack in 2026. Yobibyte — our AI-native platform — consolidates the open-source primitives shown above (NVIDIA GPU Operator, vLLM, KServe, Volcano, KubeRay, DCGM, Sigstore) into a single managed control plane: inference replicas, fine-tune jobs and notebooks all land on H100 (or H200 / B200) pools with NVLink-aware placement, FP8 enabled by default and DCGM alerts pre-wired. The vLLM and `accelerate` commands in this entry are exactly what Yobibyte reconciles under the hood on the customer's behalf.
Omniscient Compute — our cross-cloud capacity broker — indexes H100 SKUs across every connected hyperscaler and Tier-1/Tier-2 neocloud, normalises pricing onto the FinOps Foundation FOCUS spec, and arbitrages workloads to the cheapest region that meets the workspace's residency and compliance posture. When you ask Yobitel for 8x H100 SXM5 in the UK sovereign region, Omniscient Compute is the layer that finds it.
InferenceBench — our public, reproducible benchmarking harness — publishes H100 throughput, latency and cost-per-token numbers for every major open-weight model across vLLM, TensorRT-LLM, SGLang and TGI. The sizing tables in this entry are anchored on InferenceBench runs; the production numbers your team will see in steady state are typically within 10 % of the published figures. If you are sizing a 2026 H100 footprint, start with InferenceBench, lift the platform configuration into a Yobibyte manifest, and let Omniscient Compute pick the region.
References
- NVIDIA H100 Tensor Core GPU Datasheet · NVIDIA
- Hopper Architecture Whitepaper · NVIDIA
- NVLink Switch System Specification · NVIDIA
- Transformer Engine User Guide · NVIDIA
- DCGM Field Identifiers (Prometheus exporter) · NVIDIA
- Confidential Compute on NVIDIA H100 · NVIDIA
- vLLM FP8 quantisation on Hopper · vLLM
- TensorRT-LLM Hopper engines · NVIDIA
- FinOps Foundation FOCUS billing specification · FinOps Foundation
- NCSC Cloud Security Principles · UK NCSC