NVIDIA H100 Tensor Core GPU

TL;DR

Hopper-architecture data centre GPU (GH100) on TSMC 4N, 80 billion transistors, launched March 2022 and the default training accelerator from 2023 onward — still the most widely benchmarked and software-mature AI GPU in production through 2026.
Two form factors: SXM5 (700 W TDP, 18-port NVLink 4.0 at 900 GB/s) and PCIe Gen5 (350 W TDP, 600 GB/s NVLink bridge, drop-in for retrofit servers). SXM5 is what fills DGX-H100, HGX-H100, AWS p5, GCP a3, Azure ND H100 v5 and almost every neocloud H100 instance.
80 GB HBM3 at 3.35 TB/s; fourth-generation Tensor Core delivers 989 TFLOPS BF16 and 3,958 TFLOPS FP8 (2:4 sparse); the Transformer Engine auto-casts layers between BF16 and FP8 (E4M3/E5M2) with runtime amax tracking.
NVLink 4.0 + third-gen NVSwitch ASIC scales to 256-GPU NVLink-domain pods with 57.6 TB/s bisection — the substrate every multi-billion-parameter training run shipped between 2023 and 2025 ran on.
Sizing rule of thumb: Llama 3 70B FP8 fits on 1x H100 with 8K context (no TP); 32K context needs 2x H100 with TP=2; QLoRA fine-tune of 70B fits on 2x H100 SXM5 (~70 GB peak per GPU including optimiser).

Overview

The NVIDIA H100 is the data centre GPU that turned large language models from a research curiosity into an industrial product. Announced at GTC 2022 and shipping in volume from Q4 2022, it pairs the Hopper architecture (GH100, 80 billion transistors on TSMC 4N) with HBM3 memory and a dedicated Transformer Engine — the combination that let teams train 70B-parameter models in weeks rather than months and that defined the cost-per-token economics of the first ChatGPT-era serving fleet.

The headline numbers — 989 TFLOPS BF16, 1,979 TFLOPS FP8 dense, 3,958 TFLOPS FP8 with 2:4 sparsity, 80 GB HBM3, 3.35 TB/s memory bandwidth — only matter alongside the interconnect. NVLink 4.0 and the third-generation NVSwitch ASIC give H100 the lowest-latency, highest-bandwidth fabric of any commodity accelerator. That fabric is what made the H100 era distinct from the A100 era: not just more FLOPS per GPU, but a way to make 256 GPUs behave like one. By 2026, H100 capacity is broadly available across every hyperscaler, every NVIDIA-Partner neocloud and most regional sovereign clouds; pricing has compressed from $4-8/GPU-hour in 2023 to $1.10-3.00/GPU-hour, making it frequently the best price-per-training-token GPU NVIDIA ships.

This entry is the reference for teams operating H100 at scale: full spec sheet, the sizing tables we use internally on InferenceBench, the DCGM signals to alert on, the FinOps levers that move the needle, the migration paths to and from neighbouring SKUs, and the troubleshooting playbook for the issues every team eventually hits. Yobitel NeoCloud offers H100 SXM5 capacity in UK and EU regions with NCSC OFFICIAL alignment, NVLink-locality-aware placement, and FOCUS-conformant billing — most teams reading this entry consume H100 either through NeoCloud directly or through Yobibyte's managed inference workspaces. This entry helps you decide when H100 is the right pick for your workload and how to size and price it on Yobitel NeoCloud or your own cluster.

Quick start

The shortest path from zero to a running H100 today. Three equivalent routes are shown below: an AWS p5.48xlarge (8x H100 SXM5) launched via the EC2 API, a GCP a3-highgpu-8g (8x H100 SXM5) launched via gcloud, and a bare-metal/colo path that exposes existing H100 nodes to Kubernetes via the NVIDIA GPU Operator. Pick whichever matches your fleet, then jump to Workload pattern A to serve Llama 3 70B on the GPUs you just provisioned.

# --- Route 1: AWS p5.48xlarge (8x H100 SXM5, 3.2 Tb/s EFA) ---
# Requires the "Running On-Demand P instances" service quota uplifted from 0.
aws ec2 run-instances \
  --region eu-west-2 \
  --image-id ami-0abcdef1234567890 \
  --instance-type p5.48xlarge \
  --key-name my-key \
  --subnet-id subnet-0abc... \
  --block-device-mappings 'DeviceName=/dev/sda1,Ebs={VolumeSize=500,VolumeType=gp3}' \
  --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=h100-train-01}]'

# Verify the 8 H100s are visible once the host boots
aws ssm start-session --target i-0abc... -- nvidia-smi -L

# --- Route 2: GCP a3-highgpu-8g (8x H100 SXM5) ---
gcloud compute instances create h100-train-01 \
  --project=my-project \
  --zone=europe-west2-a \
  --machine-type=a3-highgpu-8g \
  --image-family=common-cu124-debian-12 \
  --image-project=deeplearning-platform-release \
  --maintenance-policy=TERMINATE \
  --boot-disk-size=500GB --boot-disk-type=pd-ssd \
  --metadata="install-nvidia-driver=True"

gcloud compute ssh h100-train-01 --zone=europe-west2-a --command='nvidia-smi'

# --- Route 3: Bare-metal / colo K8s — expose existing H100 nodes ---
# Adds the NVIDIA GPU Operator, which installs drivers, container-toolkit,
# DCGM exporter, MIG manager and the device plugin.
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update
kubectl create namespace gpu-operator
helm install --wait gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --version v24.9.0 \
  --set driver.enabled=true \
  --set toolkit.enabled=true \
  --set dcgmExporter.enabled=true \
  --set migManager.enabled=true

# Verify the operator brought up driver + plugin + DCGM
kubectl get pods -n gpu-operator
kubectl get nodes -L nvidia.com/gpu.product
# NAME            STATUS   ROLES    GPU.PRODUCT
# h100-node-01    Ready    worker   NVIDIA-H100-80GB-HBM3
kubectl describe node h100-node-01 | grep -E 'nvidia.com/gpu'
# Capacity:     nvidia.com/gpu: 8
# Allocatable:  nvidia.com/gpu: 8

Tip: On AWS and GCP, the default quota for H100 instance families is zero on new accounts; file the uplift ticket weeks ahead. On bare-metal, the GPU Operator handles driver/container-toolkit/DCGM in one chart — do not install the host driver manually unless you are pinning a specific R535+/R550+ build.

How it works: Hopper architecture and the H100 pipeline

Hopper introduced four innovations over Ampere that justified the generational leap, none of which were headline FLOPS in isolation.

First, the fourth-generation Tensor Core added native FP8 support (E4M3 and E5M2) at twice the throughput of FP16. Paired with the Transformer Engine — a runtime that maintains exponential moving averages of activation magnitudes (amax history) and selects per-layer FP8 vs BF16 vs FP32 — typical LLM training throughput roughly doubled at iso-precision against A100 BF16 baselines. E4M3 is used for forward activations and weights, E5M2 for gradients where extended range matters more than precision.

Second, Thread Block Clusters grouped multiple Cooperative Thread Arrays (CTAs) under a unified distributed-shared-memory namespace, letting kernels reuse data across SM groups without round-tripping to HBM. Combined with the new Tensor Memory Accelerator (TMA) — a dedicated copy engine that asynchronously moves tensor tiles between HBM and SMEM with descriptor-based addressing — this is what made Flash Attention 2 and 3 possible in their published forms. TMA is also why hand-tuned cuBLAS LT GEMMs on H100 routinely close to 80-90 % of peak.

Third, DPX instructions accelerated dynamic-programming inner loops — Smith-Waterman sequence alignment, route planning, certain reinforcement-learning search workloads — at up to 7x Ampere throughput.

Fourth, second-generation MIG (Multi-Instance GPU) added confidential-compute boundaries between instances and memory-bandwidth partitioning, letting a single H100 host multi-tenant inference with hardware-enforced isolation. MIG slices on H100 expose a fraction of SMs, a fixed share of HBM (10 GB per 1g.10gb slice up to 80 GB for a full 7g.80gb), and an isolated NVDEC/NVENC pair.

GH100 die: 132 Streaming Multiprocessors (SMs), 528 fourth-generation Tensor Cores, 60 MB L2 cache, 50 MB combined L1/SMEM across SMs.
Memory: 5 HBM3 stacks x 16 GB = 80 GB total at 3.35 TB/s on SXM5 (HBM2e on PCIe at 2.0 TB/s).
Compute capability: sm_90 (sm_90a for the architecture-specific TMA and wgmma intrinsics used by CUTLASS, Flash Attention 3 and Triton's Hopper backend).
Confidential Compute (CC-on) mode: AES-256-GCM encryption of all PCIe traffic and HBM-resident pages, attested via SPDM and NVIDIA's attestation service.

Subsystem	Hopper detail	Practical consequence
Tensor Core (gen 4)	FP8 E4M3/E5M2, BF16, TF32, INT8	Transformer Engine routing per-layer; ~2x iso-precision training throughput vs A100.
TMA	Async tensor-tile DMA, descriptor-based	Flash Attention 3, CUTLASS 3.x, Triton Hopper kernels reach 80-90 % of peak.
Thread Block Cluster	Up to 16 CTAs share distributed SMEM	Persistent kernels, larger working sets, lower HBM pressure.
DPX	Hardware DP inner-loop instructions	Genomics, RL search and graph workloads see 4-7x Ampere uplift.
MIG gen 2	7 slices, isolated HBM/L2/bandwidth/CC	Hard multi-tenant inference on one card.

Reference: full specification sheet

Authoritative per-SKU figures. SXM5 fills HGX-H100 baseboards and almost every cloud GPU instance; PCIe Gen5 is the drop-in card for retrofit servers; NVL pairs two PCIe boards via a 600 GB/s bridge with 188 GB HBM3 for memory-pressured inference. All Tensor figures assume 2:4 structured sparsity unless noted; dense throughput is half the sparse figure.

Metric	H100 SXM5	H100 PCIe Gen5	H100 NVL (pair)
Architecture	Hopper GH100	Hopper GH100	Hopper GH100 x2
Process	TSMC 4N	TSMC 4N	TSMC 4N
Transistors	80 billion	80 billion	160 billion (pair)
SMs	132	114	132 x 2
Tensor cores	528	456	528 x 2
L2 cache	60 MB	50 MB	60 MB x 2
Compute capability	sm_90 / sm_90a	sm_90 / sm_90a	sm_90 / sm_90a
FP64 (Tensor)	67 TFLOPS	51 TFLOPS	134 TFLOPS
FP32	67 TFLOPS	51 TFLOPS	134 TFLOPS
TF32 (Tensor, sparse)	989 TFLOPS	756 TFLOPS	1,978 TFLOPS
BF16 / FP16 (Tensor, sparse)	1,979 TFLOPS	1,513 TFLOPS	3,958 TFLOPS
FP8 (Tensor, sparse)	3,958 TFLOPS	3,026 TFLOPS	7,916 TFLOPS
INT8 (Tensor, sparse)	3,958 TOPS	3,026 TOPS	7,916 TOPS
Memory	80 GB HBM3	80 GB HBM2e	188 GB HBM3 (94 GB per board)
Memory bandwidth	3.35 TB/s	2.0 TB/s	7.8 TB/s aggregate
NVLink	900 GB/s (NVLink 4.0, 18 ports)	600 GB/s (bridge, optional)	600 GB/s board-to-board bridge
PCIe	Gen5 x16 (128 GB/s)	Gen5 x16 (128 GB/s)	Gen5 x16 per board
TDP	700 W (configurable 600-700 W)	350 W	2 x 350-400 W
MIG instances	Up to 7	Up to 7	Up to 7 per board
Confidential Compute	Yes (CC-on attested)	Yes	Yes
Form factor	SXM5 mezzanine	FHFL dual-slot PCIe	Dual FHFL PCIe + bridge
Minimum driver	R525 (R535+ recommended)	R525	R535+
Minimum CUDA	12.0 (12.4+ for full TE)	12.0	12.2

Note: Sparse Tensor numbers assume 2:4 structured sparsity — half the weights pruned in a fixed pattern. Real training and inference workloads rarely sustain this; dense FP8 throughput is roughly half the listed sparse figure. Quote dense numbers in capacity plans and treat sparse figures as marketing ceilings.

Interconnect: NVLink 4.0 and the NVSwitch fabric

Every H100 SXM5 module exposes 18 NVLink 4.0 ports, each providing 50 GB/s bidirectional — 900 GB/s aggregate per GPU. That figure alone is interesting; the topology around it is what matters.

An HGX-H100 baseboard places 8 GPUs alongside 4 NVSwitch ASICs, wiring every GPU to every switch. The result is a fully non-blocking 8-GPU shared-memory fabric: any GPU can DMA into any other GPU's HBM at full NVLink bandwidth with no fabric contention. Inside one DGX H100, all-to-all collectives like AllReduce hit 450 GB/s per direction — close to the theoretical NVLink ceiling, and roughly 3x the equivalent A100 figure.

Beyond 8 GPUs, the optional NVLink Switch System extends the same topology to 256-GPU pods via external NVLink switches. The pod delivers 57.6 TB/s of bisection bandwidth — meaningfully faster than InfiniBand NDR (400 Gb/s per port x 256 ports ~= 12.8 TB/s) and the reason hyperscale training clusters increasingly look like 'one giant GPU' rather than 'a cluster of GPUs'. Beyond 256 GPUs the topology switches to InfiniBand or RoCE, and collective performance drops by an order of magnitude — sizing past 256 should account for that step.

Per-GPU NVLink: 900 GB/s bidirectional (18 ports x 50 GB/s).
Per-baseboard NVSwitch bisection: 3.6 TB/s (8 GPUs x 450 GB/s per direction).
NVLink-domain ceiling: 256 GPUs, 57.6 TB/s bisection.
Above 256 GPUs: InfiniBand NDR/XDR or Spectrum-X RoCE — plan for 5-10x latency uplift on cross-pod collectives.

Workload pattern A: Llama 3 70B inference at 32K context

Single-replica throughput target, latency-sensitive endpoint. We size to two H100 SXM5 with TP=2 to fit the 70B weights plus a meaningful KV cache budget; FP8 reduces weight memory to ~35 GB per rank, leaving headroom for 16-32 concurrent sessions. The shortest path is vllm serve directly on the host — bind the OpenAI-compatible HTTP server, pin the replica to a single HGX baseboard via CUDA_VISIBLE_DEVICES, and the smoke-test is a single curl.

# 1) Install vLLM with FP8 support (Hopper requires CUDA 12.4+, driver R550+)
pip install "vllm==0.6.3" "torch==2.4.0"

# 2) Pin to the first two GPUs on the same HGX baseboard, then serve.
#    vLLM exposes an OpenAI-compatible /v1/chat/completions endpoint on :8000.
CUDA_VISIBLE_DEVICES=0,1 \
NCCL_P2P_LEVEL=NVL \
vllm serve meta-llama/Meta-Llama-3-70B-Instruct \
  --tensor-parallel-size 2 \
  --quantization fp8 \
  --kv-cache-dtype fp8_e5m2 \
  --max-model-len 32768 \
  --max-num-batched-tokens 32768 \
  --gpu-memory-utilization 0.92 \
  --enable-prefix-caching \
  --disable-log-requests \
  --host 0.0.0.0 --port 8000

# 3) Smoke-test the endpoint
curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3-70B-Instruct",
    "messages": [{"role":"user","content":"Summarise NVLink 4.0 in one sentence."}],
    "max_tokens": 128
  }' | jq .

Warning: Pattern A gotcha: with TP=2 and 32K context, NCCL AllReduce on the attention output is the dominant inter-GPU traffic. If the two GPUs are on different NUMA nodes or behind a PCIe switch instead of NVLink, decode TPS collapses by 40-60 %. Always pin the replica to a single HGX baseboard — verify with nvidia-smi topo -m that the two devices share an NV# link, and set NCCL_P2P_LEVEL=NVL to fail loudly if they do not.

Workload pattern B: 70B QLoRA fine-tune

QLoRA fine-tune of a 70B base model on 2x H100 SXM5 using transformers + peft + bitsandbytes + trl, launched with accelerate launch. NF4 base weights (~35 GB), BF16 LoRA adapters (~600 MB), paged AdamW optimiser state for adapters only, gradient checkpointing on every transformer block, Flash Attention 2 (FA3 is wired in via attn_implementation="flash_attention_3" on transformers >= 4.46). Peak working set lands around 70 GB per GPU at batch 2 / seq 4096.

Launch on a 2x H100 node: NCCL_P2P_LEVEL=NVL accelerate launch --num_processes 2 --mixed_precision bf16 train.py.
For multi-node: accelerate launch --multi_gpu --num_machines N --machine_rank R --main_process_ip <head> train.py, or switch to torchrun --nproc_per_node 8 --nnodes N.
Monitor with watch -n 2 nvidia-smi and tail -f out/llama3-70b-qlora/runs/*/events.out.tfevents.* (TensorBoard).
For higher-throughput Hopper-tuned kernels, swap the model loader for unsloth or wrap the same config in axolotl — both compile down to the transformers + peft primitives above.

# train.py — 70B QLoRA on 2x H100 SXM5
# Deps: pip install "transformers>=4.46" "peft>=0.13" "trl>=0.11" \
#                   "bitsandbytes>=0.43" "accelerate>=0.34" datasets
import torch
from datasets import load_dataset
from transformers import (AutoModelForCausalLM, AutoTokenizer,
                          BitsAndBytesConfig)
from peft import LoraConfig, prepare_model_for_kbit_training
from trl import SFTTrainer, SFTConfig

MODEL_ID = "meta-llama/Meta-Llama-3-70B-Instruct"

bnb = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config=bnb,
    device_map="auto",                       # shard across the 2 H100s
    attn_implementation="flash_attention_2", # FA3 on transformers>=4.46
    torch_dtype=torch.bfloat16,
)
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)

lora = LoraConfig(
    r=64, lora_alpha=16, lora_dropout=0.05,
    target_modules="all-linear", bias="none", task_type="CAUSAL_LM",
)

ds = load_dataset("json", data_files="s3://my-bucket/customer-support-v3/*.jsonl",
                  split="train", streaming=False)

trainer = SFTTrainer(
    model=model, tokenizer=tokenizer, peft_config=lora, train_dataset=ds,
    args=SFTConfig(
        output_dir="./out/llama3-70b-qlora",
        num_train_epochs=3,
        per_device_train_batch_size=2,
        gradient_accumulation_steps=16,      # global batch 64 on 2 GPUs
        gradient_checkpointing=True,
        optim="paged_adamw_8bit",
        learning_rate=2e-4, lr_scheduler_type="cosine", warmup_ratio=0.03,
        bf16=True, max_seq_length=4096,
        logging_steps=10, save_steps=500, report_to="tensorboard",
    ),
)
trainer.train()
trainer.save_model("./out/llama3-70b-qlora/final")

Tip: QLoRA on 2x H100 is faster end-to-end than full FP16 fine-tune on 8x H100 for adapter-style customisation, at roughly 25 % of the GPU-hour cost. Reach for full fine-tune only when you need to update behaviour outside the LoRA rank budget.

Workload pattern C: Stable Diffusion XL serving

Stable Diffusion XL 1.0 base + refiner at 1024x1024 on a single H100 PCIe with MIG-disabled. We use diffusers with torch.compile for the UNet and BF16 VAE; the workload is compute-bound rather than memory-bound, which makes the PCIe SKU competitive with SXM5 at a meaningfully lower hourly rate. For absolute peak throughput, a separate offline trtllm-build step compiles the UNet to a TensorRT engine, but the diffusers path below is what most teams run in production.

# sdxl_server.py — SDXL base + refiner on 1x H100 PCIe
# Deps: pip install "diffusers>=0.30" "transformers>=4.46" \
#                   "torch==2.4.0" accelerate safetensors fastapi uvicorn
import io, torch
from diffusers import StableDiffusionXLPipeline, StableDiffusionXLImg2ImgPipeline
from fastapi import FastAPI
from fastapi.responses import Response

BASE = "stabilityai/stable-diffusion-xl-base-1.0"
REFINER = "stabilityai/stable-diffusion-xl-refiner-1.0"

base = StableDiffusionXLPipeline.from_pretrained(
    BASE, torch_dtype=torch.bfloat16, variant="fp16", use_safetensors=True,
).to("cuda")
refiner = StableDiffusionXLImg2ImgPipeline.from_pretrained(
    REFINER, torch_dtype=torch.bfloat16,
    text_encoder_2=base.text_encoder_2, vae=base.vae,
    variant="fp16", use_safetensors=True,
).to("cuda")

# Hopper-tuned: SDPA attention is FA2 by default on torch>=2.4
base.unet = torch.compile(base.unet, mode="max-autotune", fullgraph=True)
refiner.unet = torch.compile(refiner.unet, mode="max-autotune", fullgraph=True)

app = FastAPI()

@app.post("/generate")
def generate(prompt: str, steps: int = 25, guidance: float = 7.0):
    latent = base(prompt=prompt, num_inference_steps=steps,
                  guidance_scale=guidance, denoising_end=0.8,
                  output_type="latent").images
    image = refiner(prompt=prompt, num_inference_steps=steps,
                    denoising_start=0.8, image=latent).images[0]
    buf = io.BytesIO(); image.save(buf, format="PNG")
    return Response(content=buf.getvalue(), media_type="image/png")

# Run with: uvicorn sdxl_server:app --host 0.0.0.0 --port 8000 --workers 1

Note: SDXL on H100 PCIe with torch.compile lands around 1.6 images/second at batch 1 (25 steps, 1024^2). H100 SXM5 is roughly 1.15x faster on the UNet but the PCIe SKU's lower hourly rate usually wins on cost-per-image. Verify with InferenceBench rather than assuming. For the last 1.5-2x on top of torch.compile, build a TensorRT 10 engine for the UNet with trtllm-build and swap base.unet for the engine wrapper.

Sizing and capacity planning

Sizing tables we use internally to scope H100 footprints. All figures assume H100 SXM5, FP8 weights via the Transformer Engine, vLLM 0.6 with paged KV cache and prefix caching, and a healthy NVLink-local placement. Throughput is given in output tokens per second per replica at the listed concurrency; treat these as planning anchors, not contractual SLOs — verify on InferenceBench before production rollout.

Training rule of thumb: 1 trillion training tokens x 70B parameters at BF16 needs roughly 250-350 H100-days on 64-GPU NVLink-domain clusters with Megatron-LM + Transformer Engine FP8.
Memory ceiling for a single H100: weights + KV cache + activations + cuBLAS scratch < 78 GB. Above 78 GB, expect OOMs even with paged KV — drop precision, shrink context, or move to TP=2.
AllReduce overhead at TP=8 inside one HGX-H100: ~6-9 % of step time for 70B BF16; jumps to 25-40 % the moment a rank crosses to a second NVLink domain over InfiniBand.
For 500 RPS at 4K tokens output (mid-context chat traffic), Llama 3 70B FP8 needs roughly 6-8 H100 SXM5 replicas; size headroom for prefix-cache cold-start on rollouts.
Spot/preemptible H100 capacity is viable for fine-tunes but not for production inference SLAs — eviction rates of 8-15 % per day are typical on hyperscaler spot.

Model size	Precision	Context	GPUs per replica	TP / PP	Approx output TPS	Approx VRAM headroom
7B (Mistral, Qwen)	FP8	8K	1x H100	1 / 1	5,500-7,000	60 GB free
13B	FP8	8K	1x H100	1 / 1	3,800-4,800	50 GB free
34B (Yi, Codestral)	FP8	8K	1x H100	1 / 1	1,900-2,400	25 GB free
70B (Llama 3)	FP8	8K	1x H100	1 / 1	1,000-1,300	10-15 GB free
70B (Llama 3)	FP8	32K	2x H100	2 / 1	1,500-1,900	20 GB free per rank
70B (Llama 3)	FP8	128K	4x H100	4 / 1	1,700-2,200	12 GB free per rank
140B MoE (Mixtral 8x22B)	FP8	32K	2x H100	2 / 1	900-1,200	8 GB free per rank
180B (Falcon, Bloom)	FP8	8K	4x H100	4 / 1	600-800	15 GB free per rank
405B (Llama 3.1)	FP8	32K	8x H100	8 / 1	350-450	10 GB free per rank

Limits and quotas

Default per-account caps you will hit. Hyperscaler quotas are vendor-defined and require support tickets to raise. Plan procurement around lead times of 8-26 weeks for committed H100 capacity in 2026.

Limit	Default	Ceiling	How to raise
AWS p5.48xlarge (8x H100) on-demand	0 vCPU baseline	Account-negotiated	Service Quotas -> 'Running On-Demand P instances'
AWS p5.48xlarge capacity-block reservation	0	Region-negotiated	EC2 Capacity Blocks for ML, 1-182 day windows
GCP a3-highgpu-8g region quota	0	Org-negotiated	Cloud Console -> IAM & Admin -> Quotas (NVIDIA_H100_GPUS)
Azure ND H100 v5 cores per region	0	Org-negotiated	Azure portal -> Subscriptions -> Usage + quotas
Kubernetes nvidia.com/gpu per pod	node-allocatable	Hardware limit	Node selector + `resources.limits[nvidia.com/gpu]`
NVLink-domain size (NVL Switch)	256 GPUs	Hardware limit	Span domains via InfiniBand; expect ~10x collective latency
MIG slices per H100	7 (max)	Hardware limit	Repartition; partition changes are destructive
NCCL message size in-flight	Drivers default	Cluster-tuned	`NCCL_MAX_NCHANNELS`, `NCCL_BUFFSIZE` tuning
Confidential Compute mode	Off	Per-card	Driver toggle + attestation service; one-way until reboot
TensorRT-LLM engine cache size	8 GB	Disk-bound	`TRTLLM_CACHE_DIR` + larger PV

Warning: On hyperscalers, a quota of 'zero' is the default for new accounts on every H100 SKU. File the quota uplift ticket weeks before you need capacity; in 2026, average grant time is 3-10 business days for the larger clouds.

Observability

Production H100 observability is built around DCGM (Data Center GPU Manager) exporting Prometheus metrics, plus NCCL traces for collective hotspots and ECC counters for hardware health. The metrics below are the ones we alert on in every Yobitel-managed H100 deployment.

DCGM_FI_DEV_GPU_UTIL — SM occupancy. Sustained < 60 % under load usually means dataloader stall, not under-provisioning.
DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_FREE — framebuffer (HBM) used and free, in MiB. Alert when free < 4 GB on inference replicas.
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE — fraction of cycles the Tensor Cores are busy. The honest 'is the GPU actually doing math' metric.
DCGM_FI_PROF_DRAM_ACTIVE — HBM bandwidth utilisation; pair with PIPE_TENSOR_ACTIVE to classify compute-bound vs memory-bound regimes.
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL — NVLink throughput per GPU; sudden drops correlate with single-port link-down events.
DCGM_FI_DEV_GPU_TEMP / DCGM_FI_DEV_MEMORY_TEMP — die and HBM temperatures. Alert at die > 83 C (throttle threshold) and HBM > 95 C.
DCGM_FI_DEV_ECC_SBE_VOL_TOTAL / DBE_VOL_TOTAL — single- and double-bit ECC error counts. Any non-zero double-bit error means quarantine the card.
DCGM_FI_DEV_RETIRED_DBE / RETIRED_SBE — retired HBM pages; a steady climb predicts card failure within weeks.
DCGM_FI_DEV_POWER_USAGE / DCGM_FI_DEV_POWER_VIOLATION — watts drawn and milliseconds throttled by power cap.
DCGM_FI_DEV_THERMAL_VIOLATION — milliseconds throttled by thermal limit; correlate spikes with rack inlet temperature.

# Prometheus alert rules — H100 production fleet
groups:
- name: h100-health
  interval: 30s
  rules:
  - alert: H100ThermalThrottle
    expr: rate(DCGM_FI_DEV_THERMAL_VIOLATION[5m]) > 0
    for: 2m
    labels: { severity: warning }
    annotations:
      summary: "H100 {{ $labels.gpu }} thermal throttling"
      runbook: https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html#thermal-management

  - alert: H100ECCDoubleBit
    expr: increase(DCGM_FI_DEV_ECC_DBE_VOL_TOTAL[1h]) > 0
    labels: { severity: critical }
    annotations:
      summary: "DBE ECC error on H100 {{ $labels.gpu }} — quarantine card"

  - alert: H100NVLinkDown
    expr: DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL < 700e9
    for: 5m
    labels: { severity: critical }
    annotations:
      summary: "H100 {{ $labels.gpu }} NVLink degraded below 700 GB/s"

  - alert: H100HBMNearFull
    expr: DCGM_FI_DEV_FB_FREE < 4096
    for: 10m
    labels: { severity: warning }

  - alert: H100TensorIdle
    expr: avg_over_time(DCGM_FI_PROF_PIPE_TENSOR_ACTIVE[15m]) < 0.30
      and on(gpu) DCGM_FI_DEV_GPU_UTIL > 70
    for: 15m
    labels: { severity: info }
    annotations:
      summary: "GPU busy but Tensor Cores idle — dataloader or kernel inefficiency"

Tip: PIPE_TENSOR_ACTIVE is the single most useful Hopper signal. A replica showing 90 % DCGM_FI_DEV_GPU_UTIL but 25 % PIPE_TENSOR_ACTIVE is doing memory ops, not math — usually a dataloader bottleneck or an unfused kernel path. Fix that before adding GPUs.

Cost and FinOps

H100 hourly pricing collapsed by roughly 60-70 % between 2023 and 2026 as supply caught up. In 2026 the public ranges below are typical; private commitments often clear 20-40 % under on-demand list. Pricing levers in order of impact: commitment term (1y reserved ~= 40 % off on-demand, 3y ~= 60 % off), neocloud vs hyperscaler (neocloud = 30-50 % cheaper at parity), FP8 enablement (= 1.6x throughput vs FP16 baseline at the same hourly cost = same % cost reduction), and right-sizing replicas to NVLink-locality (avoiding cross-baseboard placement that doubles GPU count for the same TPS).

Cost-per-million-output-tokens on Llama 3 70B FP8, 1x H100 SXM5 at $2.00/GPU-hr and 1,100 TPS sustained: roughly $0.50 per million tokens before margin.
Switching from FP16 to FP8 with the Transformer Engine yields +1.6x throughput on H100 SXM5 at iso-context — a 38 % drop in cost-per-token.
Reserving 3 years cuts effective $/GPU-hr roughly in half versus on-demand; only commit when steady-state utilisation exceeds 65 %.
Idle replicas are the dominant overspend pattern in production inference fleets — set minReplicas: 0 on non-critical endpoints with cold-start tolerance.
Egress and inter-region data movement frequently exceed 10 % of total H100 bill at hyperscalers — collocate model artefacts with compute.

Provider class	SKU	On-demand $/GPU-hr	1y reserved	3y reserved	Notes
Hyperscaler (AWS/GCP/Azure)	H100 SXM5	$2.50-3.00	$1.60-2.10	$1.10-1.60	Best for hybrid stacks; data-egress costs matter.
Hyperscaler	H100 PCIe	$1.80-2.40	$1.20-1.60	$0.85-1.20	Fewer regions; not all instances support NVLink.
Tier-1 neocloud	H100 SXM5	$1.80-2.40	$1.40-1.80	$1.00-1.40	Commonly cheapest at scale; verify NVLink topology.
Tier-2 neocloud	H100 SXM5	$1.40-1.90	$1.10-1.50	$0.85-1.20	Best raw rate; expect more variance in IB topology.
Spot/preemptible	H100 SXM5	$0.90-1.60	n/a	n/a	8-15 % eviction/day; fine-tunes only.
Yobitel NeoCloud (UK + EU)	H100 SXM5	$1.90-2.40	$1.40-1.80	$1.00-1.40	NCSC OFFICIAL-aligned regions; FOCUS-conformant billing.
Yobitel Omniscient Compute	H100 SXM5 multi-cloud	Market-clearing	Commit-discounted	Commit-discounted	Cross-provider arbitrage on top of NeoCloud + partner capacity.

Note: All cost figures land on the FinOps Foundation FOCUS billing spec when consumed via Yobitel: ServiceName=AcceleratorCompute, ChargeCategory=Usage, SkuId=gpu.h100.sxm5. This is what makes cross-provider arbitrage and cost attribution tractable at scale.

Security and compliance

H100 ships with three independent isolation primitives — MIG, Confidential Compute (CC-on), and the standard CUDA process/IPC model — and the combination supports sovereign deployments under UK NCSC guidance, EU GDPR, US HIPAA and FedRAMP Moderate when paired with appropriate host hardening.

MIG provides hardware-enforced spatial partitioning: up to 7 instances per H100, each with isolated HBM, L2, NVDEC/NVENC and SM allocations. Inter-instance memory bandwidth contention is bounded by the partition. MIG slices appear to the OS as distinct PCIe devices, which means multi-tenant scheduling on Kubernetes via the NVIDIA Device Plugin is straightforward.

Confidential Compute mode (CC-on) encrypts all PCIe traffic between the H100 and the host with AES-256-GCM and seals HBM-resident pages so the host kernel cannot read them. Attestation is performed via SPDM-over-PCIe to NVIDIA's NRAS service; an attested boot binds the firmware version, driver, and the workload's measurement hash. CC-on costs roughly 3-7 % throughput on most inference workloads and is currently the only commercial GPU attestation path with FedRAMP Moderate coverage.

For Yobitel UK sovereign deployments the recommended posture is: MIG-off (full-card workloads only), CC-on, NCSC-aligned host hardening (CIS Ubuntu 22.04 LTS Level 2), NCSC Cloud Security Principles 1-14 evidence in the workspace audit log, and OFFICIAL-classification data segregation by workspace rather than by namespace.

MIG: spatial isolation, 7 slices max, hardware-enforced memory and bandwidth partitioning.
CC-on: cryptographic isolation, SPDM attestation, ~3-7 % throughput penalty.
Per-replica IAM via Yobitel workspaces; encryption-at-rest for model artefacts; signed model provenance via Sigstore.
Auditable: every Kubernetes admission event (Deployment, Job, InferenceService) lands in the cluster audit log; pair with Falco for runtime detection and Sigstore-verified container images for supply-chain provenance.
GDPR: model weights and training data residency enforced at workspace level; cross-region inference requires explicit configuration.

Migration and alternatives

When H100 is the right choice and when it isn't. The table below maps the practical migration paths in both directions; the code block below shows the real-world commands you can run today on existing infrastructure as a reference.

Two heuristics: pick H200 when memory pressure dominates (KV cache or weights); pick B200 only when you can absorb a new software stack and need FP4 throughput. Stay on H100 in every other case — the software lead alone usually justifies it through 2026.

From / to	When it pays	Migration effort	Key incompatibility
A100 -> H100	Need FP8 throughput or TMA/FA3	Low (drop-in CUDA upgrade)	FP8 calibration; sm_90 kernels not on Ampere
H100 -> H200	KV cache or weights memory-bound	Trivial (same software stack)	None — same GH100 silicon
H100 -> B200	Need FP4 or 8 TB/s bandwidth	Medium (CUDA 12.4+, FP4 quantisation)	New MX formats; some kernels need rework
H100 PCIe -> H100 SXM5	Workload uses NVLink collectives	Medium (chassis change)	Cooling envelope; thermal redesign
H100 -> MI300X	Need 192 GB HBM3 per GPU	High (CUDA -> ROCm rewrite)	CUDA kernels not portable; vLLM ROCm gap
H100 -> TPU v5p/Trillium	Already on JAX/XLA	High (full stack change)	PyTorch kernels need XLA path
H100 -> Inferentia 2	Inference-only, AWS-resident	High (Neuron compiler)	Limited model coverage

# --- Equivalents you can run today against existing stacks ---

# 1) AWS: launch an 8x H100 p5 instance (note: requires quota)
aws ec2 run-instances \
  --image-id ami-0abcdef1234567890 \
  --instance-type p5.48xlarge \
  --key-name my-key \
  --subnet-id subnet-0abc...

# 2) GCP: a3-highgpu-8g (8x H100 SXM5)
gcloud compute instances create h100-train-01 \
  --machine-type=a3-highgpu-8g \
  --zone=europe-west2-a \
  --image-family=common-cu124 \
  --image-project=deeplearning-platform-release \
  --maintenance-policy=TERMINATE

# 3) Kubernetes (GPU Operator + Device Plugin): request 2 H100s
kubectl apply -f - <<'EOF'
apiVersion: apps/v1
kind: Deployment
metadata: { name: vllm-70b }
spec:
  replicas: 2
  selector: { matchLabels: { app: vllm-70b } }
  template:
    metadata: { labels: { app: vllm-70b } }
    spec:
      nodeSelector: { nvidia.com/gpu.product: NVIDIA-H100-80GB-HBM3 }
      containers:
      - name: vllm
        image: vllm/vllm-openai:v0.6.0
        args: ["--model","meta-llama/Meta-Llama-3-70B-Instruct",
               "--tensor-parallel-size","2","--max-model-len","32768",
               "--kv-cache-dtype","fp8_e5m2","--quantization","fp8"]
        resources:
          limits: { nvidia.com/gpu: 2 }
EOF

# 4) Direct vLLM serve (bare metal)
vllm serve meta-llama/Meta-Llama-3-70B-Instruct \
  --tensor-parallel-size 2 \
  --quantization fp8 --kv-cache-dtype fp8_e5m2 \
  --max-model-len 32768

# 5) TensorRT-LLM engine build for H100
trtllm-build --checkpoint_dir ./hf_llama3_70b_fp8 \
  --output_dir ./engines/llama3-70b-h100 \
  --gemm_plugin fp8 --gpt_attention_plugin fp8 \
  --max_input_len 16384 --max_seq_len 32768 \
  --tp_size 2 --workers 2

Note: The kubectl / aws / gcloud commands above are what you run today on the underlying infrastructure. Yobitel operates this stack on customers' behalf — see 'Where this fits in the Yobitel stack' below for the integration boundary.

Troubleshooting

Operational issues we see most often on H100 fleets, ranked by frequency. Each has a definitive diagnosis and a fix path.

Error / symptom	Likely cause	Fix
GPU clocks throttling, die > 83 C	Thermal throttling — inlet > 27 C or coolant under-flow	Verify rack inlet temp, coolant supply temp, secondary loop dT; drop power cap to 600 W if persistent.
NCCL AllReduce hangs at job start	Missing or stale NCCL topology file on heterogeneous NVLink+IB cluster	Generate with `nccl-topo-dump`; set `NCCL_TOPO_FILE=/etc/nccl/topo.xml`; verify with `NCCL_DEBUG=INFO`.
`CUDA_ERROR_OUT_OF_MEMORY` on inference start	Batch x context too large for 80 GB after weights + KV cache + cuBLAS scratch	Reduce `max_model_len`, set `gpu_memory_utilization=0.88`, switch KV cache to `fp8_e5m2`, or move to TP=2.
Single NVLink port down — DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL drops to ~850 GB/s	Mezzanine reseat needed or cold solder joint	`nvidia-smi nvlink --status`; drain node; reseat module; if persistent, RMA.
MIG misconfiguration — `nvidia-smi mig -lgi` shows partial slices	Previous workload exited without releasing compute instances	`nvidia-smi mig -dgi -gi <id>` to destroy, then re-create. MIG repartition is destructive — drain workloads first.
ECC double-bit error in dmesg	HBM defect	Immediately quarantine card. Drain workloads, mark node unschedulable, RMA. Do not redeploy until replaced.
First-token latency 5-10x higher than steady state	Cold KV-cache and engine warm-up	Enable prefix caching, pre-warm replicas with synthetic traffic on rollout, use `--num-warmup-requests`.
Training step time 2-4x expected	Cross-baseboard tensor-parallel rank — collectives over IB instead of NVLink	Pin replica to single HGX baseboard; on K8s use NVLink-topology aware scheduler; verify with NCCL `PXN` debug.
FP8 training loss spikes after 1k-10k steps	Activation amax history saturated; activation scaling miscalibrated	Increase TE `fp8_amax_history_len`, sanity-check `fp8_format=HYBRID`, reduce learning rate, or fall back to BF16 on offending layer.
`nvidia-smi` shows 100 % util but tokens/sec is flat	Dataloader bound — Tensor Cores idle	Check `DCGM_FI_PROF_PIPE_TENSOR_ACTIVE`; increase dataloader workers, switch to prefetched parquet shards, enable IO pinned memory.
CUDA_ERROR_ECC_UNCORRECTABLE on a single rank	Stuck retired-page count saturating	Run `nvidia-smi --query-remapped-rows`; if remapped-rows > 50, the card is end-of-life — RMA.

Where this fits in the Yobitel stack

The H100 is the workhorse SKU across the Yobitel stack in 2026. Yobibyte — our AI-native platform — consolidates the open-source primitives shown above (NVIDIA GPU Operator, vLLM, KServe, Volcano, KubeRay, DCGM, Sigstore) into a single managed control plane: inference replicas, fine-tune jobs and notebooks all land on H100 (or H200 / B200) pools with NVLink-aware placement, FP8 enabled by default and DCGM alerts pre-wired. The vLLM and accelerate commands in this entry are exactly what Yobibyte reconciles under the hood on the customer's behalf.

Omniscient Compute — our cross-cloud capacity broker — indexes H100 SKUs across every connected hyperscaler and Tier-1/Tier-2 neocloud, normalises pricing onto the FinOps Foundation FOCUS spec, and arbitrages workloads to the cheapest region that meets the workspace's residency and compliance posture. When you ask Yobitel for 8x H100 SXM5 in the UK sovereign region, Omniscient Compute is the layer that finds it.

InferenceBench — our public, reproducible benchmarking harness — publishes H100 throughput, latency and cost-per-token numbers for every major open-weight model across vLLM, TensorRT-LLM, SGLang and TGI. The sizing tables in this entry are anchored on InferenceBench runs; the production numbers your team will see in steady state are typically within 10 % of the published figures. If you are sizing a 2026 H100 footprint, start with InferenceBench, lift the platform configuration into a Yobibyte manifest, and let Omniscient Compute pick the region.

References

NVIDIA H100 Tensor Core GPU Datasheet · NVIDIA
Hopper Architecture Whitepaper · NVIDIA
NVLink Switch System Specification · NVIDIA
Transformer Engine User Guide · NVIDIA
DCGM Field Identifiers (Prometheus exporter) · NVIDIA
Confidential Compute on NVIDIA H100 · NVIDIA
vLLM FP8 quantisation on Hopper · vLLM
TensorRT-LLM Hopper engines · NVIDIA
FinOps Foundation FOCUS billing specification · FinOps Foundation
NCSC Cloud Security Principles · UK NCSC

TL;DR

Hopper-architecture data centre GPU (GH100) on TSMC 4N, 80 billion transistors, launched March 2022 and the default training accelerator from 2023 onward — still the most widely benchmarked and software-mature AI GPU in production through 2026.
Two form factors: SXM5 (700 W TDP, 18-port NVLink 4.0 at 900 GB/s) and PCIe Gen5 (350 W TDP, 600 GB/s NVLink bridge, drop-in for retrofit servers). SXM5 is what fills DGX-H100, HGX-H100, AWS p5, GCP a3, Azure ND H100 v5 and almost every neocloud H100 instance.
80 GB HBM3 at 3.35 TB/s; fourth-generation Tensor Core delivers 989 TFLOPS BF16 and 3,958 TFLOPS FP8 (2:4 sparse); the Transformer Engine auto-casts layers between BF16 and FP8 (E4M3/E5M2) with runtime amax tracking.
NVLink 4.0 + third-gen NVSwitch ASIC scales to 256-GPU NVLink-domain pods with 57.6 TB/s bisection — the substrate every multi-billion-parameter training run shipped between 2023 and 2025 ran on.
Sizing rule of thumb: Llama 3 70B FP8 fits on 1x H100 with 8K context (no TP); 32K context needs 2x H100 with TP=2; QLoRA fine-tune of 70B fits on 2x H100 SXM5 (~70 GB peak per GPU including optimiser).

Overview

Quick start

# --- Route 1: AWS p5.48xlarge (8x H100 SXM5, 3.2 Tb/s EFA) ---
# Requires the "Running On-Demand P instances" service quota uplifted from 0.
aws ec2 run-instances \
  --region eu-west-2 \
  --image-id ami-0abcdef1234567890 \
  --instance-type p5.48xlarge \
  --key-name my-key \
  --subnet-id subnet-0abc... \
  --block-device-mappings 'DeviceName=/dev/sda1,Ebs={VolumeSize=500,VolumeType=gp3}' \
  --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=h100-train-01}]'

# Verify the 8 H100s are visible once the host boots
aws ssm start-session --target i-0abc... -- nvidia-smi -L

# --- Route 2: GCP a3-highgpu-8g (8x H100 SXM5) ---
gcloud compute instances create h100-train-01 \
  --project=my-project \
  --zone=europe-west2-a \
  --machine-type=a3-highgpu-8g \
  --image-family=common-cu124-debian-12 \
  --image-project=deeplearning-platform-release \
  --maintenance-policy=TERMINATE \
  --boot-disk-size=500GB --boot-disk-type=pd-ssd \
  --metadata="install-nvidia-driver=True"

gcloud compute ssh h100-train-01 --zone=europe-west2-a --command='nvidia-smi'

# --- Route 3: Bare-metal / colo K8s — expose existing H100 nodes ---
# Adds the NVIDIA GPU Operator, which installs drivers, container-toolkit,
# DCGM exporter, MIG manager and the device plugin.
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update
kubectl create namespace gpu-operator
helm install --wait gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --version v24.9.0 \
  --set driver.enabled=true \
  --set toolkit.enabled=true \
  --set dcgmExporter.enabled=true \
  --set migManager.enabled=true

# Verify the operator brought up driver + plugin + DCGM
kubectl get pods -n gpu-operator
kubectl get nodes -L nvidia.com/gpu.product
# NAME            STATUS   ROLES    GPU.PRODUCT
# h100-node-01    Ready    worker   NVIDIA-H100-80GB-HBM3
kubectl describe node h100-node-01 | grep -E 'nvidia.com/gpu'
# Capacity:     nvidia.com/gpu: 8
# Allocatable:  nvidia.com/gpu: 8

Tip: On AWS and GCP, the default quota for H100 instance families is zero on new accounts; file the uplift ticket weeks ahead. On bare-metal, the GPU Operator handles driver/container-toolkit/DCGM in one chart — do not install the host driver manually unless you are pinning a specific R535+/R550+ build.

How it works: Hopper architecture and the H100 pipeline

Hopper introduced four innovations over Ampere that justified the generational leap, none of which were headline FLOPS in isolation.

GH100 die: 132 Streaming Multiprocessors (SMs), 528 fourth-generation Tensor Cores, 60 MB L2 cache, 50 MB combined L1/SMEM across SMs.
Memory: 5 HBM3 stacks x 16 GB = 80 GB total at 3.35 TB/s on SXM5 (HBM2e on PCIe at 2.0 TB/s).
Compute capability: sm_90 (sm_90a for the architecture-specific TMA and wgmma intrinsics used by CUTLASS, Flash Attention 3 and Triton's Hopper backend).
Confidential Compute (CC-on) mode: AES-256-GCM encryption of all PCIe traffic and HBM-resident pages, attested via SPDM and NVIDIA's attestation service.

Subsystem	Hopper detail	Practical consequence
Tensor Core (gen 4)	FP8 E4M3/E5M2, BF16, TF32, INT8	Transformer Engine routing per-layer; ~2x iso-precision training throughput vs A100.
TMA	Async tensor-tile DMA, descriptor-based	Flash Attention 3, CUTLASS 3.x, Triton Hopper kernels reach 80-90 % of peak.
Thread Block Cluster	Up to 16 CTAs share distributed SMEM	Persistent kernels, larger working sets, lower HBM pressure.
DPX	Hardware DP inner-loop instructions	Genomics, RL search and graph workloads see 4-7x Ampere uplift.
MIG gen 2	7 slices, isolated HBM/L2/bandwidth/CC	Hard multi-tenant inference on one card.

Reference: full specification sheet

Metric	H100 SXM5	H100 PCIe Gen5	H100 NVL (pair)
Architecture	Hopper GH100	Hopper GH100	Hopper GH100 x2
Process	TSMC 4N	TSMC 4N	TSMC 4N
Transistors	80 billion	80 billion	160 billion (pair)
SMs	132	114	132 x 2
Tensor cores	528	456	528 x 2
L2 cache	60 MB	50 MB	60 MB x 2
Compute capability	sm_90 / sm_90a	sm_90 / sm_90a	sm_90 / sm_90a
FP64 (Tensor)	67 TFLOPS	51 TFLOPS	134 TFLOPS
FP32	67 TFLOPS	51 TFLOPS	134 TFLOPS
TF32 (Tensor, sparse)	989 TFLOPS	756 TFLOPS	1,978 TFLOPS
BF16 / FP16 (Tensor, sparse)	1,979 TFLOPS	1,513 TFLOPS	3,958 TFLOPS
FP8 (Tensor, sparse)	3,958 TFLOPS	3,026 TFLOPS	7,916 TFLOPS
INT8 (Tensor, sparse)	3,958 TOPS	3,026 TOPS	7,916 TOPS
Memory	80 GB HBM3	80 GB HBM2e	188 GB HBM3 (94 GB per board)
Memory bandwidth	3.35 TB/s	2.0 TB/s	7.8 TB/s aggregate
NVLink	900 GB/s (NVLink 4.0, 18 ports)	600 GB/s (bridge, optional)	600 GB/s board-to-board bridge
PCIe	Gen5 x16 (128 GB/s)	Gen5 x16 (128 GB/s)	Gen5 x16 per board
TDP	700 W (configurable 600-700 W)	350 W	2 x 350-400 W
MIG instances	Up to 7	Up to 7	Up to 7 per board
Confidential Compute	Yes (CC-on attested)	Yes	Yes
Form factor	SXM5 mezzanine	FHFL dual-slot PCIe	Dual FHFL PCIe + bridge
Minimum driver	R525 (R535+ recommended)	R525	R535+
Minimum CUDA	12.0 (12.4+ for full TE)	12.0	12.2

Note: Sparse Tensor numbers assume 2:4 structured sparsity — half the weights pruned in a fixed pattern. Real training and inference workloads rarely sustain this; dense FP8 throughput is roughly half the listed sparse figure. Quote dense numbers in capacity plans and treat sparse figures as marketing ceilings.

Interconnect: NVLink 4.0 and the NVSwitch fabric

Every H100 SXM5 module exposes 18 NVLink 4.0 ports, each providing 50 GB/s bidirectional — 900 GB/s aggregate per GPU. That figure alone is interesting; the topology around it is what matters.

Per-GPU NVLink: 900 GB/s bidirectional (18 ports x 50 GB/s).
Per-baseboard NVSwitch bisection: 3.6 TB/s (8 GPUs x 450 GB/s per direction).
NVLink-domain ceiling: 256 GPUs, 57.6 TB/s bisection.
Above 256 GPUs: InfiniBand NDR/XDR or Spectrum-X RoCE — plan for 5-10x latency uplift on cross-pod collectives.

Workload pattern A: Llama 3 70B inference at 32K context

# 1) Install vLLM with FP8 support (Hopper requires CUDA 12.4+, driver R550+)
pip install "vllm==0.6.3" "torch==2.4.0"

# 2) Pin to the first two GPUs on the same HGX baseboard, then serve.
#    vLLM exposes an OpenAI-compatible /v1/chat/completions endpoint on :8000.
CUDA_VISIBLE_DEVICES=0,1 \
NCCL_P2P_LEVEL=NVL \
vllm serve meta-llama/Meta-Llama-3-70B-Instruct \
  --tensor-parallel-size 2 \
  --quantization fp8 \
  --kv-cache-dtype fp8_e5m2 \
  --max-model-len 32768 \
  --max-num-batched-tokens 32768 \
  --gpu-memory-utilization 0.92 \
  --enable-prefix-caching \
  --disable-log-requests \
  --host 0.0.0.0 --port 8000

# 3) Smoke-test the endpoint
curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3-70B-Instruct",
    "messages": [{"role":"user","content":"Summarise NVLink 4.0 in one sentence."}],
    "max_tokens": 128
  }' | jq .

Warning: Pattern A gotcha: with TP=2 and 32K context, NCCL AllReduce on the attention output is the dominant inter-GPU traffic. If the two GPUs are on different NUMA nodes or behind a PCIe switch instead of NVLink, decode TPS collapses by 40-60 %. Always pin the replica to a single HGX baseboard — verify with nvidia-smi topo -m that the two devices share an NV# link, and set NCCL_P2P_LEVEL=NVL to fail loudly if they do not.

Workload pattern B: 70B QLoRA fine-tune

Launch on a 2x H100 node: NCCL_P2P_LEVEL=NVL accelerate launch --num_processes 2 --mixed_precision bf16 train.py.
For multi-node: accelerate launch --multi_gpu --num_machines N --machine_rank R --main_process_ip <head> train.py, or switch to torchrun --nproc_per_node 8 --nnodes N.
Monitor with watch -n 2 nvidia-smi and tail -f out/llama3-70b-qlora/runs/*/events.out.tfevents.* (TensorBoard).
For higher-throughput Hopper-tuned kernels, swap the model loader for unsloth or wrap the same config in axolotl — both compile down to the transformers + peft primitives above.

# train.py — 70B QLoRA on 2x H100 SXM5
# Deps: pip install "transformers>=4.46" "peft>=0.13" "trl>=0.11" \
#                   "bitsandbytes>=0.43" "accelerate>=0.34" datasets
import torch
from datasets import load_dataset
from transformers import (AutoModelForCausalLM, AutoTokenizer,
                          BitsAndBytesConfig)
from peft import LoraConfig, prepare_model_for_kbit_training
from trl import SFTTrainer, SFTConfig

MODEL_ID = "meta-llama/Meta-Llama-3-70B-Instruct"

bnb = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config=bnb,
    device_map="auto",                       # shard across the 2 H100s
    attn_implementation="flash_attention_2", # FA3 on transformers>=4.46
    torch_dtype=torch.bfloat16,
)
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)

lora = LoraConfig(
    r=64, lora_alpha=16, lora_dropout=0.05,
    target_modules="all-linear", bias="none", task_type="CAUSAL_LM",
)

ds = load_dataset("json", data_files="s3://my-bucket/customer-support-v3/*.jsonl",
                  split="train", streaming=False)

trainer = SFTTrainer(
    model=model, tokenizer=tokenizer, peft_config=lora, train_dataset=ds,
    args=SFTConfig(
        output_dir="./out/llama3-70b-qlora",
        num_train_epochs=3,
        per_device_train_batch_size=2,
        gradient_accumulation_steps=16,      # global batch 64 on 2 GPUs
        gradient_checkpointing=True,
        optim="paged_adamw_8bit",
        learning_rate=2e-4, lr_scheduler_type="cosine", warmup_ratio=0.03,
        bf16=True, max_seq_length=4096,
        logging_steps=10, save_steps=500, report_to="tensorboard",
    ),
)
trainer.train()
trainer.save_model("./out/llama3-70b-qlora/final")

Tip: QLoRA on 2x H100 is faster end-to-end than full FP16 fine-tune on 8x H100 for adapter-style customisation, at roughly 25 % of the GPU-hour cost. Reach for full fine-tune only when you need to update behaviour outside the LoRA rank budget.

Workload pattern C: Stable Diffusion XL serving

# sdxl_server.py — SDXL base + refiner on 1x H100 PCIe
# Deps: pip install "diffusers>=0.30" "transformers>=4.46" \
#                   "torch==2.4.0" accelerate safetensors fastapi uvicorn
import io, torch
from diffusers import StableDiffusionXLPipeline, StableDiffusionXLImg2ImgPipeline
from fastapi import FastAPI
from fastapi.responses import Response

BASE = "stabilityai/stable-diffusion-xl-base-1.0"
REFINER = "stabilityai/stable-diffusion-xl-refiner-1.0"

base = StableDiffusionXLPipeline.from_pretrained(
    BASE, torch_dtype=torch.bfloat16, variant="fp16", use_safetensors=True,
).to("cuda")
refiner = StableDiffusionXLImg2ImgPipeline.from_pretrained(
    REFINER, torch_dtype=torch.bfloat16,
    text_encoder_2=base.text_encoder_2, vae=base.vae,
    variant="fp16", use_safetensors=True,
).to("cuda")

# Hopper-tuned: SDPA attention is FA2 by default on torch>=2.4
base.unet = torch.compile(base.unet, mode="max-autotune", fullgraph=True)
refiner.unet = torch.compile(refiner.unet, mode="max-autotune", fullgraph=True)

app = FastAPI()

@app.post("/generate")
def generate(prompt: str, steps: int = 25, guidance: float = 7.0):
    latent = base(prompt=prompt, num_inference_steps=steps,
                  guidance_scale=guidance, denoising_end=0.8,
                  output_type="latent").images
    image = refiner(prompt=prompt, num_inference_steps=steps,
                    denoising_start=0.8, image=latent).images[0]
    buf = io.BytesIO(); image.save(buf, format="PNG")
    return Response(content=buf.getvalue(), media_type="image/png")

# Run with: uvicorn sdxl_server:app --host 0.0.0.0 --port 8000 --workers 1

Note: SDXL on H100 PCIe with torch.compile lands around 1.6 images/second at batch 1 (25 steps, 1024^2). H100 SXM5 is roughly 1.15x faster on the UNet but the PCIe SKU's lower hourly rate usually wins on cost-per-image. Verify with InferenceBench rather than assuming. For the last 1.5-2x on top of torch.compile, build a TensorRT 10 engine for the UNet with trtllm-build and swap base.unet for the engine wrapper.

Sizing and capacity planning

Training rule of thumb: 1 trillion training tokens x 70B parameters at BF16 needs roughly 250-350 H100-days on 64-GPU NVLink-domain clusters with Megatron-LM + Transformer Engine FP8.
Memory ceiling for a single H100: weights + KV cache + activations + cuBLAS scratch < 78 GB. Above 78 GB, expect OOMs even with paged KV — drop precision, shrink context, or move to TP=2.
AllReduce overhead at TP=8 inside one HGX-H100: ~6-9 % of step time for 70B BF16; jumps to 25-40 % the moment a rank crosses to a second NVLink domain over InfiniBand.
For 500 RPS at 4K tokens output (mid-context chat traffic), Llama 3 70B FP8 needs roughly 6-8 H100 SXM5 replicas; size headroom for prefix-cache cold-start on rollouts.
Spot/preemptible H100 capacity is viable for fine-tunes but not for production inference SLAs — eviction rates of 8-15 % per day are typical on hyperscaler spot.

Model size	Precision	Context	GPUs per replica	TP / PP	Approx output TPS	Approx VRAM headroom
7B (Mistral, Qwen)	FP8	8K	1x H100	1 / 1	5,500-7,000	60 GB free
13B	FP8	8K	1x H100	1 / 1	3,800-4,800	50 GB free
34B (Yi, Codestral)	FP8	8K	1x H100	1 / 1	1,900-2,400	25 GB free
70B (Llama 3)	FP8	8K	1x H100	1 / 1	1,000-1,300	10-15 GB free
70B (Llama 3)	FP8	32K	2x H100	2 / 1	1,500-1,900	20 GB free per rank
70B (Llama 3)	FP8	128K	4x H100	4 / 1	1,700-2,200	12 GB free per rank
140B MoE (Mixtral 8x22B)	FP8	32K	2x H100	2 / 1	900-1,200	8 GB free per rank
180B (Falcon, Bloom)	FP8	8K	4x H100	4 / 1	600-800	15 GB free per rank
405B (Llama 3.1)	FP8	32K	8x H100	8 / 1	350-450	10 GB free per rank

Limits and quotas

Limit	Default	Ceiling	How to raise
AWS p5.48xlarge (8x H100) on-demand	0 vCPU baseline	Account-negotiated	Service Quotas -> 'Running On-Demand P instances'
AWS p5.48xlarge capacity-block reservation	0	Region-negotiated	EC2 Capacity Blocks for ML, 1-182 day windows
GCP a3-highgpu-8g region quota	0	Org-negotiated	Cloud Console -> IAM & Admin -> Quotas (NVIDIA_H100_GPUS)
Azure ND H100 v5 cores per region	0	Org-negotiated	Azure portal -> Subscriptions -> Usage + quotas
Kubernetes nvidia.com/gpu per pod	node-allocatable	Hardware limit	Node selector + `resources.limits[nvidia.com/gpu]`
NVLink-domain size (NVL Switch)	256 GPUs	Hardware limit	Span domains via InfiniBand; expect ~10x collective latency
MIG slices per H100	7 (max)	Hardware limit	Repartition; partition changes are destructive
NCCL message size in-flight	Drivers default	Cluster-tuned	`NCCL_MAX_NCHANNELS`, `NCCL_BUFFSIZE` tuning
Confidential Compute mode	Off	Per-card	Driver toggle + attestation service; one-way until reboot
TensorRT-LLM engine cache size	8 GB	Disk-bound	`TRTLLM_CACHE_DIR` + larger PV

Warning: On hyperscalers, a quota of 'zero' is the default for new accounts on every H100 SKU. File the quota uplift ticket weeks before you need capacity; in 2026, average grant time is 3-10 business days for the larger clouds.

Observability

DCGM_FI_DEV_GPU_UTIL — SM occupancy. Sustained < 60 % under load usually means dataloader stall, not under-provisioning.
DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_FREE — framebuffer (HBM) used and free, in MiB. Alert when free < 4 GB on inference replicas.
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE — fraction of cycles the Tensor Cores are busy. The honest 'is the GPU actually doing math' metric.
DCGM_FI_PROF_DRAM_ACTIVE — HBM bandwidth utilisation; pair with PIPE_TENSOR_ACTIVE to classify compute-bound vs memory-bound regimes.
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL — NVLink throughput per GPU; sudden drops correlate with single-port link-down events.
DCGM_FI_DEV_GPU_TEMP / DCGM_FI_DEV_MEMORY_TEMP — die and HBM temperatures. Alert at die > 83 C (throttle threshold) and HBM > 95 C.
DCGM_FI_DEV_ECC_SBE_VOL_TOTAL / DBE_VOL_TOTAL — single- and double-bit ECC error counts. Any non-zero double-bit error means quarantine the card.
DCGM_FI_DEV_RETIRED_DBE / RETIRED_SBE — retired HBM pages; a steady climb predicts card failure within weeks.
DCGM_FI_DEV_POWER_USAGE / DCGM_FI_DEV_POWER_VIOLATION — watts drawn and milliseconds throttled by power cap.
DCGM_FI_DEV_THERMAL_VIOLATION — milliseconds throttled by thermal limit; correlate spikes with rack inlet temperature.

# Prometheus alert rules — H100 production fleet
groups:
- name: h100-health
  interval: 30s
  rules:
  - alert: H100ThermalThrottle
    expr: rate(DCGM_FI_DEV_THERMAL_VIOLATION[5m]) > 0
    for: 2m
    labels: { severity: warning }
    annotations:
      summary: "H100 {{ $labels.gpu }} thermal throttling"
      runbook: https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html#thermal-management

  - alert: H100ECCDoubleBit
    expr: increase(DCGM_FI_DEV_ECC_DBE_VOL_TOTAL[1h]) > 0
    labels: { severity: critical }
    annotations:
      summary: "DBE ECC error on H100 {{ $labels.gpu }} — quarantine card"

  - alert: H100NVLinkDown
    expr: DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL < 700e9
    for: 5m
    labels: { severity: critical }
    annotations:
      summary: "H100 {{ $labels.gpu }} NVLink degraded below 700 GB/s"

  - alert: H100HBMNearFull
    expr: DCGM_FI_DEV_FB_FREE < 4096
    for: 10m
    labels: { severity: warning }

  - alert: H100TensorIdle
    expr: avg_over_time(DCGM_FI_PROF_PIPE_TENSOR_ACTIVE[15m]) < 0.30
      and on(gpu) DCGM_FI_DEV_GPU_UTIL > 70
    for: 15m
    labels: { severity: info }
    annotations:
      summary: "GPU busy but Tensor Cores idle — dataloader or kernel inefficiency"

Tip: PIPE_TENSOR_ACTIVE is the single most useful Hopper signal. A replica showing 90 % DCGM_FI_DEV_GPU_UTIL but 25 % PIPE_TENSOR_ACTIVE is doing memory ops, not math — usually a dataloader bottleneck or an unfused kernel path. Fix that before adding GPUs.

Cost and FinOps

Cost-per-million-output-tokens on Llama 3 70B FP8, 1x H100 SXM5 at $2.00/GPU-hr and 1,100 TPS sustained: roughly $0.50 per million tokens before margin.
Switching from FP16 to FP8 with the Transformer Engine yields +1.6x throughput on H100 SXM5 at iso-context — a 38 % drop in cost-per-token.
Reserving 3 years cuts effective $/GPU-hr roughly in half versus on-demand; only commit when steady-state utilisation exceeds 65 %.
Idle replicas are the dominant overspend pattern in production inference fleets — set minReplicas: 0 on non-critical endpoints with cold-start tolerance.
Egress and inter-region data movement frequently exceed 10 % of total H100 bill at hyperscalers — collocate model artefacts with compute.

Provider class	SKU	On-demand $/GPU-hr	1y reserved	3y reserved	Notes
Hyperscaler (AWS/GCP/Azure)	H100 SXM5	$2.50-3.00	$1.60-2.10	$1.10-1.60	Best for hybrid stacks; data-egress costs matter.
Hyperscaler	H100 PCIe	$1.80-2.40	$1.20-1.60	$0.85-1.20	Fewer regions; not all instances support NVLink.
Tier-1 neocloud	H100 SXM5	$1.80-2.40	$1.40-1.80	$1.00-1.40	Commonly cheapest at scale; verify NVLink topology.
Tier-2 neocloud	H100 SXM5	$1.40-1.90	$1.10-1.50	$0.85-1.20	Best raw rate; expect more variance in IB topology.
Spot/preemptible	H100 SXM5	$0.90-1.60	n/a	n/a	8-15 % eviction/day; fine-tunes only.
Yobitel NeoCloud (UK + EU)	H100 SXM5	$1.90-2.40	$1.40-1.80	$1.00-1.40	NCSC OFFICIAL-aligned regions; FOCUS-conformant billing.
Yobitel Omniscient Compute	H100 SXM5 multi-cloud	Market-clearing	Commit-discounted	Commit-discounted	Cross-provider arbitrage on top of NeoCloud + partner capacity.

Note: All cost figures land on the FinOps Foundation FOCUS billing spec when consumed via Yobitel: ServiceName=AcceleratorCompute, ChargeCategory=Usage, SkuId=gpu.h100.sxm5. This is what makes cross-provider arbitrage and cost attribution tractable at scale.

Security and compliance

MIG: spatial isolation, 7 slices max, hardware-enforced memory and bandwidth partitioning.
CC-on: cryptographic isolation, SPDM attestation, ~3-7 % throughput penalty.
Per-replica IAM via Yobitel workspaces; encryption-at-rest for model artefacts; signed model provenance via Sigstore.
Auditable: every Kubernetes admission event (Deployment, Job, InferenceService) lands in the cluster audit log; pair with Falco for runtime detection and Sigstore-verified container images for supply-chain provenance.
GDPR: model weights and training data residency enforced at workspace level; cross-region inference requires explicit configuration.

Migration and alternatives

From / to	When it pays	Migration effort	Key incompatibility
A100 -> H100	Need FP8 throughput or TMA/FA3	Low (drop-in CUDA upgrade)	FP8 calibration; sm_90 kernels not on Ampere
H100 -> H200	KV cache or weights memory-bound	Trivial (same software stack)	None — same GH100 silicon
H100 -> B200	Need FP4 or 8 TB/s bandwidth	Medium (CUDA 12.4+, FP4 quantisation)	New MX formats; some kernels need rework
H100 PCIe -> H100 SXM5	Workload uses NVLink collectives	Medium (chassis change)	Cooling envelope; thermal redesign
H100 -> MI300X	Need 192 GB HBM3 per GPU	High (CUDA -> ROCm rewrite)	CUDA kernels not portable; vLLM ROCm gap
H100 -> TPU v5p/Trillium	Already on JAX/XLA	High (full stack change)	PyTorch kernels need XLA path
H100 -> Inferentia 2	Inference-only, AWS-resident	High (Neuron compiler)	Limited model coverage

# --- Equivalents you can run today against existing stacks ---

# 1) AWS: launch an 8x H100 p5 instance (note: requires quota)
aws ec2 run-instances \
  --image-id ami-0abcdef1234567890 \
  --instance-type p5.48xlarge \
  --key-name my-key \
  --subnet-id subnet-0abc...

# 2) GCP: a3-highgpu-8g (8x H100 SXM5)
gcloud compute instances create h100-train-01 \
  --machine-type=a3-highgpu-8g \
  --zone=europe-west2-a \
  --image-family=common-cu124 \
  --image-project=deeplearning-platform-release \
  --maintenance-policy=TERMINATE

# 3) Kubernetes (GPU Operator + Device Plugin): request 2 H100s
kubectl apply -f - <<'EOF'
apiVersion: apps/v1
kind: Deployment
metadata: { name: vllm-70b }
spec:
  replicas: 2
  selector: { matchLabels: { app: vllm-70b } }
  template:
    metadata: { labels: { app: vllm-70b } }
    spec:
      nodeSelector: { nvidia.com/gpu.product: NVIDIA-H100-80GB-HBM3 }
      containers:
      - name: vllm
        image: vllm/vllm-openai:v0.6.0
        args: ["--model","meta-llama/Meta-Llama-3-70B-Instruct",
               "--tensor-parallel-size","2","--max-model-len","32768",
               "--kv-cache-dtype","fp8_e5m2","--quantization","fp8"]
        resources:
          limits: { nvidia.com/gpu: 2 }
EOF

# 4) Direct vLLM serve (bare metal)
vllm serve meta-llama/Meta-Llama-3-70B-Instruct \
  --tensor-parallel-size 2 \
  --quantization fp8 --kv-cache-dtype fp8_e5m2 \
  --max-model-len 32768

# 5) TensorRT-LLM engine build for H100
trtllm-build --checkpoint_dir ./hf_llama3_70b_fp8 \
  --output_dir ./engines/llama3-70b-h100 \
  --gemm_plugin fp8 --gpt_attention_plugin fp8 \
  --max_input_len 16384 --max_seq_len 32768 \
  --tp_size 2 --workers 2

Note: The kubectl / aws / gcloud commands above are what you run today on the underlying infrastructure. Yobitel operates this stack on customers' behalf — see 'Where this fits in the Yobitel stack' below for the integration boundary.

Troubleshooting

Operational issues we see most often on H100 fleets, ranked by frequency. Each has a definitive diagnosis and a fix path.

Error / symptom	Likely cause	Fix
GPU clocks throttling, die > 83 C	Thermal throttling — inlet > 27 C or coolant under-flow	Verify rack inlet temp, coolant supply temp, secondary loop dT; drop power cap to 600 W if persistent.
NCCL AllReduce hangs at job start	Missing or stale NCCL topology file on heterogeneous NVLink+IB cluster	Generate with `nccl-topo-dump`; set `NCCL_TOPO_FILE=/etc/nccl/topo.xml`; verify with `NCCL_DEBUG=INFO`.
`CUDA_ERROR_OUT_OF_MEMORY` on inference start	Batch x context too large for 80 GB after weights + KV cache + cuBLAS scratch	Reduce `max_model_len`, set `gpu_memory_utilization=0.88`, switch KV cache to `fp8_e5m2`, or move to TP=2.
Single NVLink port down — DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL drops to ~850 GB/s	Mezzanine reseat needed or cold solder joint	`nvidia-smi nvlink --status`; drain node; reseat module; if persistent, RMA.
MIG misconfiguration — `nvidia-smi mig -lgi` shows partial slices	Previous workload exited without releasing compute instances	`nvidia-smi mig -dgi -gi <id>` to destroy, then re-create. MIG repartition is destructive — drain workloads first.
ECC double-bit error in dmesg	HBM defect	Immediately quarantine card. Drain workloads, mark node unschedulable, RMA. Do not redeploy until replaced.
First-token latency 5-10x higher than steady state	Cold KV-cache and engine warm-up	Enable prefix caching, pre-warm replicas with synthetic traffic on rollout, use `--num-warmup-requests`.
Training step time 2-4x expected	Cross-baseboard tensor-parallel rank — collectives over IB instead of NVLink	Pin replica to single HGX baseboard; on K8s use NVLink-topology aware scheduler; verify with NCCL `PXN` debug.
FP8 training loss spikes after 1k-10k steps	Activation amax history saturated; activation scaling miscalibrated	Increase TE `fp8_amax_history_len`, sanity-check `fp8_format=HYBRID`, reduce learning rate, or fall back to BF16 on offending layer.
`nvidia-smi` shows 100 % util but tokens/sec is flat	Dataloader bound — Tensor Cores idle	Check `DCGM_FI_PROF_PIPE_TENSOR_ACTIVE`; increase dataloader workers, switch to prefetched parquet shards, enable IO pinned memory.
CUDA_ERROR_ECC_UNCORRECTABLE on a single rank	Stuck retired-page count saturating	Run `nvidia-smi --query-remapped-rows`; if remapped-rows > 50, the card is end-of-life — RMA.

Where this fits in the Yobitel stack

References

NVIDIA H100 Tensor Core GPU Datasheet · NVIDIA
Hopper Architecture Whitepaper · NVIDIA
NVLink Switch System Specification · NVIDIA
Transformer Engine User Guide · NVIDIA
DCGM Field Identifiers (Prometheus exporter) · NVIDIA
Confidential Compute on NVIDIA H100 · NVIDIA
vLLM FP8 quantisation on Hopper · vLLM
TensorRT-LLM Hopper engines · NVIDIA
FinOps Foundation FOCUS billing specification · FinOps Foundation
NCSC Cloud Security Principles · UK NCSC

NVIDIA H100 Tensor Core GPU

Overview

Quick start

How it works: Hopper architecture and the H100 pipeline

Reference: full specification sheet

Interconnect: NVLink 4.0 and the NVSwitch fabric

Workload pattern A: Llama 3 70B inference at 32K context

Workload pattern B: 70B QLoRA fine-tune

Workload pattern C: Stable Diffusion XL serving

Sizing and capacity planning

Limits and quotas

Observability

Cost and FinOps

Security and compliance

Migration and alternatives

Troubleshooting

Where this fits in the Yobitel stack

References

Browse all entries

Deploy on Yobibyte

NVIDIA H100 Tensor Core GPU

Overview

Quick start

How it works: Hopper architecture and the H100 pipeline

Reference: full specification sheet

Interconnect: NVLink 4.0 and the NVSwitch fabric

Workload pattern A: Llama 3 70B inference at 32K context

Workload pattern B: 70B QLoRA fine-tune

Workload pattern C: Stable Diffusion XL serving

Sizing and capacity planning

Limits and quotas

Observability

Cost and FinOps

Security and compliance

Migration and alternatives

Troubleshooting

Where this fits in the Yobitel stack

References

Browse all entries

Deploy on Yobibyte