TL;DR
- Refresh of GB200 that swaps the two B200 GPUs for two B300 (Blackwell Ultra) GPUs while retaining the Grace CPU and NVLink-C2C topology.
- Per-module HBM3e capacity climbs to roughly 576 GB; FP4 throughput uplifts ~1.5× over GB200.
- Designed to slot into NVL72 rack designs with minimal rework — same compute-tray footprint, refreshed liquid-cooling specifications.
- Targets reasoning-model inference and trillion-parameter MoE training where KV-cache pressure dominates step time.
Overview#
GB300 is the Blackwell Ultra equivalent of GB200. Architecturally it is the same Grace + 2× Blackwell module pattern, with the GPUs upgraded to B300 silicon and 12-high HBM3e stacks. NVIDIA positions the resulting rack as 'GB300 NVL72', a drop-in upgrade path for operators already invested in the NVL72 form factor.
The motivation is the same as B300 versus B200 at the GPU level — HBM capacity and FP4 throughput are the binding constraints on reasoning-model economics. Doubling KV-cache headroom per module is what lets a single NVL72 rack host larger, longer-context reasoning replicas without spanning multiple racks over InfiniBand.
Specifications vs GB200#
| Metric | GB300 module | GB200 module |
|---|---|---|
| GPUs | 2× B300 (Blackwell Ultra) | 2× B200 |
| CPU | 1× Grace (72 cores) | 1× Grace (72 cores) |
| GPU memory | ~576 GB HBM3e (12-high) | 384 GB HBM3e |
| FP4 (Tensor, sparse) | ~54,000 TFLOPS | ~36,000 TFLOPS |
| NVLink | 1.8 TB/s (5.0) | 1.8 TB/s (5.0) |
| NVLink-C2C | 900 GB/s coherent | 900 GB/s coherent |
| Module TDP | ~3,600 W | ~2,700 W |
| Rack | GB300 NVL72 | GB200 NVL72 |
Several GB300 specifications were still being finalised as the part ramped in 2026. Memory and FP4 figures are NVIDIA's stated targets; the qualitative story — '1.5× HBM3e and FP4 over GB200 at higher TDP' — is the load-bearing claim.
Why a Refresh Now#
Reasoning models reshaped the inference cost curve through 2025. A reasoning replica spends most of its FLOPs emitting internal chain-of-thought tokens that the user never sees; the KV cache grows linearly with chain length, and per-replica memory becomes the dominant capacity unit. GB300 exists because the GB200 KV-cache budget was the production bottleneck for frontier reasoning deployments.
The CPU side is unchanged because Grace was not the constraint. Dataloading, expert routing, and scheduler overhead all still fit comfortably inside the existing 72-core, 480 GB LPDDR5X envelope.
When to Pick GB300#
- Frontier reasoning-model inference at production scale where KV cache pressure binds.
- Trillion-parameter MoE training where expert KV state at long sequence lengths exceeds GB200's HBM budget.
- Greenfield NVL72 builds where the cooling envelope can be upgraded to absorb higher per-module TDP.
- Pick GB200 NVL72 if the workload comfortably fits 384 GB per module and supply or cost dominate.
- Pick discrete B300 HGX if the unit of work is 8 GPUs, not 72.
Operational Notes#
- Cooling: per-module TDP increases ~30 % over GB200; loop temperatures and flow rates need re-validation.
- Power distribution: rack-level draw approaches 140+ kW depending on configuration.
- Backwards compatibility: most NVL72 racks accept GB300 trays with firmware updates, but vendor confirmation is required.
- Software stack matches GB200 — CUDA 12.4+, NCCL 2.21+, NeMo / Megatron-Core paths, Mission Control rack manager.
Software Ecosystem#
Identical to GB200. The only configuration that needs revisiting is per-replica memory budgets — vLLM, TensorRT-LLM and NeMo defaults sized for 384 GB modules will under-utilise GB300 capacity. Reasoning-model inference recipes benefit most from explicit chain-of-thought batching policies that exploit the larger KV-cache headroom.