TL;DR
- Performance-tier fifth-generation TPU launched late 2023 — 459 TFLOPS BF16, 95 GB HBM per chip.
- Pods scale to 8,960 chips linked by ICI v4 with OCS-reconfigured 3D torus.
- Used for Gemini training and other frontier Google work.
- Available on Google Cloud; competitive with H100-class GPU pods on cost/performance for very-large training.
Overview#
TPU v5p is the performance-tier fifth-generation TPU. Launched in late 2023, it nearly doubled v4's per-chip throughput, raised HBM to 95 GB per chip, and scaled pods to 8,960 chips with OCS-reconfigured 3D-torus fabric.
Gemini and other frontier Google models trained on v5p pods. For external customers, v5p is the TPU for training jobs that justify the largest pod commitments — typically tens of billions of parameters and up.
Specifications#
| Metric | TPU v5p (per chip) |
|---|---|
| BF16 | 459 TFLOPS |
| INT8 | 918 TOPS |
| Memory | 95 GB HBM |
| Memory bandwidth | 2.77 TB/s |
| Inter-chip link | ICI v4 |
| Pod scale | 8,960 chips |
| Fabric | OCS 3D torus |
When to Pick v5p#
- Frontier training on Google Cloud where pod scale and HBM per chip dominate.
- Workloads already invested in JAX + MaxText / Pax.
- Production training jobs where v4 has been outgrown but v6 supply is limited.
- Pick Trillium (v6) for newer-generation parts at higher per-chip throughput.
Pitfalls#
- Google Cloud only.
- JAX-first; PyTorch/XLA can lag substantially.
- Very-large pod scheduling requires Google Cloud's TPU reservation model.
- XLA compilation overhead can extend iteration cycles relative to PyTorch+CUDA.
Software Notes#
Same JAX + XLA stack as v4/v5e with v5p-specific optimisations. Pallas exposes low-level kernel authoring. MaxText provides reference Llama / Gemma training pipelines on v5p.
References
- Google Cloud TPU v5p Documentation · Google Cloud