TL;DR
- Volta-generation data centre GPU launched May 2017 — first GPU with dedicated Tensor Cores.
- 16 / 32 GB HBM2 at 900 GB/s with first-generation Tensor Cores delivering 125 TFLOPS FP16.
- Trained BERT, GPT-2, the original Transformer-XL, and most of the 2018-2020 large-model literature.
- End of mainstream support is approaching — modern frameworks still run but increasingly skip Volta in optimised paths.
Overview#
V100 is the GPU that introduced Tensor Cores to data centres and made transformer training practical. Volta added a dedicated FP16 matmul unit alongside the conventional SMs; the resulting 125 TFLOPS of FP16 throughput was an order-of-magnitude leap over Pascal and is why BERT, GPT-2 and most of the 2018-2020 transformer literature were trained on V100 clusters.
By 2026 V100 is a legacy platform. CUDA still supports it, many cloud providers still list V100 instances, but new training runs almost never target Volta. The card remains relevant for educational use, cost-sensitive inference of smaller models, and as a baseline for performance comparisons.
Specifications#
| Metric | V100 SXM2 32 GB | V100 PCIe 32 GB |
|---|---|---|
| Architecture | Volta (GV100) | Volta (GV100) |
| Process | TSMC 12 nm FFN | TSMC 12 nm FFN |
| FP64 | 7.8 TFLOPS | 7 TFLOPS |
| FP32 | 15.7 TFLOPS | 14 TFLOPS |
| FP16 (Tensor) | 125 TFLOPS | 112 TFLOPS |
| Memory | 32 GB HBM2 | 32 GB HBM2 |
| Memory bandwidth | 900 GB/s | 900 GB/s |
| TDP | 300 W | 250 W |
| NVLink | 300 GB/s (2.0) | Not supported |
| PCIe | Gen3 x16 (32 GB/s) | Gen3 x16 (32 GB/s) |
Volta Tensor Cores only support FP16 — no BF16, no INT8 acceleration, no FP8. Mixed-precision training requires explicit loss scaling.
Why V100 Mattered#
Before Volta, mixed-precision training was a research curiosity. Volta's Tensor Cores plus NVIDIA's Apex library (later folded into PyTorch as autocast) made FP16 training routine; combined with HBM2 capacity and NVLink 2.0, V100 was the first data centre GPU that could realistically train models at the BERT-large scale in days rather than months.
The resulting 'V100 ecosystem' — DGX-1, Cirrascale, AWS p3, GCP V100 nodes — is the substrate the modern AI infrastructure industry grew out of.
When V100 Still Makes Sense#
- Educational clusters and research environments where amortised cost dominates.
- Small-model inference (sub-3B) where Volta tensor throughput is adequate.
- Legacy training pipelines that cannot be re-baselined onto newer hardware.
- Replicable baseline experiments tied to original V100 results.
- For any new deployment — A100 or L40S are almost always better choices.
Pitfalls#
- No BF16: Llama-family and most modern LLM training stacks expect BF16; V100 forces FP16 with loss scaling and adds operational fragility.
- PCIe Gen3 limits host bandwidth severely versus current cards.
- Older HBM2 stacks have higher infant-mortality rates; second-hand V100s often arrive with degraded memory.
- Driver lifecycle: CUDA continues to support Volta but new features (FP8, async tensor copies) are unavailable.
Software Notes#
PyTorch, TensorFlow and JAX all still treat V100 as a supported target. Some inference servers (vLLM, SGLang) gate certain kernels on compute capability — Volta (7.0) misses some Hopper- and Ampere-only paths but core inference works.
References
- NVIDIA V100 Datasheet · NVIDIA
- Volta Architecture Whitepaper · NVIDIA