NVIDIA V100 Tensor Core GPU

TL;DR

Volta-generation data centre GPU launched May 2017 — first GPU with dedicated Tensor Cores.
16 / 32 GB HBM2 at 900 GB/s with first-generation Tensor Cores delivering 125 TFLOPS FP16.
Trained BERT, GPT-2, the original Transformer-XL, and most of the 2018-2020 large-model literature.
End of mainstream support is approaching — modern frameworks still run but increasingly skip Volta in optimised paths.

Overview#

V100 is the GPU that introduced Tensor Cores to data centres and made transformer training practical. Volta added a dedicated FP16 matmul unit alongside the conventional SMs; the resulting 125 TFLOPS of FP16 throughput was an order-of-magnitude leap over Pascal and is why BERT, GPT-2 and most of the 2018-2020 transformer literature were trained on V100 clusters.

By 2026 V100 is a legacy platform. CUDA still supports it, many cloud providers still list V100 instances, but new training runs almost never target Volta. The card remains relevant for educational use, cost-sensitive inference of smaller models, and as a baseline for performance comparisons.

Specifications#

Metric	V100 SXM2 32 GB	V100 PCIe 32 GB
Architecture	Volta (GV100)	Volta (GV100)
Process	TSMC 12 nm FFN	TSMC 12 nm FFN
FP64	7.8 TFLOPS	7 TFLOPS
FP32	15.7 TFLOPS	14 TFLOPS
FP16 (Tensor)	125 TFLOPS	112 TFLOPS
Memory	32 GB HBM2	32 GB HBM2
Memory bandwidth	900 GB/s	900 GB/s
TDP	300 W	250 W
NVLink	300 GB/s (2.0)	Not supported
PCIe	Gen3 x16 (32 GB/s)	Gen3 x16 (32 GB/s)

Volta Tensor Cores only support FP16 — no BF16, no INT8 acceleration, no FP8. Mixed-precision training requires explicit loss scaling.

Why V100 Mattered#

Before Volta, mixed-precision training was a research curiosity. Volta's Tensor Cores plus NVIDIA's Apex library (later folded into PyTorch as autocast) made FP16 training routine; combined with HBM2 capacity and NVLink 2.0, V100 was the first data centre GPU that could realistically train models at the BERT-large scale in days rather than months.

The resulting 'V100 ecosystem' — DGX-1, Cirrascale, AWS p3, GCP V100 nodes — is the substrate the modern AI infrastructure industry grew out of.

When V100 Still Makes Sense#

Educational clusters and research environments where amortised cost dominates.
Small-model inference (sub-3B) where Volta tensor throughput is adequate.
Legacy training pipelines that cannot be re-baselined onto newer hardware.
Replicable baseline experiments tied to original V100 results.
For any new deployment — A100 or L40S are almost always better choices.

Pitfalls#

No BF16: Llama-family and most modern LLM training stacks expect BF16; V100 forces FP16 with loss scaling and adds operational fragility.
PCIe Gen3 limits host bandwidth severely versus current cards.
Older HBM2 stacks have higher infant-mortality rates; second-hand V100s often arrive with degraded memory.
Driver lifecycle: CUDA continues to support Volta but new features (FP8, async tensor copies) are unavailable.

Software Notes#

PyTorch, TensorFlow and JAX all still treat V100 as a supported target. Some inference servers (vLLM, SGLang) gate certain kernels on compute capability — Volta (7.0) misses some Hopper- and Ampere-only paths but core inference works.

References

NVIDIA V100 Datasheet · NVIDIA
Volta Architecture Whitepaper · NVIDIA

Overview#

Specifications#

Metric	V100 SXM2 32 GB	V100 PCIe 32 GB
Architecture	Volta (GV100)	Volta (GV100)
Process	TSMC 12 nm FFN	TSMC 12 nm FFN
FP64	7.8 TFLOPS	7 TFLOPS
FP32	15.7 TFLOPS	14 TFLOPS
FP16 (Tensor)	125 TFLOPS	112 TFLOPS
Memory	32 GB HBM2	32 GB HBM2
Memory bandwidth	900 GB/s	900 GB/s
TDP	300 W	250 W
NVLink	300 GB/s (2.0)	Not supported
PCIe	Gen3 x16 (32 GB/s)	Gen3 x16 (32 GB/s)

Volta Tensor Cores only support FP16 — no BF16, no INT8 acceleration, no FP8. Mixed-precision training requires explicit loss scaling.

Why V100 Mattered#

The resulting 'V100 ecosystem' — DGX-1, Cirrascale, AWS p3, GCP V100 nodes — is the substrate the modern AI infrastructure industry grew out of.

When V100 Still Makes Sense#

Educational clusters and research environments where amortised cost dominates.

Small-model inference (sub-3B) where Volta tensor throughput is adequate.

Legacy training pipelines that cannot be re-baselined onto newer hardware.

Replicable baseline experiments tied to original V100 results.

For any new deployment — A100 or L40S are almost always better choices.

Pitfalls#

No BF16: Llama-family and most modern LLM training stacks expect BF16; V100 forces FP16 with loss scaling and adds operational fragility.

PCIe Gen3 limits host bandwidth severely versus current cards.

Older HBM2 stacks have higher infant-mortality rates; second-hand V100s often arrive with degraded memory.

Driver lifecycle: CUDA continues to support Volta but new features (FP8, async tensor copies) are unavailable.

NVIDIA V100 Tensor Core GPU

Overview#

Specifications#

Why V100 Mattered#

When V100 Still Makes Sense#

Pitfalls#

Software Notes#

References

Browse all entries

Deploy on Yobitel

NVIDIA V100 Tensor Core GPU

Overview#

Specifications#

Why V100 Mattered#

When V100 Still Makes Sense#

Pitfalls#

Software Notes#

References

Browse all entries

Deploy on Yobitel