TL;DR
- Single-slot PCIe Ampere card aimed at enterprise inference, virtual workstations and modest fine-tuning.
- 24 GB GDDR6 at 600 GB/s in a 150 W passive form factor — drop-in for 1U/2U servers.
- Strong fit for 7B-class LLM inference, image generation, and video/CV pipelines.
- Supersedes T4 in most fleets; widely deployed before L4 took over the low-TDP slot.
Overview#
A10 is NVIDIA's retail counterpart to the A10G. Same Ampere GA102 die, slightly different binning, a 150 W TDP and a single-slot passive form factor designed to slot into standard 1U and 2U servers. Through 2022-2024 it was the default 'inference card' for enterprise deployments that did not need A100-class memory.
By 2026 A10 is being displaced in new deployments by L4 (lower TDP, similar throughput per watt) and L40S (substantially more capable for a similar slot envelope). It remains heavily deployed in existing fleets.
Specifications#
| Metric | A10 |
|---|---|
| Architecture | Ampere (GA102) |
| Process | Samsung 8N |
| FP32 | 31.2 TFLOPS |
| TF32 (Tensor, sparse) | 125 TFLOPS |
| BF16 / FP16 (Tensor, sparse) | 250 TFLOPS |
| INT8 (Tensor, sparse) | 500 TOPS |
| Memory | 24 GB GDDR6 |
| Memory bandwidth | 600 GB/s |
| TDP | 150 W |
| Form factor | PCIe Gen4 x16, single-slot |
| NVLink | Not supported |
| Display outputs | None |
Architecture Notes#
Same Ampere generation as A100 — third-generation Tensor Core with TF32, BF16 and INT8 — but no FP8, no HBM and no NVLink. The GA102 die gives more raster and graphics horsepower than GA100 but less tensor density per watt at the high end.
MIG is not supported. Multi-tenant deployments rely on time-slicing through CUDA MPS or on vGPU licensing.
When to Pick A10#
- Drop-in inference upgrades over T4 fleets that need more memory or compute.
- Virtual workstation hosts (Citrix, VMware Horizon) where NVENC/NVDEC are valuable.
- Modest fine-tuning of 7B-class models with LoRA or QLoRA — 24 GB is workable.
- Single-card servers where 150 W and single-slot are hard constraints.
- Pick L4 instead for new builds under tight power budgets.
- Pick L40S instead when 48 GB and Ada Tensor Core throughput justify the TDP increase.
Pitfalls#
- Often confused with A10G; software guidance for one usually applies to the other but they are not identical bins.
- GDDR6 bandwidth limits long-context inference and large-batch decode performance.
- No NVLink means multi-A10 setups rely on PCIe — tensor parallel scaling is poor.
- Power-efficiency-per-token is worse than L4 in many modern inference workloads.
Software Notes#
CUDA 11.x and 12.x both supported; the entire mainstream inference stack (vLLM, TensorRT-LLM, Triton, TGI, Ollama) treats A10 as a routine Ampere target. Quantisation paths (AWQ, GPTQ, INT8) are well-tuned. CUDA 13 retains support; treat A10 as stable through 2027.
References
- NVIDIA A10 Datasheet · NVIDIA
- Ampere Architecture Whitepaper · NVIDIA