TL;DR
- AWS's first in-house inference chip launched late 2019; powered EC2 inf1 instances.
- Four NeuronCore v1 per chip with on-chip SRAM and 8 GB DDR4 per chip (no HBM).
- Targeted at small-model inference — CNN, BERT-base, smaller transformers.
- Largely superseded by Inferentia 2 for LLM-class workloads.
Overview#
Inferentia is AWS's first generation of in-house inference silicon. Launched in 2019, it provided low-cost EC2 inf1 instances optimised for small transformer and CNN inference. The chip was modest by modern standards but offered material cost advantages over GPU instances of the same era for the workloads it could host.
By 2026 Inferentia is largely a legacy platform. Workloads beyond BERT-base or small-CNN class generally need Inferentia 2 or GPU.
Specifications#
| Metric | Inferentia (per chip) |
|---|---|
| NeuronCores | 4 (NeuronCore v1) |
| BF16 | ~128 TFLOPS |
| INT8 | ~256 TOPS |
| Memory | 8 GB DDR4 |
| Instance | inf1 (1-16 chips) |
When Inferentia Still Makes Sense#
- Pre-existing inf1 deployments running small-model inference.
- CNN and BERT-base scale workloads where cost per inference dominates.
- Pick Inferentia 2 or L4 / L40S for any new workload involving LLMs.
Pitfalls#
- No HBM — bandwidth-bound workloads underperform substantially.
- 8 GB per chip cannot host modern LLMs.
- Neuron SDK feature set on Inferentia trails Inferentia 2 in modern model support.
Software Notes#
AWS Neuron SDK with PyTorch/XLA or TensorFlow Neuron. Inferentia compilation tooling is mature for the chip's intended workloads.