AWS Inferentia

TL;DR

AWS's first in-house inference chip launched late 2019; powered EC2 inf1 instances.
Four NeuronCore v1 per chip with on-chip SRAM and 8 GB DDR4 per chip (no HBM).
Targeted at small-model inference — CNN, BERT-base, smaller transformers.
Largely superseded by Inferentia 2 for LLM-class workloads.

Overview#

Inferentia is AWS's first generation of in-house inference silicon. Launched in 2019, it provided low-cost EC2 inf1 instances optimised for small transformer and CNN inference. The chip was modest by modern standards but offered material cost advantages over GPU instances of the same era for the workloads it could host.

By 2026 Inferentia is largely a legacy platform. Workloads beyond BERT-base or small-CNN class generally need Inferentia 2 or GPU.

Specifications#

Metric	Inferentia (per chip)
NeuronCores	4 (NeuronCore v1)
BF16	~128 TFLOPS
INT8	~256 TOPS
Memory	8 GB DDR4
Instance	inf1 (1-16 chips)

When Inferentia Still Makes Sense#

Pre-existing inf1 deployments running small-model inference.
CNN and BERT-base scale workloads where cost per inference dominates.
Pick Inferentia 2 or L4 / L40S for any new workload involving LLMs.

Pitfalls#

No HBM — bandwidth-bound workloads underperform substantially.
8 GB per chip cannot host modern LLMs.
Neuron SDK feature set on Inferentia trails Inferentia 2 in modern model support.

Software Notes#

AWS Neuron SDK with PyTorch/XLA or TensorFlow Neuron. Inferentia compilation tooling is mature for the chip's intended workloads.

References

AWS Inferentia Product Page · AWS

AWS Inferentia

Overview#

Specifications#

When Inferentia Still Makes Sense#

Pitfalls#

Software Notes#

References

Browse all entries

Deploy on Yobitel

AWS Inferentia

Overview#

Specifications#

When Inferentia Still Makes Sense#

Pitfalls#

Software Notes#

References

Browse all entries

Deploy on Yobitel