Groq LPU (Language Processing Unit)

TL;DR

Groq's custom inference accelerator built around a Tensor Streaming Processor architecture.
Optimises for single-stream token latency — published benchmarks reach 700+ tokens/sec on Llama 70B.
On-chip SRAM (~230 MB per chip) replaces HBM; capacity comes from chaining many chips together.
Sold primarily through Groq Cloud API; on-prem GroqRack systems available for selected customers.

Overview#

Groq's Language Processing Unit (LPU) is a custom inference accelerator built around the Tensor Streaming Processor (TSP) architecture. The design philosophy is deterministic execution: every instruction's timing is known statically, the compiler schedules everything explicitly, and there is no runtime cache or branch prediction to introduce variance.

The visible result is single-stream token latency that GPU systems cannot easily match. Public benchmarks routinely show Groq serving Llama-3 70B at 700+ tokens per second per stream, with sub-millisecond inter-token latency. The trade-off is capacity per chip — Groq uses on-chip SRAM rather than HBM, and total model capacity comes from chaining many LPUs together.

Specifications#

Metric	Groq LPU
Architecture	Tensor Streaming Processor
On-chip SRAM	~230 MB
INT8 throughput	~750 TOPS
FP16 throughput	~188 TFLOPS
External memory	None (chip-to-chip dataflow)
Process	Global Foundries 14 nm (LPU v1)
Form factor	GroqCard PCIe / GroqNode / GroqRack

Groq's headline metric is tokens-per-second per stream, not aggregate throughput. The architecture wins on latency, not on raw FLOPS density per dollar.

Tensor Streaming Processor Architecture#

A TSP organises compute as a deterministic dataflow pipeline. Functional units — vector, matrix, switch, memory — are arranged in a regular spatial pattern; data flows through the pipeline in lockstep, with the compiler placing every operation in space and time.

Eliminating runtime variance has two consequences. First, latency becomes predictable and minimal — no caches to miss, no schedulers to wait on, no branches to mispredict. Second, the compiler bears enormous complexity: it must schedule every operation, route every tensor, and balance every pipeline stage by hand.

Model capacity comes from chaining. A Groq deployment uses many LPUs operating as a single distributed dataflow pipeline, with model weights spread across the SRAM of all chips. Llama-3 70B inference, for instance, requires hundreds of LPUs co-operating to host the full model.

When to Pick Groq#

Latency-critical inference where token-per-second per stream dominates user experience.
Real-time conversational AI, voice-AI front ends and live agent loops.
Workloads where Groq Cloud's hosted API is acceptable.
Pick GPU clusters when throughput per dollar dominates over single-stream latency.
Pick Cerebras / Tenstorrent when wafer-scale or RISC-V dataflow models suit better.

Pitfalls#

Single-stream latency leadership does not always translate to throughput-per-dollar leadership.
Adding a new model architecture is a non-trivial engineering exercise on TSP — Groq supports a curated model catalogue rather than arbitrary fine-tunes.
On-prem deployment requires substantial bespoke planning; Groq Cloud is the typical entry point.
Software ecosystem reach is narrow; framework integration is via Groq's API, not native PyTorch / vLLM.

Software Notes#

Groq's primary interface is the Groq Cloud OpenAI-compatible API. Self-hosted GroqRack systems use Groq's compiler stack; supported models include Llama, Mistral, Mixtral and other open-weight families with Groq-published recipes.

References

Groq LPU Architecture Page · Groq
Tensor Streaming Processor Whitepaper · Groq

Overview#

Specifications#

Metric	Groq LPU
Architecture	Tensor Streaming Processor
On-chip SRAM	~230 MB
INT8 throughput	~750 TOPS
FP16 throughput	~188 TFLOPS
External memory	None (chip-to-chip dataflow)
Process	Global Foundries 14 nm (LPU v1)
Form factor	GroqCard PCIe / GroqNode / GroqRack

Groq's headline metric is tokens-per-second per stream, not aggregate throughput. The architecture wins on latency, not on raw FLOPS density per dollar.

Tensor Streaming Processor Architecture#

When to Pick Groq#

Latency-critical inference where token-per-second per stream dominates user experience.

Real-time conversational AI, voice-AI front ends and live agent loops.

Workloads where Groq Cloud's hosted API is acceptable.

Pick GPU clusters when throughput per dollar dominates over single-stream latency.

Pick Cerebras / Tenstorrent when wafer-scale or RISC-V dataflow models suit better.

Pitfalls#

Single-stream latency leadership does not always translate to throughput-per-dollar leadership.

Adding a new model architecture is a non-trivial engineering exercise on TSP — Groq supports a curated model catalogue rather than arbitrary fine-tunes.

On-prem deployment requires substantial bespoke planning; Groq Cloud is the typical entry point.

Software ecosystem reach is narrow; framework integration is via Groq's API, not native PyTorch / vLLM.

Groq LPU (Language Processing Unit)

Overview#

Specifications#

Tensor Streaming Processor Architecture#

When to Pick Groq#

Pitfalls#

Software Notes#

References

Browse all entries

Deploy on Yobitel

Groq LPU (Language Processing Unit)

Overview#

Specifications#

Tensor Streaming Processor Architecture#

When to Pick Groq#

Pitfalls#

Software Notes#

References

Browse all entries

Deploy on Yobitel