TL;DR
- Groq's custom inference accelerator built around a Tensor Streaming Processor architecture.
- Optimises for single-stream token latency — published benchmarks reach 700+ tokens/sec on Llama 70B.
- On-chip SRAM (~230 MB per chip) replaces HBM; capacity comes from chaining many chips together.
- Sold primarily through Groq Cloud API; on-prem GroqRack systems available for selected customers.
Overview#
Groq's Language Processing Unit (LPU) is a custom inference accelerator built around the Tensor Streaming Processor (TSP) architecture. The design philosophy is deterministic execution: every instruction's timing is known statically, the compiler schedules everything explicitly, and there is no runtime cache or branch prediction to introduce variance.
The visible result is single-stream token latency that GPU systems cannot easily match. Public benchmarks routinely show Groq serving Llama-3 70B at 700+ tokens per second per stream, with sub-millisecond inter-token latency. The trade-off is capacity per chip — Groq uses on-chip SRAM rather than HBM, and total model capacity comes from chaining many LPUs together.
Specifications#
| Metric | Groq LPU |
|---|---|
| Architecture | Tensor Streaming Processor |
| On-chip SRAM | ~230 MB |
| INT8 throughput | ~750 TOPS |
| FP16 throughput | ~188 TFLOPS |
| External memory | None (chip-to-chip dataflow) |
| Process | Global Foundries 14 nm (LPU v1) |
| Form factor | GroqCard PCIe / GroqNode / GroqRack |
Groq's headline metric is tokens-per-second per stream, not aggregate throughput. The architecture wins on latency, not on raw FLOPS density per dollar.
Tensor Streaming Processor Architecture#
A TSP organises compute as a deterministic dataflow pipeline. Functional units — vector, matrix, switch, memory — are arranged in a regular spatial pattern; data flows through the pipeline in lockstep, with the compiler placing every operation in space and time.
Eliminating runtime variance has two consequences. First, latency becomes predictable and minimal — no caches to miss, no schedulers to wait on, no branches to mispredict. Second, the compiler bears enormous complexity: it must schedule every operation, route every tensor, and balance every pipeline stage by hand.
Model capacity comes from chaining. A Groq deployment uses many LPUs operating as a single distributed dataflow pipeline, with model weights spread across the SRAM of all chips. Llama-3 70B inference, for instance, requires hundreds of LPUs co-operating to host the full model.
When to Pick Groq#
- Latency-critical inference where token-per-second per stream dominates user experience.
- Real-time conversational AI, voice-AI front ends and live agent loops.
- Workloads where Groq Cloud's hosted API is acceptable.
- Pick GPU clusters when throughput per dollar dominates over single-stream latency.
- Pick Cerebras / Tenstorrent when wafer-scale or RISC-V dataflow models suit better.
Pitfalls#
- Single-stream latency leadership does not always translate to throughput-per-dollar leadership.
- Adding a new model architecture is a non-trivial engineering exercise on TSP — Groq supports a curated model catalogue rather than arbitrary fine-tunes.
- On-prem deployment requires substantial bespoke planning; Groq Cloud is the typical entry point.
- Software ecosystem reach is narrow; framework integration is via Groq's API, not native PyTorch / vLLM.
Software Notes#
Groq's primary interface is the Groq Cloud OpenAI-compatible API. Self-hosted GroqRack systems use Groq's compiler stack; supported models include Llama, Mistral, Mixtral and other open-weight families with Groq-published recipes.