TL;DR
- Post-training INT4 weight-only quantisation method introduced in Lin et al., 'AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration' (arXiv:2306.00978, 2023, MIT-HAN Lab).
- Core observation: across Transformer layers, roughly the top 1 percent of weight channels (by activation magnitude) carry the bulk of output quality. Apply a per-channel scaling that preserves precision on those channels and the rest can be quantised aggressively to INT4 with near-zero perplexity loss.
- Storage is INT4 weight + FP16 group scale (group size 128 is the de facto default); activations remain in FP16/BF16. Inference is mixed-precision matmul — dequant on the fly against FP16 activations through a fused INT4xFP16 GEMM kernel.
- Generally beats GPTQ on raw inference throughput at the same bit budget (no Hessian-derived ordering, simpler memory layout, kernel-friendly) and ties or slightly wins on perplexity on Llama, Qwen and Mistral.
- Supported as a first-class quantisation backend in vLLM (`--quantization awq` / `awq_marlin`), TensorRT-LLM, Hugging Face TGI, SGLang, MLC-LLM, ExLlamaV2; default INT4 format on Yobitel's Yobibyte Marketplace recipes for Llama, Qwen and Mistral when TTFT and tokens/sec dominate the workspace SLO.
Overview#
AWQ — Activation-aware Weight Quantisation — was published by Lin, Tang, Tang, Yang, Xiao and Han (MIT-HAN Lab) in June 2023 and has since become the most widely deployed INT4 method for open-weights LLM inference. The technique sidesteps the per-column Hessian error correction that defines GPTQ in favour of a cheaper, empirical reframing: it asks which weight channels carry the activation mass that actually matters for output quality, then protects those channels through a per-channel scaling absorbed back into the preceding layer.
The reframing matters at a systems level because it produces a quantisation layout that maps cleanly onto fused INT4xFP16 GEMM kernels — no irregular per-column scaling, no Cholesky-derived ordering, just INT4 weights with per-group FP16 scales. That regularity is what lets vLLM, TensorRT-LLM, SGLang and TGI all converge on the same AWQ kernel path, and what makes AWQ INT4 the throughput-favoured default on Ampere (A100) and Hopper (H100/H200) when FP8 hardware is unavailable or memory pressure demands 4-bit.
Through mid-2026, AWQ INT4 retains roughly 99 percent of BF16 perplexity on standard suites (WikiText-2, C4) for Llama 3.1 70B, Qwen 3 72B and Mistral Large 2, while halving inference memory footprint relative to BF16. The footprint reduction is what makes single-H100 80GB serving of a 70B class model practical at all — 140 GB BF16 collapses to roughly 35 GB INT4 weights, leaving 40+ GB for KV cache and activations.
This entry helps you decide when AWQ is the right INT4 method versus GPTQ, FP8 or FP4, how to produce AWQ checkpoints with AutoAWQ for self-hosted serving, and how to size a fleet that consumes the Yobitel Yobibyte Marketplace recipes that default to AWQ INT4 on Yobitel NeoCloud capacity. After reading you should be able to predict the throughput-versus-quality envelope for an AWQ deployment without running a benchmark.
How it works: the activation-aware scaling derivation#
Naive round-to-nearest INT4 quantisation of a Transformer linear layer wipes out the small but high-importance subset of weight channels that carry most of the activation magnitude. Lin et al. observed that this subset is concentrated in roughly 1 percent of channels per layer, and that protecting them is sufficient to recover most of the accuracy lost. The naive 'keep those channels in FP16' approach works but breaks the regularity that GEMM kernels need — mixed-format storage is slow.
AWQ's contribution is a way to protect those channels without mixed storage. For each channel c with activation scale s_c, multiply the weight column by s_c before quantising, and divide the preceding layer's output (the activation feeding into c) by s_c. The mathematical behaviour is unchanged, but the quantisation step now rounds a larger value to INT4, preserving more relative precision on the channels that need it. The scaling factors s are derived from a small calibration set (~128 samples typically) by minimising the activation-weighted L2 error between the original and quantised outputs of each layer.
The calibration is done one layer at a time, in topological order, so each layer's scaling factor sees the quantisation error already introduced upstream. Unlike GPTQ, which propagates error column by column within a layer through the inverse Hessian, AWQ propagates error layer by layer through forward activations. The cost is one forward pass over the calibration set per layer; for a 70B model on a single H100 that takes roughly 20 minutes.
- Step 1: run the calibration set through the model and collect per-channel activation magnitude statistics (mean absolute activation per input channel) for every linear layer.
- Step 2: for each layer's input channels, identify the high-importance subset (top ~1 percent by activation magnitude) — these are the channels that need protection.
- Step 3: search for a per-channel scaling vector s (parameterised by a single scalar alpha in [0, 1] applied as s = activation_scale ** alpha) that minimises the L2 error between the layer's FP16 output and its quantised output.
- Step 4: absorb the inverse scaling 1/s into the previous layer's output projection (or the embedding for layer 0), then quantise the rescaled weights w * s to INT4 with per-group FP16 scales.
- Step 5: serialise the INT4 weights and per-group scales into the checkpoint; the runtime loads them and dispatches the fused INT4xFP16 GEMM kernel at inference time.
The alpha hyperparameter is the only knob you genuinely need to tune. Lin et al. used grid search over alpha in {0.0, 0.1, ..., 1.0}; AutoAWQ does the same automatically. Higher alpha protects channels more aggressively but can hurt the unprotected majority — the sweet spot is typically 0.5-0.6 for Llama-family models.
Variants and architectural choices#
Three knobs define every AWQ deployment: the bit width (almost always INT4, occasionally INT3 on memory-starved edge devices), the group size for the FP16 scales (smaller groups mean more scales and slightly more storage, but lower quantisation error), and the kernel format (GEMM versus GEMV) which determines whether prefill or decode is favoured.
| Knob | Common choice | Effect | When to deviate |
|---|---|---|---|
| Bit width | INT4 (w4) | ~4x smaller than BF16, near-zero perplexity loss | INT3 only for edge with hard memory caps; quality drop becomes visible |
| Group size | 128 | Industry default; balanced storage vs error | Group 64 for tighter quality at +5 percent storage; group 32 only for research |
| Zero point | Asymmetric (zero_point=True) | Per-group zero shift, recovers 0.1-0.3 perplexity points | Symmetric only when targeting a kernel that requires it (rare in 2026) |
| Kernel format | GEMM (Marlin-AWQ in vLLM v0.6+) | Decode + prefill both fast on H100/H200 | GEMV format on llama.cpp / MLC-LLM for very small batch (b=1) edge inference |
| Activation dtype | FP16/BF16 (W4A16) | Standard mixed-precision matmul | W4A8 INT8 activations under research; not yet production-stable in mid-2026 |
| Calibration set | ~128 samples, in-domain text | Cheap, robust | Use 512-1024 samples when domain shift is large (code, multilingual, math) |
When to use AWQ versus the alternatives#
AWQ INT4 is the default INT4 method for open-weights LLM serving in mid-2026, but it is not universally the right choice. The decision turns on three questions: what GPU generation are you serving on, how memory-constrained is the deployment, and what is the workload's batch and context profile?
On Hopper (H100/H200) and Blackwell (B200/B300) with FP8 Tensor Cores available, FP8 weights with FP8 KV cache (W8A8 FP8) usually beats AWQ INT4 on throughput at equal or better quality — FP8 hardware paths are denser and avoid the dequant step. Reach for AWQ on Hopper specifically when memory pressure forces 4-bit (single-H100 70B serving), when the workload is small-batch decode-dominant (where the dequant overhead is hidden by the decoder bottleneck), or when you have a published AWQ checkpoint and no FP8 conversion budget.
On Ampere (A100, A40, L40S) and AMD MI250 where FP8 is unavailable or kernel-immature, AWQ INT4 is the throughput-favoured production choice. It beats GPTQ INT4 on raw decode tokens-per-second by 10-25 percent on typical Llama-family models, ties or wins on perplexity, and has wider runtime support. The historical GPTQ advantage on Ampere — the Marlin kernel — has been ported to AWQ format (`awq_marlin` in vLLM v0.6+), so the kernel gap has closed.
On edge and on-device targets (Jetson Orin, Apple M-series, llama.cpp on consumer GPUs), the trade-off shifts. GGUF Q4_K_M (a llama.cpp quant scheme broadly similar to AWQ) and MLC-LLM's MLC-compiled AWQ checkpoints dominate that space. The Yobitel Edge AI fleet — described in the edge-inference entry — runs llama.cpp with Q4_K_M GGUF when latency is paramount and MLC-LLM with AWQ when WebGPU portability matters; either way, INT4 weight-only is the deployment shape.
Yobitel's Yobibyte Marketplace catalogues both AWQ and GPTQ recipes for the same model so customers can pick by workload. The default selection for Llama 3.1 70B Instruct, Qwen 3 72B, and Mistral Large is AWQ INT4 because TTFT and tokens-per-second dominate the typical chat SLO; GPTQ recipes are retained for teams with an existing GPTQ checkpoint they want to consume unchanged.
Trade-offs and known limitations#
AWQ inherits the structural limitations of all weight-only post-training quantisation. Activations stay in FP16/BF16, so the inference path is mixed-precision; the dequant step costs measurable bandwidth on every forward pass, which W8A8 FP8 paths avoid by storing activations natively at 8-bit. On decode-heavy small-batch workloads (b=1, b=2) the dequant overhead is often hidden behind the inherent memory-bound nature of decoding; on prefill-heavy or large-batch workloads, FP8 W8A8 is meaningfully faster on Hopper and above.
Calibration sensitivity is the most common operational pitfall. AWQ's scaling factors are derived from activation statistics on the calibration set; if that set is not representative of production prompts (e.g., calibrated on English Wikipedia but serving multilingual code completion), perplexity on the deployed workload can be noticeably worse than the published number. Recalibrate with 512-1024 in-domain samples when domain shift is large.
Long-context quality is generally preserved, but very small models (1B-3B class) can show 1-2 percent perplexity degradation that becomes visible on reasoning benchmarks. For 7B and up, AWQ INT4 is essentially a free 2x memory reduction.
Mathematical / coding accuracy is the workload where INT4 weight-only quantisation (AWQ or GPTQ) has historically lagged BF16 by the largest margin — 3-5 percent on GSM8K and HumanEval for Llama 3.1 70B. If the workload is reasoning-heavy, FP8 W8A8 or selective FP16 retention on math-heavy layers is preferable; the Yobibyte Marketplace exposes both options for the reader to choose between.
Do not naively compare 'INT4 perplexity' numbers across papers. Different calibration sets, different group sizes, different evaluation harnesses produce 0.3-0.5 point swings on the same checkpoint. The InferenceBench methodology runs all INT4 quantisations on the same evaluation suite at the same context length for a fair comparison.
Practical implementation notes#
AutoAWQ (casper-hansen/AutoAWQ) is the canonical production conversion path; it wraps Lin et al.'s reference implementation in a HuggingFace-compatible API and exposes the standard quant_config knobs. Conversion of a 70B model on a single H100 takes 20-40 minutes depending on calibration set size. The snippet below covers the standard recipe: load BF16 weights, set w_bit=4 with group size 128, quantise, save the AWQ checkpoint.
Serving the checkpoint in vLLM is a single flag — `--quantization awq` selects the AWQ format, `awq_marlin` selects the faster Marlin-AWQ kernel path on Ampere and Hopper. The Yobibyte managed alternative skips the conversion entirely: customers select an AWQ recipe in their workspace and Yobitel handles the checkpoint, the kernel selection and the routing to NeoCloud capacity that supports it.
# Producing an AWQ INT4 checkpoint with AutoAWQ
# pip install autoawq>=0.2.6 transformers
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
MODEL_ID = "meta-llama/Meta-Llama-3.1-70B-Instruct"
OUT_DIR = "./llama-3.1-70b-awq-int4"
model = AutoAWQForCausalLM.from_pretrained(
MODEL_ID,
safetensors=True,
device_map="auto",
low_cpu_mem_usage=True,
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
quant_config = {
"zero_point": True, # Asymmetric quantisation: ~0.2 perplexity win
"q_group_size": 128, # Industry default; 64 for tighter quality
"w_bit": 4, # 4-bit INT4 weights
"version": "GEMM", # GEMM = balanced prefill + decode; GEMV = b=1 edge
}
# Calibration with 128 in-domain samples; pass calib_data=... for domain shift
model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(OUT_DIR)
tokenizer.save_pretrained(OUT_DIR)
# Serve the AWQ checkpoint with vLLM (self-hosted; Yobibyte handles this end-to-end)
# vllm serve ./llama-3.1-70b-awq-int4 \
# --quantization awq_marlin \
# --tensor-parallel-size 2 \
# --max-model-len 32768 \
# --enable-prefix-caching \
# --enable-chunked-prefillWhere AWQ fits in the Yobitel stack#
Yobitel's Yobibyte Marketplace catalogues AWQ INT4 as the default quantisation recipe for the open-weights Llama, Qwen and Mistral families when the workspace SLO prioritises tokens-per-second and time-to-first-token. The recipe captures the published checkpoint plus the runtime flags that consume it correctly; customers select the recipe by name in their workspace and Yobibyte routes inference to NeoCloud capacity selected by Omniscient Compute based on the workload's KV-cache and concurrency profile. The internal mechanics of that routing are not customer-exposed; what the customer sees is an OpenAI-compatible endpoint that meets the stated SLO at the published price.
Yobitel NeoCloud's H100 SXM5 and H200 SXM5 SKUs run AWQ checkpoints natively through the awq_marlin kernel path. The 80 GB / 141 GB HBM headroom on those SKUs lets a single GPU host a 70B AWQ model with 30 GB+ of KV cache budget, which is the configuration that drives most chat-shaped Yobibyte workloads on the platform. For larger Qwen3-MoE and DeepSeek-V3 deployments, AWQ-equivalent FP8 paths usually win and become the default recipe instead.
Yobitel's InferenceBench publishes side-by-side measurements of AWQ INT4, GPTQ INT4 and FP8 W8A8 on the same model, GPU and runtime across H100, H200, B200 and MI300X tenancies — tokens-per-second, time-to-first-token, p99 latency and cost-per-million-tokens, with all configurations reproducible. For teams deciding between AWQ and the alternatives, InferenceBench is the empirical complement to the analysis in this entry.
References
- AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration · arXiv (Lin et al., 2023)
- llm-awq on GitHub (reference implementation) · GitHub (MIT-HAN Lab)
- AutoAWQ on GitHub (production conversion) · GitHub
- vLLM AWQ Quantisation Documentation · vLLM
- Marlin: Mixed-Precision Auto-Regressive Parallel Inference of LLMs · GitHub (IST-DASLab)