Hugging Face TGI (Text Generation Inference)

TL;DR

Open-source LLM inference server from Hugging Face, first released August 2022.
Written in Rust (router) and Python (model server) with tight integration into the Hugging Face Hub for one-command model deployment.
Features continuous batching, paged KV cache, FP8 and AWQ quantisation, multi-LoRA hot-swapping and an OpenAI-compatible Messages API.
Powers Hugging Face Inference Endpoints and is a common pick when teams already live inside the HF ecosystem.

Overview#

Text Generation Inference (TGI) is Hugging Face's production stack for serving LLMs. It splits responsibilities between a Rust router that handles HTTP, request validation, queuing and scheduling, and a Python model server that runs the forward pass. The router speaks an OpenAI-compatible Messages API and a native generate API, while the model server uses paged KV cache and continuous batching internally.

Because TGI is maintained by the same team that runs the Hugging Face Hub, it ships with the smoothest path from a Hub model ID to a running endpoint — `text-generation-launcher --model-id meta-llama/Meta-Llama-3.1-8B-Instruct` is the entire deployment.

Features#

Continuous batching with paged KV cache.
Multi-LoRA serving — load many adapters into one base-model engine and route per request via an `adapter_id` field.
Quantisation: FP8, AWQ INT4, GPTQ INT4, BitsAndBytes INT8.
Speculative decoding via Medusa heads or draft models.
Watermarking and guidance APIs for safety integrations.
Sharded tensor parallelism across GPUs in a node.
Hardware backends: NVIDIA CUDA, AMD ROCm, Intel Gaudi, AWS Neuron, Google TPU.

Multi-LoRA Serving#

TGI is the open-source reference implementation for serving many LoRA adapters behind a single base model. Adapters are loaded into GPU memory at start-up or hot-loaded at runtime; per-request routing chooses which adapter is active for that forward pass. For SaaS platforms where each tenant has their own fine-tune, this collapses what would have been dozens of dedicated deployments into one.

Cap concurrent adapters at the point where adapter activation matmuls start to dominate the step time — typically around 8-32 adapters per base model on H100.

Deployment#

TGI is distributed primarily as a container image on the Hugging Face registry. The same image powers Hugging Face Inference Endpoints, Sagemaker, Azure ML and many private deployments. Local launches use the `text-generation-launcher` CLI; in Kubernetes, the official Helm chart or KServe handle autoscaling and rollout.

bash

# Launch TGI with Llama 3.1 8B and two LoRA adapters
docker run --gpus all --shm-size 1g -p 8080:80 \
  -v $PWD/data:/data ghcr.io/huggingface/text-generation-inference:latest \
  --model-id meta-llama/Meta-Llama-3.1-8B-Instruct \
  --quantize fp8 \
  --lora-adapters customer-a=org/customer-a-lora,customer-b=org/customer-b-lora \
  --max-batch-prefill-tokens 16384

Licence and Governance#

TGI moved through a non-commercial-restricted licence (HFOIL) in 2023 and returned to Apache 2.0 in 2024 after community feedback. As of 2026 it is Apache 2.0 throughout with no restrictions on commercial use or hosting.

When to Use#

Choose TGI when the team is already invested in Hugging Face tooling, when multi-LoRA per-tenant routing is a hard requirement, or when you want a battle-tested OpenAI-compatible endpoint with the smoothest Hub integration. For raw throughput on the latest model architectures, vLLM and SGLang often pull ahead first; TGI follows close behind with a slightly later release cadence.

References

Text Generation Inference on GitHub · GitHub (Hugging Face)
TGI Documentation · Hugging Face
Hugging Face Inference Endpoints · Hugging Face

Overview#

Features#

Continuous batching with paged KV cache.

Multi-LoRA serving — load many adapters into one base-model engine and route per request via an `adapter_id` field.

Quantisation: FP8, AWQ INT4, GPTQ INT4, BitsAndBytes INT8.

Speculative decoding via Medusa heads or draft models.

Watermarking and guidance APIs for safety integrations.

Sharded tensor parallelism across GPUs in a node.

Hardware backends: NVIDIA CUDA, AMD ROCm, Intel Gaudi, AWS Neuron, Google TPU.

Multi-LoRA Serving#

Cap concurrent adapters at the point where adapter activation matmuls start to dominate the step time — typically around 8-32 adapters per base model on H100.

Deployment#

bash

# Launch TGI with Llama 3.1 8B and two LoRA adapters
docker run --gpus all --shm-size 1g -p 8080:80 \
  -v $PWD/data:/data ghcr.io/huggingface/text-generation-inference:latest \
  --model-id meta-llama/Meta-Llama-3.1-8B-Instruct \
  --quantize fp8 \
  --lora-adapters customer-a=org/customer-a-lora,customer-b=org/customer-b-lora \
  --max-batch-prefill-tokens 16384

When to Use#

Hugging Face TGI (Text Generation Inference)

Overview#

Features#

Multi-LoRA Serving#

Deployment#

Licence and Governance#

When to Use#

References

Browse all entries

Deploy on Yobitel

Hugging Face TGI (Text Generation Inference)

Overview#

Features#

Multi-LoRA Serving#

Deployment#

Licence and Governance#

When to Use#

References

Browse all entries

Deploy on Yobitel