TL;DR
- Open-source LLM inference engine originally from UC Berkeley's Sky Computing Lab (June 2023), now governed under the LF AI & Data Foundation with contributors from Meta, NVIDIA, AMD, IBM and most major neoclouds. Apache 2.0, >100 model architectures, OpenAI-compatible REST surface.
- Built on PagedAttention (block-paged KV cache borrowed from OS virtual memory) and continuous batching (token-level scheduling) — together delivering 2-24x higher throughput than naive HuggingFace `model.generate()` at the same latency budget.
- Supports tensor / pipeline / expert parallelism, prefix caching, speculative decoding (draft model, EAGLE-2, Medusa, n-gram), chunked prefill, multi-LoRA hot-swap, AWQ / GPTQ / Marlin / FP8 / FP4 quantisation, and a Prometheus metrics endpoint with request-level histograms.
- Runs on NVIDIA (Ampere / Hopper / Blackwell), AMD ROCm (MI250 / MI300X), Intel Gaudi 2/3, AWS Neuron (Trn1 / Inf2) and Google TPU v5e/v5p. Ships as a PyPI package, a CUDA wheel and the `vllm/vllm-openai` container.
- Default inference engine in Yobitel's Yobibyte platform; scored continuously by Omniscient Compute against TensorRT-LLM and SGLang on InferenceBench across H100 SXM5, H200, B200 and MI300X tenancies.
Overview#
vLLM is an inference engine for transformer language models. It exposes an HTTP server that speaks the OpenAI Chat Completions, Completions and Embeddings APIs, an offline Python entry point (`LLM(...).generate(...)`) for batch inference, and a low-level engine API used by production stacks such as KServe, Ray Serve, NVIDIA Dynamo and the upstream vLLM Production Stack. The project's defining premise is that LLM serving is a memory-and-scheduling problem first and a kernel-tuning problem second.
The original v0.1 release (June 2023) shipped two ideas that the rest of the field has since adopted: PagedAttention, which manages the KV cache as fixed-size blocks indexed by per-sequence block tables; and continuous (iteration-level) batching, which admits and evicts sequences between every forward pass rather than at request boundaries. Together they raised KV-cache memory utilisation from roughly 40 percent to above 95 percent and lifted achievable throughput on a single H100 by an order of magnitude on chat-shaped workloads.
By mid-2026 the project sits under the LF AI & Data Foundation, with maintainers from UC Berkeley, IBM, Meta, NVIDIA, AMD, Anyscale, Snowflake, Databricks and Yobitel. The mainline release cadence is roughly every two to three weeks, with new model architectures typically supported within days of weight publication. vLLM is the runtime against which every newer engine — SGLang, TensorRT-LLM, MAX, MistralRS — is benchmarked. Yobibyte exposes vLLM as its default inference engine; Yobitel customers reach vLLM through a managed workspace rather than building containers, registries and schedulers from raw upstream.
This entry documents the production surface: the CLI and Python APIs, the scheduling and KV-cache internals, the parallelism strategies, the deployment patterns, the limits and quotas, the observability hooks, and the practical sizing and cost models you need to operate vLLM at scale on Yobitel and beyond. This entry helps you stand up vLLM for production LLM serving with the right flags, sizing and operational practices — whether you are running raw upstream on your own cluster or consuming vLLM through Yobibyte's managed workspaces.
Quick start#
The example below deploys Llama 3.1 70B on a 4x H100 SXM5 node with FP8 weights, prefix caching, chunked prefill and a 32K context window, then issues an OpenAI-compatible chat completion. The first block installs vLLM and serves the model directly on any CUDA 12.4+ host; the second block hits the running endpoint with `curl`; the third block drives the same endpoint from Python using the standard `openai` SDK pointed at the local server.
# 1. Install vLLM and serve Llama 3.1 70B on 4x H100 SXM5
pip install "vllm>=0.8.0"
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
--tensor-parallel-size 4 \
--max-model-len 32768 \
--max-num-seqs 256 \
--gpu-memory-utilization 0.92 \
--quantization fp8 \
--kv-cache-dtype fp8 \
--enable-prefix-caching \
--enable-chunked-prefill \
--num-scheduler-steps 8 \
--port 8000
# 2. Hit the OpenAI-compatible endpoint with curl
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3.1-70B-Instruct",
"messages": [{"role": "user", "content": "Summarise PagedAttention in 2 lines."}],
"max_tokens": 128
}'
# 3. Same call from Python using the official openai SDK
python - <<'PY'
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
reply = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-70B-Instruct",
messages=[{"role": "user", "content": "Summarise PagedAttention in 2 lines."}],
max_tokens=128,
)
print(reply.choices[0].message.content)
PYHow it works#
vLLM is structured as an asynchronous engine wrapped by a frontend API server. Requests enter through the FastAPI server, are tokenised, validated against `--max-model-len` and the per-request `max_tokens` budget, and handed to the LLMEngine. The engine maintains three queues: WAITING (admitted, not yet running), RUNNING (currently in the active batch), and SWAPPED (paged out to host memory under KV-cache pressure). On every step the scheduler picks the next set of sequences to run based on KV-cache availability and request priority, and the model executor performs one forward pass.
Forward execution uses FlashAttention-3 on Hopper and Blackwell and FlashAttention-2 on Ampere, with a paged-KV variant of the attention kernel that gathers K and V from the block pool via the per-sequence block table. Tensor-parallel and pipeline-parallel ranks coordinate via NCCL; the engine starts one worker process per rank with shared GPU memory for the KV pool. The KV pool size is computed at start-up as (`gpu_memory_utilization` x device memory) minus weights minus activation working set, divided by block size.
Chunked prefill changes how long inputs are processed. Without it, a single request with a 30K-token prompt monopolises the GPU for a full prefill before any decode tokens emerge, starving short interactive requests. With `--enable-chunked-prefill`, the scheduler interleaves prefill chunks (default 512 tokens) with decode tokens in the same iteration, dramatically improving p99 latency under mixed load. From v0.6 onward chunked prefill is the recommended default for production deployments.
Speculative decoding adds a draft pass per step. The engine runs a small draft model (or EAGLE-2 / Medusa heads attached to the target) to propose k tokens, then verifies them with one parallel forward of the target. Accepted tokens are committed; the first rejected token is the new ground truth. On chat workloads with a well-matched draft (Llama 3 8B for 70B target), end-to-end latency drops 1.5-3x at unchanged quality.
- PagedAttention: KV cache split into 16-token blocks (configurable), allocated on demand from a global pool. Identical prefix blocks are content-addressed and shared across sequences.
- Continuous batching: scheduler decision happens every iteration; sequences enter and leave the running batch at token boundaries.
- Prefix caching: cached blocks survive request completion until evicted under LRU; subsequent requests with matching prefixes skip prefill on shared tokens.
- Chunked prefill: prefill work split into fixed-size chunks interleaved with decode in the same step.
- Multi-step scheduling: `--num-scheduler-steps n` lets the worker execute n forward passes per scheduler invocation, reducing Python overhead by ~3-5x.
- CUDA graphs: enabled by default for decode-only batches; captures the forward pass to eliminate launch overhead.
Turn on prefix caching, chunked prefill and multi-step scheduling together before reaching for more exotic optimisations. The combined uplift on a shared-system-prompt workload is typically 30-70 percent over defaults.
Reference and specifications#
Every long-lived deployment is parameterised through a small number of high-impact flags. The table below is the canonical reference for the engine CLI surface as of vLLM v0.8 (June 2026). Flags marked with an asterisk are also available as `EngineArgs` fields in the Python API. Flags not listed here are either internal tuning knobs that defaults handle correctly, or specialised features documented in the upstream reference.
| Flag | Type | Default | Description |
|---|---|---|---|
| --model | string | (required) | HuggingFace repo id or local path. Drives architecture detection. |
| --tensor-parallel-size * | int | 1 | Shard each weight matrix across N GPUs within a node via NCCL AllReduce. |
| --pipeline-parallel-size * | int | 1 | Split layers into stages across nodes. Tolerates lower interconnect bandwidth than TP. |
| --max-model-len * | int | model-defined | Maximum total tokens per sequence. Bounded by RoPE and KV-cache budget. |
| --max-num-seqs * | int | 256 | Hard cap on concurrent sequences in the running batch. |
| --max-num-batched-tokens * | int | auto | Cap on tokens per iteration; controls prefill / decode mix under chunked prefill. |
| --gpu-memory-utilization * | float | 0.9 | Fraction of GPU memory available to vLLM (weights + activations + KV pool). |
| --swap-space * | int (GB) | 4 | CPU memory reserved for swapping KV blocks when the GPU pool fills up. |
| --kv-cache-dtype * | string | auto | auto | fp8 | fp8_e4m3 | fp8_e5m2. FP8 KV halves cache footprint. |
| --quantization * | string | (off) | fp8 | awq | gptq | gptq_marlin | bitsandbytes | fp4 | int4_w4a16. |
| --enable-prefix-caching * | bool | false | Persist KV blocks across requests; reuse by content hash. |
| --enable-chunked-prefill * | bool | v0.6+ true | Interleave prefill chunks with decode in the same step. |
| --num-scheduler-steps * | int | 1 | Number of forward passes per scheduler invocation; 8-16 typical. |
| --speculative-model * | string | (off) | Draft model id or `[eagle]` / `[medusa]` / `[ngram]` for built-in heads. |
| --num-speculative-tokens * | int | 5 | Number of tokens the draft proposes per step. |
| --enable-lora * | bool | false | Enables multi-LoRA hot-swap; pair with `--max-loras` and `--max-lora-rank`. |
| --max-loras * | int | 1 | Number of LoRA adapters resident in GPU memory. |
| --disable-log-requests | bool | false | Suppress per-request access logs (recommended at high RPS). |
| --rope-scaling * | json | (model) | Override RoPE scaling (linear, dynamic, yarn, longrope) for context extension. |
| --guided-decoding-backend * | string | outlines | outlines | xgrammar | lm-format-enforcer for JSON / regex / grammar constrained decoding. |
| --enable-eager | bool | false | Disable CUDA-graph capture; useful when debugging. |
| --worker-use-ray | bool | false | Drive workers via Ray instead of multiprocessing (multi-node). |
| --distributed-executor-backend * | string | mp | mp | ray. Use ray for cross-node deployments. |
| --block-size * | int | 16 | KV block size in tokens. 32 sometimes preferred at long context. |
| --preemption-mode * | string | recompute | recompute | swap. Swap moves blocks to CPU; recompute re-runs prefill. |
| --served-model-name | string | model id | Override the model name reported via the OpenAI API. |
| --api-key | string | (none) | Optional bearer-token auth for the API server. |
| --enable-auto-tool-choice | bool | false | Enable native tool-calling on Llama 3, Mistral, Hermes, Granite chat templates. |
| --limit-mm-per-prompt | json | (model) | Cap multimodal items per prompt for VLM serving. |
From v0.8 onwards a subset of these flags can be tuned via the `/load_config` admin endpoint without restart. Restartless reconfiguration of tensor-parallel size and quantisation mode is not supported and never will be — those changes require a new engine.
Workload patterns#
Three workload shapes cover the bulk of vLLM production deployments: interactive chat behind an OpenAI-compatible gateway, RAG with long shared system prompts, and offline batch inference for evaluation or labelling. Each has its own preferred set of flags. These are the same three shapes Yobibyte automates for managed customers — the flags below are what a team running raw vLLM on their own Kubernetes signs up to hand-tune; the Yobibyte console derives them from the workspace's stated SLO.
Pattern A — Chat endpoint, OpenAI-compatible. Maximise concurrent users at a target p95 time-to-first-token of 250-500 ms. Enable prefix caching for repeated system prompts, chunked prefill so a long prompt does not stall short requests, and multi-step scheduling to amortise Python overhead. Pattern B — RAG endpoint with a 4-32K shared system prompt across thousands of requests. Pattern C — offline batch scoring with no API server.
# A — chat endpoint on 2x H100 SXM5 for an 8B-class model
vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
--tensor-parallel-size 2 \
--max-model-len 16384 \
--max-num-seqs 512 \
--gpu-memory-utilization 0.90 \
--enable-prefix-caching \
--enable-chunked-prefill \
--num-scheduler-steps 16 \
--kv-cache-dtype fp8 \
--quantization fp8 \
--enable-auto-tool-choice \
--port 8000
# B — RAG endpoint, 4-32K shared system prompt across thousands of requests
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
--tensor-parallel-size 4 \
--max-model-len 32768 \
--max-num-batched-tokens 8192 \
--gpu-memory-utilization 0.92 \
--enable-prefix-caching \
--enable-chunked-prefill \
--block-size 32 \
--quantization fp8
# C — offline batch scoring (no API server)
python - <<'PY'
from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Meta-Llama-3.1-70B-Instruct",
tensor_parallel_size=4,
quantization="fp8",
enable_prefix_caching=True,
max_num_seqs=1024,
gpu_memory_utilization=0.95,
)
prompts = open("eval-prompts.txt").read().splitlines()
out = llm.generate(prompts, SamplingParams(max_tokens=256, temperature=0))
for r in out:
print(r.outputs[0].text)
PYPattern B with very high prefix-cache hit rates can paradoxically saturate decode bandwidth — prefill is cheap but each decode token still costs one full forward. Watch `vllm:prefix_cache_hit_rate` next to `vllm:gpu_cache_usage_perc`; if hits exceed 80 percent and decode latency spikes, scale by adding decode replicas, not bigger TP.
Sizing and capacity planning#
vLLM throughput is bounded first by KV-cache memory, then by tensor-core FLOPs, then by NCCL bandwidth at TP > 2. The planning model below assumes Llama-family architectures with grouped-query attention and FP8 weights and KV; FP16 / BF16 doubles every memory column. Tokens-per-second figures are mid-range observed values from InferenceBench v3 at 4K input / 256 output, mixed concurrency; treat them as planning anchors, not contractual.
KV-cache budget is the constraint that decides max concurrency. The per-token KV size is `2 x n_layers x n_kv_heads x head_dim x dtype_bytes`. For Llama 3.1 70B with GQA (8 KV heads) at FP8 that is roughly 40 KB per token. On 4x H100 with TP=4 the weights occupy ~70 GB, activations ~12 GB, leaving ~240 GB of pooled KV — about 6 million KV tokens, enough for 180 concurrent sequences at average 32K used or 1,500 sequences averaging 4K used. Tensor-parallel size choice is governed by intra-node bandwidth: TP up to 8 inside one NVLink island is well-behaved; TP across InfiniBand is almost always slower than pipeline parallelism. For two-node deployments, prefer TP=8 + PP=2 over TP=16. For three or more nodes, expert parallelism dominates if the model is MoE; otherwise pipeline with replicas behind a router beats deep PP.
| Workload | Model | Recommended SKU | Concurrency | Output tok/s | Notes |
|---|---|---|---|---|---|
| Chat, low latency | Llama 3.1 8B | 1x H100 SXM5 80GB | 64-128 | 3,800-5,200 | FP8 weights + KV, multi-step 16. |
| Chat, balanced | Llama 3.1 70B | 4x H100 SXM5 | 128-256 | 2,800-4,200 | TP=4, FP8, chunked prefill 512. |
| Chat, high QPS | Llama 3.1 70B | 8x H100 SXM5 | 256-512 | 5,200-7,800 | TP=8, prefix cache on shared prompts. |
| Long context (128K) | Llama 3.1 70B | 2x H200 141GB | 32-64 | 1,400-2,200 | FP8 KV, block-size 32, swap-space 32GB. |
| MoE serving | Mixtral 8x22B | 8x H100 SXM5 | 192-384 | 4,500-6,800 | TP=8 with expert parallelism. |
| MoE serving | DeepSeek-V3 671B | 16x H100 SXM5 (2 nodes) | 256-512 | 3,200-4,800 | TP=8 + PP=2, NVLink + 400Gb IB. |
| RAG, prefix-heavy | Llama 3.1 70B | 4x H100 SXM5 | 256-512 | 6,000-9,500 | Prefix hit rate >70 percent assumed. |
| Offline batch | Llama 3.1 70B | 4x H100 SXM5 | 1024+ | 8,500-12,000 | Disable streaming, max_num_seqs 1024. |
| Edge inference | Llama 3.1 8B Q4 | 1x L40S 48GB | 16-32 | 1,400-2,000 | AWQ INT4, FP16 KV, eager mode. |
| Blackwell next-gen | Llama 3.1 70B | 4x B200 | 256-512 | 6,800-10,500 | FP4 weights, FP8 KV, FA3 kernels. |
Limits and quotas#
vLLM enforces a small set of hard and soft limits at the engine boundary. Hard limits reject requests with HTTP 400 at the API server; soft limits apply backpressure by extending queue depth. Operational ceilings (memory, NCCL groups, file descriptors) come from the host OS and CUDA runtime.
| Limit | Default | Hard ceiling | How to raise |
|---|---|---|---|
| max_model_len | model-defined | RoPE-limited (e.g. 128K Llama 3.1) | Use --rope-scaling longrope/yarn; verify quality. |
| max_num_seqs | 256 | KV-cache budget | Raise --max-num-seqs; check `gpu_cache_usage_perc`. |
| max_num_batched_tokens | auto (8192) | Activation memory | Raise carefully; watch p99 prefill latency. |
| max_loras | 1 | GPU memory | Raise; activation matmuls cost grows linearly. |
| max_lora_rank | 16 | 64 | Higher ranks raise per-step compute by ~5 percent. |
| Replicas per engine | 1 | Hardware-bounded | Scale by adding pods, not engines per pod. |
| TP size (intra-node) | 1 | 8 (NVLink) | Bounded by GPUs per node. |
| PP size (cross-node) | 1 | ~32 in practice | Bounded by pipeline bubble overhead. |
| Request body size | unlimited | HTTP server limit | Set --max-log-len; configure reverse proxy. |
| Concurrent requests / engine | max_num_seqs + queue | Memory-bounded | Add replicas behind a router. |
| Shared memory (NCCL + MIG) | /dev/shm | Container-defined | Mount /dev/shm >= 1GB per worker. |
| File descriptors | 1024 | ulimit | ulimit -n 65536 in container. |
Multi-Instance GPU (MIG) slices on H100 advertise reduced memory but share /dev/shm with siblings. If you run vLLM TP>1 inside a MIG slice you must increase the shared-memory limit on the container; the default 64 MB will OOM NCCL on the first AllReduce.
Observability#
vLLM exposes a Prometheus metrics endpoint at `/metrics` covering request throughput, latency histograms, GPU cache utilisation, prefix-cache hit rate, scheduler queue depth and preemption counts. The metric prefix is `vllm:`. Engine logs emit one structured line per request when `--disable-log-requests` is unset; switch to JSON output via `VLLM_LOG_LEVEL=INFO` and `VLLM_LOGGING_CONFIG_PATH` for ingestion into Loki, Splunk or Datadog.
The metrics worth alerting on in production are: time-to-first-token p95, inter-token latency p95, GPU cache usage, prefix-cache hit rate, number of preempted sequences, and request queue time. The following Prometheus rules cover the common failure modes.
- vllm:time_to_first_token_seconds — prefill latency; correlate with prompt-length histogram.
- vllm:time_per_output_token_seconds — decode latency; should be near 1 / (theoretical tok/s) when batch is full.
- vllm:gpu_cache_usage_perc — KV pool fill; consistent reading above 90 percent indicates capacity headroom is gone.
- vllm:prefix_cache_hit_rate — fraction of incoming prefill tokens served from cache.
- vllm:num_preemptions_total — sequences evicted under KV pressure; non-zero in steady state means undersized cluster.
- vllm:request_queue_time_seconds — time from request arrival to first scheduler admission.
- DCGM_FI_DEV_GPU_UTIL and DCGM_FI_DEV_MEM_COPY_UTIL — pair with vLLM metrics to distinguish compute, memory and idle bottlenecks.
# Prometheus rules for a vLLM deployment
groups:
- name: vllm-sla
interval: 30s
rules:
- alert: VLLMHighTimeToFirstToken
expr: histogram_quantile(0.95,
sum by (le, model_name) (
rate(vllm:time_to_first_token_seconds_bucket[5m]))) > 1.0
for: 5m
labels: { severity: warning, team: inference }
annotations:
summary: "vLLM TTFT p95 above 1s on {{ $labels.model_name }}"
- alert: VLLMKVCachePressure
expr: vllm:gpu_cache_usage_perc > 0.95
for: 10m
labels: { severity: warning }
annotations:
summary: "KV pool >95 percent full — preemption imminent"
- alert: VLLMPreemptionSpike
expr: increase(vllm:num_preemptions_total[5m]) > 20
for: 5m
labels: { severity: critical }
annotations:
summary: "Preemptions rising — capacity insufficient or runaway request"
- alert: VLLMPrefixCacheCollapse
expr: vllm:prefix_cache_hit_rate < 0.20
and vllm:prefix_cache_queries_total > 100
for: 15m
labels: { severity: info }
annotations:
summary: "Prefix cache hit rate dropped — workload shape changed"
- alert: VLLMGPUUnderutilised
expr: avg_over_time(DCGM_FI_DEV_GPU_UTIL[5m]) < 30
and rate(vllm:request_success_total[5m]) > 0
for: 15m
labels: { severity: info }
annotations:
summary: "GPU under 30 percent — investigate Python overhead or PP bubble"If TTFT p95 is high but GPU utilisation is low, suspect Python-side overhead — raise `--num-scheduler-steps`, ensure CUDA graphs are enabled, and confirm the engine is not running with `--enable-eager`.
Cost and FinOps#
vLLM cost economics are dominated by three levers: GPU rental rate, achieved tokens-per-second-per-GPU, and average prefix-cache hit rate. Holding model and SKU fixed, doubling sustained throughput halves the unit cost in $/M tokens. The table below uses Yobitel UK list pricing (June 2026) and InferenceBench v3 throughput anchors; substitute your own when planning.
- Spot instances cut GPU rate 40-60 percent but require autoscaling that tolerates 30-90s pre-emption notices. Pair with vLLM's draining endpoint to flush in-flight requests cleanly.
- FP8 weights + FP8 KV is the highest $/M-tokens lever available on Hopper; BF16 is roughly 1.6x more expensive at the same SLO.
- Prefix-cache hits are accounted at zero prefill cost — high-overlap workloads (multi-tenant agents, shared system prompts) can lift effective throughput 1.5-2x at no infrastructure change.
- FOCUS-conformant billing exports from Yobitel include `inference_engine` and `model_name` resource tags so $/M tokens can be sliced by tenant or product line.
| Configuration | GPU rate ($/h) | Sustained tok/s | $/M output tokens | Notes |
|---|---|---|---|---|
| 1x H100 SXM5, Llama 3.1 8B FP8 | $3.20 | 4,500 | $0.20 | Single replica, prefix cache on. |
| 4x H100 SXM5, Llama 3.1 70B FP8 | $12.40 | 3,500 | $0.98 | TP=4, chunked prefill. |
| 8x H100 SXM5, Llama 3.1 70B FP8 | $24.80 | 6,800 | $1.01 | TP=8, prefix cache 60 percent. |
| 2x H200, Llama 3.1 70B 128K ctx | $8.40 | 1,800 | $1.30 | Long context tax. |
| 4x B200, Llama 3.1 70B FP4 | $22.00 | 9,200 | $0.66 | Blackwell FP4 + FA3. |
| 4x H100 spot, Llama 3.1 70B | $6.20 | 2,800 | $0.62 | Spot interruption averaged in. |
| 8x MI300X, Llama 3.1 70B FP8 | $18.80 | 5,400 | $0.97 | ROCm 6.2, FA-ROCm kernel. |
| Hosted SaaS reference (GPT-4o mini class) | n/a | n/a | $0.60 | List API price; comparison only. |
Security and compliance#
vLLM ships with an opt-in bearer-token auth on the API server (`--api-key`); production deployments terminate TLS at an ingress (Envoy, NGINX, AWS ALB) and apply mTLS or signed-JWT auth at that layer. The engine itself does not enforce per-tenant quotas — those are implemented at the gateway or via the Yobitel platform's multi-tenant router. Network isolation should follow the standard pattern: the inference pod has no egress to the public internet, model weights are pulled from a private registry, and per-replica NetworkPolicy locks ingress to the gateway service account.
Prompt-injection mitigation is a layered concern that vLLM does not solve directly. The engine supports system-prompt pinning via prefix caching (the system prompt is the same blocks every request) and structured-output enforcement via Outlines, XGrammar or LM Format Enforcer. Pair these with retrieval source validation, output classifiers and rate limits at the gateway. The Yobibyte platform enforces this stack by default.
Regulatory implications are model-, data- and deployment-specific. For UK public-sector workloads, deploy on Yobitel sovereign tenancies that satisfy NCSC Cloud Security Principles and G-Cloud 14 (vLLM as the engine is a control-plane component on those tenancies). For EU GDPR, the engine processes prompt and completion data only in volatile GPU memory and the on-disk scratch path; ensure `--swap-space` resides on encrypted ephemeral storage. For US HIPAA workloads, run inside a BAA-covered VPC and disable request logging; for FedRAMP, run the FIPS-validated CUDA build and pin to NIAP-approved cipher suites at the ingress.
Multi-tenant single-engine vLLM deployments share the prefix cache across tenants by default. If tenants must not see each other's system prompts in any side-channel, either run one engine per tenant or set `--enable-prefix-caching=false` and accept the throughput hit.
Migration and alternatives#
Most production migrations to vLLM come from one of four origins: HuggingFace `pipeline()` / `model.generate()`, Hugging Face TGI, NVIDIA TensorRT-LLM, or a managed SaaS API (OpenAI, Anthropic, AWS Bedrock). The first delivers the largest throughput uplift (5-10x typical at the same latency); the others are roughly at parity on throughput with different operational trade-offs.
If you are running production today on Kubernetes with TGI, the migration is essentially a container swap and a flag rename. The incumbent commands below produce comparable deployments; you can roll vLLM behind the same `Service` and shift traffic at the gateway.
| From | Migration effort | Throughput change | Operational notes |
|---|---|---|---|
| HuggingFace pipeline / generate | Low — drop in OpenAI client | 5-10x faster | Eliminates GIL-bound serving loop. |
| TGI (Text Generation Inference) | Low — same OpenAI API | Comparable, vLLM wins on new models | Lose TGI multi-LoRA hot-swap polish; gain wider model support. |
| TensorRT-LLM + Triton | Medium — drop engine build | 10-30 percent slower at same latency | Gain rapid model rotation; lose absolute-min latency. |
| SGLang | Low — same API surface | Roughly equal on chat; SGLang wins on agents | Switch back for RadixAttention-heavy workloads. |
| OpenAI / Bedrock / Anthropic API | High — model substitution | Variable | Gain control, sovereignty; lose hosted model variety. |
| llama.cpp / Ollama (cloud) | Low — same model | 3-8x faster on GPU | Use llama.cpp for CPU and Apple Silicon. |
# Production deployment on Kubernetes with NVIDIA GPU Operator installed
kubectl apply -f - <<'YAML'
apiVersion: apps/v1
kind: Deployment
metadata: { name: llama3-70b-vllm }
spec:
replicas: 2
selector: { matchLabels: { app: llama3-70b-vllm } }
template:
metadata: { labels: { app: llama3-70b-vllm } }
spec:
containers:
- name: vllm
image: vllm/vllm-openai:v0.8.0
args:
- "--model=meta-llama/Meta-Llama-3.1-70B-Instruct"
- "--tensor-parallel-size=4"
- "--max-model-len=32768"
- "--enable-prefix-caching"
- "--enable-chunked-prefill"
- "--quantization=fp8"
- "--kv-cache-dtype=fp8"
resources:
limits: { nvidia.com/gpu: 4 }
ports: [{ containerPort: 8000 }]
volumeMounts:
- { name: dshm, mountPath: /dev/shm }
volumes:
- name: dshm
emptyDir: { medium: Memory, sizeLimit: 8Gi }
YAML
# Equivalent today on AWS (bare p5 instance with Deep Learning AMI)
AMI_ID=$(aws ec2 describe-images --owners amazon \
--filters "Name=name,Values=Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.4*" \
--query 'sort_by(Images,&CreationDate)[-1].ImageId' --output text)
aws ec2 run-instances \
--image-id "$AMI_ID" \
--instance-type p5.48xlarge \
--user-data "$(printf '#!/bin/bash\npip install vllm>=0.8.0\nvllm serve meta-llama/Meta-Llama-3.1-70B-Instruct --tensor-parallel-size 8 --quantization fp8 --port 8000\n')"
# Equivalent on GCP A3 (H100)
gcloud compute instances create vllm-llama70b \
--machine-type=a3-highgpu-8g \
--accelerator=type=nvidia-h100-80gb,count=8 \
--image-family=pytorch-2-4-cu124 --image-project=deeplearning-platform-release \
--metadata=startup-script='vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct --tensor-parallel-size 8 --quantization fp8'If you must hit absolute-min latency on a fixed model (real-time voice agents, low-latency RAG, public APIs under tight SLOs), keep TensorRT-LLM. vLLM closes most of the gap with FA3 and CUDA graphs, but compiled engines still win the last 10-20 percent.
Troubleshooting#
The error table below covers the failure modes that account for roughly 80 percent of production vLLM incidents observed on Yobitel-operated fleets and InferenceBench community submissions. Each row maps an observable symptom to the underlying mechanism and the minimum-viable fix.
| Symptom / Error | Cause | Fix |
|---|---|---|
| torch.cuda.OutOfMemoryError on first request | gpu_memory_utilization too high; activations crowd KV pool. | Lower to 0.88; raise --swap-space; check no other process on GPU. |
| NCCL hang on startup with TP>1 | /dev/shm too small or CUDA_VISIBLE_DEVICES misordered. | Mount /dev/shm >= 8GB; pin NVIDIA_VISIBLE_DEVICES per worker; set NCCL_DEBUG=INFO. |
| Very slow first token after deploy | CUDA-graph capture on cold start. | Expected for first 30-60s; pre-warm with a synthetic request before flipping traffic. |
| Prefix cache hit rate near zero | System prompt varies by request (e.g. timestamp). | Move volatile fields out of the cached prefix; re-measure `vllm:prefix_cache_hit_rate`. |
| Speculative decoding regression in throughput | Draft model too large or accept rate too low. | Halve --num-speculative-tokens; switch to EAGLE-2 head; benchmark accept rate. |
| Quantisation accuracy drift on FP4 / INT4 | Calibration set unrepresentative. | Recalibrate on real traffic; pin to AWQ or Marlin paths over raw GPTQ. |
| HTTP 400 prompt too long | Total tokens exceed --max-model-len. | Raise --max-model-len with --rope-scaling longrope, or chunk the prompt client-side. |
| Preemption rate climbs in steady state | max_num_seqs too high for KV-cache budget. | Lower --max-num-seqs; or add replicas; never push gpu_memory_utilization above 0.95. |
| TTFT p95 spikes under mixed load | Long prefill starves decode. | Enable --enable-chunked-prefill; tune --max-num-batched-tokens to 4096-8192. |
| Throughput drops after upgrading driver | FlashAttention kernel selection regressed. | Pin VLLM_ATTENTION_BACKEND=FLASHINFER or FLASH_ATTN; rerun benchmark. |
| LoRA adapter latency dominates | Too many --max-loras resident. | Cap at 8-16 on H100; benchmark adapter activation matmul cost. |
| Multi-node deployment never reaches steady state | Pipeline bubble too large or NCCL over IB misconfigured. | Lower --pipeline-parallel-size; set NCCL_IB_HCA, NCCL_SOCKET_IFNAME; verify GPUDirect RDMA. |
Where this fits in the Yobitel stack#
vLLM is the default inference engine inside Yobibyte, Yobitel's AI-native platform. Every model that lands in the Yobibyte catalogue — from Llama 3.1 8B to DeepSeek-V3 671B — is exposed first via a vLLM-backed endpoint, with TensorRT-LLM offered as an opt-in performance variant where lowest-latency SLOs justify the engine build overhead. The Yobibyte control plane handles fleet sizing, prefix-cache-aware routing, multi-LoRA tenancy, draining for spot pre-emption, and FOCUS-conformant cost attribution back to tenants.
Omniscient Compute scores vLLM continuously on InferenceBench v3, the public benchmark suite Yobitel maintains for inference engines across NVIDIA H100, H200, B200 and AMD MI300X tenancies. Each release is benchmarked at fixed input/output token mixes (chat, RAG, long-context, batch) and the results are surfaced to customers as live capacity plans — every recommended SKU and replica count on the Yobibyte console comes from an InferenceBench measurement, not a vendor datasheet.
For UK and EU sovereign workloads, vLLM runs on the Yobitel London-1 and Frankfurt-1 regions inside tenancies that satisfy NCSC Cloud Security Principles, G-Cloud 14 lot definitions and the OFFICIAL handling caveat. The combination of an open-source Apache 2.0 engine, sovereign hardware, and transparent benchmark scoring is what lets Yobitel customers deploy production LLMs without ceding control or visibility to a hosted SaaS API.
References
- Efficient Memory Management for LLM Serving with PagedAttention · arXiv (Kwon et al., 2023)
- vLLM Documentation · vLLM Project
- vLLM on GitHub · GitHub
- vLLM Production Stack · GitHub
- FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision · arXiv (Shah et al., 2024)
- SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills · arXiv (Agrawal et al., 2023)
- EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees · arXiv (Li et al., 2024)
- Outlines — Guided Generation for Language Models · GitHub (dottxt)