vLLM

TL;DR

Open-source LLM inference engine originally from UC Berkeley's Sky Computing Lab (June 2023), now governed under the LF AI & Data Foundation with contributors from Meta, NVIDIA, AMD, IBM and most major neoclouds. Apache 2.0, >100 model architectures, OpenAI-compatible REST surface.
Built on PagedAttention (block-paged KV cache borrowed from OS virtual memory) and continuous batching (token-level scheduling) — together delivering 2-24x higher throughput than naive HuggingFace `model.generate()` at the same latency budget.
Supports tensor / pipeline / expert parallelism, prefix caching, speculative decoding (draft model, EAGLE-2, Medusa, n-gram), chunked prefill, multi-LoRA hot-swap, AWQ / GPTQ / Marlin / FP8 / FP4 quantisation, and a Prometheus metrics endpoint with request-level histograms.
Runs on NVIDIA (Ampere / Hopper / Blackwell), AMD ROCm (MI250 / MI300X), Intel Gaudi 2/3, AWS Neuron (Trn1 / Inf2) and Google TPU v5e/v5p. Ships as a PyPI package, a CUDA wheel and the `vllm/vllm-openai` container.
Default inference engine in Yobitel's Yobibyte platform; scored continuously by Omniscient Compute against TensorRT-LLM and SGLang on InferenceBench across H100 SXM5, H200, B200 and MI300X tenancies.

Overview

vLLM is an inference engine for transformer language models. It exposes an HTTP server that speaks the OpenAI Chat Completions, Completions and Embeddings APIs, an offline Python entry point (LLM(...).generate(...)) for batch inference, and a low-level engine API used by production stacks such as KServe, Ray Serve, NVIDIA Dynamo and the upstream vLLM Production Stack. The project's defining premise is that LLM serving is a memory-and-scheduling problem first and a kernel-tuning problem second.

The original v0.1 release (June 2023) shipped two ideas that the rest of the field has since adopted: PagedAttention, which manages the KV cache as fixed-size blocks indexed by per-sequence block tables; and continuous (iteration-level) batching, which admits and evicts sequences between every forward pass rather than at request boundaries. Together they raised KV-cache memory utilisation from roughly 40 percent to above 95 percent and lifted achievable throughput on a single H100 by an order of magnitude on chat-shaped workloads.

By mid-2026 the project sits under the LF AI & Data Foundation, with maintainers from UC Berkeley, IBM, Meta, NVIDIA, AMD, Anyscale, Snowflake, Databricks and Yobitel. The mainline release cadence is roughly every two to three weeks, with new model architectures typically supported within days of weight publication. vLLM is the runtime against which every newer engine — SGLang, TensorRT-LLM, MAX, MistralRS — is benchmarked. Yobibyte exposes vLLM as its default inference engine; Yobitel customers reach vLLM through a managed workspace rather than building containers, registries and schedulers from raw upstream.

This entry documents the production surface: the CLI and Python APIs, the scheduling and KV-cache internals, the parallelism strategies, the deployment patterns, the limits and quotas, the observability hooks, and the practical sizing and cost models you need to operate vLLM at scale on Yobitel and beyond. This entry helps you stand up vLLM for production LLM serving with the right flags, sizing and operational practices — whether you are running raw upstream on your own cluster or consuming vLLM through Yobibyte's managed workspaces.

Quick start

The example below deploys Llama 3.1 70B on a 4x H100 SXM5 node with FP8 weights, prefix caching, chunked prefill and a 32K context window, then issues an OpenAI-compatible chat completion. The first block installs vLLM and serves the model directly on any CUDA 12.4+ host; the second block hits the running endpoint with curl; the third block drives the same endpoint from Python using the standard openai SDK pointed at the local server.

# 1. Install vLLM and serve Llama 3.1 70B on 4x H100 SXM5
pip install "vllm>=0.8.0"

vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
    --tensor-parallel-size 4 \
    --max-model-len 32768 \
    --max-num-seqs 256 \
    --gpu-memory-utilization 0.92 \
    --quantization fp8 \
    --kv-cache-dtype fp8 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --num-scheduler-steps 8 \
    --port 8000

# 2. Hit the OpenAI-compatible endpoint with curl
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "meta-llama/Meta-Llama-3.1-70B-Instruct",
      "messages": [{"role": "user", "content": "Summarise PagedAttention in 2 lines."}],
      "max_tokens": 128
    }'

# 3. Same call from Python using the official openai SDK
python - <<'PY'
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
reply = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    messages=[{"role": "user", "content": "Summarise PagedAttention in 2 lines."}],
    max_tokens=128,
)
print(reply.choices[0].message.content)
PY

How it works

vLLM is structured as an asynchronous engine wrapped by a frontend API server. Requests enter through the FastAPI server, are tokenised, validated against --max-model-len and the per-request max_tokens budget, and handed to the LLMEngine. The engine maintains three queues: WAITING (admitted, not yet running), RUNNING (currently in the active batch), and SWAPPED (paged out to host memory under KV-cache pressure). On every step the scheduler picks the next set of sequences to run based on KV-cache availability and request priority, and the model executor performs one forward pass.

Forward execution uses FlashAttention-3 on Hopper and Blackwell and FlashAttention-2 on Ampere, with a paged-KV variant of the attention kernel that gathers K and V from the block pool via the per-sequence block table. Tensor-parallel and pipeline-parallel ranks coordinate via NCCL; the engine starts one worker process per rank with shared GPU memory for the KV pool. The KV pool size is computed at start-up as (gpu_memory_utilization x device memory) minus weights minus activation working set, divided by block size.

Chunked prefill changes how long inputs are processed. Without it, a single request with a 30K-token prompt monopolises the GPU for a full prefill before any decode tokens emerge, starving short interactive requests. With --enable-chunked-prefill, the scheduler interleaves prefill chunks (default 512 tokens) with decode tokens in the same iteration, dramatically improving p99 latency under mixed load. From v0.6 onward chunked prefill is the recommended default for production deployments.

Speculative decoding adds a draft pass per step. The engine runs a small draft model (or EAGLE-2 / Medusa heads attached to the target) to propose k tokens, then verifies them with one parallel forward of the target. Accepted tokens are committed; the first rejected token is the new ground truth. On chat workloads with a well-matched draft (Llama 3 8B for 70B target), end-to-end latency drops 1.5-3x at unchanged quality.

PagedAttention: KV cache split into 16-token blocks (configurable), allocated on demand from a global pool. Identical prefix blocks are content-addressed and shared across sequences.
Continuous batching: scheduler decision happens every iteration; sequences enter and leave the running batch at token boundaries.
Prefix caching: cached blocks survive request completion until evicted under LRU; subsequent requests with matching prefixes skip prefill on shared tokens.
Chunked prefill: prefill work split into fixed-size chunks interleaved with decode in the same step.
Multi-step scheduling: --num-scheduler-steps n lets the worker execute n forward passes per scheduler invocation, reducing Python overhead by ~3-5x.
CUDA graphs: enabled by default for decode-only batches; captures the forward pass to eliminate launch overhead.

Tip: Turn on prefix caching, chunked prefill and multi-step scheduling together before reaching for more exotic optimisations. The combined uplift on a shared-system-prompt workload is typically 30-70 percent over defaults.

Reference and specifications

Every long-lived deployment is parameterised through a small number of high-impact flags. The table below is the canonical reference for the engine CLI surface as of vLLM v0.8 (June 2026). Flags marked with an asterisk are also available as EngineArgs fields in the Python API. Flags not listed here are either internal tuning knobs that defaults handle correctly, or specialised features documented in the upstream reference.

Flag	Type	Default	Description
--model	string	(required)	HuggingFace repo id or local path. Drives architecture detection.
--tensor-parallel-size *	int	1	Shard each weight matrix across N GPUs within a node via NCCL AllReduce.
--pipeline-parallel-size *	int	1	Split layers into stages across nodes. Tolerates lower interconnect bandwidth than TP.
--max-model-len *	int	model-defined	Maximum total tokens per sequence. Bounded by RoPE and KV-cache budget.
--max-num-seqs *	int	256	Hard cap on concurrent sequences in the running batch.
--max-num-batched-tokens *	int	auto	Cap on tokens per iteration; controls prefill / decode mix under chunked prefill.
--gpu-memory-utilization *	float	0.9	Fraction of GPU memory available to vLLM (weights + activations + KV pool).
--swap-space *	int (GB)	4	CPU memory reserved for swapping KV blocks when the GPU pool fills up.
--kv-cache-dtype *	string	auto	auto
--quantization *	string	(off)	fp8
--enable-prefix-caching *	bool	false	Persist KV blocks across requests; reuse by content hash.
--enable-chunked-prefill *	bool	v0.6+ true	Interleave prefill chunks with decode in the same step.
--num-scheduler-steps *	int	1	Number of forward passes per scheduler invocation; 8-16 typical.
--speculative-model *	string	(off)	Draft model id or `[eagle]` / `[medusa]` / `[ngram]` for built-in heads.
--num-speculative-tokens *	int	5	Number of tokens the draft proposes per step.
--enable-lora *	bool	false	Enables multi-LoRA hot-swap; pair with `--max-loras` and `--max-lora-rank`.
--max-loras *	int	1	Number of LoRA adapters resident in GPU memory.
--disable-log-requests	bool	false	Suppress per-request access logs (recommended at high RPS).
--rope-scaling *	json	(model)	Override RoPE scaling (linear, dynamic, yarn, longrope) for context extension.
--guided-decoding-backend *	string	outlines	outlines
--enable-eager	bool	false	Disable CUDA-graph capture; useful when debugging.
--worker-use-ray	bool	false	Drive workers via Ray instead of multiprocessing (multi-node).
--distributed-executor-backend *	string	mp	mp
--block-size *	int	16	KV block size in tokens. 32 sometimes preferred at long context.
--preemption-mode *	string	recompute	recompute
--served-model-name	string	model id	Override the model name reported via the OpenAI API.
--api-key	string	(none)	Optional bearer-token auth for the API server.
--enable-auto-tool-choice	bool	false	Enable native tool-calling on Llama 3, Mistral, Hermes, Granite chat templates.
--limit-mm-per-prompt	json	(model)	Cap multimodal items per prompt for VLM serving.

Note: From v0.8 onwards a subset of these flags can be tuned via the /load_config admin endpoint without restart. Restartless reconfiguration of tensor-parallel size and quantisation mode is not supported and never will be — those changes require a new engine.

Workload patterns

Three workload shapes cover the bulk of vLLM production deployments: interactive chat behind an OpenAI-compatible gateway, RAG with long shared system prompts, and offline batch inference for evaluation or labelling. Each has its own preferred set of flags. These are the same three shapes Yobibyte automates for managed customers — the flags below are what a team running raw vLLM on their own Kubernetes signs up to hand-tune; the Yobibyte console derives them from the workspace's stated SLO.

Pattern A — Chat endpoint, OpenAI-compatible. Maximise concurrent users at a target p95 time-to-first-token of 250-500 ms. Enable prefix caching for repeated system prompts, chunked prefill so a long prompt does not stall short requests, and multi-step scheduling to amortise Python overhead. Pattern B — RAG endpoint with a 4-32K shared system prompt across thousands of requests. Pattern C — offline batch scoring with no API server.

# A — chat endpoint on 2x H100 SXM5 for an 8B-class model
vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
    --tensor-parallel-size 2 \
    --max-model-len 16384 \
    --max-num-seqs 512 \
    --gpu-memory-utilization 0.90 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --num-scheduler-steps 16 \
    --kv-cache-dtype fp8 \
    --quantization fp8 \
    --enable-auto-tool-choice \
    --port 8000

# B — RAG endpoint, 4-32K shared system prompt across thousands of requests
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
    --tensor-parallel-size 4 \
    --max-model-len 32768 \
    --max-num-batched-tokens 8192 \
    --gpu-memory-utilization 0.92 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --block-size 32 \
    --quantization fp8

# C — offline batch scoring (no API server)
python - <<'PY'
from vllm import LLM, SamplingParams
llm = LLM(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    tensor_parallel_size=4,
    quantization="fp8",
    enable_prefix_caching=True,
    max_num_seqs=1024,
    gpu_memory_utilization=0.95,
)
prompts = open("eval-prompts.txt").read().splitlines()
out = llm.generate(prompts, SamplingParams(max_tokens=256, temperature=0))
for r in out:
    print(r.outputs[0].text)
PY

Warning: Pattern B with very high prefix-cache hit rates can paradoxically saturate decode bandwidth — prefill is cheap but each decode token still costs one full forward. Watch vllm:prefix_cache_hit_rate next to vllm:gpu_cache_usage_perc; if hits exceed 80 percent and decode latency spikes, scale by adding decode replicas, not bigger TP.

Sizing and capacity planning

vLLM throughput is bounded first by KV-cache memory, then by tensor-core FLOPs, then by NCCL bandwidth at TP > 2. The planning model below assumes Llama-family architectures with grouped-query attention and FP8 weights and KV; FP16 / BF16 doubles every memory column. Tokens-per-second figures are mid-range observed values from InferenceBench v3 at 4K input / 256 output, mixed concurrency; treat them as planning anchors, not contractual.

KV-cache budget is the constraint that decides max concurrency. The per-token KV size is 2 x n_layers x n_kv_heads x head_dim x dtype_bytes. For Llama 3.1 70B with GQA (8 KV heads) at FP8 that is roughly 40 KB per token. On 4x H100 with TP=4 the weights occupy ~70 GB, activations ~12 GB, leaving ~240 GB of pooled KV — about 6 million KV tokens, enough for 180 concurrent sequences at average 32K used or 1,500 sequences averaging 4K used. Tensor-parallel size choice is governed by intra-node bandwidth: TP up to 8 inside one NVLink island is well-behaved; TP across InfiniBand is almost always slower than pipeline parallelism. For two-node deployments, prefer TP=8 + PP=2 over TP=16. For three or more nodes, expert parallelism dominates if the model is MoE; otherwise pipeline with replicas behind a router beats deep PP.

Workload	Model	Recommended SKU	Concurrency	Output tok/s	Notes
Chat, low latency	Llama 3.1 8B	1x H100 SXM5 80GB	64-128	3,800-5,200	FP8 weights + KV, multi-step 16.
Chat, balanced	Llama 3.1 70B	4x H100 SXM5	128-256	2,800-4,200	TP=4, FP8, chunked prefill 512.
Chat, high QPS	Llama 3.1 70B	8x H100 SXM5	256-512	5,200-7,800	TP=8, prefix cache on shared prompts.
Long context (128K)	Llama 3.1 70B	2x H200 141GB	32-64	1,400-2,200	FP8 KV, block-size 32, swap-space 32GB.
MoE serving	Mixtral 8x22B	8x H100 SXM5	192-384	4,500-6,800	TP=8 with expert parallelism.
MoE serving	DeepSeek-V3 671B	16x H100 SXM5 (2 nodes)	256-512	3,200-4,800	TP=8 + PP=2, NVLink + 400Gb IB.
RAG, prefix-heavy	Llama 3.1 70B	4x H100 SXM5	256-512	6,000-9,500	Prefix hit rate >70 percent assumed.
Offline batch	Llama 3.1 70B	4x H100 SXM5	1024+	8,500-12,000	Disable streaming, max_num_seqs 1024.
Edge inference	Llama 3.1 8B Q4	1x L40S 48GB	16-32	1,400-2,000	AWQ INT4, FP16 KV, eager mode.
Blackwell next-gen	Llama 3.1 70B	4x B200	256-512	6,800-10,500	FP4 weights, FP8 KV, FA3 kernels.

Limits and quotas

vLLM enforces a small set of hard and soft limits at the engine boundary. Hard limits reject requests with HTTP 400 at the API server; soft limits apply backpressure by extending queue depth. Operational ceilings (memory, NCCL groups, file descriptors) come from the host OS and CUDA runtime.

Limit	Default	Hard ceiling	How to raise
max_model_len	model-defined	RoPE-limited (e.g. 128K Llama 3.1)	Use --rope-scaling longrope/yarn; verify quality.
max_num_seqs	256	KV-cache budget	Raise --max-num-seqs; check `gpu_cache_usage_perc`.
max_num_batched_tokens	auto (8192)	Activation memory	Raise carefully; watch p99 prefill latency.
max_loras	1	GPU memory	Raise; activation matmuls cost grows linearly.
max_lora_rank	16	64	Higher ranks raise per-step compute by ~5 percent.
Replicas per engine	1	Hardware-bounded	Scale by adding pods, not engines per pod.
TP size (intra-node)	1	8 (NVLink)	Bounded by GPUs per node.
PP size (cross-node)	1	~32 in practice	Bounded by pipeline bubble overhead.
Request body size	unlimited	HTTP server limit	Set --max-log-len; configure reverse proxy.
Concurrent requests / engine	max_num_seqs + queue	Memory-bounded	Add replicas behind a router.
Shared memory (NCCL + MIG)	/dev/shm	Container-defined	Mount /dev/shm >= 1GB per worker.
File descriptors	1024	ulimit	ulimit -n 65536 in container.

Warning: Multi-Instance GPU (MIG) slices on H100 advertise reduced memory but share /dev/shm with siblings. If you run vLLM TP>1 inside a MIG slice you must increase the shared-memory limit on the container; the default 64 MB will OOM NCCL on the first AllReduce.

Observability

vLLM exposes a Prometheus metrics endpoint at /metrics covering request throughput, latency histograms, GPU cache utilisation, prefix-cache hit rate, scheduler queue depth and preemption counts. The metric prefix is vllm:. Engine logs emit one structured line per request when --disable-log-requests is unset; switch to JSON output via VLLM_LOG_LEVEL=INFO and VLLM_LOGGING_CONFIG_PATH for ingestion into Loki, Splunk or Datadog.

The metrics worth alerting on in production are: time-to-first-token p95, inter-token latency p95, GPU cache usage, prefix-cache hit rate, number of preempted sequences, and request queue time. The following Prometheus rules cover the common failure modes.

vllm:time_to_first_token_seconds — prefill latency; correlate with prompt-length histogram.
vllm:time_per_output_token_seconds — decode latency; should be near 1 / (theoretical tok/s) when batch is full.
vllm:gpu_cache_usage_perc — KV pool fill; consistent reading above 90 percent indicates capacity headroom is gone.
vllm:prefix_cache_hit_rate — fraction of incoming prefill tokens served from cache.
vllm:num_preemptions_total — sequences evicted under KV pressure; non-zero in steady state means undersized cluster.
vllm:request_queue_time_seconds — time from request arrival to first scheduler admission.
DCGM_FI_DEV_GPU_UTIL and DCGM_FI_DEV_MEM_COPY_UTIL — pair with vLLM metrics to distinguish compute, memory and idle bottlenecks.

# Prometheus rules for a vLLM deployment
groups:
  - name: vllm-sla
    interval: 30s
    rules:
      - alert: VLLMHighTimeToFirstToken
        expr: histogram_quantile(0.95,
                sum by (le, model_name) (
                  rate(vllm:time_to_first_token_seconds_bucket[5m]))) > 1.0
        for: 5m
        labels: { severity: warning, team: inference }
        annotations:
          summary: "vLLM TTFT p95 above 1s on {{ $labels.model_name }}"

      - alert: VLLMKVCachePressure
        expr: vllm:gpu_cache_usage_perc > 0.95
        for: 10m
        labels: { severity: warning }
        annotations:
          summary: "KV pool >95 percent full — preemption imminent"

      - alert: VLLMPreemptionSpike
        expr: increase(vllm:num_preemptions_total[5m]) > 20
        for: 5m
        labels: { severity: critical }
        annotations:
          summary: "Preemptions rising — capacity insufficient or runaway request"

      - alert: VLLMPrefixCacheCollapse
        expr: vllm:prefix_cache_hit_rate < 0.20
              and vllm:prefix_cache_queries_total > 100
        for: 15m
        labels: { severity: info }
        annotations:
          summary: "Prefix cache hit rate dropped — workload shape changed"

      - alert: VLLMGPUUnderutilised
        expr: avg_over_time(DCGM_FI_DEV_GPU_UTIL[5m]) < 30
              and rate(vllm:request_success_total[5m]) > 0
        for: 15m
        labels: { severity: info }
        annotations:
          summary: "GPU under 30 percent — investigate Python overhead or PP bubble"

Tip: If TTFT p95 is high but GPU utilisation is low, suspect Python-side overhead — raise --num-scheduler-steps, ensure CUDA graphs are enabled, and confirm the engine is not running with --enable-eager.

Cost and FinOps

vLLM cost economics are dominated by three levers: GPU rental rate, achieved tokens-per-second-per-GPU, and average prefix-cache hit rate. Holding model and SKU fixed, doubling sustained throughput halves the unit cost in $/M tokens. The table below uses Yobitel UK list pricing (June 2026) and InferenceBench v3 throughput anchors; substitute your own when planning.

Spot instances cut GPU rate 40-60 percent but require autoscaling that tolerates 30-90s pre-emption notices. Pair with vLLM's draining endpoint to flush in-flight requests cleanly.
FP8 weights + FP8 KV is the highest $/M-tokens lever available on Hopper; BF16 is roughly 1.6x more expensive at the same SLO.
Prefix-cache hits are accounted at zero prefill cost — high-overlap workloads (multi-tenant agents, shared system prompts) can lift effective throughput 1.5-2x at no infrastructure change.
FOCUS-conformant billing exports from Yobitel include inference_engine and model_name resource tags so $/M tokens can be sliced by tenant or product line.

Configuration	GPU rate ($/h)	Sustained tok/s	$/M output tokens	Notes
1x H100 SXM5, Llama 3.1 8B FP8	$3.20	4,500	$0.20	Single replica, prefix cache on.
4x H100 SXM5, Llama 3.1 70B FP8	$12.40	3,500	$0.98	TP=4, chunked prefill.
8x H100 SXM5, Llama 3.1 70B FP8	$24.80	6,800	$1.01	TP=8, prefix cache 60 percent.
2x H200, Llama 3.1 70B 128K ctx	$8.40	1,800	$1.30	Long context tax.
4x B200, Llama 3.1 70B FP4	$22.00	9,200	$0.66	Blackwell FP4 + FA3.
4x H100 spot, Llama 3.1 70B	$6.20	2,800	$0.62	Spot interruption averaged in.
8x MI300X, Llama 3.1 70B FP8	$18.80	5,400	$0.97	ROCm 6.2, FA-ROCm kernel.
Hosted SaaS reference (GPT-4o mini class)	n/a	n/a	$0.60	List API price; comparison only.

Security and compliance

vLLM ships with an opt-in bearer-token auth on the API server (--api-key); production deployments terminate TLS at an ingress (Envoy, NGINX, AWS ALB) and apply mTLS or signed-JWT auth at that layer. The engine itself does not enforce per-tenant quotas — those are implemented at the gateway or via the Yobitel platform's multi-tenant router. Network isolation should follow the standard pattern: the inference pod has no egress to the public internet, model weights are pulled from a private registry, and per-replica NetworkPolicy locks ingress to the gateway service account.

Prompt-injection mitigation is a layered concern that vLLM does not solve directly. The engine supports system-prompt pinning via prefix caching (the system prompt is the same blocks every request) and structured-output enforcement via Outlines, XGrammar or LM Format Enforcer. Pair these with retrieval source validation, output classifiers and rate limits at the gateway. The Yobibyte platform enforces this stack by default.

Regulatory implications are model-, data- and deployment-specific. For UK public-sector workloads, deploy on Yobitel sovereign tenancies that satisfy NCSC Cloud Security Principles and G-Cloud 14 (vLLM as the engine is a control-plane component on those tenancies). For EU GDPR, the engine processes prompt and completion data only in volatile GPU memory and the on-disk scratch path; ensure --swap-space resides on encrypted ephemeral storage. For US HIPAA workloads, run inside a BAA-covered VPC and disable request logging; for FedRAMP, run the FIPS-validated CUDA build and pin to NIAP-approved cipher suites at the ingress.

Warning: Multi-tenant single-engine vLLM deployments share the prefix cache across tenants by default. If tenants must not see each other's system prompts in any side-channel, either run one engine per tenant or set --enable-prefix-caching=false and accept the throughput hit.

Migration and alternatives

Most production migrations to vLLM come from one of four origins: HuggingFace pipeline() / model.generate(), Hugging Face TGI, NVIDIA TensorRT-LLM, or a managed SaaS API (OpenAI, Anthropic, AWS Bedrock). The first delivers the largest throughput uplift (5-10x typical at the same latency); the others are roughly at parity on throughput with different operational trade-offs.

If you are running production today on Kubernetes with TGI, the migration is essentially a container swap and a flag rename. The incumbent commands below produce comparable deployments; you can roll vLLM behind the same Service and shift traffic at the gateway.

From	Migration effort	Throughput change	Operational notes
HuggingFace pipeline / generate	Low — drop in OpenAI client	5-10x faster	Eliminates GIL-bound serving loop.
TGI (Text Generation Inference)	Low — same OpenAI API	Comparable, vLLM wins on new models	Lose TGI multi-LoRA hot-swap polish; gain wider model support.
TensorRT-LLM + Triton	Medium — drop engine build	10-30 percent slower at same latency	Gain rapid model rotation; lose absolute-min latency.
SGLang	Low — same API surface	Roughly equal on chat; SGLang wins on agents	Switch back for RadixAttention-heavy workloads.
OpenAI / Bedrock / Anthropic API	High — model substitution	Variable	Gain control, sovereignty; lose hosted model variety.
llama.cpp / Ollama (cloud)	Low — same model	3-8x faster on GPU	Use llama.cpp for CPU and Apple Silicon.

# Production deployment on Kubernetes with NVIDIA GPU Operator installed
kubectl apply -f - <<'YAML'
apiVersion: apps/v1
kind: Deployment
metadata: { name: llama3-70b-vllm }
spec:
  replicas: 2
  selector: { matchLabels: { app: llama3-70b-vllm } }
  template:
    metadata: { labels: { app: llama3-70b-vllm } }
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:v0.8.0
          args:
            - "--model=meta-llama/Meta-Llama-3.1-70B-Instruct"
            - "--tensor-parallel-size=4"
            - "--max-model-len=32768"
            - "--enable-prefix-caching"
            - "--enable-chunked-prefill"
            - "--quantization=fp8"
            - "--kv-cache-dtype=fp8"
          resources:
            limits: { nvidia.com/gpu: 4 }
          ports: [{ containerPort: 8000 }]
          volumeMounts:
            - { name: dshm, mountPath: /dev/shm }
      volumes:
        - name: dshm
          emptyDir: { medium: Memory, sizeLimit: 8Gi }
YAML

# Equivalent today on AWS (bare p5 instance with Deep Learning AMI)
AMI_ID=$(aws ec2 describe-images --owners amazon \
    --filters "Name=name,Values=Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.4*" \
    --query 'sort_by(Images,&CreationDate)[-1].ImageId' --output text)

aws ec2 run-instances \
    --image-id "$AMI_ID" \
    --instance-type p5.48xlarge \
    --user-data "$(printf '#!/bin/bash\npip install vllm>=0.8.0\nvllm serve meta-llama/Meta-Llama-3.1-70B-Instruct --tensor-parallel-size 8 --quantization fp8 --port 8000\n')"

# Equivalent on GCP A3 (H100)
gcloud compute instances create vllm-llama70b \
    --machine-type=a3-highgpu-8g \
    --accelerator=type=nvidia-h100-80gb,count=8 \
    --image-family=pytorch-2-4-cu124 --image-project=deeplearning-platform-release \
    --metadata=startup-script='vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct --tensor-parallel-size 8 --quantization fp8'

Note: If you must hit absolute-min latency on a fixed model (real-time voice agents, low-latency RAG, public APIs under tight SLOs), keep TensorRT-LLM. vLLM closes most of the gap with FA3 and CUDA graphs, but compiled engines still win the last 10-20 percent.

Troubleshooting

The error table below covers the failure modes that account for roughly 80 percent of production vLLM incidents observed on Yobitel-operated fleets and InferenceBench community submissions. Each row maps an observable symptom to the underlying mechanism and the minimum-viable fix.

Symptom / Error	Cause	Fix
torch.cuda.OutOfMemoryError on first request	gpu_memory_utilization too high; activations crowd KV pool.	Lower to 0.88; raise --swap-space; check no other process on GPU.
NCCL hang on startup with TP>1	/dev/shm too small or CUDA_VISIBLE_DEVICES misordered.	Mount /dev/shm >= 8GB; pin NVIDIA_VISIBLE_DEVICES per worker; set NCCL_DEBUG=INFO.
Very slow first token after deploy	CUDA-graph capture on cold start.	Expected for first 30-60s; pre-warm with a synthetic request before flipping traffic.
Prefix cache hit rate near zero	System prompt varies by request (e.g. timestamp).	Move volatile fields out of the cached prefix; re-measure `vllm:prefix_cache_hit_rate`.
Speculative decoding regression in throughput	Draft model too large or accept rate too low.	Halve --num-speculative-tokens; switch to EAGLE-2 head; benchmark accept rate.
Quantisation accuracy drift on FP4 / INT4	Calibration set unrepresentative.	Recalibrate on real traffic; pin to AWQ or Marlin paths over raw GPTQ.
HTTP 400 prompt too long	Total tokens exceed --max-model-len.	Raise --max-model-len with --rope-scaling longrope, or chunk the prompt client-side.
Preemption rate climbs in steady state	max_num_seqs too high for KV-cache budget.	Lower --max-num-seqs; or add replicas; never push gpu_memory_utilization above 0.95.
TTFT p95 spikes under mixed load	Long prefill starves decode.	Enable --enable-chunked-prefill; tune --max-num-batched-tokens to 4096-8192.
Throughput drops after upgrading driver	FlashAttention kernel selection regressed.	Pin VLLM_ATTENTION_BACKEND=FLASHINFER or FLASH_ATTN; rerun benchmark.
LoRA adapter latency dominates	Too many --max-loras resident.	Cap at 8-16 on H100; benchmark adapter activation matmul cost.
Multi-node deployment never reaches steady state	Pipeline bubble too large or NCCL over IB misconfigured.	Lower --pipeline-parallel-size; set NCCL_IB_HCA, NCCL_SOCKET_IFNAME; verify GPUDirect RDMA.

Where this fits in the Yobitel stack

vLLM is the default inference engine inside Yobibyte, Yobitel's AI-native platform. Every model that lands in the Yobibyte catalogue — from Llama 3.1 8B to DeepSeek-V3 671B — is exposed first via a vLLM-backed endpoint, with TensorRT-LLM offered as an opt-in performance variant where lowest-latency SLOs justify the engine build overhead. The Yobibyte control plane handles fleet sizing, prefix-cache-aware routing, multi-LoRA tenancy, draining for spot pre-emption, and FOCUS-conformant cost attribution back to tenants.

Omniscient Compute scores vLLM continuously on InferenceBench v3, the public benchmark suite Yobitel maintains for inference engines across NVIDIA H100, H200, B200 and AMD MI300X tenancies. Each release is benchmarked at fixed input/output token mixes (chat, RAG, long-context, batch) and the results are surfaced to customers as live capacity plans — every recommended SKU and replica count on the Yobibyte console comes from an InferenceBench measurement, not a vendor datasheet.

For UK and EU sovereign workloads, vLLM runs on the Yobitel London-1 and Frankfurt-1 regions inside tenancies that satisfy NCSC Cloud Security Principles, G-Cloud 14 lot definitions and the OFFICIAL handling caveat. The combination of an open-source Apache 2.0 engine, sovereign hardware, and transparent benchmark scoring is what lets Yobitel customers deploy production LLMs without ceding control or visibility to a hosted SaaS API.

References

Efficient Memory Management for LLM Serving with PagedAttention · arXiv (Kwon et al., 2023)
vLLM Documentation · vLLM Project
vLLM on GitHub · GitHub
vLLM Production Stack · GitHub
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision · arXiv (Shah et al., 2024)
SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills · arXiv (Agrawal et al., 2023)
EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees · arXiv (Li et al., 2024)
Outlines — Guided Generation for Language Models · GitHub (dottxt)

TL;DR

Open-source LLM inference engine originally from UC Berkeley's Sky Computing Lab (June 2023), now governed under the LF AI & Data Foundation with contributors from Meta, NVIDIA, AMD, IBM and most major neoclouds. Apache 2.0, >100 model architectures, OpenAI-compatible REST surface.
Built on PagedAttention (block-paged KV cache borrowed from OS virtual memory) and continuous batching (token-level scheduling) — together delivering 2-24x higher throughput than naive HuggingFace `model.generate()` at the same latency budget.
Supports tensor / pipeline / expert parallelism, prefix caching, speculative decoding (draft model, EAGLE-2, Medusa, n-gram), chunked prefill, multi-LoRA hot-swap, AWQ / GPTQ / Marlin / FP8 / FP4 quantisation, and a Prometheus metrics endpoint with request-level histograms.
Runs on NVIDIA (Ampere / Hopper / Blackwell), AMD ROCm (MI250 / MI300X), Intel Gaudi 2/3, AWS Neuron (Trn1 / Inf2) and Google TPU v5e/v5p. Ships as a PyPI package, a CUDA wheel and the `vllm/vllm-openai` container.
Default inference engine in Yobitel's Yobibyte platform; scored continuously by Omniscient Compute against TensorRT-LLM and SGLang on InferenceBench across H100 SXM5, H200, B200 and MI300X tenancies.

Overview

Quick start

# 1. Install vLLM and serve Llama 3.1 70B on 4x H100 SXM5
pip install "vllm>=0.8.0"

vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
    --tensor-parallel-size 4 \
    --max-model-len 32768 \
    --max-num-seqs 256 \
    --gpu-memory-utilization 0.92 \
    --quantization fp8 \
    --kv-cache-dtype fp8 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --num-scheduler-steps 8 \
    --port 8000

# 2. Hit the OpenAI-compatible endpoint with curl
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "meta-llama/Meta-Llama-3.1-70B-Instruct",
      "messages": [{"role": "user", "content": "Summarise PagedAttention in 2 lines."}],
      "max_tokens": 128
    }'

# 3. Same call from Python using the official openai SDK
python - <<'PY'
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
reply = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    messages=[{"role": "user", "content": "Summarise PagedAttention in 2 lines."}],
    max_tokens=128,
)
print(reply.choices[0].message.content)
PY

How it works

PagedAttention: KV cache split into 16-token blocks (configurable), allocated on demand from a global pool. Identical prefix blocks are content-addressed and shared across sequences.
Continuous batching: scheduler decision happens every iteration; sequences enter and leave the running batch at token boundaries.
Prefix caching: cached blocks survive request completion until evicted under LRU; subsequent requests with matching prefixes skip prefill on shared tokens.
Chunked prefill: prefill work split into fixed-size chunks interleaved with decode in the same step.
Multi-step scheduling: --num-scheduler-steps n lets the worker execute n forward passes per scheduler invocation, reducing Python overhead by ~3-5x.
CUDA graphs: enabled by default for decode-only batches; captures the forward pass to eliminate launch overhead.

Tip: Turn on prefix caching, chunked prefill and multi-step scheduling together before reaching for more exotic optimisations. The combined uplift on a shared-system-prompt workload is typically 30-70 percent over defaults.

Reference and specifications

Flag	Type	Default	Description
--model	string	(required)	HuggingFace repo id or local path. Drives architecture detection.
--tensor-parallel-size *	int	1	Shard each weight matrix across N GPUs within a node via NCCL AllReduce.
--pipeline-parallel-size *	int	1	Split layers into stages across nodes. Tolerates lower interconnect bandwidth than TP.
--max-model-len *	int	model-defined	Maximum total tokens per sequence. Bounded by RoPE and KV-cache budget.
--max-num-seqs *	int	256	Hard cap on concurrent sequences in the running batch.
--max-num-batched-tokens *	int	auto	Cap on tokens per iteration; controls prefill / decode mix under chunked prefill.
--gpu-memory-utilization *	float	0.9	Fraction of GPU memory available to vLLM (weights + activations + KV pool).
--swap-space *	int (GB)	4	CPU memory reserved for swapping KV blocks when the GPU pool fills up.
--kv-cache-dtype *	string	auto	auto
--quantization *	string	(off)	fp8
--enable-prefix-caching *	bool	false	Persist KV blocks across requests; reuse by content hash.
--enable-chunked-prefill *	bool	v0.6+ true	Interleave prefill chunks with decode in the same step.
--num-scheduler-steps *	int	1	Number of forward passes per scheduler invocation; 8-16 typical.
--speculative-model *	string	(off)	Draft model id or `[eagle]` / `[medusa]` / `[ngram]` for built-in heads.
--num-speculative-tokens *	int	5	Number of tokens the draft proposes per step.
--enable-lora *	bool	false	Enables multi-LoRA hot-swap; pair with `--max-loras` and `--max-lora-rank`.
--max-loras *	int	1	Number of LoRA adapters resident in GPU memory.
--disable-log-requests	bool	false	Suppress per-request access logs (recommended at high RPS).
--rope-scaling *	json	(model)	Override RoPE scaling (linear, dynamic, yarn, longrope) for context extension.
--guided-decoding-backend *	string	outlines	outlines
--enable-eager	bool	false	Disable CUDA-graph capture; useful when debugging.
--worker-use-ray	bool	false	Drive workers via Ray instead of multiprocessing (multi-node).
--distributed-executor-backend *	string	mp	mp
--block-size *	int	16	KV block size in tokens. 32 sometimes preferred at long context.
--preemption-mode *	string	recompute	recompute
--served-model-name	string	model id	Override the model name reported via the OpenAI API.
--api-key	string	(none)	Optional bearer-token auth for the API server.
--enable-auto-tool-choice	bool	false	Enable native tool-calling on Llama 3, Mistral, Hermes, Granite chat templates.
--limit-mm-per-prompt	json	(model)	Cap multimodal items per prompt for VLM serving.

Note: From v0.8 onwards a subset of these flags can be tuned via the /load_config admin endpoint without restart. Restartless reconfiguration of tensor-parallel size and quantisation mode is not supported and never will be — those changes require a new engine.

Workload patterns

# A — chat endpoint on 2x H100 SXM5 for an 8B-class model
vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct \
    --tensor-parallel-size 2 \
    --max-model-len 16384 \
    --max-num-seqs 512 \
    --gpu-memory-utilization 0.90 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --num-scheduler-steps 16 \
    --kv-cache-dtype fp8 \
    --quantization fp8 \
    --enable-auto-tool-choice \
    --port 8000

# B — RAG endpoint, 4-32K shared system prompt across thousands of requests
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
    --tensor-parallel-size 4 \
    --max-model-len 32768 \
    --max-num-batched-tokens 8192 \
    --gpu-memory-utilization 0.92 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --block-size 32 \
    --quantization fp8

# C — offline batch scoring (no API server)
python - <<'PY'
from vllm import LLM, SamplingParams
llm = LLM(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    tensor_parallel_size=4,
    quantization="fp8",
    enable_prefix_caching=True,
    max_num_seqs=1024,
    gpu_memory_utilization=0.95,
)
prompts = open("eval-prompts.txt").read().splitlines()
out = llm.generate(prompts, SamplingParams(max_tokens=256, temperature=0))
for r in out:
    print(r.outputs[0].text)
PY

Warning: Pattern B with very high prefix-cache hit rates can paradoxically saturate decode bandwidth — prefill is cheap but each decode token still costs one full forward. Watch vllm:prefix_cache_hit_rate next to vllm:gpu_cache_usage_perc; if hits exceed 80 percent and decode latency spikes, scale by adding decode replicas, not bigger TP.

Sizing and capacity planning

Workload	Model	Recommended SKU	Concurrency	Output tok/s	Notes
Chat, low latency	Llama 3.1 8B	1x H100 SXM5 80GB	64-128	3,800-5,200	FP8 weights + KV, multi-step 16.
Chat, balanced	Llama 3.1 70B	4x H100 SXM5	128-256	2,800-4,200	TP=4, FP8, chunked prefill 512.
Chat, high QPS	Llama 3.1 70B	8x H100 SXM5	256-512	5,200-7,800	TP=8, prefix cache on shared prompts.
Long context (128K)	Llama 3.1 70B	2x H200 141GB	32-64	1,400-2,200	FP8 KV, block-size 32, swap-space 32GB.
MoE serving	Mixtral 8x22B	8x H100 SXM5	192-384	4,500-6,800	TP=8 with expert parallelism.
MoE serving	DeepSeek-V3 671B	16x H100 SXM5 (2 nodes)	256-512	3,200-4,800	TP=8 + PP=2, NVLink + 400Gb IB.
RAG, prefix-heavy	Llama 3.1 70B	4x H100 SXM5	256-512	6,000-9,500	Prefix hit rate >70 percent assumed.
Offline batch	Llama 3.1 70B	4x H100 SXM5	1024+	8,500-12,000	Disable streaming, max_num_seqs 1024.
Edge inference	Llama 3.1 8B Q4	1x L40S 48GB	16-32	1,400-2,000	AWQ INT4, FP16 KV, eager mode.
Blackwell next-gen	Llama 3.1 70B	4x B200	256-512	6,800-10,500	FP4 weights, FP8 KV, FA3 kernels.

Limits and quotas

Limit	Default	Hard ceiling	How to raise
max_model_len	model-defined	RoPE-limited (e.g. 128K Llama 3.1)	Use --rope-scaling longrope/yarn; verify quality.
max_num_seqs	256	KV-cache budget	Raise --max-num-seqs; check `gpu_cache_usage_perc`.
max_num_batched_tokens	auto (8192)	Activation memory	Raise carefully; watch p99 prefill latency.
max_loras	1	GPU memory	Raise; activation matmuls cost grows linearly.
max_lora_rank	16	64	Higher ranks raise per-step compute by ~5 percent.
Replicas per engine	1	Hardware-bounded	Scale by adding pods, not engines per pod.
TP size (intra-node)	1	8 (NVLink)	Bounded by GPUs per node.
PP size (cross-node)	1	~32 in practice	Bounded by pipeline bubble overhead.
Request body size	unlimited	HTTP server limit	Set --max-log-len; configure reverse proxy.
Concurrent requests / engine	max_num_seqs + queue	Memory-bounded	Add replicas behind a router.
Shared memory (NCCL + MIG)	/dev/shm	Container-defined	Mount /dev/shm >= 1GB per worker.
File descriptors	1024	ulimit	ulimit -n 65536 in container.

Warning: Multi-Instance GPU (MIG) slices on H100 advertise reduced memory but share /dev/shm with siblings. If you run vLLM TP>1 inside a MIG slice you must increase the shared-memory limit on the container; the default 64 MB will OOM NCCL on the first AllReduce.

Observability

vllm:time_to_first_token_seconds — prefill latency; correlate with prompt-length histogram.
vllm:time_per_output_token_seconds — decode latency; should be near 1 / (theoretical tok/s) when batch is full.
vllm:gpu_cache_usage_perc — KV pool fill; consistent reading above 90 percent indicates capacity headroom is gone.
vllm:prefix_cache_hit_rate — fraction of incoming prefill tokens served from cache.
vllm:num_preemptions_total — sequences evicted under KV pressure; non-zero in steady state means undersized cluster.
vllm:request_queue_time_seconds — time from request arrival to first scheduler admission.
DCGM_FI_DEV_GPU_UTIL and DCGM_FI_DEV_MEM_COPY_UTIL — pair with vLLM metrics to distinguish compute, memory and idle bottlenecks.

# Prometheus rules for a vLLM deployment
groups:
  - name: vllm-sla
    interval: 30s
    rules:
      - alert: VLLMHighTimeToFirstToken
        expr: histogram_quantile(0.95,
                sum by (le, model_name) (
                  rate(vllm:time_to_first_token_seconds_bucket[5m]))) > 1.0
        for: 5m
        labels: { severity: warning, team: inference }
        annotations:
          summary: "vLLM TTFT p95 above 1s on {{ $labels.model_name }}"

      - alert: VLLMKVCachePressure
        expr: vllm:gpu_cache_usage_perc > 0.95
        for: 10m
        labels: { severity: warning }
        annotations:
          summary: "KV pool >95 percent full — preemption imminent"

      - alert: VLLMPreemptionSpike
        expr: increase(vllm:num_preemptions_total[5m]) > 20
        for: 5m
        labels: { severity: critical }
        annotations:
          summary: "Preemptions rising — capacity insufficient or runaway request"

      - alert: VLLMPrefixCacheCollapse
        expr: vllm:prefix_cache_hit_rate < 0.20
              and vllm:prefix_cache_queries_total > 100
        for: 15m
        labels: { severity: info }
        annotations:
          summary: "Prefix cache hit rate dropped — workload shape changed"

      - alert: VLLMGPUUnderutilised
        expr: avg_over_time(DCGM_FI_DEV_GPU_UTIL[5m]) < 30
              and rate(vllm:request_success_total[5m]) > 0
        for: 15m
        labels: { severity: info }
        annotations:
          summary: "GPU under 30 percent — investigate Python overhead or PP bubble"

Tip: If TTFT p95 is high but GPU utilisation is low, suspect Python-side overhead — raise --num-scheduler-steps, ensure CUDA graphs are enabled, and confirm the engine is not running with --enable-eager.

Cost and FinOps

Spot instances cut GPU rate 40-60 percent but require autoscaling that tolerates 30-90s pre-emption notices. Pair with vLLM's draining endpoint to flush in-flight requests cleanly.
FP8 weights + FP8 KV is the highest $/M-tokens lever available on Hopper; BF16 is roughly 1.6x more expensive at the same SLO.
Prefix-cache hits are accounted at zero prefill cost — high-overlap workloads (multi-tenant agents, shared system prompts) can lift effective throughput 1.5-2x at no infrastructure change.
FOCUS-conformant billing exports from Yobitel include inference_engine and model_name resource tags so $/M tokens can be sliced by tenant or product line.

Configuration	GPU rate ($/h)	Sustained tok/s	$/M output tokens	Notes
1x H100 SXM5, Llama 3.1 8B FP8	$3.20	4,500	$0.20	Single replica, prefix cache on.
4x H100 SXM5, Llama 3.1 70B FP8	$12.40	3,500	$0.98	TP=4, chunked prefill.
8x H100 SXM5, Llama 3.1 70B FP8	$24.80	6,800	$1.01	TP=8, prefix cache 60 percent.
2x H200, Llama 3.1 70B 128K ctx	$8.40	1,800	$1.30	Long context tax.
4x B200, Llama 3.1 70B FP4	$22.00	9,200	$0.66	Blackwell FP4 + FA3.
4x H100 spot, Llama 3.1 70B	$6.20	2,800	$0.62	Spot interruption averaged in.
8x MI300X, Llama 3.1 70B FP8	$18.80	5,400	$0.97	ROCm 6.2, FA-ROCm kernel.
Hosted SaaS reference (GPT-4o mini class)	n/a	n/a	$0.60	List API price; comparison only.

Security and compliance

Warning: Multi-tenant single-engine vLLM deployments share the prefix cache across tenants by default. If tenants must not see each other's system prompts in any side-channel, either run one engine per tenant or set --enable-prefix-caching=false and accept the throughput hit.

Migration and alternatives

From	Migration effort	Throughput change	Operational notes
HuggingFace pipeline / generate	Low — drop in OpenAI client	5-10x faster	Eliminates GIL-bound serving loop.
TGI (Text Generation Inference)	Low — same OpenAI API	Comparable, vLLM wins on new models	Lose TGI multi-LoRA hot-swap polish; gain wider model support.
TensorRT-LLM + Triton	Medium — drop engine build	10-30 percent slower at same latency	Gain rapid model rotation; lose absolute-min latency.
SGLang	Low — same API surface	Roughly equal on chat; SGLang wins on agents	Switch back for RadixAttention-heavy workloads.
OpenAI / Bedrock / Anthropic API	High — model substitution	Variable	Gain control, sovereignty; lose hosted model variety.
llama.cpp / Ollama (cloud)	Low — same model	3-8x faster on GPU	Use llama.cpp for CPU and Apple Silicon.

# Production deployment on Kubernetes with NVIDIA GPU Operator installed
kubectl apply -f - <<'YAML'
apiVersion: apps/v1
kind: Deployment
metadata: { name: llama3-70b-vllm }
spec:
  replicas: 2
  selector: { matchLabels: { app: llama3-70b-vllm } }
  template:
    metadata: { labels: { app: llama3-70b-vllm } }
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:v0.8.0
          args:
            - "--model=meta-llama/Meta-Llama-3.1-70B-Instruct"
            - "--tensor-parallel-size=4"
            - "--max-model-len=32768"
            - "--enable-prefix-caching"
            - "--enable-chunked-prefill"
            - "--quantization=fp8"
            - "--kv-cache-dtype=fp8"
          resources:
            limits: { nvidia.com/gpu: 4 }
          ports: [{ containerPort: 8000 }]
          volumeMounts:
            - { name: dshm, mountPath: /dev/shm }
      volumes:
        - name: dshm
          emptyDir: { medium: Memory, sizeLimit: 8Gi }
YAML

# Equivalent today on AWS (bare p5 instance with Deep Learning AMI)
AMI_ID=$(aws ec2 describe-images --owners amazon \
    --filters "Name=name,Values=Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.4*" \
    --query 'sort_by(Images,&CreationDate)[-1].ImageId' --output text)

aws ec2 run-instances \
    --image-id "$AMI_ID" \
    --instance-type p5.48xlarge \
    --user-data "$(printf '#!/bin/bash\npip install vllm>=0.8.0\nvllm serve meta-llama/Meta-Llama-3.1-70B-Instruct --tensor-parallel-size 8 --quantization fp8 --port 8000\n')"

# Equivalent on GCP A3 (H100)
gcloud compute instances create vllm-llama70b \
    --machine-type=a3-highgpu-8g \
    --accelerator=type=nvidia-h100-80gb,count=8 \
    --image-family=pytorch-2-4-cu124 --image-project=deeplearning-platform-release \
    --metadata=startup-script='vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct --tensor-parallel-size 8 --quantization fp8'

Note: If you must hit absolute-min latency on a fixed model (real-time voice agents, low-latency RAG, public APIs under tight SLOs), keep TensorRT-LLM. vLLM closes most of the gap with FA3 and CUDA graphs, but compiled engines still win the last 10-20 percent.

Troubleshooting

Symptom / Error	Cause	Fix
torch.cuda.OutOfMemoryError on first request	gpu_memory_utilization too high; activations crowd KV pool.	Lower to 0.88; raise --swap-space; check no other process on GPU.
NCCL hang on startup with TP>1	/dev/shm too small or CUDA_VISIBLE_DEVICES misordered.	Mount /dev/shm >= 8GB; pin NVIDIA_VISIBLE_DEVICES per worker; set NCCL_DEBUG=INFO.
Very slow first token after deploy	CUDA-graph capture on cold start.	Expected for first 30-60s; pre-warm with a synthetic request before flipping traffic.
Prefix cache hit rate near zero	System prompt varies by request (e.g. timestamp).	Move volatile fields out of the cached prefix; re-measure `vllm:prefix_cache_hit_rate`.
Speculative decoding regression in throughput	Draft model too large or accept rate too low.	Halve --num-speculative-tokens; switch to EAGLE-2 head; benchmark accept rate.
Quantisation accuracy drift on FP4 / INT4	Calibration set unrepresentative.	Recalibrate on real traffic; pin to AWQ or Marlin paths over raw GPTQ.
HTTP 400 prompt too long	Total tokens exceed --max-model-len.	Raise --max-model-len with --rope-scaling longrope, or chunk the prompt client-side.
Preemption rate climbs in steady state	max_num_seqs too high for KV-cache budget.	Lower --max-num-seqs; or add replicas; never push gpu_memory_utilization above 0.95.
TTFT p95 spikes under mixed load	Long prefill starves decode.	Enable --enable-chunked-prefill; tune --max-num-batched-tokens to 4096-8192.
Throughput drops after upgrading driver	FlashAttention kernel selection regressed.	Pin VLLM_ATTENTION_BACKEND=FLASHINFER or FLASH_ATTN; rerun benchmark.
LoRA adapter latency dominates	Too many --max-loras resident.	Cap at 8-16 on H100; benchmark adapter activation matmul cost.
Multi-node deployment never reaches steady state	Pipeline bubble too large or NCCL over IB misconfigured.	Lower --pipeline-parallel-size; set NCCL_IB_HCA, NCCL_SOCKET_IFNAME; verify GPUDirect RDMA.

Where this fits in the Yobitel stack

References

Efficient Memory Management for LLM Serving with PagedAttention · arXiv (Kwon et al., 2023)
vLLM Documentation · vLLM Project
vLLM on GitHub · GitHub
vLLM Production Stack · GitHub
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision · arXiv (Shah et al., 2024)
SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills · arXiv (Agrawal et al., 2023)
EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees · arXiv (Li et al., 2024)
Outlines — Guided Generation for Language Models · GitHub (dottxt)

Overview

Quick start

How it works

Reference and specifications

Workload patterns

Sizing and capacity planning

Limits and quotas

Observability

Cost and FinOps

Security and compliance

Migration and alternatives

Troubleshooting

Where this fits in the Yobitel stack

References

Browse all entries

Deploy on Yobibyte

vLLM

Overview

Quick start

How it works

Reference and specifications

Workload patterns

Sizing and capacity planning

Limits and quotas

Observability

Cost and FinOps

Security and compliance

Migration and alternatives

Troubleshooting

Where this fits in the Yobitel stack

References

Browse all entries

Deploy on Yobibyte