SGLang

TL;DR

Open-source LLM serving framework from LMSYS (the team behind Vicuna and Chatbot Arena), first release January 2024, Apache 2.0. Backed by a Linux Foundation governance proposal in 2026 with contributors from xAI, NVIDIA, AMD, ByteDance, Databricks and Yobitel.
RadixAttention is the headline differentiator versus vLLM: a radix-tree index over the entire KV pool that shares prefix blocks across unrelated requests automatically, lifting effective throughput 2-5x on agent and multi-tenant workloads where prompt overlap is the dominant cost.
First-class structured generation surface — `regex=`, `choices=`, JSON-schema constrained decoding via XGrammar — implemented with a compressed finite-state-machine path that updates logit masks in single-digit microseconds per step.
OpenAI-compatible REST API plus a Python DSL that compiles multi-call programmes (forks, joins, conditional gens) into batched scheduler primitives. Supports FlashInfer kernels, FP8 / FP4 quantisation, tensor / expert parallelism, EAGLE-2 / Medusa speculative decoding, multi-LoRA hot-swap.
Offered inside Yobitel's Yobibyte platform as the recommended engine for agent loops, structured-output workloads and high-prefix-overlap multi-tenant serving; scored continuously against vLLM and TensorRT-LLM on InferenceBench across H100 SXM5, H200, B200 and MI300X tenancies.

Overview

SGLang is an LLM serving runtime that started as a research project in the LMSYS group at UC Berkeley (the same community behind Vicuna and Chatbot Arena) and shipped its first public release in January 2024. Where vLLM optimises for breadth and developer ergonomics, SGLang optimises for the workloads where prompt structure dominates the cost profile — agent loops with shared tool scaffolds, multi-tenant chat with overlapping system prompts, batch evaluations with shared few-shot examples, and any workload that needs constrained JSON / regex / choice outputs at production rates.

The framework has two halves. The runtime is a Python and C++ engine wrapping FlashInfer kernels for attention, with a custom scheduler, the RadixAttention prefix-cache index, an XGrammar-based constrained decoder, and an HTTP server that speaks the OpenAI Chat Completions, Completions and Embeddings APIs. The front-end is a Python DSL (sglang as sgl) that lets you express multi-call programmes — forks, joins, parallel gens, structured outputs — as ordinary Python code; the DSL compiles to scheduler primitives that the runtime executes with maximum batching.

By mid-2026 SGLang sits among the three reference open-source LLM serving engines (with vLLM and TensorRT-LLM), is the default engine in xAI's Grok inference fleet, ships in the NVIDIA NIM catalogue, and has crossed roughly 30,000 GitHub stars. Releases land every two to four weeks. New model architectures are typically supported within one to two weeks of weight publication — slower than vLLM's day-one cadence, faster than TensorRT-LLM's monthly cycle. Yobibyte exposes SGLang as an opt-in engine for structured-generation and prefix-heavy workloads — Yobitel customers reach SGLang through a managed workspace, with the platform routing agent and JSON-constrained traffic to a SGLang-backed endpoint when its measured profile beats the vLLM default.

This entry documents the production surface: the CLI and Python DSL, the RadixAttention and structured-decoding internals, the parallelism strategies, the workload patterns where SGLang beats vLLM (and the ones where it does not), limits, observability hooks, and the sizing, cost and migration models you need to run SGLang at scale on Yobitel and beyond. This entry helps you stand up SGLang for production LLM serving with the right flags, sizing and operational practices — whether you are operating raw upstream or consuming SGLang as a Yobibyte opt-in for agent and structured-output workloads.

Quick start

The example below installs SGLang, serves Llama 3.1 70B Instruct on a 4x H100 SXM5 node with FP8 weights, FlashInfer attention, RadixAttention prefix sharing and a 32K context window, then issues both a plain OpenAI-compatible chat completion and a JSON-schema-constrained completion using the structured-generation API. The third snippet drives the same endpoint from the SGLang Python DSL to show the multi-call programme model.

# 1. Install SGLang and serve Llama 3.1 70B on 4x H100 SXM5
pip install "sglang[all]>=0.4.0"

python -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3.1-70B-Instruct \
    --tp 4 \
    --context-length 32768 \
    --mem-fraction-static 0.88 \
    --quantization fp8 \
    --kv-cache-dtype fp8_e4m3 \
    --enable-flashinfer \
    --schedule-policy lpm \
    --max-running-requests 256 \
    --port 30000

# 2. Plain OpenAI-compatible chat completion
curl http://localhost:30000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "meta-llama/Meta-Llama-3.1-70B-Instruct",
      "messages": [{"role": "user", "content": "Summarise RadixAttention in 2 lines."}],
      "max_tokens": 128
    }'

# 3. JSON-schema constrained completion (industry-standard XGrammar backend)
curl http://localhost:30000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "meta-llama/Meta-Llama-3.1-70B-Instruct",
      "messages": [{"role": "user", "content": "Extract the company and amount."}],
      "response_format": {
        "type": "json_schema",
        "json_schema": {
          "name": "extract",
          "schema": {
            "type": "object",
            "properties": {
              "company": { "type": "string" },
              "amount_usd": { "type": "number" }
            },
            "required": ["company", "amount_usd"]
          }
        }
      }
    }'

# 4. The same endpoint, driven by the native SGLang Python DSL
python - <<'PY'
import sglang as sgl

sgl.set_default_backend(sgl.RuntimeEndpoint("http://localhost:30000"))

@sgl.function
def triage(s, ticket):
    s += sgl.system("You triage support tickets.")
    s += sgl.user(ticket)
    s += sgl.assistant(
        "Category: " + sgl.gen("category", choices=["billing", "outage", "feature"])
        + "\nSeverity: " + sgl.gen("severity", choices=["P1", "P2", "P3"])
        + "\nReason: " + sgl.gen("reason", max_tokens=80)
    )

state = triage.run(ticket="Production endpoint returning 503 for the last 12 minutes.")
print(state["category"], state["severity"], "-", state["reason"])
PY

How it works

SGLang is structured as an asynchronous engine wrapped by a FastAPI HTTP server. Requests enter through the server, are tokenised, validated against --context-length, and handed to a scheduler that selects the next set of sequences to run based on RadixAttention cache hits, fairness policy and KV-pool occupancy. Forward execution uses FlashInfer (the LMSYS-maintained attention library) on Hopper and Blackwell, with paged-KV variants that gather K and V from a global block pool via per-sequence block tables — conceptually equivalent to PagedAttention but with FlashInfer-specific kernel optimisations for grouped-query attention and FP8 KV.

RadixAttention is the differentiating innovation. The runtime maintains a radix tree keyed by token IDs over the entire KV pool, not just within a single sequence family. Every cached block is content-addressed and shared the moment a second request matches. When a new request arrives, the scheduler walks the radix tree to find the longest matching prefix, reuses those physical blocks, and prefills only the suffix. The eviction policy is least-recently-used over tree branches with a small bias toward keeping high-fanout branches (frequently shared prefixes) resident.

The structured-generation path is implemented with XGrammar, a compressed-FSM library that compiles regex, JSON-schema and choice constraints into a state machine over token IDs. On each decode step, XGrammar masks out tokens that would violate the grammar before the sampling step; the mask update takes single-digit microseconds and adds negligible overhead even at high throughput. Compared to Outlines (the canonical Python implementation) XGrammar trades some flexibility for a 10-30x speedup on the per-step mask update, which is what makes JSON-constrained serving viable at chat-grade latency.

The Python DSL compiles multi-call programmes into batched scheduler primitives. A fork/join over k branches becomes k parallel sequences scheduled together; a choices= gen becomes a constrained-decoding sequence with a static FSM; a gen with max_tokens becomes a normal autoregressive request. The scheduler sees these primitives as a unified batch and can apply RadixAttention sharing across them automatically — multiple parallel branches of one fork will share the prompt prefix exactly because they entered the scheduler together.

RadixAttention: radix-tree-indexed KV pool with cross-request prefix sharing; effective hit rates above 80% on agent and multi-tenant workloads.
FlashInfer kernels: LMSYS-maintained attention library with paged-KV, GQA-aware, FP8-KV optimised paths.
XGrammar constrained decoding: regex, JSON-schema and choices implemented as compressed FSMs over the tokenizer; <5µs per-step mask update.
Continuous batching with three scheduler policies: LPM (longest prefix match, RadixAttention-aware), FCFS (first-come-first-served), DFS (depth-first for fork-heavy workloads).
Tensor parallelism (--tp) and expert parallelism (--ep-size) for MoE; pipeline parallelism via --nnodes and --node-rank for multi-host deployments.
Speculative decoding: external draft, EAGLE-2, Medusa heads, n-gram lookahead — all enabled with --speculative-algorithm and a draft model path.
Multi-LoRA hot-swap via --lora-paths with per-request lora_id routing.
Quantisation: FP8 (E4M3 / E5M2), INT8, AWQ INT4, GPTQ INT4, FP4 on Blackwell.
OpenAI-compatible REST API plus the native sglang Python DSL with fork, gen, select, choices, regex primitives.

Tip: Turn on --enable-flashinfer, --schedule-policy lpm and the FP8 KV cache together as your baseline on Hopper. The combined uplift over the SGLang defaults on a typical agent workload is 30-50% before you change anything else.

Reference and specifications

Every long-lived SGLang deployment is parameterised through the sglang.launch_server CLI. The table below is the canonical reference for the flags as of SGLang v0.4 (June 2026). Flags marked with an asterisk are also available as ServerArgs fields in the Python API. Flags not listed here are either internal tuning knobs that defaults handle correctly or specialised features documented in the upstream reference.

Flag	Type	Default	Description
--model-path *	string	(required)	HuggingFace repo id or local path. Drives architecture detection.
--tp / --tensor-parallel-size *	int	1	Shard each weight matrix across N GPUs within an NVLink island.
--ep-size *	int	1	Expert-parallel degree for MoE models (DeepSeek, Mixtral, Qwen MoE).
--nnodes / --node-rank	int	1 / 0	Multi-node coordination for pipeline-parallel / large-MoE deployments.
--context-length *	int	model-defined	Maximum total tokens per sequence. Bounded by RoPE and KV budget.
--max-running-requests *	int	auto	Hard cap on concurrent sequences in the running batch.
--max-total-tokens *	int	auto	Cap on tokens resident in the KV pool; sized from `--mem-fraction-static`.
--mem-fraction-static *	float	0.88	Fraction of GPU memory pooled for weights + activations + KV.
--kv-cache-dtype *	string	auto	auto
--quantization *	string	(off)	fp8
--enable-flashinfer	bool	true on H100+	Use the FlashInfer paged-KV attention kernel.
--attention-backend *	string	flashinfer	flashinfer
--schedule-policy *	string	lpm	lpm (longest-prefix-match, RadixAttention-aware)
--disable-radix-cache	bool	false	Turn off cross-request prefix sharing. Required by some multi-tenant isolation models.
--chunked-prefill-size *	int	8192	Tokens per prefill chunk; interleaves with decode in the same step.
--grammar-backend *	string	xgrammar	xgrammar
--constrain-output *	string	(off)	Default grammar for all responses — JSON schema path, regex literal or `choices=`.
--speculative-algorithm *	string	(off)	eagle
--speculative-num-steps *	int	5	Tokens proposed per draft pass.
--lora-paths *	list	(none)	Comma-separated `name=path` pairs for multi-LoRA hot-swap; per-request `lora_id`.
--max-loras-per-batch *	int	1	Number of LoRA adapters active in a single forward; >1 uses S-LoRA path.
--enable-mixed-chunk	bool	false	Allow prefill and decode tokens in the same chunked-prefill batch.
--enable-p2p-check	bool	false	Verify NVLink P2P connectivity at startup; turn on when debugging TP hangs.
--watchdog-timeout *	int (s)	300	Engine watchdog; kill the process if a step exceeds this duration.
--port	int	30000	HTTP/OpenAI-compatible API port.
--served-model-name	string	model id	Override the model name reported via the OpenAI API.
--api-key	string	(none)	Optional bearer-token auth for the API server.
--show-time-cost	bool	false	Per-request server-side timing breakdown in the log.
--enable-metrics	bool	true	Expose Prometheus metrics on /metrics.
--decode-log-interval *	int	40	Number of decode steps between throughput log lines.
--disable-cuda-graph	bool	false	Disable CUDA-graph capture; useful when debugging kernel selection.

Note: --schedule-policy lpm is the policy that makes RadixAttention pay off — it groups requests sharing a long prefix into the same batch so the cached blocks are actually reused. Switching to fcfs strands the cache and erases the cross-request win.

Workload patterns

Three workload shapes cover the bulk of SGLang production deployments and are where SGLang either beats or matches vLLM. Pattern A is the easy choice: agent loops with a long shared system prompt, where RadixAttention's cross-request sharing is the biggest single optimisation available on the field. Pattern B is structured-output serving — JSON, regex, choice constraints — where SGLang's XGrammar backend is materially faster than Outlines at chat-grade latency. Pattern C is tool-call routing for agent platforms, where the structured-generation API removes a whole layer of client-side parsing. These are the three workflows Yobibyte automates when a customer's measured profile shows the SGLang opt-in beating the vLLM default — the LPM scheduler, XGrammar backend and RadixAttention tuning are what a team running raw SGLang on their own Kubernetes signs up to operate themselves.

# A — multi-tenant agent endpoint on 4x H100 SXM5 with long shared tool scaffold
python -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3.1-70B-Instruct \
    --tp 4 \
    --context-length 32768 \
    --mem-fraction-static 0.90 \
    --quantization fp8 \
    --kv-cache-dtype fp8_e4m3 \
    --schedule-policy lpm \
    --max-running-requests 384 \
    --enable-flashinfer \
    --port 30000

# B — JSON-schema-constrained extraction endpoint at high RPS
python -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
    --tp 1 \
    --context-length 8192 \
    --grammar-backend xgrammar \
    --quantization fp8 \
    --max-running-requests 512 \
    --schedule-policy lpm \
    --port 30000

# C — tool-call routing using the DSL's choices= constraint
python - <<'PY'
import sglang as sgl

sgl.set_default_backend(sgl.RuntimeEndpoint("http://localhost:30000"))

@sgl.function
def route(s, query):
    s += sgl.system("Route the query to a tool.")
    s += sgl.user(query)
    s += sgl.assistant(
        "Tool: " + sgl.gen("tool", choices=["search", "calculator", "code_exec", "answer"])
        + "\nArguments: " + sgl.gen("args", max_tokens=120, regex=r"\{.*\}")
    )

for q in ["what is 1.07^12?", "current population of London", "summarise this paper"]:
    out = route.run(query=q)
    print(q, "->", out["tool"], out["args"])
PY

Warning: If you turn RadixAttention off (--disable-radix-cache) for tenant-isolation reasons, SGLang's headline advantage over vLLM evaporates and you should consider whether vLLM is the simpler choice for that fleet.

Sizing and capacity planning

SGLang throughput is bounded first by KV-cache memory, then by tensor-core FLOPs, then by NCCL bandwidth at TP > 2 — same hierarchy as vLLM. The differentiator is that SGLang's effective KV budget is larger because cached prefix blocks are shared across tenants; the sizing table below reports throughput at realistic prefix hit rates rather than the worst case. Tokens-per-second figures are mid-range observed values from InferenceBench v3 at 4K input / 256 output, mixed concurrency, with the prefix-hit-rate assumption noted; treat as planning anchors rather than contractual.

The TP / EP / PP rules of thumb mirror vLLM: TP up to 8 inside one NVLink island is well-behaved; TP across InfiniBand is almost always slower than pipeline or expert parallelism. For DeepSeek-V3 and other large MoE models, EP=8 inside a node combined with TP=8 is the typical baseline; for two-node MoE deployments, EP across the InfiniBand fabric with TP=8 inside each node tends to beat TP=16 + PP=2.

Workload	Model	Recommended SKU	Concurrency	Output tok/s	Notes
Chat, low latency	Llama 3.1 8B	1x H100 SXM5 80GB	64-128	4,200-5,800	FP8 weights + KV, FlashInfer.
Agent, shared scaffold	Llama 3.1 70B	4x H100 SXM5	256-512	5,800-8,400	RadixAttention hit rate ~75%.
JSON-constrained extraction	Llama 3.1 8B	1x H100 SXM5	128-256	3,400-4,600	XGrammar backend.
Multi-tenant chat (shared sys prompt)	Llama 3.1 70B	8x H100 SXM5	512-1,024	8,200-12,500	LPM scheduler, RadixAttention.
Long context (128K)	Llama 3.1 70B	2x H200 141GB	32-64	1,500-2,300	FP8 KV, chunked prefill 4096.
MoE serving	Mixtral 8x22B	8x H100 SXM5	192-384	4,700-7,000	TP=8 + EP=8.
MoE serving	DeepSeek-V3 671B	16x H100 SXM5 (2 nodes)	256-512	3,400-5,000	TP=8 + EP=8 per node, IB400.
Blackwell next-gen	Llama 3.1 70B	4x B200	256-512	7,200-11,000	FP4 weights, FP8 KV, FlashInfer3.
Speculative (EAGLE-2)	Llama 3.1 70B	4x H100 SXM5	32-64	4,000-6,200	Low-concurrency interactive.
AMD ROCm path	Llama 3.1 70B	8x MI300X	256-512	5,200-7,800	ROCm 6.2, AITER kernels.

Limits and quotas

SGLang enforces a small set of hard and soft limits at the engine boundary. Hard limits reject requests with HTTP 400 at the API server; soft limits apply backpressure by extending queue depth. Operational ceilings (memory, NCCL groups, file descriptors, /dev/shm) come from the host OS and CUDA runtime — the same ones that bite vLLM in the same configurations.

Limit	Default	Hard ceiling	How to raise
context-length	model-defined	RoPE-limited (e.g. 128K Llama 3.1)	Pin RoPE scaling via the HF config; verify quality.
max-running-requests	auto	KV-cache budget	Raise; watch gpu_cache_usage and preemption count.
max-total-tokens	auto from --mem-fraction-static	Device memory	Raise --mem-fraction-static carefully (>0.92 risks activation OOM).
max-loras-per-batch	1	GPU memory	Increase; activation matmul cost grows linearly.
chunked-prefill-size	8192	Activation budget	Smaller chunks lower p99 prefill latency; larger chunks raise throughput.
TP size (intra-node)	1	8 (NVLink)	Bounded by GPUs per NVLink island.
EP size (intra-node, MoE)	1	Number of experts	Useful for DeepSeek / Mixtral / Qwen MoE.
PP size (cross-node)	1	Practical ~16	Bounded by pipeline-bubble overhead.
Speculative draft steps	5	~8	Diminishing returns above 5-6 on most workloads.
Shared memory (NCCL)	/dev/shm	Container-defined	Mount /dev/shm >= 8GB for TP>1.
File descriptors	1024	ulimit	ulimit -n 65536 in container.
Concurrent requests / engine	max-running-requests + queue	Memory-bounded	Add replicas behind a router.

Warning: RadixAttention's cross-request sharing means tenants on the same engine can observe each other's prefix existence through cache-hit timing. Where strict tenant isolation is required (regulated public-sector workloads, multi-customer SaaS with confidential prompts), either run one engine per tenant or set --disable-radix-cache and accept the throughput hit.

Observability

SGLang exposes a Prometheus metrics endpoint at /metrics (when --enable-metrics is set, which is the default) covering request throughput, latency histograms, KV-cache utilisation, RadixAttention hit rate, scheduler queue depth and preemption counts. The metric prefix is sglang:. Engine logs emit one structured line per request and per decode-log-interval throughput sample; switch to JSON output via SGLANG_LOGGING_CONFIG for ingestion into Loki, Splunk or Datadog.

The metrics worth alerting on in production are time-to-first-token p95, inter-token latency p95, KV-pool usage, RadixAttention hit rate, request queue time and the watchdog-skipped step counter. The following Prometheus rules cover the common failure modes.

sglang:time_to_first_token_seconds — prefill latency; correlate with prompt-length histogram.
sglang:e2e_request_latency_seconds — end-to-end p50/p95/p99.
sglang:gen_throughput — sustained output tokens/sec across the active batch.
sglang:cache_hit_rate — RadixAttention hit rate; the headline efficiency metric for this engine.
sglang:num_running_reqs — current batch size; compare with --max-running-requests.
sglang:num_used_tokens — tokens resident in the KV pool; compare with sglang:max_total_num_tokens.
sglang:num_preempt_reqs — non-zero in steady state indicates undersized cluster.
DCGM_FI_DEV_GPU_UTIL and DCGM_FI_DEV_MEM_COPY_UTIL — pair with sglang metrics to distinguish compute, memory and idle bottlenecks.

# Prometheus rules for an SGLang deployment
groups:
  - name: sglang-sla
    interval: 30s
    rules:
      - alert: SGLangHighTimeToFirstToken
        expr: histogram_quantile(0.95,
                sum by (le, model_name) (
                  rate(sglang:time_to_first_token_seconds_bucket[5m]))) > 1.0
        for: 5m
        labels: { severity: warning, team: inference }
        annotations:
          summary: "SGLang TTFT p95 above 1s on {{ $labels.model_name }}"

      - alert: SGLangRadixCacheCollapse
        expr: avg_over_time(sglang:cache_hit_rate[10m]) < 0.20
        for: 15m
        labels: { severity: info }
        annotations:
          summary: "RadixAttention hit rate below 20% — workload shape changed, consider vLLM"

      - alert: SGLangKVPressure
        expr: sglang:num_used_tokens / sglang:max_total_num_tokens > 0.95
        for: 10m
        labels: { severity: warning }
        annotations:
          summary: "KV pool >95% full — preemption imminent"

      - alert: SGLangPreemptionSpike
        expr: increase(sglang:num_preempt_reqs[5m]) > 20
        for: 5m
        labels: { severity: critical }
        annotations:
          summary: "Preemptions rising — capacity insufficient or runaway request"

      - alert: SGLangWatchdogSkips
        expr: increase(sglang:num_watchdog_skipped_steps[10m]) > 0
        for: 5m
        labels: { severity: critical }
        annotations:
          summary: "Watchdog skipped steps — investigate kernel hang or NCCL stall"

Cost and FinOps

SGLang cost economics are dominated by the same three levers as vLLM: GPU rental rate, achieved tokens-per-second-per-GPU, and average prefix-cache hit rate. The difference is that SGLang's effective hit rate on multi-tenant and agent workloads is materially higher because RadixAttention shares across requests, not just within them — which translates directly to lower $/M tokens at the same SLO. The table uses Yobitel UK list pricing (June 2026) and InferenceBench v3 throughput anchors; substitute your own when planning.

RadixAttention hit rate is the largest per-fleet cost lever — a fleet with a 70% hit rate runs at roughly half the $/M tokens of the same fleet with a 10% hit rate.
FP8 weights + FP8 KV is the next-largest lever, identical to vLLM economics.
Speculative decoding pays off at low concurrency only; at the batch sizes that dominate steady-state production it adds compute without lowering wall-clock latency.
FOCUS-conformant billing exports from Yobitel tag each engine with inference_engine=sglang so $/M tokens can be sliced by tenant, model and engine type for direct vLLM / TensorRT-LLM comparison.

Configuration	GPU rate ($/h)	Sustained tok/s	$/M output tokens	Notes
1x H100 SXM5, Llama 3.1 8B FP8	$3.20	5,000	$0.18	Single replica, RadixAttention on.
4x H100 SXM5, Llama 3.1 70B FP8	$12.40	4,200	$0.82	TP=4, LPM scheduler.
4x H100 SXM5, agent workload (75% hit)	$12.40	7,200	$0.48	RadixAttention sharing dominates.
8x H100 SXM5, multi-tenant chat	$24.80	10,500	$0.66	TP=8, shared sys prompt.
2x H200, Llama 3.1 70B 128K ctx	$8.40	1,900	$1.23	Long context tax; FP8 KV.
4x B200, Llama 3.1 70B FP4	$22.00	9,800	$0.62	Blackwell FP4 + FlashInfer3.
8x MI300X, Llama 3.1 70B FP8	$18.80	5,500	$0.95	ROCm 6.2, AITER kernels.
1x H100, JSON-constrained 8B	$3.20	4,000	$0.22	XGrammar overhead <3%.

Security and compliance

SGLang ships with optional bearer-token auth on the API server (--api-key); production deployments terminate TLS at an ingress (Envoy, NGINX, AWS ALB) and apply mTLS or signed-JWT auth at that layer. The engine does not enforce per-tenant quotas itself — those are implemented at the gateway or via the Yobitel platform's multi-tenant router. Network isolation should follow the standard pattern: the inference pod has no egress to the public internet, model weights are pulled from a private registry, and per-replica NetworkPolicy locks ingress to the gateway service account.

Structured generation is a load-bearing prompt-injection mitigation surface for SGLang in production. Pinning outputs to a JSON schema or a regex literal — response_format: json_schema at the API or regex= in the DSL — removes a whole class of jailbreaks where the model is coaxed into emitting free-form text. Pair with retrieval source validation and output classifiers at the gateway. The Yobibyte platform enforces this stack by default.

Regulatory considerations mirror vLLM. For UK public-sector workloads, deploy on Yobitel sovereign tenancies that satisfy NCSC Cloud Security Principles and G-Cloud 14. For EU GDPR, the engine processes prompt and completion data only in volatile GPU memory and the on-disk scratch path; encrypt ephemeral storage. For US HIPAA, run inside a BAA-covered VPC and disable per-request logging; for FedRAMP, run the FIPS-validated CUDA build and pin NIAP-approved cipher suites at the ingress.

Warning: RadixAttention shares prefix blocks across tenants by default. If tenant prompt content is itself confidential — clinical notes, legal drafts, identifiable PII in the prefix — either run one engine per tenant or disable RadixAttention (--disable-radix-cache) and accept the throughput cost.

Migration and alternatives

Most production migrations to SGLang come from one of three origins: vLLM (for the RadixAttention or structured-generation win), TensorRT-LLM (for faster iteration on new architectures), or TGI / a managed SaaS API. The first is the most common and the most mechanical — both engines speak the OpenAI-compatible API, both run from a HuggingFace repo id, both support FP8 KV and tensor parallelism, and the migration is essentially a container swap.

The decision matrix is straightforward: if your dominant workload is agent loops, multi-tenant chat with shared system prompts, or JSON-constrained extraction, SGLang typically beats vLLM by 1.5-3x on $/M tokens; if your workload is single-tenant chat with no prompt overlap, the two engines are roughly tied and vLLM's larger model coverage usually wins. From TensorRT-LLM, you give up 10-30% peak throughput in exchange for instant model rotation and no engine-build pipeline.

From	Migration effort	Throughput change	Operational notes
vLLM (chat workload)	Low — drop-in OpenAI API swap	Roughly equal	Switch only if structured generation matters.
vLLM (agent / multi-tenant)	Low — drop-in	1.5-3x faster on hit-rate-heavy workloads	RadixAttention is the headline win.
TensorRT-LLM + Triton	Medium — drop engine build	10-30% slower at same latency	Gain rapid model rotation, structured generation; lose absolute-min latency.
TGI (Text Generation Inference)	Low — same OpenAI API	Comparable; SGLang wins on new architectures and structured outputs	Multi-LoRA story is similar.
OpenAI / Bedrock / Anthropic API	High — model substitution	Variable	Gain control, sovereignty; lose hosted model variety.
Outlines-on-vLLM for structured	Low — same API	10-30x faster mask updates via XGrammar	If you currently bottleneck on Outlines, this is the upgrade.

# Kubernetes deployment with NVIDIA GPU Operator installed
kubectl apply -f - <<'YAML'
apiVersion: apps/v1
kind: Deployment
metadata: { name: llama3-70b-sglang }
spec:
  replicas: 2
  selector: { matchLabels: { app: llama3-70b-sglang } }
  template:
    metadata: { labels: { app: llama3-70b-sglang } }
    spec:
      containers:
        - name: sglang
          image: lmsysorg/sglang:v0.4.0-cu124
          args:
            - "python"
            - "-m"
            - "sglang.launch_server"
            - "--model-path=meta-llama/Meta-Llama-3.1-70B-Instruct"
            - "--tp=4"
            - "--context-length=32768"
            - "--mem-fraction-static=0.90"
            - "--quantization=fp8"
            - "--kv-cache-dtype=fp8_e4m3"
            - "--schedule-policy=lpm"
            - "--enable-flashinfer"
            - "--port=30000"
          resources:
            limits: { nvidia.com/gpu: 4 }
          ports: [{ containerPort: 30000 }]
          volumeMounts:
            - { name: dshm, mountPath: /dev/shm }
      volumes:
        - name: dshm
          emptyDir: { medium: Memory, sizeLimit: 8Gi }
YAML

Note: If you cannot tolerate a structured-generation pass-through layer breaking on a new model architecture, keep an Outlines+vLLM fallback path warm. SGLang's XGrammar is fast but occasionally lags new tokenisers by a release.

Troubleshooting

The error table below covers the failure modes that account for roughly 80% of production SGLang incidents observed on Yobitel-operated fleets and InferenceBench community submissions. Each row maps an observable symptom to the underlying mechanism and the minimum-viable fix.

Symptom / Error	Cause	Fix
torch.cuda.OutOfMemoryError at startup	mem-fraction-static too high; activations crowd KV pool.	Lower to 0.85; verify no other process on GPU.
NCCL hang on startup with TP>1	/dev/shm too small or NVLink P2P disabled.	Mount /dev/shm >= 8GB; set --enable-p2p-check; export NCCL_DEBUG=INFO.
cache_hit_rate near zero on a hit-friendly workload	schedule-policy fcfs scattering shared prefixes across batches.	Switch to --schedule-policy lpm.
Throughput unexpectedly lower than vLLM	Workload has no prefix overlap; RadixAttention overhead unrecovered.	Either rework prompts to share a stable prefix or move that workload to vLLM.
Watchdog killed step	Long prefill or driver-level kernel hang.	Raise --watchdog-timeout temporarily; investigate driver and FlashInfer version.
XGrammar parse error on response_format	JSON schema unsupported feature (e.g. `oneOf` with discriminator).	Simplify schema; switch grammar-backend to outlines for that workload.
Multi-LoRA latency spikes	Too many --max-loras-per-batch on small GPUs.	Cap at 8 on H100; benchmark adapter activation matmul.
EAGLE-2 speculative regression in throughput	Draft head from a different fine-tune than the served target.	Recompute draft head; verify acceptance rate via the metrics endpoint.
Preemption rate climbs steadily	max-running-requests too high for current KV budget.	Lower --max-running-requests or add replicas; never push --mem-fraction-static above 0.92.
HTTP 400 'context length exceeded'	Prompt + max_new_tokens exceeds --context-length.	Raise --context-length with HF rope_scaling; or chunk prompt client-side.
Multi-node deployment never converges	PP bubble too large or IB misconfigured.	Lower PP; set NCCL_IB_HCA, NCCL_SOCKET_IFNAME; verify GPUDirect RDMA.
Sudden throughput drop after upgrade	FlashInfer kernel selection regressed.	Pin SGLANG_ATTENTION_BACKEND=flashinfer or triton; re-benchmark.

Where this fits in the Yobitel stack

SGLang is the recommended inference engine inside Yobibyte for any workload with significant prefix overlap or structured-output requirements — agent loops, multi-tenant chat with shared system prompts, JSON-constrained extraction, choice-based routing. vLLM remains the default for general single-tenant chat and TensorRT-LLM is the opt-in performance variant for stable production endpoints with hard latency SLOs. The Yobibyte control plane routes traffic to the engine that wins on a given workload's measured profile, not by static assignment.

Omniscient Compute scores SGLang continuously on InferenceBench v3 across NVIDIA H100, H200, B200 and AMD MI300X tenancies at fixed input/output token mixes (chat, agent, RAG, JSON-constrained, long-context, batch). Each release is benchmarked against the latest vLLM and TensorRT-LLM, with results surfaced to customers as live capacity plans — every recommended SKU and replica count on the Yobibyte console comes from an InferenceBench measurement.

For UK and EU sovereign workloads, SGLang runs on the Yobitel London-1 and Frankfurt-1 regions inside tenancies that satisfy NCSC Cloud Security Principles, G-Cloud 14 lot definitions and the OFFICIAL handling caveat. The combination of an open-source Apache 2.0 engine, sovereign hardware, and transparent benchmark scoring is what lets Yobitel customers run agent and structured-generation workloads in regulated environments without ceding control to a hosted SaaS API.

References

SGLang: Efficient Execution of Structured Language Model Programs · arXiv (Zheng et al., 2023)
SGLang on GitHub · GitHub
SGLang Documentation · LMSYS
FlashInfer: Kernel Library for LLM Serving · GitHub (FlashInfer)
XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models · arXiv (Dong et al., 2024)
EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees · arXiv (Li et al., 2024)
Efficient Memory Management for LLM Serving with PagedAttention · arXiv (Kwon et al., 2023)

TL;DR

Open-source LLM serving framework from LMSYS (the team behind Vicuna and Chatbot Arena), first release January 2024, Apache 2.0. Backed by a Linux Foundation governance proposal in 2026 with contributors from xAI, NVIDIA, AMD, ByteDance, Databricks and Yobitel.
RadixAttention is the headline differentiator versus vLLM: a radix-tree index over the entire KV pool that shares prefix blocks across unrelated requests automatically, lifting effective throughput 2-5x on agent and multi-tenant workloads where prompt overlap is the dominant cost.
First-class structured generation surface — `regex=`, `choices=`, JSON-schema constrained decoding via XGrammar — implemented with a compressed finite-state-machine path that updates logit masks in single-digit microseconds per step.
OpenAI-compatible REST API plus a Python DSL that compiles multi-call programmes (forks, joins, conditional gens) into batched scheduler primitives. Supports FlashInfer kernels, FP8 / FP4 quantisation, tensor / expert parallelism, EAGLE-2 / Medusa speculative decoding, multi-LoRA hot-swap.
Offered inside Yobitel's Yobibyte platform as the recommended engine for agent loops, structured-output workloads and high-prefix-overlap multi-tenant serving; scored continuously against vLLM and TensorRT-LLM on InferenceBench across H100 SXM5, H200, B200 and MI300X tenancies.

Overview

Quick start

# 1. Install SGLang and serve Llama 3.1 70B on 4x H100 SXM5
pip install "sglang[all]>=0.4.0"

python -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3.1-70B-Instruct \
    --tp 4 \
    --context-length 32768 \
    --mem-fraction-static 0.88 \
    --quantization fp8 \
    --kv-cache-dtype fp8_e4m3 \
    --enable-flashinfer \
    --schedule-policy lpm \
    --max-running-requests 256 \
    --port 30000

# 2. Plain OpenAI-compatible chat completion
curl http://localhost:30000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "meta-llama/Meta-Llama-3.1-70B-Instruct",
      "messages": [{"role": "user", "content": "Summarise RadixAttention in 2 lines."}],
      "max_tokens": 128
    }'

# 3. JSON-schema constrained completion (industry-standard XGrammar backend)
curl http://localhost:30000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "meta-llama/Meta-Llama-3.1-70B-Instruct",
      "messages": [{"role": "user", "content": "Extract the company and amount."}],
      "response_format": {
        "type": "json_schema",
        "json_schema": {
          "name": "extract",
          "schema": {
            "type": "object",
            "properties": {
              "company": { "type": "string" },
              "amount_usd": { "type": "number" }
            },
            "required": ["company", "amount_usd"]
          }
        }
      }
    }'

# 4. The same endpoint, driven by the native SGLang Python DSL
python - <<'PY'
import sglang as sgl

sgl.set_default_backend(sgl.RuntimeEndpoint("http://localhost:30000"))

@sgl.function
def triage(s, ticket):
    s += sgl.system("You triage support tickets.")
    s += sgl.user(ticket)
    s += sgl.assistant(
        "Category: " + sgl.gen("category", choices=["billing", "outage", "feature"])
        + "\nSeverity: " + sgl.gen("severity", choices=["P1", "P2", "P3"])
        + "\nReason: " + sgl.gen("reason", max_tokens=80)
    )

state = triage.run(ticket="Production endpoint returning 503 for the last 12 minutes.")
print(state["category"], state["severity"], "-", state["reason"])
PY

How it works

RadixAttention: radix-tree-indexed KV pool with cross-request prefix sharing; effective hit rates above 80% on agent and multi-tenant workloads.
FlashInfer kernels: LMSYS-maintained attention library with paged-KV, GQA-aware, FP8-KV optimised paths.
XGrammar constrained decoding: regex, JSON-schema and choices implemented as compressed FSMs over the tokenizer; <5µs per-step mask update.
Continuous batching with three scheduler policies: LPM (longest prefix match, RadixAttention-aware), FCFS (first-come-first-served), DFS (depth-first for fork-heavy workloads).
Tensor parallelism (--tp) and expert parallelism (--ep-size) for MoE; pipeline parallelism via --nnodes and --node-rank for multi-host deployments.
Speculative decoding: external draft, EAGLE-2, Medusa heads, n-gram lookahead — all enabled with --speculative-algorithm and a draft model path.
Multi-LoRA hot-swap via --lora-paths with per-request lora_id routing.
Quantisation: FP8 (E4M3 / E5M2), INT8, AWQ INT4, GPTQ INT4, FP4 on Blackwell.
OpenAI-compatible REST API plus the native sglang Python DSL with fork, gen, select, choices, regex primitives.

Tip: Turn on --enable-flashinfer, --schedule-policy lpm and the FP8 KV cache together as your baseline on Hopper. The combined uplift over the SGLang defaults on a typical agent workload is 30-50% before you change anything else.

Reference and specifications

Flag	Type	Default	Description
--model-path *	string	(required)	HuggingFace repo id or local path. Drives architecture detection.
--tp / --tensor-parallel-size *	int	1	Shard each weight matrix across N GPUs within an NVLink island.
--ep-size *	int	1	Expert-parallel degree for MoE models (DeepSeek, Mixtral, Qwen MoE).
--nnodes / --node-rank	int	1 / 0	Multi-node coordination for pipeline-parallel / large-MoE deployments.
--context-length *	int	model-defined	Maximum total tokens per sequence. Bounded by RoPE and KV budget.
--max-running-requests *	int	auto	Hard cap on concurrent sequences in the running batch.
--max-total-tokens *	int	auto	Cap on tokens resident in the KV pool; sized from `--mem-fraction-static`.
--mem-fraction-static *	float	0.88	Fraction of GPU memory pooled for weights + activations + KV.
--kv-cache-dtype *	string	auto	auto
--quantization *	string	(off)	fp8
--enable-flashinfer	bool	true on H100+	Use the FlashInfer paged-KV attention kernel.
--attention-backend *	string	flashinfer	flashinfer
--schedule-policy *	string	lpm	lpm (longest-prefix-match, RadixAttention-aware)
--disable-radix-cache	bool	false	Turn off cross-request prefix sharing. Required by some multi-tenant isolation models.
--chunked-prefill-size *	int	8192	Tokens per prefill chunk; interleaves with decode in the same step.
--grammar-backend *	string	xgrammar	xgrammar
--constrain-output *	string	(off)	Default grammar for all responses — JSON schema path, regex literal or `choices=`.
--speculative-algorithm *	string	(off)	eagle
--speculative-num-steps *	int	5	Tokens proposed per draft pass.
--lora-paths *	list	(none)	Comma-separated `name=path` pairs for multi-LoRA hot-swap; per-request `lora_id`.
--max-loras-per-batch *	int	1	Number of LoRA adapters active in a single forward; >1 uses S-LoRA path.
--enable-mixed-chunk	bool	false	Allow prefill and decode tokens in the same chunked-prefill batch.
--enable-p2p-check	bool	false	Verify NVLink P2P connectivity at startup; turn on when debugging TP hangs.
--watchdog-timeout *	int (s)	300	Engine watchdog; kill the process if a step exceeds this duration.
--port	int	30000	HTTP/OpenAI-compatible API port.
--served-model-name	string	model id	Override the model name reported via the OpenAI API.
--api-key	string	(none)	Optional bearer-token auth for the API server.
--show-time-cost	bool	false	Per-request server-side timing breakdown in the log.
--enable-metrics	bool	true	Expose Prometheus metrics on /metrics.
--decode-log-interval *	int	40	Number of decode steps between throughput log lines.
--disable-cuda-graph	bool	false	Disable CUDA-graph capture; useful when debugging kernel selection.

Note: --schedule-policy lpm is the policy that makes RadixAttention pay off — it groups requests sharing a long prefix into the same batch so the cached blocks are actually reused. Switching to fcfs strands the cache and erases the cross-request win.

Workload patterns

# A — multi-tenant agent endpoint on 4x H100 SXM5 with long shared tool scaffold
python -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3.1-70B-Instruct \
    --tp 4 \
    --context-length 32768 \
    --mem-fraction-static 0.90 \
    --quantization fp8 \
    --kv-cache-dtype fp8_e4m3 \
    --schedule-policy lpm \
    --max-running-requests 384 \
    --enable-flashinfer \
    --port 30000

# B — JSON-schema-constrained extraction endpoint at high RPS
python -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
    --tp 1 \
    --context-length 8192 \
    --grammar-backend xgrammar \
    --quantization fp8 \
    --max-running-requests 512 \
    --schedule-policy lpm \
    --port 30000

# C — tool-call routing using the DSL's choices= constraint
python - <<'PY'
import sglang as sgl

sgl.set_default_backend(sgl.RuntimeEndpoint("http://localhost:30000"))

@sgl.function
def route(s, query):
    s += sgl.system("Route the query to a tool.")
    s += sgl.user(query)
    s += sgl.assistant(
        "Tool: " + sgl.gen("tool", choices=["search", "calculator", "code_exec", "answer"])
        + "\nArguments: " + sgl.gen("args", max_tokens=120, regex=r"\{.*\}")
    )

for q in ["what is 1.07^12?", "current population of London", "summarise this paper"]:
    out = route.run(query=q)
    print(q, "->", out["tool"], out["args"])
PY

Warning: If you turn RadixAttention off (--disable-radix-cache) for tenant-isolation reasons, SGLang's headline advantage over vLLM evaporates and you should consider whether vLLM is the simpler choice for that fleet.

Sizing and capacity planning

Workload	Model	Recommended SKU	Concurrency	Output tok/s	Notes
Chat, low latency	Llama 3.1 8B	1x H100 SXM5 80GB	64-128	4,200-5,800	FP8 weights + KV, FlashInfer.
Agent, shared scaffold	Llama 3.1 70B	4x H100 SXM5	256-512	5,800-8,400	RadixAttention hit rate ~75%.
JSON-constrained extraction	Llama 3.1 8B	1x H100 SXM5	128-256	3,400-4,600	XGrammar backend.
Multi-tenant chat (shared sys prompt)	Llama 3.1 70B	8x H100 SXM5	512-1,024	8,200-12,500	LPM scheduler, RadixAttention.
Long context (128K)	Llama 3.1 70B	2x H200 141GB	32-64	1,500-2,300	FP8 KV, chunked prefill 4096.
MoE serving	Mixtral 8x22B	8x H100 SXM5	192-384	4,700-7,000	TP=8 + EP=8.
MoE serving	DeepSeek-V3 671B	16x H100 SXM5 (2 nodes)	256-512	3,400-5,000	TP=8 + EP=8 per node, IB400.
Blackwell next-gen	Llama 3.1 70B	4x B200	256-512	7,200-11,000	FP4 weights, FP8 KV, FlashInfer3.
Speculative (EAGLE-2)	Llama 3.1 70B	4x H100 SXM5	32-64	4,000-6,200	Low-concurrency interactive.
AMD ROCm path	Llama 3.1 70B	8x MI300X	256-512	5,200-7,800	ROCm 6.2, AITER kernels.

Limits and quotas

Limit	Default	Hard ceiling	How to raise
context-length	model-defined	RoPE-limited (e.g. 128K Llama 3.1)	Pin RoPE scaling via the HF config; verify quality.
max-running-requests	auto	KV-cache budget	Raise; watch gpu_cache_usage and preemption count.
max-total-tokens	auto from --mem-fraction-static	Device memory	Raise --mem-fraction-static carefully (>0.92 risks activation OOM).
max-loras-per-batch	1	GPU memory	Increase; activation matmul cost grows linearly.
chunked-prefill-size	8192	Activation budget	Smaller chunks lower p99 prefill latency; larger chunks raise throughput.
TP size (intra-node)	1	8 (NVLink)	Bounded by GPUs per NVLink island.
EP size (intra-node, MoE)	1	Number of experts	Useful for DeepSeek / Mixtral / Qwen MoE.
PP size (cross-node)	1	Practical ~16	Bounded by pipeline-bubble overhead.
Speculative draft steps	5	~8	Diminishing returns above 5-6 on most workloads.
Shared memory (NCCL)	/dev/shm	Container-defined	Mount /dev/shm >= 8GB for TP>1.
File descriptors	1024	ulimit	ulimit -n 65536 in container.
Concurrent requests / engine	max-running-requests + queue	Memory-bounded	Add replicas behind a router.

Warning: RadixAttention's cross-request sharing means tenants on the same engine can observe each other's prefix existence through cache-hit timing. Where strict tenant isolation is required (regulated public-sector workloads, multi-customer SaaS with confidential prompts), either run one engine per tenant or set --disable-radix-cache and accept the throughput hit.

Observability

sglang:time_to_first_token_seconds — prefill latency; correlate with prompt-length histogram.
sglang:e2e_request_latency_seconds — end-to-end p50/p95/p99.
sglang:gen_throughput — sustained output tokens/sec across the active batch.
sglang:cache_hit_rate — RadixAttention hit rate; the headline efficiency metric for this engine.
sglang:num_running_reqs — current batch size; compare with --max-running-requests.
sglang:num_used_tokens — tokens resident in the KV pool; compare with sglang:max_total_num_tokens.
sglang:num_preempt_reqs — non-zero in steady state indicates undersized cluster.
DCGM_FI_DEV_GPU_UTIL and DCGM_FI_DEV_MEM_COPY_UTIL — pair with sglang metrics to distinguish compute, memory and idle bottlenecks.

# Prometheus rules for an SGLang deployment
groups:
  - name: sglang-sla
    interval: 30s
    rules:
      - alert: SGLangHighTimeToFirstToken
        expr: histogram_quantile(0.95,
                sum by (le, model_name) (
                  rate(sglang:time_to_first_token_seconds_bucket[5m]))) > 1.0
        for: 5m
        labels: { severity: warning, team: inference }
        annotations:
          summary: "SGLang TTFT p95 above 1s on {{ $labels.model_name }}"

      - alert: SGLangRadixCacheCollapse
        expr: avg_over_time(sglang:cache_hit_rate[10m]) < 0.20
        for: 15m
        labels: { severity: info }
        annotations:
          summary: "RadixAttention hit rate below 20% — workload shape changed, consider vLLM"

      - alert: SGLangKVPressure
        expr: sglang:num_used_tokens / sglang:max_total_num_tokens > 0.95
        for: 10m
        labels: { severity: warning }
        annotations:
          summary: "KV pool >95% full — preemption imminent"

      - alert: SGLangPreemptionSpike
        expr: increase(sglang:num_preempt_reqs[5m]) > 20
        for: 5m
        labels: { severity: critical }
        annotations:
          summary: "Preemptions rising — capacity insufficient or runaway request"

      - alert: SGLangWatchdogSkips
        expr: increase(sglang:num_watchdog_skipped_steps[10m]) > 0
        for: 5m
        labels: { severity: critical }
        annotations:
          summary: "Watchdog skipped steps — investigate kernel hang or NCCL stall"

Cost and FinOps

RadixAttention hit rate is the largest per-fleet cost lever — a fleet with a 70% hit rate runs at roughly half the $/M tokens of the same fleet with a 10% hit rate.
FP8 weights + FP8 KV is the next-largest lever, identical to vLLM economics.
Speculative decoding pays off at low concurrency only; at the batch sizes that dominate steady-state production it adds compute without lowering wall-clock latency.
FOCUS-conformant billing exports from Yobitel tag each engine with inference_engine=sglang so $/M tokens can be sliced by tenant, model and engine type for direct vLLM / TensorRT-LLM comparison.

Configuration	GPU rate ($/h)	Sustained tok/s	$/M output tokens	Notes
1x H100 SXM5, Llama 3.1 8B FP8	$3.20	5,000	$0.18	Single replica, RadixAttention on.
4x H100 SXM5, Llama 3.1 70B FP8	$12.40	4,200	$0.82	TP=4, LPM scheduler.
4x H100 SXM5, agent workload (75% hit)	$12.40	7,200	$0.48	RadixAttention sharing dominates.
8x H100 SXM5, multi-tenant chat	$24.80	10,500	$0.66	TP=8, shared sys prompt.
2x H200, Llama 3.1 70B 128K ctx	$8.40	1,900	$1.23	Long context tax; FP8 KV.
4x B200, Llama 3.1 70B FP4	$22.00	9,800	$0.62	Blackwell FP4 + FlashInfer3.
8x MI300X, Llama 3.1 70B FP8	$18.80	5,500	$0.95	ROCm 6.2, AITER kernels.
1x H100, JSON-constrained 8B	$3.20	4,000	$0.22	XGrammar overhead <3%.

Security and compliance

Warning: RadixAttention shares prefix blocks across tenants by default. If tenant prompt content is itself confidential — clinical notes, legal drafts, identifiable PII in the prefix — either run one engine per tenant or disable RadixAttention (--disable-radix-cache) and accept the throughput cost.

Migration and alternatives

From	Migration effort	Throughput change	Operational notes
vLLM (chat workload)	Low — drop-in OpenAI API swap	Roughly equal	Switch only if structured generation matters.
vLLM (agent / multi-tenant)	Low — drop-in	1.5-3x faster on hit-rate-heavy workloads	RadixAttention is the headline win.
TensorRT-LLM + Triton	Medium — drop engine build	10-30% slower at same latency	Gain rapid model rotation, structured generation; lose absolute-min latency.
TGI (Text Generation Inference)	Low — same OpenAI API	Comparable; SGLang wins on new architectures and structured outputs	Multi-LoRA story is similar.
OpenAI / Bedrock / Anthropic API	High — model substitution	Variable	Gain control, sovereignty; lose hosted model variety.
Outlines-on-vLLM for structured	Low — same API	10-30x faster mask updates via XGrammar	If you currently bottleneck on Outlines, this is the upgrade.

# Kubernetes deployment with NVIDIA GPU Operator installed
kubectl apply -f - <<'YAML'
apiVersion: apps/v1
kind: Deployment
metadata: { name: llama3-70b-sglang }
spec:
  replicas: 2
  selector: { matchLabels: { app: llama3-70b-sglang } }
  template:
    metadata: { labels: { app: llama3-70b-sglang } }
    spec:
      containers:
        - name: sglang
          image: lmsysorg/sglang:v0.4.0-cu124
          args:
            - "python"
            - "-m"
            - "sglang.launch_server"
            - "--model-path=meta-llama/Meta-Llama-3.1-70B-Instruct"
            - "--tp=4"
            - "--context-length=32768"
            - "--mem-fraction-static=0.90"
            - "--quantization=fp8"
            - "--kv-cache-dtype=fp8_e4m3"
            - "--schedule-policy=lpm"
            - "--enable-flashinfer"
            - "--port=30000"
          resources:
            limits: { nvidia.com/gpu: 4 }
          ports: [{ containerPort: 30000 }]
          volumeMounts:
            - { name: dshm, mountPath: /dev/shm }
      volumes:
        - name: dshm
          emptyDir: { medium: Memory, sizeLimit: 8Gi }
YAML

Note: If you cannot tolerate a structured-generation pass-through layer breaking on a new model architecture, keep an Outlines+vLLM fallback path warm. SGLang's XGrammar is fast but occasionally lags new tokenisers by a release.

Troubleshooting

Symptom / Error	Cause	Fix
torch.cuda.OutOfMemoryError at startup	mem-fraction-static too high; activations crowd KV pool.	Lower to 0.85; verify no other process on GPU.
NCCL hang on startup with TP>1	/dev/shm too small or NVLink P2P disabled.	Mount /dev/shm >= 8GB; set --enable-p2p-check; export NCCL_DEBUG=INFO.
cache_hit_rate near zero on a hit-friendly workload	schedule-policy fcfs scattering shared prefixes across batches.	Switch to --schedule-policy lpm.
Throughput unexpectedly lower than vLLM	Workload has no prefix overlap; RadixAttention overhead unrecovered.	Either rework prompts to share a stable prefix or move that workload to vLLM.
Watchdog killed step	Long prefill or driver-level kernel hang.	Raise --watchdog-timeout temporarily; investigate driver and FlashInfer version.
XGrammar parse error on response_format	JSON schema unsupported feature (e.g. `oneOf` with discriminator).	Simplify schema; switch grammar-backend to outlines for that workload.
Multi-LoRA latency spikes	Too many --max-loras-per-batch on small GPUs.	Cap at 8 on H100; benchmark adapter activation matmul.
EAGLE-2 speculative regression in throughput	Draft head from a different fine-tune than the served target.	Recompute draft head; verify acceptance rate via the metrics endpoint.
Preemption rate climbs steadily	max-running-requests too high for current KV budget.	Lower --max-running-requests or add replicas; never push --mem-fraction-static above 0.92.
HTTP 400 'context length exceeded'	Prompt + max_new_tokens exceeds --context-length.	Raise --context-length with HF rope_scaling; or chunk prompt client-side.
Multi-node deployment never converges	PP bubble too large or IB misconfigured.	Lower PP; set NCCL_IB_HCA, NCCL_SOCKET_IFNAME; verify GPUDirect RDMA.
Sudden throughput drop after upgrade	FlashInfer kernel selection regressed.	Pin SGLANG_ATTENTION_BACKEND=flashinfer or triton; re-benchmark.

Where this fits in the Yobitel stack

References

SGLang: Efficient Execution of Structured Language Model Programs · arXiv (Zheng et al., 2023)
SGLang on GitHub · GitHub
SGLang Documentation · LMSYS
FlashInfer: Kernel Library for LLM Serving · GitHub (FlashInfer)
XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models · arXiv (Dong et al., 2024)
EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees · arXiv (Li et al., 2024)
Efficient Memory Management for LLM Serving with PagedAttention · arXiv (Kwon et al., 2023)

Overview

Quick start

How it works

Reference and specifications

Workload patterns

Sizing and capacity planning

Limits and quotas

Observability

Cost and FinOps

Security and compliance

Migration and alternatives

Troubleshooting

Where this fits in the Yobitel stack

References

Browse all entries

Deploy on Yobibyte

SGLang

Overview

Quick start

How it works

Reference and specifications

Workload patterns

Sizing and capacity planning

Limits and quotas

Observability

Cost and FinOps

Security and compliance

Migration and alternatives

Troubleshooting

Where this fits in the Yobitel stack

References

Browse all entries

Deploy on Yobibyte