TL;DR
- Open-source LLM serving framework from LMSYS (the team behind Vicuna and Chatbot Arena), first release January 2024, Apache 2.0. Backed by a Linux Foundation governance proposal in 2026 with contributors from xAI, NVIDIA, AMD, ByteDance, Databricks and Yobitel.
- RadixAttention is the headline differentiator versus vLLM: a radix-tree index over the entire KV pool that shares prefix blocks across unrelated requests automatically, lifting effective throughput 2-5x on agent and multi-tenant workloads where prompt overlap is the dominant cost.
- First-class structured generation surface — `regex=`, `choices=`, JSON-schema constrained decoding via XGrammar — implemented with a compressed finite-state-machine path that updates logit masks in single-digit microseconds per step.
- OpenAI-compatible REST API plus a Python DSL that compiles multi-call programmes (forks, joins, conditional gens) into batched scheduler primitives. Supports FlashInfer kernels, FP8 / FP4 quantisation, tensor / expert parallelism, EAGLE-2 / Medusa speculative decoding, multi-LoRA hot-swap.
- Offered inside Yobitel's Yobibyte platform as the recommended engine for agent loops, structured-output workloads and high-prefix-overlap multi-tenant serving; scored continuously against vLLM and TensorRT-LLM on InferenceBench across H100 SXM5, H200, B200 and MI300X tenancies.
Overview#
SGLang is an LLM serving runtime that started as a research project in the LMSYS group at UC Berkeley (the same community behind Vicuna and Chatbot Arena) and shipped its first public release in January 2024. Where vLLM optimises for breadth and developer ergonomics, SGLang optimises for the workloads where prompt structure dominates the cost profile — agent loops with shared tool scaffolds, multi-tenant chat with overlapping system prompts, batch evaluations with shared few-shot examples, and any workload that needs constrained JSON / regex / choice outputs at production rates.
The framework has two halves. The runtime is a Python and C++ engine wrapping FlashInfer kernels for attention, with a custom scheduler, the RadixAttention prefix-cache index, an XGrammar-based constrained decoder, and an HTTP server that speaks the OpenAI Chat Completions, Completions and Embeddings APIs. The front-end is a Python DSL (`sglang as sgl`) that lets you express multi-call programmes — forks, joins, parallel `gen`s, structured outputs — as ordinary Python code; the DSL compiles to scheduler primitives that the runtime executes with maximum batching.
By mid-2026 SGLang sits among the three reference open-source LLM serving engines (with vLLM and TensorRT-LLM), is the default engine in xAI's Grok inference fleet, ships in the NVIDIA NIM catalogue, and has crossed roughly 30,000 GitHub stars. Releases land every two to four weeks. New model architectures are typically supported within one to two weeks of weight publication — slower than vLLM's day-one cadence, faster than TensorRT-LLM's monthly cycle. Yobibyte exposes SGLang as an opt-in engine for structured-generation and prefix-heavy workloads — Yobitel customers reach SGLang through a managed workspace, with the platform routing agent and JSON-constrained traffic to a SGLang-backed endpoint when its measured profile beats the vLLM default.
This entry documents the production surface: the CLI and Python DSL, the RadixAttention and structured-decoding internals, the parallelism strategies, the workload patterns where SGLang beats vLLM (and the ones where it does not), limits, observability hooks, and the sizing, cost and migration models you need to run SGLang at scale on Yobitel and beyond. This entry helps you stand up SGLang for production LLM serving with the right flags, sizing and operational practices — whether you are operating raw upstream or consuming SGLang as a Yobibyte opt-in for agent and structured-output workloads.
Quick start#
The example below installs SGLang, serves Llama 3.1 70B Instruct on a 4x H100 SXM5 node with FP8 weights, FlashInfer attention, RadixAttention prefix sharing and a 32K context window, then issues both a plain OpenAI-compatible chat completion and a JSON-schema-constrained completion using the structured-generation API. The third snippet drives the same endpoint from the SGLang Python DSL to show the multi-call programme model.
# 1. Install SGLang and serve Llama 3.1 70B on 4x H100 SXM5
pip install "sglang[all]>=0.4.0"
python -m sglang.launch_server \
--model-path meta-llama/Meta-Llama-3.1-70B-Instruct \
--tp 4 \
--context-length 32768 \
--mem-fraction-static 0.88 \
--quantization fp8 \
--kv-cache-dtype fp8_e4m3 \
--enable-flashinfer \
--schedule-policy lpm \
--max-running-requests 256 \
--port 30000
# 2. Plain OpenAI-compatible chat completion
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3.1-70B-Instruct",
"messages": [{"role": "user", "content": "Summarise RadixAttention in 2 lines."}],
"max_tokens": 128
}'
# 3. JSON-schema constrained completion (industry-standard XGrammar backend)
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3.1-70B-Instruct",
"messages": [{"role": "user", "content": "Extract the company and amount."}],
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "extract",
"schema": {
"type": "object",
"properties": {
"company": { "type": "string" },
"amount_usd": { "type": "number" }
},
"required": ["company", "amount_usd"]
}
}
}
}'
# 4. The same endpoint, driven by the native SGLang Python DSL
python - <<'PY'
import sglang as sgl
sgl.set_default_backend(sgl.RuntimeEndpoint("http://localhost:30000"))
@sgl.function
def triage(s, ticket):
s += sgl.system("You triage support tickets.")
s += sgl.user(ticket)
s += sgl.assistant(
"Category: " + sgl.gen("category", choices=["billing", "outage", "feature"])
+ "\nSeverity: " + sgl.gen("severity", choices=["P1", "P2", "P3"])
+ "\nReason: " + sgl.gen("reason", max_tokens=80)
)
state = triage.run(ticket="Production endpoint returning 503 for the last 12 minutes.")
print(state["category"], state["severity"], "-", state["reason"])
PYHow it works#
SGLang is structured as an asynchronous engine wrapped by a FastAPI HTTP server. Requests enter through the server, are tokenised, validated against `--context-length`, and handed to a scheduler that selects the next set of sequences to run based on RadixAttention cache hits, fairness policy and KV-pool occupancy. Forward execution uses FlashInfer (the LMSYS-maintained attention library) on Hopper and Blackwell, with paged-KV variants that gather K and V from a global block pool via per-sequence block tables — conceptually equivalent to PagedAttention but with FlashInfer-specific kernel optimisations for grouped-query attention and FP8 KV.
RadixAttention is the differentiating innovation. The runtime maintains a radix tree keyed by token IDs over the entire KV pool, not just within a single sequence family. Every cached block is content-addressed and shared the moment a second request matches. When a new request arrives, the scheduler walks the radix tree to find the longest matching prefix, reuses those physical blocks, and prefills only the suffix. The eviction policy is least-recently-used over tree branches with a small bias toward keeping high-fanout branches (frequently shared prefixes) resident.
The structured-generation path is implemented with XGrammar, a compressed-FSM library that compiles regex, JSON-schema and choice constraints into a state machine over token IDs. On each decode step, XGrammar masks out tokens that would violate the grammar before the sampling step; the mask update takes single-digit microseconds and adds negligible overhead even at high throughput. Compared to Outlines (the canonical Python implementation) XGrammar trades some flexibility for a 10-30x speedup on the per-step mask update, which is what makes JSON-constrained serving viable at chat-grade latency.
The Python DSL compiles multi-call programmes into batched scheduler primitives. A `fork`/`join` over k branches becomes k parallel sequences scheduled together; a `choices=` `gen` becomes a constrained-decoding sequence with a static FSM; a `gen` with `max_tokens` becomes a normal autoregressive request. The scheduler sees these primitives as a unified batch and can apply RadixAttention sharing across them automatically — multiple parallel branches of one `fork` will share the prompt prefix exactly because they entered the scheduler together.
- RadixAttention: radix-tree-indexed KV pool with cross-request prefix sharing; effective hit rates above 80% on agent and multi-tenant workloads.
- FlashInfer kernels: LMSYS-maintained attention library with paged-KV, GQA-aware, FP8-KV optimised paths.
- XGrammar constrained decoding: regex, JSON-schema and choices implemented as compressed FSMs over the tokenizer; <5µs per-step mask update.
- Continuous batching with three scheduler policies: LPM (longest prefix match, RadixAttention-aware), FCFS (first-come-first-served), DFS (depth-first for fork-heavy workloads).
- Tensor parallelism (`--tp`) and expert parallelism (`--ep-size`) for MoE; pipeline parallelism via `--nnodes` and `--node-rank` for multi-host deployments.
- Speculative decoding: external draft, EAGLE-2, Medusa heads, n-gram lookahead — all enabled with `--speculative-algorithm` and a draft model path.
- Multi-LoRA hot-swap via `--lora-paths` with per-request `lora_id` routing.
- Quantisation: FP8 (E4M3 / E5M2), INT8, AWQ INT4, GPTQ INT4, FP4 on Blackwell.
- OpenAI-compatible REST API plus the native `sglang` Python DSL with `fork`, `gen`, `select`, `choices`, `regex` primitives.
Turn on `--enable-flashinfer`, `--schedule-policy lpm` and the FP8 KV cache together as your baseline on Hopper. The combined uplift over the SGLang defaults on a typical agent workload is 30-50% before you change anything else.
Reference and specifications#
Every long-lived SGLang deployment is parameterised through the `sglang.launch_server` CLI. The table below is the canonical reference for the flags as of SGLang v0.4 (June 2026). Flags marked with an asterisk are also available as `ServerArgs` fields in the Python API. Flags not listed here are either internal tuning knobs that defaults handle correctly or specialised features documented in the upstream reference.
| Flag | Type | Default | Description |
|---|---|---|---|
| --model-path * | string | (required) | HuggingFace repo id or local path. Drives architecture detection. |
| --tp / --tensor-parallel-size * | int | 1 | Shard each weight matrix across N GPUs within an NVLink island. |
| --ep-size * | int | 1 | Expert-parallel degree for MoE models (DeepSeek, Mixtral, Qwen MoE). |
| --nnodes / --node-rank | int | 1 / 0 | Multi-node coordination for pipeline-parallel / large-MoE deployments. |
| --context-length * | int | model-defined | Maximum total tokens per sequence. Bounded by RoPE and KV budget. |
| --max-running-requests * | int | auto | Hard cap on concurrent sequences in the running batch. |
| --max-total-tokens * | int | auto | Cap on tokens resident in the KV pool; sized from `--mem-fraction-static`. |
| --mem-fraction-static * | float | 0.88 | Fraction of GPU memory pooled for weights + activations + KV. |
| --kv-cache-dtype * | string | auto | auto | fp8_e4m3 | fp8_e5m2. FP8 halves cache footprint at <0.1 EM regression. |
| --quantization * | string | (off) | fp8 | awq | gptq | gptq_marlin | bitsandbytes | fp4 (Blackwell only). |
| --enable-flashinfer | bool | true on H100+ | Use the FlashInfer paged-KV attention kernel. |
| --attention-backend * | string | flashinfer | flashinfer | triton | torch_native. flashinfer is fastest on Hopper/Blackwell. |
| --schedule-policy * | string | lpm | lpm (longest-prefix-match, RadixAttention-aware) | fcfs | dfs. |
| --disable-radix-cache | bool | false | Turn off cross-request prefix sharing. Required by some multi-tenant isolation models. |
| --chunked-prefill-size * | int | 8192 | Tokens per prefill chunk; interleaves with decode in the same step. |
| --grammar-backend * | string | xgrammar | xgrammar | outlines | llguidance. xgrammar is the fastest in production. |
| --constrain-output * | string | (off) | Default grammar for all responses — JSON schema path, regex literal or `choices=`. |
| --speculative-algorithm * | string | (off) | eagle | eagle2 | medusa | ngram. Pair with --speculative-draft-model-path. |
| --speculative-num-steps * | int | 5 | Tokens proposed per draft pass. |
| --lora-paths * | list | (none) | Comma-separated `name=path` pairs for multi-LoRA hot-swap; per-request `lora_id`. |
| --max-loras-per-batch * | int | 1 | Number of LoRA adapters active in a single forward; >1 uses S-LoRA path. |
| --enable-mixed-chunk | bool | false | Allow prefill and decode tokens in the same chunked-prefill batch. |
| --enable-p2p-check | bool | false | Verify NVLink P2P connectivity at startup; turn on when debugging TP hangs. |
| --watchdog-timeout * | int (s) | 300 | Engine watchdog; kill the process if a step exceeds this duration. |
| --port | int | 30000 | HTTP/OpenAI-compatible API port. |
| --served-model-name | string | model id | Override the model name reported via the OpenAI API. |
| --api-key | string | (none) | Optional bearer-token auth for the API server. |
| --show-time-cost | bool | false | Per-request server-side timing breakdown in the log. |
| --enable-metrics | bool | true | Expose Prometheus metrics on /metrics. |
| --decode-log-interval * | int | 40 | Number of decode steps between throughput log lines. |
| --disable-cuda-graph | bool | false | Disable CUDA-graph capture; useful when debugging kernel selection. |
`--schedule-policy lpm` is the policy that makes RadixAttention pay off — it groups requests sharing a long prefix into the same batch so the cached blocks are actually reused. Switching to `fcfs` strands the cache and erases the cross-request win.
Workload patterns#
Three workload shapes cover the bulk of SGLang production deployments and are where SGLang either beats or matches vLLM. Pattern A is the easy choice: agent loops with a long shared system prompt, where RadixAttention's cross-request sharing is the biggest single optimisation available on the field. Pattern B is structured-output serving — JSON, regex, choice constraints — where SGLang's XGrammar backend is materially faster than Outlines at chat-grade latency. Pattern C is tool-call routing for agent platforms, where the structured-generation API removes a whole layer of client-side parsing. These are the three workflows Yobibyte automates when a customer's measured profile shows the SGLang opt-in beating the vLLM default — the LPM scheduler, XGrammar backend and RadixAttention tuning are what a team running raw SGLang on their own Kubernetes signs up to operate themselves.
# A — multi-tenant agent endpoint on 4x H100 SXM5 with long shared tool scaffold
python -m sglang.launch_server \
--model-path meta-llama/Meta-Llama-3.1-70B-Instruct \
--tp 4 \
--context-length 32768 \
--mem-fraction-static 0.90 \
--quantization fp8 \
--kv-cache-dtype fp8_e4m3 \
--schedule-policy lpm \
--max-running-requests 384 \
--enable-flashinfer \
--port 30000
# B — JSON-schema-constrained extraction endpoint at high RPS
python -m sglang.launch_server \
--model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--tp 1 \
--context-length 8192 \
--grammar-backend xgrammar \
--quantization fp8 \
--max-running-requests 512 \
--schedule-policy lpm \
--port 30000
# C — tool-call routing using the DSL's choices= constraint
python - <<'PY'
import sglang as sgl
sgl.set_default_backend(sgl.RuntimeEndpoint("http://localhost:30000"))
@sgl.function
def route(s, query):
s += sgl.system("Route the query to a tool.")
s += sgl.user(query)
s += sgl.assistant(
"Tool: " + sgl.gen("tool", choices=["search", "calculator", "code_exec", "answer"])
+ "\nArguments: " + sgl.gen("args", max_tokens=120, regex=r"\{.*\}")
)
for q in ["what is 1.07^12?", "current population of London", "summarise this paper"]:
out = route.run(query=q)
print(q, "->", out["tool"], out["args"])
PYIf you turn RadixAttention off (`--disable-radix-cache`) for tenant-isolation reasons, SGLang's headline advantage over vLLM evaporates and you should consider whether vLLM is the simpler choice for that fleet.
Sizing and capacity planning#
SGLang throughput is bounded first by KV-cache memory, then by tensor-core FLOPs, then by NCCL bandwidth at TP > 2 — same hierarchy as vLLM. The differentiator is that SGLang's effective KV budget is larger because cached prefix blocks are shared across tenants; the sizing table below reports throughput at realistic prefix hit rates rather than the worst case. Tokens-per-second figures are mid-range observed values from InferenceBench v3 at 4K input / 256 output, mixed concurrency, with the prefix-hit-rate assumption noted; treat as planning anchors rather than contractual.
The TP / EP / PP rules of thumb mirror vLLM: TP up to 8 inside one NVLink island is well-behaved; TP across InfiniBand is almost always slower than pipeline or expert parallelism. For DeepSeek-V3 and other large MoE models, EP=8 inside a node combined with TP=8 is the typical baseline; for two-node MoE deployments, EP across the InfiniBand fabric with TP=8 inside each node tends to beat TP=16 + PP=2.
| Workload | Model | Recommended SKU | Concurrency | Output tok/s | Notes |
|---|---|---|---|---|---|
| Chat, low latency | Llama 3.1 8B | 1x H100 SXM5 80GB | 64-128 | 4,200-5,800 | FP8 weights + KV, FlashInfer. |
| Agent, shared scaffold | Llama 3.1 70B | 4x H100 SXM5 | 256-512 | 5,800-8,400 | RadixAttention hit rate ~75%. |
| JSON-constrained extraction | Llama 3.1 8B | 1x H100 SXM5 | 128-256 | 3,400-4,600 | XGrammar backend. |
| Multi-tenant chat (shared sys prompt) | Llama 3.1 70B | 8x H100 SXM5 | 512-1,024 | 8,200-12,500 | LPM scheduler, RadixAttention. |
| Long context (128K) | Llama 3.1 70B | 2x H200 141GB | 32-64 | 1,500-2,300 | FP8 KV, chunked prefill 4096. |
| MoE serving | Mixtral 8x22B | 8x H100 SXM5 | 192-384 | 4,700-7,000 | TP=8 + EP=8. |
| MoE serving | DeepSeek-V3 671B | 16x H100 SXM5 (2 nodes) | 256-512 | 3,400-5,000 | TP=8 + EP=8 per node, IB400. |
| Blackwell next-gen | Llama 3.1 70B | 4x B200 | 256-512 | 7,200-11,000 | FP4 weights, FP8 KV, FlashInfer3. |
| Speculative (EAGLE-2) | Llama 3.1 70B | 4x H100 SXM5 | 32-64 | 4,000-6,200 | Low-concurrency interactive. |
| AMD ROCm path | Llama 3.1 70B | 8x MI300X | 256-512 | 5,200-7,800 | ROCm 6.2, AITER kernels. |
Limits and quotas#
SGLang enforces a small set of hard and soft limits at the engine boundary. Hard limits reject requests with HTTP 400 at the API server; soft limits apply backpressure by extending queue depth. Operational ceilings (memory, NCCL groups, file descriptors, /dev/shm) come from the host OS and CUDA runtime — the same ones that bite vLLM in the same configurations.
| Limit | Default | Hard ceiling | How to raise |
|---|---|---|---|
| context-length | model-defined | RoPE-limited (e.g. 128K Llama 3.1) | Pin RoPE scaling via the HF config; verify quality. |
| max-running-requests | auto | KV-cache budget | Raise; watch gpu_cache_usage and preemption count. |
| max-total-tokens | auto from --mem-fraction-static | Device memory | Raise --mem-fraction-static carefully (>0.92 risks activation OOM). |
| max-loras-per-batch | 1 | GPU memory | Increase; activation matmul cost grows linearly. |
| chunked-prefill-size | 8192 | Activation budget | Smaller chunks lower p99 prefill latency; larger chunks raise throughput. |
| TP size (intra-node) | 1 | 8 (NVLink) | Bounded by GPUs per NVLink island. |
| EP size (intra-node, MoE) | 1 | Number of experts | Useful for DeepSeek / Mixtral / Qwen MoE. |
| PP size (cross-node) | 1 | Practical ~16 | Bounded by pipeline-bubble overhead. |
| Speculative draft steps | 5 | ~8 | Diminishing returns above 5-6 on most workloads. |
| Shared memory (NCCL) | /dev/shm | Container-defined | Mount /dev/shm >= 8GB for TP>1. |
| File descriptors | 1024 | ulimit | ulimit -n 65536 in container. |
| Concurrent requests / engine | max-running-requests + queue | Memory-bounded | Add replicas behind a router. |
RadixAttention's cross-request sharing means tenants on the same engine can observe each other's prefix existence through cache-hit timing. Where strict tenant isolation is required (regulated public-sector workloads, multi-customer SaaS with confidential prompts), either run one engine per tenant or set `--disable-radix-cache` and accept the throughput hit.
Observability#
SGLang exposes a Prometheus metrics endpoint at `/metrics` (when `--enable-metrics` is set, which is the default) covering request throughput, latency histograms, KV-cache utilisation, RadixAttention hit rate, scheduler queue depth and preemption counts. The metric prefix is `sglang:`. Engine logs emit one structured line per request and per decode-log-interval throughput sample; switch to JSON output via `SGLANG_LOGGING_CONFIG` for ingestion into Loki, Splunk or Datadog.
The metrics worth alerting on in production are time-to-first-token p95, inter-token latency p95, KV-pool usage, RadixAttention hit rate, request queue time and the watchdog-skipped step counter. The following Prometheus rules cover the common failure modes.
- sglang:time_to_first_token_seconds — prefill latency; correlate with prompt-length histogram.
- sglang:e2e_request_latency_seconds — end-to-end p50/p95/p99.
- sglang:gen_throughput — sustained output tokens/sec across the active batch.
- sglang:cache_hit_rate — RadixAttention hit rate; the headline efficiency metric for this engine.
- sglang:num_running_reqs — current batch size; compare with --max-running-requests.
- sglang:num_used_tokens — tokens resident in the KV pool; compare with sglang:max_total_num_tokens.
- sglang:num_preempt_reqs — non-zero in steady state indicates undersized cluster.
- DCGM_FI_DEV_GPU_UTIL and DCGM_FI_DEV_MEM_COPY_UTIL — pair with sglang metrics to distinguish compute, memory and idle bottlenecks.
# Prometheus rules for an SGLang deployment
groups:
- name: sglang-sla
interval: 30s
rules:
- alert: SGLangHighTimeToFirstToken
expr: histogram_quantile(0.95,
sum by (le, model_name) (
rate(sglang:time_to_first_token_seconds_bucket[5m]))) > 1.0
for: 5m
labels: { severity: warning, team: inference }
annotations:
summary: "SGLang TTFT p95 above 1s on {{ $labels.model_name }}"
- alert: SGLangRadixCacheCollapse
expr: avg_over_time(sglang:cache_hit_rate[10m]) < 0.20
for: 15m
labels: { severity: info }
annotations:
summary: "RadixAttention hit rate below 20% — workload shape changed, consider vLLM"
- alert: SGLangKVPressure
expr: sglang:num_used_tokens / sglang:max_total_num_tokens > 0.95
for: 10m
labels: { severity: warning }
annotations:
summary: "KV pool >95% full — preemption imminent"
- alert: SGLangPreemptionSpike
expr: increase(sglang:num_preempt_reqs[5m]) > 20
for: 5m
labels: { severity: critical }
annotations:
summary: "Preemptions rising — capacity insufficient or runaway request"
- alert: SGLangWatchdogSkips
expr: increase(sglang:num_watchdog_skipped_steps[10m]) > 0
for: 5m
labels: { severity: critical }
annotations:
summary: "Watchdog skipped steps — investigate kernel hang or NCCL stall"Cost and FinOps#
SGLang cost economics are dominated by the same three levers as vLLM: GPU rental rate, achieved tokens-per-second-per-GPU, and average prefix-cache hit rate. The difference is that SGLang's effective hit rate on multi-tenant and agent workloads is materially higher because RadixAttention shares across requests, not just within them — which translates directly to lower $/M tokens at the same SLO. The table uses Yobitel UK list pricing (June 2026) and InferenceBench v3 throughput anchors; substitute your own when planning.
- RadixAttention hit rate is the largest per-fleet cost lever — a fleet with a 70% hit rate runs at roughly half the $/M tokens of the same fleet with a 10% hit rate.
- FP8 weights + FP8 KV is the next-largest lever, identical to vLLM economics.
- Speculative decoding pays off at low concurrency only; at the batch sizes that dominate steady-state production it adds compute without lowering wall-clock latency.
- FOCUS-conformant billing exports from Yobitel tag each engine with `inference_engine=sglang` so $/M tokens can be sliced by tenant, model and engine type for direct vLLM / TensorRT-LLM comparison.
| Configuration | GPU rate ($/h) | Sustained tok/s | $/M output tokens | Notes |
|---|---|---|---|---|
| 1x H100 SXM5, Llama 3.1 8B FP8 | $3.20 | 5,000 | $0.18 | Single replica, RadixAttention on. |
| 4x H100 SXM5, Llama 3.1 70B FP8 | $12.40 | 4,200 | $0.82 | TP=4, LPM scheduler. |
| 4x H100 SXM5, agent workload (75% hit) | $12.40 | 7,200 | $0.48 | RadixAttention sharing dominates. |
| 8x H100 SXM5, multi-tenant chat | $24.80 | 10,500 | $0.66 | TP=8, shared sys prompt. |
| 2x H200, Llama 3.1 70B 128K ctx | $8.40 | 1,900 | $1.23 | Long context tax; FP8 KV. |
| 4x B200, Llama 3.1 70B FP4 | $22.00 | 9,800 | $0.62 | Blackwell FP4 + FlashInfer3. |
| 8x MI300X, Llama 3.1 70B FP8 | $18.80 | 5,500 | $0.95 | ROCm 6.2, AITER kernels. |
| 1x H100, JSON-constrained 8B | $3.20 | 4,000 | $0.22 | XGrammar overhead <3%. |
Security and compliance#
SGLang ships with optional bearer-token auth on the API server (`--api-key`); production deployments terminate TLS at an ingress (Envoy, NGINX, AWS ALB) and apply mTLS or signed-JWT auth at that layer. The engine does not enforce per-tenant quotas itself — those are implemented at the gateway or via the Yobitel platform's multi-tenant router. Network isolation should follow the standard pattern: the inference pod has no egress to the public internet, model weights are pulled from a private registry, and per-replica NetworkPolicy locks ingress to the gateway service account.
Structured generation is a load-bearing prompt-injection mitigation surface for SGLang in production. Pinning outputs to a JSON schema or a regex literal — `response_format: json_schema` at the API or `regex=` in the DSL — removes a whole class of jailbreaks where the model is coaxed into emitting free-form text. Pair with retrieval source validation and output classifiers at the gateway. The Yobibyte platform enforces this stack by default.
Regulatory considerations mirror vLLM. For UK public-sector workloads, deploy on Yobitel sovereign tenancies that satisfy NCSC Cloud Security Principles and G-Cloud 14. For EU GDPR, the engine processes prompt and completion data only in volatile GPU memory and the on-disk scratch path; encrypt ephemeral storage. For US HIPAA, run inside a BAA-covered VPC and disable per-request logging; for FedRAMP, run the FIPS-validated CUDA build and pin NIAP-approved cipher suites at the ingress.
RadixAttention shares prefix blocks across tenants by default. If tenant prompt content is itself confidential — clinical notes, legal drafts, identifiable PII in the prefix — either run one engine per tenant or disable RadixAttention (`--disable-radix-cache`) and accept the throughput cost.
Migration and alternatives#
Most production migrations to SGLang come from one of three origins: vLLM (for the RadixAttention or structured-generation win), TensorRT-LLM (for faster iteration on new architectures), or TGI / a managed SaaS API. The first is the most common and the most mechanical — both engines speak the OpenAI-compatible API, both run from a HuggingFace repo id, both support FP8 KV and tensor parallelism, and the migration is essentially a container swap.
The decision matrix is straightforward: if your dominant workload is agent loops, multi-tenant chat with shared system prompts, or JSON-constrained extraction, SGLang typically beats vLLM by 1.5-3x on $/M tokens; if your workload is single-tenant chat with no prompt overlap, the two engines are roughly tied and vLLM's larger model coverage usually wins. From TensorRT-LLM, you give up 10-30% peak throughput in exchange for instant model rotation and no engine-build pipeline.
| From | Migration effort | Throughput change | Operational notes |
|---|---|---|---|
| vLLM (chat workload) | Low — drop-in OpenAI API swap | Roughly equal | Switch only if structured generation matters. |
| vLLM (agent / multi-tenant) | Low — drop-in | 1.5-3x faster on hit-rate-heavy workloads | RadixAttention is the headline win. |
| TensorRT-LLM + Triton | Medium — drop engine build | 10-30% slower at same latency | Gain rapid model rotation, structured generation; lose absolute-min latency. |
| TGI (Text Generation Inference) | Low — same OpenAI API | Comparable; SGLang wins on new architectures and structured outputs | Multi-LoRA story is similar. |
| OpenAI / Bedrock / Anthropic API | High — model substitution | Variable | Gain control, sovereignty; lose hosted model variety. |
| Outlines-on-vLLM for structured | Low — same API | 10-30x faster mask updates via XGrammar | If you currently bottleneck on Outlines, this is the upgrade. |
# Kubernetes deployment with NVIDIA GPU Operator installed
kubectl apply -f - <<'YAML'
apiVersion: apps/v1
kind: Deployment
metadata: { name: llama3-70b-sglang }
spec:
replicas: 2
selector: { matchLabels: { app: llama3-70b-sglang } }
template:
metadata: { labels: { app: llama3-70b-sglang } }
spec:
containers:
- name: sglang
image: lmsysorg/sglang:v0.4.0-cu124
args:
- "python"
- "-m"
- "sglang.launch_server"
- "--model-path=meta-llama/Meta-Llama-3.1-70B-Instruct"
- "--tp=4"
- "--context-length=32768"
- "--mem-fraction-static=0.90"
- "--quantization=fp8"
- "--kv-cache-dtype=fp8_e4m3"
- "--schedule-policy=lpm"
- "--enable-flashinfer"
- "--port=30000"
resources:
limits: { nvidia.com/gpu: 4 }
ports: [{ containerPort: 30000 }]
volumeMounts:
- { name: dshm, mountPath: /dev/shm }
volumes:
- name: dshm
emptyDir: { medium: Memory, sizeLimit: 8Gi }
YAMLIf you cannot tolerate a structured-generation pass-through layer breaking on a new model architecture, keep an Outlines+vLLM fallback path warm. SGLang's XGrammar is fast but occasionally lags new tokenisers by a release.
Troubleshooting#
The error table below covers the failure modes that account for roughly 80% of production SGLang incidents observed on Yobitel-operated fleets and InferenceBench community submissions. Each row maps an observable symptom to the underlying mechanism and the minimum-viable fix.
| Symptom / Error | Cause | Fix |
|---|---|---|
| torch.cuda.OutOfMemoryError at startup | mem-fraction-static too high; activations crowd KV pool. | Lower to 0.85; verify no other process on GPU. |
| NCCL hang on startup with TP>1 | /dev/shm too small or NVLink P2P disabled. | Mount /dev/shm >= 8GB; set --enable-p2p-check; export NCCL_DEBUG=INFO. |
| cache_hit_rate near zero on a hit-friendly workload | schedule-policy fcfs scattering shared prefixes across batches. | Switch to --schedule-policy lpm. |
| Throughput unexpectedly lower than vLLM | Workload has no prefix overlap; RadixAttention overhead unrecovered. | Either rework prompts to share a stable prefix or move that workload to vLLM. |
| Watchdog killed step | Long prefill or driver-level kernel hang. | Raise --watchdog-timeout temporarily; investigate driver and FlashInfer version. |
| XGrammar parse error on response_format | JSON schema unsupported feature (e.g. `oneOf` with discriminator). | Simplify schema; switch grammar-backend to outlines for that workload. |
| Multi-LoRA latency spikes | Too many --max-loras-per-batch on small GPUs. | Cap at 8 on H100; benchmark adapter activation matmul. |
| EAGLE-2 speculative regression in throughput | Draft head from a different fine-tune than the served target. | Recompute draft head; verify acceptance rate via the metrics endpoint. |
| Preemption rate climbs steadily | max-running-requests too high for current KV budget. | Lower --max-running-requests or add replicas; never push --mem-fraction-static above 0.92. |
| HTTP 400 'context length exceeded' | Prompt + max_new_tokens exceeds --context-length. | Raise --context-length with HF rope_scaling; or chunk prompt client-side. |
| Multi-node deployment never converges | PP bubble too large or IB misconfigured. | Lower PP; set NCCL_IB_HCA, NCCL_SOCKET_IFNAME; verify GPUDirect RDMA. |
| Sudden throughput drop after upgrade | FlashInfer kernel selection regressed. | Pin SGLANG_ATTENTION_BACKEND=flashinfer or triton; re-benchmark. |
Where this fits in the Yobitel stack#
SGLang is the recommended inference engine inside Yobibyte for any workload with significant prefix overlap or structured-output requirements — agent loops, multi-tenant chat with shared system prompts, JSON-constrained extraction, choice-based routing. vLLM remains the default for general single-tenant chat and TensorRT-LLM is the opt-in performance variant for stable production endpoints with hard latency SLOs. The Yobibyte control plane routes traffic to the engine that wins on a given workload's measured profile, not by static assignment.
Omniscient Compute scores SGLang continuously on InferenceBench v3 across NVIDIA H100, H200, B200 and AMD MI300X tenancies at fixed input/output token mixes (chat, agent, RAG, JSON-constrained, long-context, batch). Each release is benchmarked against the latest vLLM and TensorRT-LLM, with results surfaced to customers as live capacity plans — every recommended SKU and replica count on the Yobibyte console comes from an InferenceBench measurement.
For UK and EU sovereign workloads, SGLang runs on the Yobitel London-1 and Frankfurt-1 regions inside tenancies that satisfy NCSC Cloud Security Principles, G-Cloud 14 lot definitions and the OFFICIAL handling caveat. The combination of an open-source Apache 2.0 engine, sovereign hardware, and transparent benchmark scoring is what lets Yobitel customers run agent and structured-generation workloads in regulated environments without ceding control to a hosted SaaS API.
References
- SGLang: Efficient Execution of Structured Language Model Programs · arXiv (Zheng et al., 2023)
- SGLang on GitHub · GitHub
- SGLang Documentation · LMSYS
- FlashInfer: Kernel Library for LLM Serving · GitHub (FlashInfer)
- XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models · arXiv (Dong et al., 2024)
- EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees · arXiv (Li et al., 2024)
- Efficient Memory Management for LLM Serving with PagedAttention · arXiv (Kwon et al., 2023)