TL;DR
- TensorRT-LLM's term for iteration-level batch scheduling — functionally equivalent to vLLM's continuous batching.
- Implemented inside the TensorRT-LLM batch manager; admits new requests and evicts finished ones between every decoding step.
- Enabled via the `inflight_fused_batching` model config in Triton's `tensorrtllm_backend`.
- Pairs with paged KV cache, chunked prefill and speculative decoding inside the same engine.
Overview#
When NVIDIA built TensorRT-LLM, the team adopted iteration-level scheduling as a core feature and gave it the name 'in-flight batching'. The mechanics are the same as vLLM's continuous batching: between every forward pass, the scheduler evicts sequences that have produced an end-of-sequence token and admits waiting sequences into the freed slots.
The naming reflects NVIDIA's preference for describing the technique as 'requests in flight at any moment' rather than 'batch boundaries that move over time'. Both descriptions point at the same scheduling model.
Implementation#
- Lives inside the TensorRT-LLM `BatchManager` C++ component.
- Coupled with `paged_kv_cache` to handle the irregular KV lifecycles.
- Supports `max_num_sequences`, `max_num_tokens` and `kv_cache_free_gpu_mem_fraction` as scheduling knobs.
- Chunked context (TensorRT-LLM's chunked prefill) splits long prompts across iterations to keep decode latency stable.
- Engine must be built with `--remove_input_padding` and packed-tensor support enabled.
Triton Configuration#
The Triton TensorRT-LLM backend exposes in-flight batching via the `gpt_model_type` parameter in `config.pbtxt`. Setting it to `inflight_fused_batching` selects the continuous-batching scheduler; the older `v1` setting falls back to per-request static batching.
parameters: {
key: "gpt_model_type"
value: { string_value: "inflight_fused_batching" }
}
parameters: {
key: "max_tokens_in_paged_kv_cache"
value: { string_value: "131072" }
}
parameters: {
key: "kv_cache_free_gpu_mem_fraction"
value: { string_value: "0.9" }
}
parameters: {
key: "enable_chunked_context"
value: { string_value: "true" }
}Comparison to vLLM Continuous Batching#
There is no meaningful behavioural difference. Both runtimes implement iteration-level scheduling, paged KV cache, chunked prefill and request-level priorities. The differences are in the surrounding implementation — TensorRT-LLM's C++ batch manager versus vLLM's Python+CUDA scheduler — and in how each integrates with quantisation, parallelism and the serving layer.
When to Care About the Distinction#
Only when reading NVIDIA documentation. Outside of TensorRT-LLM and Triton context, 'continuous batching' is the term most teams use. The two are synonymous and the choice of word does not change deployment behaviour.
References
- TensorRT-LLM In-Flight Batching · NVIDIA
- Triton TensorRT-LLM Backend · GitHub
- TensorRT-LLM on GitHub · GitHub (NVIDIA)