TL;DR
- Pattern for serving multiple models from one fleet, sharing GPU capacity, queues and metrics infrastructure.
- Strategies range from coarse-grained (one model per replica, routed by tag) to fine-grained (many adapters in one replica via multi-LoRA, model hot-swap or MIG slicing).
- Common platforms: Triton Inference Server, KServe with model-mesh, BentoML, Ray Serve, and runtime-native multi-LoRA in vLLM and TGI.
- Trade-off is between GPU utilisation (higher with consolidation) and isolation (better with dedicated replicas).
Why Consolidate#
A SaaS platform serving fifty fine-tuned LLM variants cannot afford fifty dedicated GPU fleets — most variants would sit idle at any moment. Multi-model serving consolidates many models behind shared infrastructure so that aggregate utilisation rises and idle capacity falls.
The pattern is older than LLMs (Triton has supported it since 2018) but became urgent with the rise of per-tenant fine-tunes and LoRA adapters.
Patterns#
- Replica-per-model — simplest; each model runs in its own replica, routed by URL or header. Best isolation, worst utilisation.
- Model-mesh — KServe ModelMesh schedules many models across a pool of replicas, loading and unloading on demand. Good for cold-tolerant workloads.
- Multi-LoRA — single base model in memory, many LoRA adapters loaded as cheap deltas, per-request routing. Best utilisation when models share a base.
- MIG slicing — partition a GPU into 2-7 MIG instances and serve different models on each. Hardware-enforced isolation, fixed slice sizes.
- Time-multiplexed swap — load and unload models on demand; cold start dominates first request.
Multi-LoRA Detail#
Multi-LoRA is the high-leverage pattern for fine-tune-heavy platforms. A base model (e.g. Llama 3 8B) lives in GPU memory; per-tenant LoRA adapters — typically a few tens of megabytes each — are loaded alongside and selected per request via an `adapter_id` field. Forward-pass cost rises only by the adapter matmul, which is cheap. TGI, vLLM and TensorRT-LLM all support this pattern with comparable performance characteristics.
Cap concurrent adapters per replica. Performance is sensitive to the number of active adapters in a single batch; typical sweet spot is 8-32 on H100.
Routing Layer#
Multi-model fleets need an intelligent router. Key responsibilities: select the replica with the requested model already loaded, balance per-model load, honour priority and isolation rules, expose per-model metrics. KServe's ModelMesh, BentoML's routing layer and the vLLM Production Stack all implement variants of this.
Observability#
Per-model metrics are essential. Aggregate fleet metrics hide that one model has 99th-percentile latency of 8 seconds while the rest hum along. Per-model histograms, queue depths and error rates should be exported by default and used as the primary signal for capacity planning.
When to Consolidate#
Consolidate when traffic per model is sub-replica scale, when models share a base (multi-LoRA fits), or when the alternative is unaffordable idle capacity. Keep models on dedicated fleets when latency SLOs are tight (cold-load risk), when models share no base, or when regulatory or contractual isolation is required.
References
- KServe ModelMesh · KServe
- Triton Model Repository Documentation · NVIDIA
- vLLM Multi-LoRA Documentation · vLLM