TL;DR
- Multi-Process Service (MPS) is a CUDA feature that lets multiple processes share a GPU concurrently by funnelling their kernels through a single MPS server.
- Predates MIG by years; works on every CUDA-capable GPU including consumer cards, T4, L4, L40, and A10 that have no MIG support.
- Provides only software isolation — one process can OOM another, and memory bandwidth is shared. But it has near-zero overhead and finer-grained sharing than MIG's fixed profiles.
- Exposed in Kubernetes via the NVIDIA device plugin's `sharing.mps` mode or via third-party fractional-GPU plugins; useful for inference fleets on non-MIG hardware and for workloads needing >7-way sharing.
What MPS Does#
Without MPS, a GPU runs one CUDA context at a time. Two processes sharing a GPU time-slice at the context level: each takes turns being active, with context-switch overhead between them. MPS replaces this with a single long-running daemon (`nvidia-cuda-mps-server`) that accepts kernels from multiple client processes and submits them to the GPU in a merged stream. The GPU's hardware scheduler then runs the kernels concurrently on different SMs.
The result is true spatial sharing on a single GPU, with overhead measured in microseconds rather than the milliseconds a context switch costs. For inference workloads where each request fills only a fraction of the SMs, MPS lets the GPU serve several requests in parallel without per-request context overhead.
MPS vs MIG#
MPS and MIG solve the same business problem — sharing a GPU between workloads — but with different trade-offs.
| Property | MPS | MIG |
|---|---|---|
| Isolation | Software | Hardware |
| Memory protection | No | Yes |
| Bandwidth isolation | No | Yes |
| Hardware required | Any CUDA GPU | A100/H100/H200/B200 |
| Max partitions | Unbounded (CUDA limit) | 7 per GPU |
| Partition sizes | Fully dynamic | Fixed profiles |
| Repartition cost | None | Driver reload + node drain |
| Use case | Inference, dev | Multi-tenant prod |
Kubernetes Integration#
The NVIDIA Kubernetes device plugin supports an `mps` sharing mode where each physical GPU is advertised as N logical GPUs, each backed by an MPS client. Kubernetes treats these as standard `nvidia.com/gpu` resources, so unmodified workloads can be packed onto a shared GPU.
# Device plugin config: 4-way MPS sharing per GPU
version: v1
sharing:
mps:
resources:
- name: nvidia.com/gpu
replicas: 4MPS has no memory isolation. If one client allocates more VRAM than its fair share, the others will OOM. Set CUDA_MPS_PINNED_DEVICE_MEM_LIMIT per client when you cannot trust the workload mix.
When to Use MPS#
- Inference fleets on L4, L40S, A10, or T4 — these have no MIG support, so MPS is the only spatial-sharing option.
- Dev clusters where many users share a small number of GPUs and a hard isolation guarantee is not required.
- Workloads needing more than 7-way sharing on a single GPU — MIG caps at 7 instances, MPS does not.
- Latency-sensitive inference where the 60-90 s MIG repartitioning cost is unacceptable.
When Not to Use MPS#
Multi-tenant production where workloads do not trust each other — software isolation is not enough. Frontier-model serving where deterministic bandwidth matters. Training, full stop — MPS does not help and adds a single point of failure (the MPS daemon).
Operational Notes#
The MPS daemon is a single process per GPU. If it crashes, every client on that GPU dies with it. Monitor the daemon, configure systemd or a sidecar to restart it, and budget for occasional surprise terminations. Telemetry under MPS is harder than MIG — DCGM reports the whole GPU as one resource, so per-client billing requires nvidia-smi process-level sampling or eBPF-based attribution.
References
- NVIDIA Multi-Process Service Guide · NVIDIA Docs
- k8s-device-plugin sharing modes · GitHub