MPS (Multi-Process Service)

TL;DR

Multi-Process Service (MPS) is a CUDA feature that lets multiple processes share a GPU concurrently by funnelling their kernels through a single MPS server.
Predates MIG by years; works on every CUDA-capable GPU including consumer cards, T4, L4, L40, and A10 that have no MIG support.
Provides only software isolation — one process can OOM another, and memory bandwidth is shared. But it has near-zero overhead and finer-grained sharing than MIG's fixed profiles.
Exposed in Kubernetes via the NVIDIA device plugin's `sharing.mps` mode or via third-party fractional-GPU plugins; useful for inference fleets on non-MIG hardware and for workloads needing >7-way sharing.

What MPS Does#

Without MPS, a GPU runs one CUDA context at a time. Two processes sharing a GPU time-slice at the context level: each takes turns being active, with context-switch overhead between them. MPS replaces this with a single long-running daemon (`nvidia-cuda-mps-server`) that accepts kernels from multiple client processes and submits them to the GPU in a merged stream. The GPU's hardware scheduler then runs the kernels concurrently on different SMs.

The result is true spatial sharing on a single GPU, with overhead measured in microseconds rather than the milliseconds a context switch costs. For inference workloads where each request fills only a fraction of the SMs, MPS lets the GPU serve several requests in parallel without per-request context overhead.

MPS vs MIG#

MPS and MIG solve the same business problem — sharing a GPU between workloads — but with different trade-offs.

Property	MPS	MIG
Isolation	Software	Hardware
Memory protection	No	Yes
Bandwidth isolation	No	Yes
Hardware required	Any CUDA GPU	A100/H100/H200/B200
Max partitions	Unbounded (CUDA limit)	7 per GPU
Partition sizes	Fully dynamic	Fixed profiles
Repartition cost	None	Driver reload + node drain
Use case	Inference, dev	Multi-tenant prod

Kubernetes Integration#

The NVIDIA Kubernetes device plugin supports an `mps` sharing mode where each physical GPU is advertised as N logical GPUs, each backed by an MPS client. Kubernetes treats these as standard `nvidia.com/gpu` resources, so unmodified workloads can be packed onto a shared GPU.

yaml

# Device plugin config: 4-way MPS sharing per GPU
version: v1
sharing:
  mps:
    resources:
      - name: nvidia.com/gpu
        replicas: 4

MPS has no memory isolation. If one client allocates more VRAM than its fair share, the others will OOM. Set CUDA_MPS_PINNED_DEVICE_MEM_LIMIT per client when you cannot trust the workload mix.

When to Use MPS#

Inference fleets on L4, L40S, A10, or T4 — these have no MIG support, so MPS is the only spatial-sharing option.
Dev clusters where many users share a small number of GPUs and a hard isolation guarantee is not required.
Workloads needing more than 7-way sharing on a single GPU — MIG caps at 7 instances, MPS does not.
Latency-sensitive inference where the 60-90 s MIG repartitioning cost is unacceptable.

When Not to Use MPS#

Multi-tenant production where workloads do not trust each other — software isolation is not enough. Frontier-model serving where deterministic bandwidth matters. Training, full stop — MPS does not help and adds a single point of failure (the MPS daemon).

Operational Notes#

The MPS daemon is a single process per GPU. If it crashes, every client on that GPU dies with it. Monitor the daemon, configure systemd or a sidecar to restart it, and budget for occasional surprise terminations. Telemetry under MPS is harder than MIG — DCGM reports the whole GPU as one resource, so per-client billing requires nvidia-smi process-level sampling or eBPF-based attribution.

References

NVIDIA Multi-Process Service Guide · NVIDIA Docs
k8s-device-plugin sharing modes · GitHub

What MPS Does#

MPS vs MIG#

MPS and MIG solve the same business problem — sharing a GPU between workloads — but with different trade-offs.

Property	MPS	MIG
Isolation	Software	Hardware
Memory protection	No	Yes
Bandwidth isolation	No	Yes
Hardware required	Any CUDA GPU	A100/H100/H200/B200
Max partitions	Unbounded (CUDA limit)	7 per GPU
Partition sizes	Fully dynamic	Fixed profiles
Repartition cost	None	Driver reload + node drain
Use case	Inference, dev	Multi-tenant prod

Kubernetes Integration#

yaml

# Device plugin config: 4-way MPS sharing per GPU
version: v1
sharing:
  mps:
    resources:
      - name: nvidia.com/gpu
        replicas: 4

MPS has no memory isolation. If one client allocates more VRAM than its fair share, the others will OOM. Set CUDA_MPS_PINNED_DEVICE_MEM_LIMIT per client when you cannot trust the workload mix.

When to Use MPS#

Inference fleets on L4, L40S, A10, or T4 — these have no MIG support, so MPS is the only spatial-sharing option.

Dev clusters where many users share a small number of GPUs and a hard isolation guarantee is not required.

Workloads needing more than 7-way sharing on a single GPU — MIG caps at 7 instances, MPS does not.

Latency-sensitive inference where the 60-90 s MIG repartitioning cost is unacceptable.

Operational Notes#

MPS (Multi-Process Service)

What MPS Does#

MPS vs MIG#

Kubernetes Integration#

When to Use MPS#

When Not to Use MPS#

Operational Notes#

References

Browse all entries

Deploy on Yobitel

MPS (Multi-Process Service)

What MPS Does#

MPS vs MIG#

Kubernetes Integration#

When to Use MPS#

When Not to Use MPS#

Operational Notes#

References

Browse all entries

Deploy on Yobitel