TL;DR
- MIG is hardware GPU partitioning: one A100 / A30 / H100 / H200 / B100 / B200 presents as up to seven independent, isolated GPUs with their own SMs, L2 slice, HBM bandwidth and NVENC/NVDEC allocation.
- Not time-slicing — silicon-level isolation. Inter-instance memory bandwidth contention is bounded by the partition table; a misbehaving tenant cannot starve another at the hardware layer.
- Standard profile syntax: `<compute>g.<memory>gb` — e.g. `1g.10gb`, `2g.20gb`, `3g.40gb`, `4g.40gb`, `7g.80gb` on H100 80 GB. H200 141 GB exposes wider profiles (1g.18gb, 7g.141gb). Hopper added FP8 + Confidential Compute attestation per slice.
- GPU Operator integration on Kubernetes: 'single' strategy exposes uniform slices as `nvidia.com/gpu`; 'mixed' strategy exposes per-profile resources (e.g. `nvidia.com/mig-1g.10gb`, `nvidia.com/mig-3g.40gb`) — both are first-class Kubernetes resources.
- Yobibyte exposes MIG as workspace tenancy mode `shared` (vs `dedicated` for full-card); Yobitel NeoCloud bills per MIG slice on a FOCUS-conformant per-slice-hour basis so multi-tenant inference economics work at the silicon, billing and observability layers together.
Overview#
Multi-Instance GPU (MIG) is the hardware feature NVIDIA introduced with A100 in 2020 that lets one data-centre GPU present as up to seven independent, isolated GPUs. Each MIG instance has its own dedicated SMs (Streaming Multiprocessors), its own slice of L2 cache, its own HBM bandwidth allocation, its own memory controller assignment, and its own NVENC/NVDEC pair where applicable. The instances cannot interfere with each other — there is no shared state at the silicon level beyond the GPU's chassis and PCIe lane.
MIG matters because it allows expensive GPUs to be sold, scheduled and billed at finer granularity than 'one whole GPU per workload'. For inference fleets running many small (7B-class) replicas, MIG lets a single H100 host seven concurrent 7B-class endpoints with hard isolation — each tenant sees what looks like a dedicated 1g.10gb GPU with deterministic performance, and the host operator extracts 5-7x the per-card revenue compared to renting the same H100 to one tenant. For Kubernetes clusters, MIG slices appear as discrete schedulable resources (`nvidia.com/mig-1g.10gb`) that the scheduler can place workloads onto using the same primitives as full GPUs.
This entry is the reference for teams sizing MIG-aware infrastructure on Hopper or Blackwell: full profile tables for A100 / A30 / H100 / H200 / B200, the partition-table constraints that make some combinations valid and others not, the GPU Operator single-vs-mixed strategy choice, the confidential-compute attestation story on Hopper, the per-slice cost economics, and the operational pitfalls. Yobitel NeoCloud bills MIG slices independently and Yobibyte exposes a `shared` tenancy mode that maps to MIG under the hood — customers don't see the slice profile but inherit the price advantage. This entry helps you decide when MIG-on-shared makes sense vs full-card-dedicated and how to size the slice mix.
Specifications: profiles and partition tables#
MIG profiles encode the compute and memory allocation per slice as `<compute>g.<memory>gb` — e.g. `1g.10gb` is 1 GPC (~14 SMs on H100) and 10 GB of HBM, `7g.80gb` is the entire H100. The valid combinations are constrained by the GPU's partition table; you cannot freely mix arbitrary slice sizes. The tables below list the supported profiles per SKU as of 2026.
- Partition table rule: total compute across active instances cannot exceed 7g (i.e., the full GPU). A valid layout for H100 is `3g.40gb + 2g.20gb + 1g.10gb + 1g.10gb` (= 7g compute, 80 GB memory).
- Switching the partition table (e.g. from 7x1g.10gb to 1x4g.40gb + 1x3g.40gb) requires destroying all existing instances first — destructive operation, plan at cluster-build or maintenance window.
- MIG instance UUIDs are deterministic per profile slot and survive reboots; container runtime mounts the slice by UUID so workloads can be restarted without re-pinning.
- H200's 1g.18gb slice is the highest-VRAM single-slice MIG profile NVIDIA ships through 2026 — useful for 7B-class chat with 32K+ context that overflows H100's 10 GB.
| GPU | Profile | Compute (GPCs) | Memory | Max instances of this profile | FP8 / TF32 (sparse) per slice |
|---|---|---|---|---|---|
| A100 80 GB | 1g.10gb | 1 | 10 GB | 7 | No FP8 (Ampere); ~22 TFLOPS TF32 |
| A100 80 GB | 2g.20gb | 2 | 20 GB | 3 | ~44 TFLOPS TF32 |
| A100 80 GB | 3g.40gb | 3 | 40 GB | 2 | ~66 TFLOPS TF32 |
| A100 80 GB | 4g.40gb | 4 | 40 GB | 1 | ~89 TFLOPS TF32 |
| A100 80 GB | 7g.80gb | 7 | 80 GB | 1 (full GPU) | ~156 TFLOPS TF32 |
| H100 80 GB SXM5 | 1g.10gb | 1 | 10 GB | 7 | ~565 TFLOPS FP8 sparse |
| H100 80 GB SXM5 | 1g.20gb | 1 | 20 GB | 4 | ~565 TFLOPS FP8 sparse |
| H100 80 GB SXM5 | 2g.20gb | 2 | 20 GB | 3 | ~1,130 TFLOPS FP8 sparse |
| H100 80 GB SXM5 | 3g.40gb | 3 | 40 GB | 2 | ~1,695 TFLOPS FP8 sparse |
| H100 80 GB SXM5 | 4g.40gb | 4 | 40 GB | 1 | ~2,260 TFLOPS FP8 sparse |
| H100 80 GB SXM5 | 7g.80gb | 7 | 80 GB | 1 (full GPU) | ~3,958 TFLOPS FP8 sparse |
| H200 141 GB | 1g.18gb | 1 | 18 GB | 7 | ~565 TFLOPS FP8 sparse |
| H200 141 GB | 1g.35gb | 1 | 35 GB | 4 | ~565 TFLOPS FP8 sparse |
| H200 141 GB | 2g.35gb | 2 | 35 GB | 3 | ~1,130 TFLOPS FP8 sparse |
| H200 141 GB | 3g.71gb | 3 | 71 GB | 2 | ~1,695 TFLOPS FP8 sparse |
| H200 141 GB | 4g.71gb | 4 | 71 GB | 1 | ~2,260 TFLOPS FP8 sparse |
| H200 141 GB | 7g.141gb | 7 | 141 GB | 1 (full GPU) | ~3,958 TFLOPS FP8 sparse |
| B200 192 GB | 1g.23gb | 1 | 23 GB | 7 | ~640 TFLOPS FP4 sparse |
| B200 192 GB | 3g.96gb | 3 | 96 GB | 2 | ~1,950 TFLOPS FP4 sparse |
| B200 192 GB | 7g.192gb | 7 | 192 GB | 1 (full GPU) | ~9,000 TFLOPS FP4 sparse |
Profile names look similar across GPUs but the memory allocations differ — '1g.10gb' on H100 is '1g.18gb' on H200 and '1g.23gb' on B200. Always check the SKU's documented profile list before designing a partition layout; a manifest that hardcodes `1g.10gb` will fail on H200.
Architecture: how silicon-level partitioning works#
Time-sliced GPU sharing (CUDA MPS, Multi-Process Service) shares one set of SMs across processes by interleaving kernel launches on a fast schedule. This works for batched inference of trusted workloads but provides no isolation guarantees — a misbehaving tenant can saturate the SMs and starve others, the L2 cache is shared (cache-line eviction patterns leak information across tenants), HBM bandwidth is shared (a bandwidth-heavy tenant degrades a latency-sensitive one), and security boundaries are software-enforced (CUDA context isolation, not silicon).
MIG instead partitions the GPU at the silicon level. The GPU is internally organised as 7 GPCs (Graphics Processing Clusters on A100; renamed but functionally similar on Hopper/Blackwell), 8 HBM stacks (mapped to memory controllers), an L2 cache, NVENC/NVDEC engines and the rest. MIG creates 'GPU Instances' (GIs) — groups of GPCs, HBM partitions, L2 slices and NVDEC/NVENC channels — and 'Compute Instances' (CIs) within each GI. The hypervisor-level GPU driver enforces the partition: a workload on instance A cannot issue memory accesses, kernel launches or DMA transfers that touch instance B's resources.
Hopper MIG added Confidential Compute (CC-on) attestation per-instance: each MIG slice can be attested independently via SPDM-over-PCIe to NVIDIA's NRAS service, and HBM-resident pages within a slice are sealed against the host kernel. This makes Hopper MIG the first commercial GPU partitioning primitive with FedRAMP Moderate / NCSC OFFICIAL-aligned multi-tenant inference posture.
What MIG does NOT do: NVLink is not exposed across MIG slices on the same GPU (you cannot tensor-parallel across two MIG instances), some peer-to-peer CUDA features are restricted, and certain UVM (Unified Virtual Memory) patterns degrade or fail. MIG is a single-card partitioning primitive — collective operations across MIG instances run over PCIe with the same penalties as cross-host PCIe, not NVLink.
- Silicon partitioning: dedicated SMs, dedicated L2 slice, dedicated HBM memory controllers, dedicated NVDEC/NVENC where applicable.
- No shared state to contend over and no software boundary to bypass at the slice perimeter.
- Hopper added FP8 Tensor Cores per slice (same per-SM throughput as the full GPU) plus Confidential Compute attestation per instance.
- Blackwell extends MIG with FP4 per slice and per-instance MX-format support.
- MIG does not expose NVLink — tensor parallelism across MIG instances is not viable. Use the full card or different cards for TP.
Form factor / power and thermal#
MIG is a partitioning primitive, not a form factor, but it affects how operators think about power and density. A 7-slice H100 SXM5 still draws ~700 W TDP — the slicing does not lower aggregate power, it raises per-watt revenue by hosting more concurrent tenants. Thermal envelope, cooling design and rack power budget are sized for the full-card TDP regardless of MIG configuration.
- Supported on data-centre SKUs only: A100 (40/80 GB), A30, H100 (SXM5 / PCIe / NVL), H200 (SXM5e), B100, B200, GB200. NOT supported on L4, L40, L40S, T4, RTX-class workstation cards or consumer GPUs.
- Power draw is GPU-wide, not per slice — a 7-slice H100 still dissipates ~700 W at peak.
- Slice thermal limits are inherited from the host card — there is no per-slice power cap.
- Confidential Compute (CC-on) on Hopper MIG adds ~3-7 % throughput overhead per slice but does not change the thermal envelope.
Software ecosystem: GPU Operator strategies#
MIG is configured via `nvidia-smi mig` commands or programmatically via NVML. On Kubernetes, the NVIDIA GPU Operator manages MIG slice creation, labelling and exposure as resources. Two strategies are supported: 'single' (the node is configured for one uniform slice profile across all GPUs; slices expose as `nvidia.com/gpu` so existing manifests work unchanged) and 'mixed' (different profiles on different GPUs; each profile exposes as a distinct resource type `nvidia.com/mig-1g.10gb`, `nvidia.com/mig-3g.40gb`, etc.).
- GPU Operator single strategy: simple, low-friction, every slice on every GPU is the same profile. Best for homogeneous inference fleets (e.g. all 1g.10gb for a 7B-class endpoint pool).
- GPU Operator mixed strategy: heterogeneous slice mix per node. Best for multi-workload fleets where some workloads need 1g.10gb and others need 3g.40gb on the same physical card.
- MIG Manager (part of the GPU Operator) handles partition changes — drains workloads, applies the new partition table, restarts the device plugin. Partition changes are destructive; plan with a node drain.
- Triton Inference Server and most modern inference servers treat MIG slices as discrete CUDA devices. vLLM, TensorRT-LLM, SGLang and TGI all run unchanged on MIG slices (subject to the slice's VRAM and compute budget).
- DCGM and the DCGM exporter expose per-slice metrics — `DCGM_FI_DEV_GPU_UTIL`, `DCGM_FI_DEV_FB_USED`, etc. are reported per MIG instance UUID, so per-tenant observability is uniform.
- Yobibyte exposes MIG via workspace tenancy mode `shared` (slice-backed) vs `dedicated` (full-card). The workspace API does not surface profile names; Yobibyte's placement layer maps customer-stated workload sizes to slice profiles on Yobitel NeoCloud's MIG-enabled pools.
# --- MIG management on a bare H100 node ---
# 1) Enable MIG mode on GPU 0 (reboot may be required)
nvidia-smi -i 0 -mig 1
# 2) List available GPU Instance profiles for this SKU
nvidia-smi mig -lgip
# +-----------------------------------------------------------------------------+
# | GPU instance profiles: |
# | GPU Name ID Instances Memory P2P SM DEC ENC |
# | Free/Total GiB |
# | 0 MIG 1g.10gb 19 7/7 9.50 No 14 1 0 |
# | 0 MIG 1g.20gb 15 4/4 19.50 No 14 1 0 |
# | 0 MIG 2g.20gb 14 3/3 19.50 No 28 1 0 |
# | 0 MIG 3g.40gb 9 2/2 39.25 No 42 2 0 |
# | 0 MIG 4g.40gb 5 1/1 39.25 No 56 2 0 |
# | 0 MIG 7g.80gb 0 1/1 78.75 No 98 5 0 |
# +-----------------------------------------------------------------------------+
# 3) Create three instances using profile 9 (3g.40gb on H100) — invalid here, only 2 slots
# Valid: create 1x 3g.40gb + 2x 1g.10gb + 1x 2g.20gb = 7g compute, 70 GB memory
nvidia-smi mig -cgi 9,19,19,14 -C
# 4) List created GPU Instances
nvidia-smi mig -lgi
# 5) List Compute Instances created on each GI
nvidia-smi mig -lci
# 6) Destroy all instances (drains workloads, partition can be redone)
nvidia-smi mig -dci && nvidia-smi mig -dgi
# --- Kubernetes via NVIDIA GPU Operator ---
# 1) Set MIG strategy to 'mixed' so multiple profiles can coexist
helm upgrade gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--set mig.strategy=mixed
# 2) Label the node with the desired MIG configuration (defined in a ConfigMap)
kubectl label node h100-node-01 nvidia.com/mig.config=all-1g.10gb --overwrite
# 3) Workloads request slices by profile-typed resource
kubectl apply -f - <<'EOF'
apiVersion: apps/v1
kind: Deployment
metadata: { name: vllm-7b-shared }
spec:
replicas: 7
selector: { matchLabels: { app: vllm-7b-shared } }
template:
metadata: { labels: { app: vllm-7b-shared } }
spec:
containers:
- name: vllm
image: vllm/vllm-openai:v0.6.0
args: ["--model","meta-llama/Llama-3.1-8B-Instruct",
"--quantization","fp8","--max-model-len","8192"]
resources:
limits: { nvidia.com/mig-1g.10gb: 1 }
EOFSizing: when to slice and how#
MIG-sizing decisions reduce to one question: does your workload mix benefit more from many small isolated tenants on one card, or one workload owning the whole card? The thresholds below are what we use to map workloads to MIG profiles on Yobitel NeoCloud.
- Capacity rule: a 7x 1g.10gb H100 hosts seven independent 7B-class FP8 chat endpoints at ~1,500-2,000 TPS each — aggregate per-card output exceeds a full-card replica because batching efficiency is higher across many cold streams.
- Performance per slice is not exactly proportional — overhead and minor cross-slice effects mean 7x 1g.10gb is ~5-10 % less aggregate throughput than 1x 7g.80gb on the same workload. The tenancy benefit usually pays for the overhead.
- MIG does not help training: NVLink is disabled across slices, so the multi-GPU collective primitives every training stack relies on do not work. Train on full cards.
- Yobibyte's `shared` tenancy mode maps to MIG slice placement on Yobitel NeoCloud's MIG-enabled H100 / H200 pools; customers see a workspace-level price and SLA, not the slice profile.
| Workload | Recommended MIG profile | Why |
|---|---|---|
| 7B-class chat, modest QPS, isolated tenant | 1g.10gb (H100) or 1g.18gb (H200) | FP8 weights fit in 10-18 GB; per-tenant isolation guaranteed. |
| 13B-class chat or 7B with 32K+ context | 2g.20gb (H100) or 1g.35gb (H200) | KV cache headroom; bandwidth roughly proportional. |
| 34B-class chat (Codestral, Yi) | 3g.40gb (H100) or 3g.71gb (H200) | Weights fit at FP8; concurrency limited. |
| 70B+ chat | Full card (7g.80gb / 7g.141gb) | Cannot fit in any single MIG slice on H100; H200 fits 70B FP8 in 4g.71gb. |
| Embeddings batch, multi-tenant | 1g.10gb x 7 | Throughput-uniform; hard tenant isolation. |
| Multi-tenant inference for trusted internal workloads | MPS or full card | MIG overhead unjustified if isolation is not required. |
| Confidential inference (FedRAMP / NCSC OFFICIAL) | Any MIG profile, CC-on | Per-instance attestation; sealed HBM. |
| Training | Full card or NVLink-attached cluster | MIG disables NVLink across slices; training collectives need full-card NVLink. |
Cost and TCO#
MIG changes the economics on the supply side rather than the demand side: the per-slice rate is roughly 1/7 of the full-card rate (plus a small premium for the partition-table coordination overhead), and the operator extracts 5-7x the per-card revenue from multi-tenant inference. For the customer, MIG-on-shared is the cheapest path to dedicated-feeling H100 inference for small models.
- Yobitel NeoCloud bills MIG slices per-slice-hour on the FinOps Foundation FOCUS spec (`ServiceName=AcceleratorCompute`, `SkuId=gpu.h100.mig.1g.10gb`) — uniform attribution with full-card SKUs.
- Yobibyte's `shared` tenancy is priced per-token (not per-slice-hour) so customers consume the slice economics through the OpenAI-compatible endpoint without managing the MIG layer.
- Spot/preemptible MIG slices are technically possible but rarely offered — partition stability matters for multi-tenant SLAs, and spot eviction would drain all tenants on the GPU.
- Confidential Compute on MIG slices adds ~3-7 % throughput overhead but does not change billing — CC-on is included in the listed per-slice-hour rate on NeoCloud's sovereign UK / EU regions.
| Configuration | Slice profile | $/slice-hr (NeoCloud, on-demand) | Equivalent $/full-card-hr | Use case |
|---|---|---|---|---|
| H100 SXM5 full-card | 7g.80gb | — | $2.20-3.00 | 70B+ chat, training, latency-critical SLAs |
| H100 SXM5 MIG 1g.10gb | 1g.10gb | $0.34-0.46 | ~$2.40-3.20 | 7B-class chat with hard tenant isolation |
| H100 SXM5 MIG 2g.20gb | 2g.20gb | $0.68-0.92 | ~$2.40-3.20 | 13B-class or 7B with long context |
| H100 SXM5 MIG 3g.40gb | 3g.40gb | $1.00-1.36 | ~$2.40-3.20 | 34B-class chat |
| H200 MIG 1g.18gb | 1g.18gb | $0.46-0.60 | ~$3.20-4.20 | 7B-class chat with very long context |
| B200 MIG 1g.23gb | 1g.23gb | $0.85-1.15 | ~$5.95-8.05 | 7B chat at FP4 with peak throughput |
Migration and alternatives#
Alternatives to MIG when the goal is multi-tenant inference on shared GPUs.
- Migration full-card -> MIG: pick the slice profile, drain workloads, run `nvidia-smi -i N -mig 1`, partition with `nvidia-smi mig -cgi`, re-deploy workloads with updated resource requests. Plan a maintenance window.
- Migration MIG -> full-card: reverse — destroy all instances, disable MIG, redeploy. Also destructive.
- Migration MPS -> MIG: replace `EXCLUSIVE_PROCESS` compute mode with MIG partitioning; isolation upgrade is real, throughput per workload slightly lower.
| Option | Isolation | When to pick | Trade-off vs MIG |
|---|---|---|---|
| MIG (silicon partitioning) | Hardware | Multi-tenant inference with strict isolation; FedRAMP / NCSC OFFICIAL postures | Static partition; no NVLink across slices |
| CUDA MPS (Multi-Process Service) | Software (CUDA context) | Trusted workloads, batched inference, low overhead | No security boundary; bandwidth contention |
| Time-slicing via GPU Operator (--time-slicing) | Software (driver-level) | Dev / test environments, hobby workloads | Best-effort scheduling; no QoS |
| KubeVirt + GPU passthrough (vGPU) | Hypervisor (per-VM) | VM-shaped workloads; existing VDI | Hypervisor overhead; vGPU licence |
| Single tenant per full card | Card-level | Latency-critical SLAs, training, single-tenant | Costly per tenant; under-utilised at off-peak |
| L4 1U sled with N independent cards | Card-level | Density at the chassis level, not the GPU | More physical cards; cheaper per slice equivalent |
Pitfalls / operational notes#
Operational issues we see most often on MIG-enabled fleets, ranked by frequency.
- Mode switching is destructive: toggling between MIG and full-card drops all running workloads on the GPU. Plan partitioning at cluster-build time or schedule a node-drain window for changes.
- Profile combinations are constrained: not every combination of slice sizes is valid. A 4g.40gb + 4g.40gb layout is invalid (over 7g compute); a 3g.40gb + 3g.40gb + 1g.10gb is invalid (over 80 GB on H100). Verify with `nvidia-smi mig -lgipp` before designing.
- Performance is not exactly proportional: 7x 1g.10gb is roughly 90-95 % of 1x 7g.80gb aggregate throughput due to L2 fragmentation and cross-slice scheduler overhead.
- NVLink is invisible inside MIG slices: tensor-parallel inference across slices does not work. Use a full card if you need TP>=2 on one host.
- Some CUDA features are restricted: peer-to-peer memory access, certain UVM patterns, and CUDA IPC handles do not work across MIG instances. Most inference servers do not hit these; training stacks often do.
- Per-slice DCGM signals: cards appear as multiple instance UUIDs in DCGM metrics. Make sure dashboards and alerts are keyed on the instance UUID, not just the GPU index.
- MIG slice device files are mode-0666: pod security contexts must be checked; older container runtimes occasionally lose the `/dev/nvidia-caps/*` files on host restart and leave slices unavailable until the GPU Operator's MIG Manager re-applies the partition.
- Confidential Compute (CC-on) attestation: each MIG slice attests independently; ensure your attestation broker handles per-slice nonces, not per-card.
- Driver version drift: MIG support evolved across R450 / R510 / R535 / R550 driver lines. Pin the driver version per pool; mixing versions across MIG-enabled nodes is a debugging nightmare.
Where this fits in the Yobitel stack#
MIG is the silicon primitive behind Yobibyte's `shared` workspace tenancy mode. When a customer creates a Yobibyte workspace for a 7B-class chat endpoint and selects 'shared' tenancy, the placement layer maps the workload to a MIG slice (typically 1g.10gb on H100 or 1g.18gb on H200) on Yobitel NeoCloud's MIG-enabled pools in UK and EU sovereign regions. The customer sees an OpenAI-compatible endpoint and a per-token price; the slice profile, partition table and instance UUID are managed by Yobibyte and never surfaced — recipe-protected by design.
Yobitel NeoCloud bills MIG slices independently on the FinOps Foundation FOCUS spec, so multi-tenant inference economics work consistently at the silicon (MIG), Kubernetes (GPU Operator mixed strategy with profile-typed resources), workspace (Yobibyte `shared` tenancy) and billing (per-slice-hour FOCUS rows) layers. The UK sovereign region runs MIG with Confidential Compute attested per slice, which is what makes NCSC OFFICIAL-aligned multi-tenant inference postures viable without bespoke isolation infrastructure.
Omniscient Compute indexes MIG-slice SKUs alongside full-card SKUs when arbitrating capacity across NeoCloud and partner clouds, so a 7B chat workspace can land on the cheapest qualifying slice anywhere in the Yobitel-managed estate. InferenceBench publishes per-MIG-profile throughput tables (1g.10gb vs 2g.20gb vs full-card on the same H100) so customers and operators can size the slice mix from first-party benchmarks rather than vendor marketing.
References
- NVIDIA MIG User Guide · NVIDIA
- NVIDIA GPU Operator MIG documentation · NVIDIA
- Hopper MIG with Confidential Compute · NVIDIA
- DCGM per-MIG-instance field IDs · NVIDIA
- FinOps Foundation FOCUS billing specification · FinOps Foundation
- NCSC Cloud Security Principles · UK NCSC