Multi-Instance GPU (MIG)

TL;DR

MIG is hardware GPU partitioning: one A100 / A30 / H100 / H200 / B100 / B200 presents as up to seven independent, isolated GPUs with their own SMs, L2 slice, HBM bandwidth and NVENC/NVDEC allocation.
Not time-slicing — silicon-level isolation. Inter-instance memory bandwidth contention is bounded by the partition table; a misbehaving tenant cannot starve another at the hardware layer.
Standard profile syntax: `<compute>g.<memory>gb` — e.g. `1g.10gb`, `2g.20gb`, `3g.40gb`, `4g.40gb`, `7g.80gb` on H100 80 GB. H200 141 GB exposes wider profiles (1g.18gb, 7g.141gb). Hopper added FP8 + Confidential Compute attestation per slice.
GPU Operator integration on Kubernetes: 'single' strategy exposes uniform slices as `nvidia.com/gpu`; 'mixed' strategy exposes per-profile resources (e.g. `nvidia.com/mig-1g.10gb`, `nvidia.com/mig-3g.40gb`) — both are first-class Kubernetes resources.
Yobibyte exposes MIG as workspace tenancy mode `shared` (vs `dedicated` for full-card); Yobitel NeoCloud bills per MIG slice on a FOCUS-conformant per-slice-hour basis so multi-tenant inference economics work at the silicon, billing and observability layers together.

Overview

Multi-Instance GPU (MIG) is the hardware feature NVIDIA introduced with A100 in 2020 that lets one data-centre GPU present as up to seven independent, isolated GPUs. Each MIG instance has its own dedicated SMs (Streaming Multiprocessors), its own slice of L2 cache, its own HBM bandwidth allocation, its own memory controller assignment, and its own NVENC/NVDEC pair where applicable. The instances cannot interfere with each other — there is no shared state at the silicon level beyond the GPU's chassis and PCIe lane.

MIG matters because it allows expensive GPUs to be sold, scheduled and billed at finer granularity than 'one whole GPU per workload'. For inference fleets running many small (7B-class) replicas, MIG lets a single H100 host seven concurrent 7B-class endpoints with hard isolation — each tenant sees what looks like a dedicated 1g.10gb GPU with deterministic performance, and the host operator extracts 5-7x the per-card revenue compared to renting the same H100 to one tenant. For Kubernetes clusters, MIG slices appear as discrete schedulable resources (nvidia.com/mig-1g.10gb) that the scheduler can place workloads onto using the same primitives as full GPUs.

This entry is the reference for teams sizing MIG-aware infrastructure on Hopper or Blackwell: full profile tables for A100 / A30 / H100 / H200 / B200, the partition-table constraints that make some combinations valid and others not, the GPU Operator single-vs-mixed strategy choice, the confidential-compute attestation story on Hopper, the per-slice cost economics, and the operational pitfalls. Yobitel NeoCloud bills MIG slices independently and Yobibyte exposes a shared tenancy mode that maps to MIG under the hood — customers don't see the slice profile but inherit the price advantage. This entry helps you decide when MIG-on-shared makes sense vs full-card-dedicated and how to size the slice mix.

Specifications: profiles and partition tables

MIG profiles encode the compute and memory allocation per slice as <compute>g.<memory>gb — e.g. 1g.10gb is 1 GPC (~14 SMs on H100) and 10 GB of HBM, 7g.80gb is the entire H100. The valid combinations are constrained by the GPU's partition table; you cannot freely mix arbitrary slice sizes. The tables below list the supported profiles per SKU as of 2026.

Partition table rule: total compute across active instances cannot exceed 7g (i.e., the full GPU). A valid layout for H100 is 3g.40gb + 2g.20gb + 1g.10gb + 1g.10gb (= 7g compute, 80 GB memory).
Switching the partition table (e.g. from 7x1g.10gb to 1x4g.40gb + 1x3g.40gb) requires destroying all existing instances first — destructive operation, plan at cluster-build or maintenance window.
MIG instance UUIDs are deterministic per profile slot and survive reboots; container runtime mounts the slice by UUID so workloads can be restarted without re-pinning.
H200's 1g.18gb slice is the highest-VRAM single-slice MIG profile NVIDIA ships through 2026 — useful for 7B-class chat with 32K+ context that overflows H100's 10 GB.

GPU	Profile	Compute (GPCs)	Memory	Max instances of this profile	FP8 / TF32 (sparse) per slice
A100 80 GB	1g.10gb	1	10 GB	7	No FP8 (Ampere); ~22 TFLOPS TF32
A100 80 GB	2g.20gb	2	20 GB	3	~44 TFLOPS TF32
A100 80 GB	3g.40gb	3	40 GB	2	~66 TFLOPS TF32
A100 80 GB	4g.40gb	4	40 GB	1	~89 TFLOPS TF32
A100 80 GB	7g.80gb	7	80 GB	1 (full GPU)	~156 TFLOPS TF32
H100 80 GB SXM5	1g.10gb	1	10 GB	7	~565 TFLOPS FP8 sparse
H100 80 GB SXM5	1g.20gb	1	20 GB	4	~565 TFLOPS FP8 sparse
H100 80 GB SXM5	2g.20gb	2	20 GB	3	~1,130 TFLOPS FP8 sparse
H100 80 GB SXM5	3g.40gb	3	40 GB	2	~1,695 TFLOPS FP8 sparse
H100 80 GB SXM5	4g.40gb	4	40 GB	1	~2,260 TFLOPS FP8 sparse
H100 80 GB SXM5	7g.80gb	7	80 GB	1 (full GPU)	~3,958 TFLOPS FP8 sparse
H200 141 GB	1g.18gb	1	18 GB	7	~565 TFLOPS FP8 sparse
H200 141 GB	1g.35gb	1	35 GB	4	~565 TFLOPS FP8 sparse
H200 141 GB	2g.35gb	2	35 GB	3	~1,130 TFLOPS FP8 sparse
H200 141 GB	3g.71gb	3	71 GB	2	~1,695 TFLOPS FP8 sparse
H200 141 GB	4g.71gb	4	71 GB	1	~2,260 TFLOPS FP8 sparse
H200 141 GB	7g.141gb	7	141 GB	1 (full GPU)	~3,958 TFLOPS FP8 sparse
B200 192 GB	1g.23gb	1	23 GB	7	~640 TFLOPS FP4 sparse
B200 192 GB	3g.96gb	3	96 GB	2	~1,950 TFLOPS FP4 sparse
B200 192 GB	7g.192gb	7	192 GB	1 (full GPU)	~9,000 TFLOPS FP4 sparse

Warning: Profile names look similar across GPUs but the memory allocations differ — '1g.10gb' on H100 is '1g.18gb' on H200 and '1g.23gb' on B200. Always check the SKU's documented profile list before designing a partition layout; a manifest that hardcodes 1g.10gb will fail on H200.

Architecture: how silicon-level partitioning works

Time-sliced GPU sharing (CUDA MPS, Multi-Process Service) shares one set of SMs across processes by interleaving kernel launches on a fast schedule. This works for batched inference of trusted workloads but provides no isolation guarantees — a misbehaving tenant can saturate the SMs and starve others, the L2 cache is shared (cache-line eviction patterns leak information across tenants), HBM bandwidth is shared (a bandwidth-heavy tenant degrades a latency-sensitive one), and security boundaries are software-enforced (CUDA context isolation, not silicon).

MIG instead partitions the GPU at the silicon level. The GPU is internally organised as 7 GPCs (Graphics Processing Clusters on A100; renamed but functionally similar on Hopper/Blackwell), 8 HBM stacks (mapped to memory controllers), an L2 cache, NVENC/NVDEC engines and the rest. MIG creates 'GPU Instances' (GIs) — groups of GPCs, HBM partitions, L2 slices and NVDEC/NVENC channels — and 'Compute Instances' (CIs) within each GI. The hypervisor-level GPU driver enforces the partition: a workload on instance A cannot issue memory accesses, kernel launches or DMA transfers that touch instance B's resources.

Hopper MIG added Confidential Compute (CC-on) attestation per-instance: each MIG slice can be attested independently via SPDM-over-PCIe to NVIDIA's NRAS service, and HBM-resident pages within a slice are sealed against the host kernel. This makes Hopper MIG the first commercial GPU partitioning primitive with FedRAMP Moderate / NCSC OFFICIAL-aligned multi-tenant inference posture.

What MIG does NOT do: NVLink is not exposed across MIG slices on the same GPU (you cannot tensor-parallel across two MIG instances), some peer-to-peer CUDA features are restricted, and certain UVM (Unified Virtual Memory) patterns degrade or fail. MIG is a single-card partitioning primitive — collective operations across MIG instances run over PCIe with the same penalties as cross-host PCIe, not NVLink.

Silicon partitioning: dedicated SMs, dedicated L2 slice, dedicated HBM memory controllers, dedicated NVDEC/NVENC where applicable.
No shared state to contend over and no software boundary to bypass at the slice perimeter.
Hopper added FP8 Tensor Cores per slice (same per-SM throughput as the full GPU) plus Confidential Compute attestation per instance.
Blackwell extends MIG with FP4 per slice and per-instance MX-format support.
MIG does not expose NVLink — tensor parallelism across MIG instances is not viable. Use the full card or different cards for TP.

Form factor / power and thermal

MIG is a partitioning primitive, not a form factor, but it affects how operators think about power and density. A 7-slice H100 SXM5 still draws ~700 W TDP — the slicing does not lower aggregate power, it raises per-watt revenue by hosting more concurrent tenants. Thermal envelope, cooling design and rack power budget are sized for the full-card TDP regardless of MIG configuration.

Supported on data-centre SKUs only: A100 (40/80 GB), A30, H100 (SXM5 / PCIe / NVL), H200 (SXM5e), B100, B200, GB200. NOT supported on L4, L40, L40S, T4, RTX-class workstation cards or consumer GPUs.
Power draw is GPU-wide, not per slice — a 7-slice H100 still dissipates ~700 W at peak.
Slice thermal limits are inherited from the host card — there is no per-slice power cap.
Confidential Compute (CC-on) on Hopper MIG adds ~3-7 % throughput overhead per slice but does not change the thermal envelope.

Software ecosystem: GPU Operator strategies

MIG is configured via nvidia-smi mig commands or programmatically via NVML. On Kubernetes, the NVIDIA GPU Operator manages MIG slice creation, labelling and exposure as resources. Two strategies are supported: 'single' (the node is configured for one uniform slice profile across all GPUs; slices expose as nvidia.com/gpu so existing manifests work unchanged) and 'mixed' (different profiles on different GPUs; each profile exposes as a distinct resource type nvidia.com/mig-1g.10gb, nvidia.com/mig-3g.40gb, etc.).

GPU Operator single strategy: simple, low-friction, every slice on every GPU is the same profile. Best for homogeneous inference fleets (e.g. all 1g.10gb for a 7B-class endpoint pool).
GPU Operator mixed strategy: heterogeneous slice mix per node. Best for multi-workload fleets where some workloads need 1g.10gb and others need 3g.40gb on the same physical card.
MIG Manager (part of the GPU Operator) handles partition changes — drains workloads, applies the new partition table, restarts the device plugin. Partition changes are destructive; plan with a node drain.
Triton Inference Server and most modern inference servers treat MIG slices as discrete CUDA devices. vLLM, TensorRT-LLM, SGLang and TGI all run unchanged on MIG slices (subject to the slice's VRAM and compute budget).
DCGM and the DCGM exporter expose per-slice metrics — DCGM_FI_DEV_GPU_UTIL, DCGM_FI_DEV_FB_USED, etc. are reported per MIG instance UUID, so per-tenant observability is uniform.
Yobibyte exposes MIG via workspace tenancy mode shared (slice-backed) vs dedicated (full-card). The workspace API does not surface profile names; Yobibyte's placement layer maps customer-stated workload sizes to slice profiles on Yobitel NeoCloud's MIG-enabled pools.

# --- MIG management on a bare H100 node ---

# 1) Enable MIG mode on GPU 0 (reboot may be required)
nvidia-smi -i 0 -mig 1

# 2) List available GPU Instance profiles for this SKU
nvidia-smi mig -lgip
# +-----------------------------------------------------------------------------+
# | GPU instance profiles:                                                      |
# | GPU   Name             ID    Instances   Memory     P2P    SM    DEC    ENC |
# |                              Free/Total   GiB                              |
# |   0  MIG 1g.10gb        19     7/7         9.50      No     14     1      0 |
# |   0  MIG 1g.20gb        15     4/4        19.50      No     14     1      0 |
# |   0  MIG 2g.20gb        14     3/3        19.50      No     28     1      0 |
# |   0  MIG 3g.40gb         9     2/2        39.25      No     42     2      0 |
# |   0  MIG 4g.40gb         5     1/1        39.25      No     56     2      0 |
# |   0  MIG 7g.80gb         0     1/1        78.75      No     98     5      0 |
# +-----------------------------------------------------------------------------+

# 3) Create three instances using profile 9 (3g.40gb on H100) — invalid here, only 2 slots
# Valid: create 1x 3g.40gb + 2x 1g.10gb + 1x 2g.20gb = 7g compute, 70 GB memory
nvidia-smi mig -cgi 9,19,19,14 -C

# 4) List created GPU Instances
nvidia-smi mig -lgi

# 5) List Compute Instances created on each GI
nvidia-smi mig -lci

# 6) Destroy all instances (drains workloads, partition can be redone)
nvidia-smi mig -dci && nvidia-smi mig -dgi

# --- Kubernetes via NVIDIA GPU Operator ---

# 1) Set MIG strategy to 'mixed' so multiple profiles can coexist
helm upgrade gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --set mig.strategy=mixed

# 2) Label the node with the desired MIG configuration (defined in a ConfigMap)
kubectl label node h100-node-01 nvidia.com/mig.config=all-1g.10gb --overwrite

# 3) Workloads request slices by profile-typed resource
kubectl apply -f - <<'EOF'
apiVersion: apps/v1
kind: Deployment
metadata: { name: vllm-7b-shared }
spec:
  replicas: 7
  selector: { matchLabels: { app: vllm-7b-shared } }
  template:
    metadata: { labels: { app: vllm-7b-shared } }
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:v0.6.0
        args: ["--model","meta-llama/Llama-3.1-8B-Instruct",
               "--quantization","fp8","--max-model-len","8192"]
        resources:
          limits: { nvidia.com/mig-1g.10gb: 1 }
EOF

Sizing: when to slice and how

MIG-sizing decisions reduce to one question: does your workload mix benefit more from many small isolated tenants on one card, or one workload owning the whole card? The thresholds below are what we use to map workloads to MIG profiles on Yobitel NeoCloud.

Capacity rule: a 7x 1g.10gb H100 hosts seven independent 7B-class FP8 chat endpoints at ~1,500-2,000 TPS each — aggregate per-card output exceeds a full-card replica because batching efficiency is higher across many cold streams.
Performance per slice is not exactly proportional — overhead and minor cross-slice effects mean 7x 1g.10gb is ~5-10 % less aggregate throughput than 1x 7g.80gb on the same workload. The tenancy benefit usually pays for the overhead.
MIG does not help training: NVLink is disabled across slices, so the multi-GPU collective primitives every training stack relies on do not work. Train on full cards.
Yobibyte's shared tenancy mode maps to MIG slice placement on Yobitel NeoCloud's MIG-enabled H100 / H200 pools; customers see a workspace-level price and SLA, not the slice profile.

Workload	Recommended MIG profile	Why
7B-class chat, modest QPS, isolated tenant	1g.10gb (H100) or 1g.18gb (H200)	FP8 weights fit in 10-18 GB; per-tenant isolation guaranteed.
13B-class chat or 7B with 32K+ context	2g.20gb (H100) or 1g.35gb (H200)	KV cache headroom; bandwidth roughly proportional.
34B-class chat (Codestral, Yi)	3g.40gb (H100) or 3g.71gb (H200)	Weights fit at FP8; concurrency limited.
70B+ chat	Full card (7g.80gb / 7g.141gb)	Cannot fit in any single MIG slice on H100; H200 fits 70B FP8 in 4g.71gb.
Embeddings batch, multi-tenant	1g.10gb x 7	Throughput-uniform; hard tenant isolation.
Multi-tenant inference for trusted internal workloads	MPS or full card	MIG overhead unjustified if isolation is not required.
Confidential inference (FedRAMP / NCSC OFFICIAL)	Any MIG profile, CC-on	Per-instance attestation; sealed HBM.
Training	Full card or NVLink-attached cluster	MIG disables NVLink across slices; training collectives need full-card NVLink.

Cost and TCO

MIG changes the economics on the supply side rather than the demand side: the per-slice rate is roughly 1/7 of the full-card rate (plus a small premium for the partition-table coordination overhead), and the operator extracts 5-7x the per-card revenue from multi-tenant inference. For the customer, MIG-on-shared is the cheapest path to dedicated-feeling H100 inference for small models.

Yobitel NeoCloud bills MIG slices per-slice-hour on the FinOps Foundation FOCUS spec (ServiceName=AcceleratorCompute, SkuId=gpu.h100.mig.1g.10gb) — uniform attribution with full-card SKUs.
Yobibyte's shared tenancy is priced per-token (not per-slice-hour) so customers consume the slice economics through the OpenAI-compatible endpoint without managing the MIG layer.
Spot/preemptible MIG slices are technically possible but rarely offered — partition stability matters for multi-tenant SLAs, and spot eviction would drain all tenants on the GPU.
Confidential Compute on MIG slices adds ~3-7 % throughput overhead but does not change billing — CC-on is included in the listed per-slice-hour rate on NeoCloud's sovereign UK / EU regions.

Configuration	Slice profile	$/slice-hr (NeoCloud, on-demand)	Equivalent $/full-card-hr	Use case
H100 SXM5 full-card	7g.80gb	—	$2.20-3.00	70B+ chat, training, latency-critical SLAs
H100 SXM5 MIG 1g.10gb	1g.10gb	$0.34-0.46	~$2.40-3.20	7B-class chat with hard tenant isolation
H100 SXM5 MIG 2g.20gb	2g.20gb	$0.68-0.92	~$2.40-3.20	13B-class or 7B with long context
H100 SXM5 MIG 3g.40gb	3g.40gb	$1.00-1.36	~$2.40-3.20	34B-class chat
H200 MIG 1g.18gb	1g.18gb	$0.46-0.60	~$3.20-4.20	7B-class chat with very long context
B200 MIG 1g.23gb	1g.23gb	$0.85-1.15	~$5.95-8.05	7B chat at FP4 with peak throughput

Migration and alternatives

Alternatives to MIG when the goal is multi-tenant inference on shared GPUs.

Migration full-card -> MIG: pick the slice profile, drain workloads, run nvidia-smi -i N -mig 1, partition with nvidia-smi mig -cgi, re-deploy workloads with updated resource requests. Plan a maintenance window.
Migration MIG -> full-card: reverse — destroy all instances, disable MIG, redeploy. Also destructive.
Migration MPS -> MIG: replace EXCLUSIVE_PROCESS compute mode with MIG partitioning; isolation upgrade is real, throughput per workload slightly lower.

Option	Isolation	When to pick	Trade-off vs MIG
MIG (silicon partitioning)	Hardware	Multi-tenant inference with strict isolation; FedRAMP / NCSC OFFICIAL postures	Static partition; no NVLink across slices
CUDA MPS (Multi-Process Service)	Software (CUDA context)	Trusted workloads, batched inference, low overhead	No security boundary; bandwidth contention
Time-slicing via GPU Operator (--time-slicing)	Software (driver-level)	Dev / test environments, hobby workloads	Best-effort scheduling; no QoS
KubeVirt + GPU passthrough (vGPU)	Hypervisor (per-VM)	VM-shaped workloads; existing VDI	Hypervisor overhead; vGPU licence
Single tenant per full card	Card-level	Latency-critical SLAs, training, single-tenant	Costly per tenant; under-utilised at off-peak
L4 1U sled with N independent cards	Card-level	Density at the chassis level, not the GPU	More physical cards; cheaper per slice equivalent

Pitfalls / operational notes

Operational issues we see most often on MIG-enabled fleets, ranked by frequency.

Mode switching is destructive: toggling between MIG and full-card drops all running workloads on the GPU. Plan partitioning at cluster-build time or schedule a node-drain window for changes.
Profile combinations are constrained: not every combination of slice sizes is valid. A 4g.40gb + 4g.40gb layout is invalid (over 7g compute); a 3g.40gb + 3g.40gb + 1g.10gb is invalid (over 80 GB on H100). Verify with nvidia-smi mig -lgipp before designing.
Performance is not exactly proportional: 7x 1g.10gb is roughly 90-95 % of 1x 7g.80gb aggregate throughput due to L2 fragmentation and cross-slice scheduler overhead.
NVLink is invisible inside MIG slices: tensor-parallel inference across slices does not work. Use a full card if you need TP>=2 on one host.
Some CUDA features are restricted: peer-to-peer memory access, certain UVM patterns, and CUDA IPC handles do not work across MIG instances. Most inference servers do not hit these; training stacks often do.
Per-slice DCGM signals: cards appear as multiple instance UUIDs in DCGM metrics. Make sure dashboards and alerts are keyed on the instance UUID, not just the GPU index.
MIG slice device files are mode-0666: pod security contexts must be checked; older container runtimes occasionally lose the /dev/nvidia-caps/* files on host restart and leave slices unavailable until the GPU Operator's MIG Manager re-applies the partition.
Confidential Compute (CC-on) attestation: each MIG slice attests independently; ensure your attestation broker handles per-slice nonces, not per-card.
Driver version drift: MIG support evolved across R450 / R510 / R535 / R550 driver lines. Pin the driver version per pool; mixing versions across MIG-enabled nodes is a debugging nightmare.

Where this fits in the Yobitel stack

MIG is the silicon primitive behind Yobibyte's shared workspace tenancy mode. When a customer creates a Yobibyte workspace for a 7B-class chat endpoint and selects 'shared' tenancy, the placement layer maps the workload to a MIG slice (typically 1g.10gb on H100 or 1g.18gb on H200) on Yobitel NeoCloud's MIG-enabled pools in UK and EU sovereign regions. The customer sees an OpenAI-compatible endpoint and a per-token price; the slice profile, partition table and instance UUID are managed by Yobibyte and never surfaced — recipe-protected by design.

Yobitel NeoCloud bills MIG slices independently on the FinOps Foundation FOCUS spec, so multi-tenant inference economics work consistently at the silicon (MIG), Kubernetes (GPU Operator mixed strategy with profile-typed resources), workspace (Yobibyte shared tenancy) and billing (per-slice-hour FOCUS rows) layers. The UK sovereign region runs MIG with Confidential Compute attested per slice, which is what makes NCSC OFFICIAL-aligned multi-tenant inference postures viable without bespoke isolation infrastructure.

Omniscient Compute indexes MIG-slice SKUs alongside full-card SKUs when arbitrating capacity across NeoCloud and partner clouds, so a 7B chat workspace can land on the cheapest qualifying slice anywhere in the Yobitel-managed estate. InferenceBench publishes per-MIG-profile throughput tables (1g.10gb vs 2g.20gb vs full-card on the same H100) so customers and operators can size the slice mix from first-party benchmarks rather than vendor marketing.

References

NVIDIA MIG User Guide · NVIDIA
NVIDIA GPU Operator MIG documentation · NVIDIA
Hopper MIG with Confidential Compute · NVIDIA
DCGM per-MIG-instance field IDs · NVIDIA
FinOps Foundation FOCUS billing specification · FinOps Foundation
NCSC Cloud Security Principles · UK NCSC

TL;DR

MIG is hardware GPU partitioning: one A100 / A30 / H100 / H200 / B100 / B200 presents as up to seven independent, isolated GPUs with their own SMs, L2 slice, HBM bandwidth and NVENC/NVDEC allocation.
Not time-slicing — silicon-level isolation. Inter-instance memory bandwidth contention is bounded by the partition table; a misbehaving tenant cannot starve another at the hardware layer.
Standard profile syntax: `<compute>g.<memory>gb` — e.g. `1g.10gb`, `2g.20gb`, `3g.40gb`, `4g.40gb`, `7g.80gb` on H100 80 GB. H200 141 GB exposes wider profiles (1g.18gb, 7g.141gb). Hopper added FP8 + Confidential Compute attestation per slice.
GPU Operator integration on Kubernetes: 'single' strategy exposes uniform slices as `nvidia.com/gpu`; 'mixed' strategy exposes per-profile resources (e.g. `nvidia.com/mig-1g.10gb`, `nvidia.com/mig-3g.40gb`) — both are first-class Kubernetes resources.
Yobibyte exposes MIG as workspace tenancy mode `shared` (vs `dedicated` for full-card); Yobitel NeoCloud bills per MIG slice on a FOCUS-conformant per-slice-hour basis so multi-tenant inference economics work at the silicon, billing and observability layers together.

Overview

Specifications: profiles and partition tables

Partition table rule: total compute across active instances cannot exceed 7g (i.e., the full GPU). A valid layout for H100 is 3g.40gb + 2g.20gb + 1g.10gb + 1g.10gb (= 7g compute, 80 GB memory).
Switching the partition table (e.g. from 7x1g.10gb to 1x4g.40gb + 1x3g.40gb) requires destroying all existing instances first — destructive operation, plan at cluster-build or maintenance window.
MIG instance UUIDs are deterministic per profile slot and survive reboots; container runtime mounts the slice by UUID so workloads can be restarted without re-pinning.
H200's 1g.18gb slice is the highest-VRAM single-slice MIG profile NVIDIA ships through 2026 — useful for 7B-class chat with 32K+ context that overflows H100's 10 GB.

GPU	Profile	Compute (GPCs)	Memory	Max instances of this profile	FP8 / TF32 (sparse) per slice
A100 80 GB	1g.10gb	1	10 GB	7	No FP8 (Ampere); ~22 TFLOPS TF32
A100 80 GB	2g.20gb	2	20 GB	3	~44 TFLOPS TF32
A100 80 GB	3g.40gb	3	40 GB	2	~66 TFLOPS TF32
A100 80 GB	4g.40gb	4	40 GB	1	~89 TFLOPS TF32
A100 80 GB	7g.80gb	7	80 GB	1 (full GPU)	~156 TFLOPS TF32
H100 80 GB SXM5	1g.10gb	1	10 GB	7	~565 TFLOPS FP8 sparse
H100 80 GB SXM5	1g.20gb	1	20 GB	4	~565 TFLOPS FP8 sparse
H100 80 GB SXM5	2g.20gb	2	20 GB	3	~1,130 TFLOPS FP8 sparse
H100 80 GB SXM5	3g.40gb	3	40 GB	2	~1,695 TFLOPS FP8 sparse
H100 80 GB SXM5	4g.40gb	4	40 GB	1	~2,260 TFLOPS FP8 sparse
H100 80 GB SXM5	7g.80gb	7	80 GB	1 (full GPU)	~3,958 TFLOPS FP8 sparse
H200 141 GB	1g.18gb	1	18 GB	7	~565 TFLOPS FP8 sparse
H200 141 GB	1g.35gb	1	35 GB	4	~565 TFLOPS FP8 sparse
H200 141 GB	2g.35gb	2	35 GB	3	~1,130 TFLOPS FP8 sparse
H200 141 GB	3g.71gb	3	71 GB	2	~1,695 TFLOPS FP8 sparse
H200 141 GB	4g.71gb	4	71 GB	1	~2,260 TFLOPS FP8 sparse
H200 141 GB	7g.141gb	7	141 GB	1 (full GPU)	~3,958 TFLOPS FP8 sparse
B200 192 GB	1g.23gb	1	23 GB	7	~640 TFLOPS FP4 sparse
B200 192 GB	3g.96gb	3	96 GB	2	~1,950 TFLOPS FP4 sparse
B200 192 GB	7g.192gb	7	192 GB	1 (full GPU)	~9,000 TFLOPS FP4 sparse

Warning: Profile names look similar across GPUs but the memory allocations differ — '1g.10gb' on H100 is '1g.18gb' on H200 and '1g.23gb' on B200. Always check the SKU's documented profile list before designing a partition layout; a manifest that hardcodes 1g.10gb will fail on H200.

Architecture: how silicon-level partitioning works

Silicon partitioning: dedicated SMs, dedicated L2 slice, dedicated HBM memory controllers, dedicated NVDEC/NVENC where applicable.
No shared state to contend over and no software boundary to bypass at the slice perimeter.
Hopper added FP8 Tensor Cores per slice (same per-SM throughput as the full GPU) plus Confidential Compute attestation per instance.
Blackwell extends MIG with FP4 per slice and per-instance MX-format support.
MIG does not expose NVLink — tensor parallelism across MIG instances is not viable. Use the full card or different cards for TP.

Form factor / power and thermal

Supported on data-centre SKUs only: A100 (40/80 GB), A30, H100 (SXM5 / PCIe / NVL), H200 (SXM5e), B100, B200, GB200. NOT supported on L4, L40, L40S, T4, RTX-class workstation cards or consumer GPUs.
Power draw is GPU-wide, not per slice — a 7-slice H100 still dissipates ~700 W at peak.
Slice thermal limits are inherited from the host card — there is no per-slice power cap.
Confidential Compute (CC-on) on Hopper MIG adds ~3-7 % throughput overhead per slice but does not change the thermal envelope.

Software ecosystem: GPU Operator strategies

GPU Operator single strategy: simple, low-friction, every slice on every GPU is the same profile. Best for homogeneous inference fleets (e.g. all 1g.10gb for a 7B-class endpoint pool).
GPU Operator mixed strategy: heterogeneous slice mix per node. Best for multi-workload fleets where some workloads need 1g.10gb and others need 3g.40gb on the same physical card.
MIG Manager (part of the GPU Operator) handles partition changes — drains workloads, applies the new partition table, restarts the device plugin. Partition changes are destructive; plan with a node drain.
Triton Inference Server and most modern inference servers treat MIG slices as discrete CUDA devices. vLLM, TensorRT-LLM, SGLang and TGI all run unchanged on MIG slices (subject to the slice's VRAM and compute budget).
DCGM and the DCGM exporter expose per-slice metrics — DCGM_FI_DEV_GPU_UTIL, DCGM_FI_DEV_FB_USED, etc. are reported per MIG instance UUID, so per-tenant observability is uniform.
Yobibyte exposes MIG via workspace tenancy mode shared (slice-backed) vs dedicated (full-card). The workspace API does not surface profile names; Yobibyte's placement layer maps customer-stated workload sizes to slice profiles on Yobitel NeoCloud's MIG-enabled pools.

# --- MIG management on a bare H100 node ---

# 1) Enable MIG mode on GPU 0 (reboot may be required)
nvidia-smi -i 0 -mig 1

# 2) List available GPU Instance profiles for this SKU
nvidia-smi mig -lgip
# +-----------------------------------------------------------------------------+
# | GPU instance profiles:                                                      |
# | GPU   Name             ID    Instances   Memory     P2P    SM    DEC    ENC |
# |                              Free/Total   GiB                              |
# |   0  MIG 1g.10gb        19     7/7         9.50      No     14     1      0 |
# |   0  MIG 1g.20gb        15     4/4        19.50      No     14     1      0 |
# |   0  MIG 2g.20gb        14     3/3        19.50      No     28     1      0 |
# |   0  MIG 3g.40gb         9     2/2        39.25      No     42     2      0 |
# |   0  MIG 4g.40gb         5     1/1        39.25      No     56     2      0 |
# |   0  MIG 7g.80gb         0     1/1        78.75      No     98     5      0 |
# +-----------------------------------------------------------------------------+

# 3) Create three instances using profile 9 (3g.40gb on H100) — invalid here, only 2 slots
# Valid: create 1x 3g.40gb + 2x 1g.10gb + 1x 2g.20gb = 7g compute, 70 GB memory
nvidia-smi mig -cgi 9,19,19,14 -C

# 4) List created GPU Instances
nvidia-smi mig -lgi

# 5) List Compute Instances created on each GI
nvidia-smi mig -lci

# 6) Destroy all instances (drains workloads, partition can be redone)
nvidia-smi mig -dci && nvidia-smi mig -dgi

# --- Kubernetes via NVIDIA GPU Operator ---

# 1) Set MIG strategy to 'mixed' so multiple profiles can coexist
helm upgrade gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --set mig.strategy=mixed

# 2) Label the node with the desired MIG configuration (defined in a ConfigMap)
kubectl label node h100-node-01 nvidia.com/mig.config=all-1g.10gb --overwrite

# 3) Workloads request slices by profile-typed resource
kubectl apply -f - <<'EOF'
apiVersion: apps/v1
kind: Deployment
metadata: { name: vllm-7b-shared }
spec:
  replicas: 7
  selector: { matchLabels: { app: vllm-7b-shared } }
  template:
    metadata: { labels: { app: vllm-7b-shared } }
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:v0.6.0
        args: ["--model","meta-llama/Llama-3.1-8B-Instruct",
               "--quantization","fp8","--max-model-len","8192"]
        resources:
          limits: { nvidia.com/mig-1g.10gb: 1 }
EOF

Sizing: when to slice and how

Capacity rule: a 7x 1g.10gb H100 hosts seven independent 7B-class FP8 chat endpoints at ~1,500-2,000 TPS each — aggregate per-card output exceeds a full-card replica because batching efficiency is higher across many cold streams.
Performance per slice is not exactly proportional — overhead and minor cross-slice effects mean 7x 1g.10gb is ~5-10 % less aggregate throughput than 1x 7g.80gb on the same workload. The tenancy benefit usually pays for the overhead.
MIG does not help training: NVLink is disabled across slices, so the multi-GPU collective primitives every training stack relies on do not work. Train on full cards.
Yobibyte's shared tenancy mode maps to MIG slice placement on Yobitel NeoCloud's MIG-enabled H100 / H200 pools; customers see a workspace-level price and SLA, not the slice profile.

Workload	Recommended MIG profile	Why
7B-class chat, modest QPS, isolated tenant	1g.10gb (H100) or 1g.18gb (H200)	FP8 weights fit in 10-18 GB; per-tenant isolation guaranteed.
13B-class chat or 7B with 32K+ context	2g.20gb (H100) or 1g.35gb (H200)	KV cache headroom; bandwidth roughly proportional.
34B-class chat (Codestral, Yi)	3g.40gb (H100) or 3g.71gb (H200)	Weights fit at FP8; concurrency limited.
70B+ chat	Full card (7g.80gb / 7g.141gb)	Cannot fit in any single MIG slice on H100; H200 fits 70B FP8 in 4g.71gb.
Embeddings batch, multi-tenant	1g.10gb x 7	Throughput-uniform; hard tenant isolation.
Multi-tenant inference for trusted internal workloads	MPS or full card	MIG overhead unjustified if isolation is not required.
Confidential inference (FedRAMP / NCSC OFFICIAL)	Any MIG profile, CC-on	Per-instance attestation; sealed HBM.
Training	Full card or NVLink-attached cluster	MIG disables NVLink across slices; training collectives need full-card NVLink.

Cost and TCO

Yobitel NeoCloud bills MIG slices per-slice-hour on the FinOps Foundation FOCUS spec (ServiceName=AcceleratorCompute, SkuId=gpu.h100.mig.1g.10gb) — uniform attribution with full-card SKUs.
Yobibyte's shared tenancy is priced per-token (not per-slice-hour) so customers consume the slice economics through the OpenAI-compatible endpoint without managing the MIG layer.
Spot/preemptible MIG slices are technically possible but rarely offered — partition stability matters for multi-tenant SLAs, and spot eviction would drain all tenants on the GPU.
Confidential Compute on MIG slices adds ~3-7 % throughput overhead but does not change billing — CC-on is included in the listed per-slice-hour rate on NeoCloud's sovereign UK / EU regions.

Configuration	Slice profile	$/slice-hr (NeoCloud, on-demand)	Equivalent $/full-card-hr	Use case
H100 SXM5 full-card	7g.80gb	—	$2.20-3.00	70B+ chat, training, latency-critical SLAs
H100 SXM5 MIG 1g.10gb	1g.10gb	$0.34-0.46	~$2.40-3.20	7B-class chat with hard tenant isolation
H100 SXM5 MIG 2g.20gb	2g.20gb	$0.68-0.92	~$2.40-3.20	13B-class or 7B with long context
H100 SXM5 MIG 3g.40gb	3g.40gb	$1.00-1.36	~$2.40-3.20	34B-class chat
H200 MIG 1g.18gb	1g.18gb	$0.46-0.60	~$3.20-4.20	7B-class chat with very long context
B200 MIG 1g.23gb	1g.23gb	$0.85-1.15	~$5.95-8.05	7B chat at FP4 with peak throughput

Migration and alternatives

Alternatives to MIG when the goal is multi-tenant inference on shared GPUs.

Migration full-card -> MIG: pick the slice profile, drain workloads, run nvidia-smi -i N -mig 1, partition with nvidia-smi mig -cgi, re-deploy workloads with updated resource requests. Plan a maintenance window.
Migration MIG -> full-card: reverse — destroy all instances, disable MIG, redeploy. Also destructive.
Migration MPS -> MIG: replace EXCLUSIVE_PROCESS compute mode with MIG partitioning; isolation upgrade is real, throughput per workload slightly lower.

Option	Isolation	When to pick	Trade-off vs MIG
MIG (silicon partitioning)	Hardware	Multi-tenant inference with strict isolation; FedRAMP / NCSC OFFICIAL postures	Static partition; no NVLink across slices
CUDA MPS (Multi-Process Service)	Software (CUDA context)	Trusted workloads, batched inference, low overhead	No security boundary; bandwidth contention
Time-slicing via GPU Operator (--time-slicing)	Software (driver-level)	Dev / test environments, hobby workloads	Best-effort scheduling; no QoS
KubeVirt + GPU passthrough (vGPU)	Hypervisor (per-VM)	VM-shaped workloads; existing VDI	Hypervisor overhead; vGPU licence
Single tenant per full card	Card-level	Latency-critical SLAs, training, single-tenant	Costly per tenant; under-utilised at off-peak
L4 1U sled with N independent cards	Card-level	Density at the chassis level, not the GPU	More physical cards; cheaper per slice equivalent

Pitfalls / operational notes

Operational issues we see most often on MIG-enabled fleets, ranked by frequency.

Mode switching is destructive: toggling between MIG and full-card drops all running workloads on the GPU. Plan partitioning at cluster-build time or schedule a node-drain window for changes.
Profile combinations are constrained: not every combination of slice sizes is valid. A 4g.40gb + 4g.40gb layout is invalid (over 7g compute); a 3g.40gb + 3g.40gb + 1g.10gb is invalid (over 80 GB on H100). Verify with nvidia-smi mig -lgipp before designing.
Performance is not exactly proportional: 7x 1g.10gb is roughly 90-95 % of 1x 7g.80gb aggregate throughput due to L2 fragmentation and cross-slice scheduler overhead.
NVLink is invisible inside MIG slices: tensor-parallel inference across slices does not work. Use a full card if you need TP>=2 on one host.
Some CUDA features are restricted: peer-to-peer memory access, certain UVM patterns, and CUDA IPC handles do not work across MIG instances. Most inference servers do not hit these; training stacks often do.
Per-slice DCGM signals: cards appear as multiple instance UUIDs in DCGM metrics. Make sure dashboards and alerts are keyed on the instance UUID, not just the GPU index.
MIG slice device files are mode-0666: pod security contexts must be checked; older container runtimes occasionally lose the /dev/nvidia-caps/* files on host restart and leave slices unavailable until the GPU Operator's MIG Manager re-applies the partition.
Confidential Compute (CC-on) attestation: each MIG slice attests independently; ensure your attestation broker handles per-slice nonces, not per-card.
Driver version drift: MIG support evolved across R450 / R510 / R535 / R550 driver lines. Pin the driver version per pool; mixing versions across MIG-enabled nodes is a debugging nightmare.

Where this fits in the Yobitel stack

References

NVIDIA MIG User Guide · NVIDIA
NVIDIA GPU Operator MIG documentation · NVIDIA
Hopper MIG with Confidential Compute · NVIDIA
DCGM per-MIG-instance field IDs · NVIDIA
FinOps Foundation FOCUS billing specification · FinOps Foundation
NCSC Cloud Security Principles · UK NCSC

Overview

Specifications: profiles and partition tables

Architecture: how silicon-level partitioning works

Form factor / power and thermal

Software ecosystem: GPU Operator strategies

Sizing: when to slice and how

Cost and TCO

Migration and alternatives

Pitfalls / operational notes

Where this fits in the Yobitel stack

References

Browse all entries

Deploy on Yobibyte

Multi-Instance GPU (MIG)

Overview

Specifications: profiles and partition tables

Architecture: how silicon-level partitioning works

Form factor / power and thermal

Software ecosystem: GPU Operator strategies

Sizing: when to slice and how

Cost and TCO

Migration and alternatives

Pitfalls / operational notes

Where this fits in the Yobitel stack

References

Browse all entries

Deploy on Yobibyte