GPU Multi-Instance GPU (MIG)

GPU Multi-Instance GPU (MIG) is a hardware partitioning feature introduced with NVIDIA's A100 (Ampere) architecture that divides a single physical GPU into up to seven independent instances, each with dedicated compute resources, memory bandwidth, and memory capacity — MIG enables multiple users or workloads to share a GPU with hardware-level isolation, guaranteed quality of service, and no performance interference.

MIG Architecture:
- GPU Instances (GI): the first level of partitioning divides the GPU's streaming multiprocessors (SMs) and memory into isolated GPU Instances — each GI has its own memory partition and dedicated portion of the L2 cache
- Compute Instances (CI): each GPU Instance can be further subdivided into Compute Instances that share the GI's memory but have dedicated SM resources — enables finer-grained compute partitioning within a memory domain
- Hardware Isolation: MIG uses hardware memory firewalls between instances — one instance cannot access another's memory, providing security isolation equivalent to separate physical GPUs
- Fault Isolation: ECC errors, GPU hangs, or crashes in one MIG instance don't affect other instances — each instance operates as an independent GPU with its own error handling

A100 MIG Configurations:
- Full GPU: 108 SMs, 80 GB HBM2e, 2039 GB/s bandwidth — used when a single workload needs maximum resources
- 7× 1g.5gb: seven instances with ~14 SMs and ~5 GB each — maximum multi-tenancy for small inference workloads
- 3× 2g.10gb + 1× 1g.5gb: three medium instances plus one small — mixed workload deployment
- 2× 3g.20gb + 1× 1g.5gb: two larger instances plus one small — balanced compute and memory for moderate workloads
- 1× 4g.20gb + 1× 3g.20gb: two large instances — suitable for two concurrent training jobs or large inference models

MIG Setup and Management:
- Enable MIG Mode: nvidia-smi -i 0 --mig-enabled — requires GPU reset, sets the GPU into MIG-capable mode (driver support required)
- Create GPU Instance: nvidia-smi mig -i 0 -cgi 9,3,3 — creates one 4g.20gb (profile 9) and two 2g.10gb (profile 3) GPU Instances
- Create Compute Instance: nvidia-smi mig -i 0 -gi 0 -cci 0 — creates a Compute Instance within GPU Instance 0, making it usable by applications
- Device Enumeration: CUDA_VISIBLE_DEVICES=MIG-GPU-<uuid>/<gi>/<ci> selects a specific MIG instance — applications see it as a standalone GPU with no awareness of MIG partitioning

Use Cases and Deployment:
- Multi-Tenant Inference: cloud providers assign MIG instances to different customers — each customer gets guaranteed GPU resources without noisy-neighbor interference, improving SLA compliance
- Development and Testing: developers share a single A100 by each receiving a MIG slice — 7 developers can simultaneously develop and test GPU code on one physical GPU
- Mixed Workload Consolidation: run inference serving on smaller slices while a training job uses a larger slice — improves overall GPU utilization from typical 30-40% to 80-90%
- Kubernetes Integration: NVIDIA's device plugin exposes MIG instances as individual GPU resources — Kubernetes schedules pods to specific MIG slices using standard resource requests

Performance Characteristics:
- Linear Scaling: a 1g.5gb instance provides approximately 1/7 of full GPU compute, a 3g.20gb provides approximately 3/7 — performance scales linearly with allocated SM count for compute-bound workloads
- Memory Bandwidth: each instance gets a proportional share of HBM bandwidth — a 2g.10gb instance receives approximately 2/7 of total bandwidth, sufficient for many inference workloads
- L2 Cache Partitioning: the L2 cache is physically partitioned between instances — no cache interference means predictable performance regardless of co-running workloads
- No Oversubscription: MIG doesn't allow allocating more resources than physically available — unlike time-slicing (MPS), MIG provides hard resource boundaries

Comparison with Other GPU Sharing:
- MPS (Multi-Process Service): time-shares SM resources without memory isolation — higher utilization for cooperative workloads but no QoS guarantees or security isolation
- Time-Slicing (vGPU): context-switches the entire GPU between users — provides isolation but serializes execution, Adding latency jitter
- MIG Advantage: only approach providing simultaneous execution with hardware isolation — combines the utilization benefits of MPS with the isolation guarantees of separate GPUs

MIG has fundamentally changed GPU datacenter economics — by enabling safe multi-tenancy with hardware-enforced isolation, a single A100 can serve 7 independent inference workloads simultaneously, reducing per-workload GPU cost by up to 7× while maintaining predictable performance.

Want to learn more?