Home Knowledge Base GPU Out of Memory (OOM)

GPU Out of Memory (OOM) errors occur when model weights, activations, or intermediate computations exceed available GPU VRAM — a common issue in ML development that requires understanding memory usage patterns and applying techniques like gradient checkpointing, mixed precision, quantization, and batch size reduction to fit models within constraints.

What Is GPU OOM?

Why OOM Happens

Diagnosis

Check Current Usage:

# Current GPU memory
nvidia-smi

# Real-time monitoring
watch -n1 nvidia-smi

# Detailed per-process
nvidia-smi pmon -s m

Python Memory Tracking:

import torch

# Check memory usage
print(f"Allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
print(f"Cached: {torch.cuda.memory_reserved() / 1e9:.2f} GB")

# Get detailed snapshot
torch.cuda.memory_summary()

Memory Estimation:

Model memory (FP16) ≈ 2 × parameters

Example (7B model):
Parameters: 7B × 2 bytes = 14 GB
Optimizer (Adam): 14 GB × 2 = 28 GB
Gradients: 14 GB
Activations: Variable (~10-20 GB)
─────────────────────────────
Training total: ~70-80 GB (won't fit on single 80GB GPU!)

Solutions

Reduce Batch Size (First try):

# If batch_size=32 OOMs:
batch_size = 16  # Try smaller
# Or even batch_size = 1 with gradient accumulation

Gradient Accumulation (Same effective batch):

accumulation_steps = 8
for i, batch in enumerate(dataloader):
    loss = model(batch) / accumulation_steps
    loss.backward()
    
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

Gradient Checkpointing (Trade compute for memory):

# PyTorch native
model.gradient_checkpointing_enable()

# Hugging Face
model = AutoModel.from_pretrained(
    "model-name",
    gradient_checkpointing=True
)

# Savings: 2-3× memory reduction
# Cost: ~20% slower training

Mixed Precision Training:

from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

for batch in dataloader:
    with autocast():  # FP16 compute
        loss = model(batch)
    
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

# Savings: ~2× memory for activations

Quantization (For inference):

# bitsandbytes 4-bit
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)

model = AutoModelForCausalLM.from_pretrained(
    "model-name",
    quantization_config=bnb_config
)

# 7B model: 14 GB → ~4 GB

Clear Cache:

# Clear unused cached memory
torch.cuda.empty_cache()

# Delete unused variables
del large_tensor
torch.cuda.empty_cache()

# Use context manager for temporary tensors
with torch.no_grad():
    # Inference without saving gradients
    output = model(input)

Memory-Efficient Techniques Summary

Technique            | Memory Savings | Trade-off
---------------------|----------------|-------------------
Smaller batch size   | Linear         | More iterations
Gradient accumulation| None (same effect)| Code complexity
Gradient checkpointing| 2-3×          | 20% slower
Mixed precision (FP16)| 2× activations| Minor precision
Quantization (INT4)  | 4× weights     | Quality varies
Flash Attention      | ~2× attention  | None
DeepSpeed ZeRO       | Split across GPUs| Multi-GPU needed

Inference OOM

# Use vLLM for efficient inference
from vllm import LLM

llm = LLM(
    model="model-name",
    quantization="awq",  # 4-bit quantization
    gpu_memory_utilization=0.9  # Use 90% of VRAM
)

# Reduce context length if needed
llm = LLM(model="model", max_model_len=4096)

GPU OOM is the most common issue in ML development — understanding where memory goes and systematically applying reduction techniques enables running larger models on available hardware, making memory optimization skills essential for ML engineers.

gpu oomout of memorycuda errorgradient checkpointingquantizationmemory optimizationbatch size

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.