GPU Out of Memory (OOM)

Keywords: gpu oom, out of memory, cuda error, gradient checkpointing, quantization, memory optimization, batch size

GPU Out of Memory (OOM) errors occur when model weights, activations, or intermediate computations exceed available GPU VRAM — a common issue in ML development that requires understanding memory usage patterns and applying techniques like gradient checkpointing, mixed precision, quantization, and batch size reduction to fit models within constraints.

What Is GPU OOM?

- Error: "CUDA out of memory" or "RuntimeError: CUDA error: out of memory."
- Cause: GPU memory (VRAM) exhausted by model/data/activations.
- Context: Training uses more memory than inference.
- Resolution: Reduce memory usage or increase available VRAM.

Why OOM Happens

- Model Weights: Large models need gigabytes for parameters.
- Activations: Saved for backpropagation during training.
- Optimizer States: Adam stores 2x parameters in memory.
- Gradients: Same size as parameters.
- KV Cache: For inference, grows with sequence length.
- Batch Size: More samples = more memory.

Diagnosis

Check Current Usage:
``bash
# Current GPU memory
nvidia-smi

# Real-time monitoring
watch -n1 nvidia-smi

# Detailed per-process
nvidia-smi pmon -s m
`

Python Memory Tracking:
`python
import torch

# Check memory usage
print(f"Allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
print(f"Cached: {torch.cuda.memory_reserved() / 1e9:.2f} GB")

# Get detailed snapshot
torch.cuda.memory_summary()
`

Memory Estimation:
`
Model memory (FP16) ≈ 2 × parameters

Example (7B model):
Parameters: 7B × 2 bytes = 14 GB
Optimizer (Adam): 14 GB × 2 = 28 GB
Gradients: 14 GB
Activations: Variable (~10-20 GB)
─────────────────────────────
Training total: ~70-80 GB (won't fit on single 80GB GPU!)
`

Solutions

Reduce Batch Size (First try):
`python
# If batch_size=32 OOMs:
batch_size = 16 # Try smaller
# Or even batch_size = 1 with gradient accumulation
`

Gradient Accumulation (Same effective batch):
`python
accumulation_steps = 8
for i, batch in enumerate(dataloader):
loss = model(batch) / accumulation_steps
loss.backward()

if (i + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()
`

Gradient Checkpointing (Trade compute for memory):
`python
# PyTorch native
model.gradient_checkpointing_enable()

# Hugging Face
model = AutoModel.from_pretrained(
"model-name",
gradient_checkpointing=True
)

# Savings: 2-3× memory reduction
# Cost: ~20% slower training
`

Mixed Precision Training:
`python
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

for batch in dataloader:
with autocast(): # FP16 compute
loss = model(batch)

scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

# Savings: ~2× memory for activations
`

Quantization (For inference):
`python
# bitsandbytes 4-bit
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16
)

model = AutoModelForCausalLM.from_pretrained(
"model-name",
quantization_config=bnb_config
)

# 7B model: 14 GB → ~4 GB
`

Clear Cache:
`python
# Clear unused cached memory
torch.cuda.empty_cache()

# Delete unused variables
del large_tensor
torch.cuda.empty_cache()

# Use context manager for temporary tensors
with torch.no_grad():
# Inference without saving gradients
output = model(input)
`

Memory-Efficient Techniques Summary

`
Technique | Memory Savings | Trade-off
---------------------|----------------|-------------------
Smaller batch size | Linear | More iterations
Gradient accumulation| None (same effect)| Code complexity
Gradient checkpointing| 2-3× | 20% slower
Mixed precision (FP16)| 2× activations| Minor precision
Quantization (INT4) | 4× weights | Quality varies
Flash Attention | ~2× attention | None
DeepSpeed ZeRO | Split across GPUs| Multi-GPU needed
`

Inference OOM

`python
# Use vLLM for efficient inference
from vllm import LLM

llm = LLM(
model="model-name",
quantization="awq", # 4-bit quantization
gpu_memory_utilization=0.9 # Use 90% of VRAM
)

# Reduce context length if needed
llm = LLM(model="model", max_model_len=4096)
``

GPU OOM is the most common issue in ML development — understanding where memory goes and systematically applying reduction techniques enables running larger models on available hardware, making memory optimization skills essential for ML engineers.

Want to learn more?

Search 13,225+ semiconductor and AI topics or chat with our AI assistant.

Search Topics Chat with CFSGPT