GPU Out of Memory (OOM)

GPU Out of Memory (OOM) errors occur when model weights, activations, or intermediate computations exceed available GPU VRAM — a common issue in ML development that requires understanding memory usage patterns and applying techniques like gradient checkpointing, mixed precision, quantization, and batch size reduction to fit models within constraints.

What Is GPU OOM?

- Error: "CUDA out of memory" or "RuntimeError: CUDA error: out of memory."
- Cause: GPU memory (VRAM) exhausted by model/data/activations.
- Context: Training uses more memory than inference.
- Resolution: Reduce memory usage or increase available VRAM.

Why OOM Happens

- Model Weights: Large models need gigabytes for parameters.
- Activations: Saved for backpropagation during training.
- Optimizer States: Adam stores 2x parameters in memory.
- Gradients: Same size as parameters.
- KV Cache: For inference, grows with sequence length.
- Batch Size: More samples = more memory.

Diagnosis

Check Current Usage:
``bash # Current GPU memory nvidia-smi

# Real-time monitoring watch -n1 nvidia-smi

# Detailed per-process nvidia-smi pmon -s m`

Python Memory Tracking:`python import torch

# Check memory usage print(f"Allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB") print(f"Cached: {torch.cuda.memory_reserved() / 1e9:.2f} GB")

# Get detailed snapshot torch.cuda.memory_summary()`

Memory Estimation:`Model memory (FP16) ≈ 2 × parameters

Example (7B model): Parameters: 7B × 2 bytes = 14 GB Optimizer (Adam): 14 GB × 2 = 28 GB Gradients: 14 GB Activations: Variable (~10-20 GB) ───────────────────────────── Training total: ~70-80 GB (won't fit on single 80GB GPU!)`

Solutions

Reduce Batch Size (First try):`python # If batch_size=32 OOMs: batch_size = 16 # Try smaller # Or even batch_size = 1 with gradient accumulation`

Gradient Accumulation (Same effective batch):`python accumulation_steps = 8 for i, batch in enumerate(dataloader): loss = model(batch) / accumulation_steps loss.backward() if (i + 1) % accumulation_steps == 0: optimizer.step() optimizer.zero_grad()`

Gradient Checkpointing (Trade compute for memory):`python # PyTorch native model.gradient_checkpointing_enable()

# Hugging Face model = AutoModel.from_pretrained( "model-name", gradient_checkpointing=True )

# Savings: 2-3× memory reduction # Cost: ~20% slower training`

Mixed Precision Training:`python from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

for batch in dataloader: with autocast(): # FP16 compute loss = model(batch) scaler.scale(loss).backward() scaler.step(optimizer) scaler.update()

# Savings: ~2× memory for activations`

Quantization (For inference):`python # bitsandbytes 4-bit from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16 )

model = AutoModelForCausalLM.from_pretrained( "model-name", quantization_config=bnb_config )

# 7B model: 14 GB → ~4 GB`

Clear Cache:`python # Clear unused cached memory torch.cuda.empty_cache()

# Delete unused variables del large_tensor torch.cuda.empty_cache()

# Use context manager for temporary tensors with torch.no_grad(): # Inference without saving gradients output = model(input)`

Memory-Efficient Techniques Summary

`Technique | Memory Savings | Trade-off ---------------------|----------------|------------------- Smaller batch size | Linear | More iterations Gradient accumulation| None (same effect)| Code complexity Gradient checkpointing| 2-3× | 20% slower Mixed precision (FP16)| 2× activations| Minor precision Quantization (INT4) | 4× weights | Quality varies Flash Attention | ~2× attention | None DeepSpeed ZeRO | Split across GPUs| Multi-GPU needed`

Inference OOM

`python # Use vLLM for efficient inference from vllm import LLM

llm = LLM( model="model-name", quantization="awq", # 4-bit quantization gpu_memory_utilization=0.9 # Use 90% of VRAM )

# Reduce context length if needed llm = LLM(model="model", max_model_len=4096)``

GPU OOM is the most common issue in ML development — understanding where memory goes and systematically applying reduction techniques enables running larger models on available hardware, making memory optimization skills essential for ML engineers.

Want to learn more?