Gradient Accumulation
What is Gradient Accumulation? Accumulate gradients over multiple mini-batches before updating weights, simulating a larger batch size without requiring more memory.
Why Use It?
| Constraint | Solution |
|---|---|
| GPU memory limits batch size | Accumulate smaller batches |
| Need larger effective batch | More stable gradients |
| Single GPU training | Match multi-GPU batch sizes |
How It Works
Standard Training
# Each step: forward → backward → update
for batch in dataloader:
loss = model(batch)
loss.backward()
optimizer.step() # Update every batch
optimizer.zero_grad()
With Gradient Accumulation
accumulation_steps = 4
for i, batch in enumerate(dataloader):
loss = model(batch)
loss = loss / accumulation_steps # Scale loss
loss.backward() # Accumulate gradients
if (i + 1) % accumulation_steps == 0:
optimizer.step() # Update every N batches
optimizer.zero_grad()
Effective Batch Size
effective_batch_size = batch_size × accumulation_steps × num_gpus
Example:
batch_size = 4
accumulation_steps = 8
num_gpus = 1
effective_batch_size = 4 × 8 × 1 = 32
Important Considerations
Loss Scaling Divide loss by accumulation steps to maintain correct gradient magnitude:
loss = loss / accumulation_steps
Learning Rate May need to adjust LR for larger effective batch:
- Linear scaling rule:
lr = base_lr × effective_batch_size / base_batch_size - Or use warmup to find optimal LR
Memory Usage
| Component | With Accumulation |
|---|---|
| Model weights | Same |
| Activations | Per micro-batch |
| Gradients | Accumulate (same size) |
| Optimizer states | Same |
Batch Normalization If using BatchNorm (rare in LLMs), statistics may differ with smaller micro-batches.
Hugging Face Implementation
from transformers import TrainingArguments
args = TrainingArguments(
per_device_train_batch_size=4, # Micro-batch
gradient_accumulation_steps=8, # Accumulate 8 steps
# Effective: 4 × 8 = 32 per GPU
)
Complete Example
model.train()
optimizer.zero_grad()
for step, batch in enumerate(dataloader):
outputs = model(**batch)
loss = outputs.loss / gradient_accumulation_steps
loss.backward()
if (step + 1) % gradient_accumulation_steps == 0:
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
scheduler.step()
optimizer.zero_grad()
print(f"Step {step + 1}: loss = {loss.item() * gradient_accumulation_steps:.4f}")
gradient accumulationeffective batch
Related Topics
Explore 500+ Semiconductor & AI Topics
From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.