Home Knowledge Base Gradient Accumulation

Gradient Accumulation

What is Gradient Accumulation? Accumulate gradients over multiple mini-batches before updating weights, simulating a larger batch size without requiring more memory.

Why Use It?

ConstraintSolution
GPU memory limits batch sizeAccumulate smaller batches
Need larger effective batchMore stable gradients
Single GPU trainingMatch multi-GPU batch sizes

How It Works

Standard Training

# Each step: forward → backward → update
for batch in dataloader:
    loss = model(batch)
    loss.backward()
    optimizer.step()  # Update every batch
    optimizer.zero_grad()

With Gradient Accumulation

accumulation_steps = 4

for i, batch in enumerate(dataloader):
    loss = model(batch)
    loss = loss / accumulation_steps  # Scale loss
    loss.backward()  # Accumulate gradients

    if (i + 1) % accumulation_steps == 0:
        optimizer.step()  # Update every N batches
        optimizer.zero_grad()

Effective Batch Size

effective_batch_size = batch_size × accumulation_steps × num_gpus

Example:
batch_size = 4
accumulation_steps = 8
num_gpus = 1
effective_batch_size = 4 × 8 × 1 = 32

Important Considerations

Loss Scaling Divide loss by accumulation steps to maintain correct gradient magnitude:

loss = loss / accumulation_steps

Learning Rate May need to adjust LR for larger effective batch:

Memory Usage

ComponentWith Accumulation
Model weightsSame
ActivationsPer micro-batch
GradientsAccumulate (same size)
Optimizer statesSame

Batch Normalization If using BatchNorm (rare in LLMs), statistics may differ with smaller micro-batches.

Hugging Face Implementation

from transformers import TrainingArguments

args = TrainingArguments(
    per_device_train_batch_size=4,  # Micro-batch
    gradient_accumulation_steps=8,   # Accumulate 8 steps
    # Effective: 4 × 8 = 32 per GPU
)

Complete Example

model.train()
optimizer.zero_grad()

for step, batch in enumerate(dataloader):
    outputs = model(**batch)
    loss = outputs.loss / gradient_accumulation_steps
    loss.backward()

    if (step + 1) % gradient_accumulation_steps == 0:
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()

        print(f"Step {step + 1}: loss = {loss.item() * gradient_accumulation_steps:.4f}")
gradient accumulationeffective batch

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.