Home Knowledge Base Structured Pruning

Structured Pruning is the model compression technique that removes entire structural units (channels, filters, attention heads, or layers) from a neural network — unlike unstructured pruning which zeros individual weights creating sparse matrices that require specialized hardware, structured pruning produces smaller dense models that run faster on standard GPUs and CPUs without any special sparse computation support, making it the most practically deployable form of pruning for real-world inference acceleration.

Structured vs. Unstructured Pruning

AspectUnstructuredStructured
What's removedIndividual weightsEntire channels/heads/layers
Sparsity patternRandom zerosSmaller dense matrices
Hardware supportNeeds sparse kernelsStandard dense hardware
Actual speedupOften minimal (without sparse HW)Proportional to pruning
Max sparsity90-95%30-70%
Accuracy impactLow at moderate sparsityHigher at same sparsity ratio

Structural Units for Pruning

Convolution: Remove entire output filters
  Original: Conv(C_in=256, C_out=512, 3×3) → 512 filters
  Pruned:   Conv(C_in=256, C_out=384, 3×3) → 384 filters (25% removed)
  → Must also remove corresponding input channels in next layer

Transformer: Remove attention heads or FFN neurons
  Original: 32 attention heads, FFN dim 4096
  Pruned:   24 attention heads, FFN dim 3072
  
Layer pruning: Remove entire transformer layers
  Original: 32 layers
  Pruned:   24 layers (remove least important 8 layers)

Importance Criteria

CriterionWhat It MeasuresComputation
L1 normMagnitude of filter weightsSum of abs(weights)
Taylor expansionGradient × activationRequires forward + backward
BN scaling factorBatch norm γ (Network Slimming)Already computed
Fisher informationSensitivity of loss to removalSecond-order approximation
Geometric medianRedundancy among filtersPairwise distance

Network Slimming (BN-based)

# Step 1: Train with L1 regularization on BN gamma
loss = task_loss + lambda * sum(|gamma| for all BN layers)

# Step 2: After training, channels with small gamma are unimportant
global_threshold = percentile(all_gammas, prune_ratio)

# Step 3: Remove channels where gamma < threshold
for layer in model:
    mask = layer.bn.gamma.abs() > global_threshold
    layer.conv.weight = layer.conv.weight[mask]  # remove output channels
    next_layer.conv.weight = next_layer.conv.weight[:, mask]  # remove input channels

# Step 4: Fine-tune pruned model to recover accuracy

LLM Structured Pruning

MethodWhat's PrunedModelPruning %Quality
LLM-PrunerCoupled structuresLlama-7B20-50%Good
Sheared LlamaWidth + depthLlama-240-60%Strong
SliceGPTEmbedding dimensionsVarious25-30%Good
LaCoLayers (merge similar)Various25-50%Moderate
MiniLLMDistill + pruneVarious50-75%Good

Pruning + Fine-tuning Pipeline

[Pretrained model]
      ↓
[Compute importance scores for all structures]
      ↓
[Remove lowest-importance structures]
      ↓
[Fine-tune on subset of training data (1-10%)]
      ↓
[Repeat if needed (iterative pruning)]
      ↓
[Pruned model: 30-70% smaller, ~1-3% accuracy loss]

Speedup Results

ModelPruning RatioAccuracy RetentionActual Speedup
ResNet-50 (30% filter pruning)30%99% of original1.4×
ResNet-50 (50% filter pruning)50%97% of original2.0×
BERT (40% attention heads)40%98.5% of original1.5×
Llama-7B → 5.5B (Sheared)20%96% of original1.3×

Structured pruning is the most practical path to neural network compression for standard hardware — by removing entire architectural units rather than individual weights, structured pruning produces genuinely smaller, faster models that accelerate on any device without requiring sparse computation libraries, making it the go-to technique for deploying efficient models on GPUs, CPUs, and mobile devices where real-world speedup matters more than theoretical sparsity ratios.

model pruning structuredstructured pruningchannel pruningfilter pruningnetwork slimming

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.