Structured Pruning

Structured Pruning is the model compression technique that removes entire structural units (channels, filters, attention heads, or layers) from a neural network — unlike unstructured pruning which zeros individual weights creating sparse matrices that require specialized hardware, structured pruning produces smaller dense models that run faster on standard GPUs and CPUs without any special sparse computation support, making it the most practically deployable form of pruning for real-world inference acceleration.

Structured vs. Unstructured Pruning

| Aspect | Unstructured | Structured |
|--------|-------------|------------|
| What's removed | Individual weights | Entire channels/heads/layers |
| Sparsity pattern | Random zeros | Smaller dense matrices |
| Hardware support | Needs sparse kernels | Standard dense hardware |
| Actual speedup | Often minimal (without sparse HW) | Proportional to pruning |
| Max sparsity | 90-95% | 30-70% |
| Accuracy impact | Low at moderate sparsity | Higher at same sparsity ratio |

Structural Units for Pruning

``Convolution: Remove entire output filters Original: Conv(C_in=256, C_out=512, 3×3) → 512 filters Pruned: Conv(C_in=256, C_out=384, 3×3) → 384 filters (25% removed) → Must also remove corresponding input channels in next layer

Transformer: Remove attention heads or FFN neurons Original: 32 attention heads, FFN dim 4096 Pruned: 24 attention heads, FFN dim 3072 Layer pruning: Remove entire transformer layers Original: 32 layers Pruned: 24 layers (remove least important 8 layers)`

Importance Criteria

| Criterion | What It Measures | Computation | |-----------|-----------------|-------------| | L1 norm | Magnitude of filter weights | Sum of abs(weights) | | Taylor expansion | Gradient × activation | Requires forward + backward | | BN scaling factor | Batch norm γ (Network Slimming) | Already computed | | Fisher information | Sensitivity of loss to removal | Second-order approximation | | Geometric median | Redundancy among filters | Pairwise distance |

Network Slimming (BN-based)

`python # Step 1: Train with L1 regularization on BN gamma loss = task_loss + lambda * sum(|gamma| for all BN layers)

# Step 2: After training, channels with small gamma are unimportant global_threshold = percentile(all_gammas, prune_ratio)

# Step 3: Remove channels where gamma < threshold for layer in model: mask = layer.bn.gamma.abs() > global_threshold layer.conv.weight = layer.conv.weight[mask] # remove output channels next_layer.conv.weight = next_layer.conv.weight[:, mask] # remove input channels

# Step 4: Fine-tune pruned model to recover accuracy`

LLM Structured Pruning

| Method | What's Pruned | Model | Pruning % | Quality | |--------|-------------|-------|----------|--------| | LLM-Pruner | Coupled structures | Llama-7B | 20-50% | Good | | Sheared Llama | Width + depth | Llama-2 | 40-60% | Strong | | SliceGPT | Embedding dimensions | Various | 25-30% | Good | | LaCo | Layers (merge similar) | Various | 25-50% | Moderate | | MiniLLM | Distill + prune | Various | 50-75% | Good |

Pruning + Fine-tuning Pipeline

`[Pretrained model] ↓ [Compute importance scores for all structures] ↓ [Remove lowest-importance structures] ↓ [Fine-tune on subset of training data (1-10%)] ↓ [Repeat if needed (iterative pruning)] ↓ [Pruned model: 30-70% smaller, ~1-3% accuracy loss]``

Speedup Results

| Model | Pruning Ratio | Accuracy Retention | Actual Speedup |
|-------|-------------|-------------------|---------------|
| ResNet-50 (30% filter pruning) | 30% | 99% of original | 1.4× |
| ResNet-50 (50% filter pruning) | 50% | 97% of original | 2.0× |
| BERT (40% attention heads) | 40% | 98.5% of original | 1.5× |
| Llama-7B → 5.5B (Sheared) | 20% | 96% of original | 1.3× |

Structured pruning is the most practical path to neural network compression for standard hardware — by removing entire architectural units rather than individual weights, structured pruning produces genuinely smaller, faster models that accelerate on any device without requiring sparse computation libraries, making it the go-to technique for deploying efficient models on GPUs, CPUs, and mobile devices where real-world speedup matters more than theoretical sparsity ratios.

Want to learn more?