Model Pruning and Compression
What is Pruning?
Removing unnecessary weights/structures from neural networks to reduce size and increase speed.
Pruning Types
Unstructured Pruning
Remove individual weights:
``python
import torch.nn.utils.prune as prune
# Prune 50% of weights with lowest magnitude
prune.l1_unstructured(model.fc, name="weight", amount=0.5)
# See pruning mask
model.fc.weight_mask
`
Structured Pruning
Remove entire channels/heads:
`python`
# Prune attention heads
def prune_heads(model, heads_to_prune):
for layer_idx, head_indices in heads_to_prune.items():
model.layers[layer_idx].attention.prune_heads(head_indices)
Pruning Criteria
| Criterion | Prune by |
|-----------|----------|
| Magnitude | Smallest absolute weights |
| Gradient | Smallest gradient impact |
| Activation | Least activated neurons |
| Taylor | First-order Taylor approximation |
One-Shot vs Iterative
`python
# One-shot: Prune all at once
pruned_model = prune(model, amount=0.5)
pruned_model = finetune(pruned_model)
# Iterative: Prune gradually
for _ in range(iterations):
model = prune(model, amount=0.1) # 10% each time
model = finetune(model)
`
SparseGPT
Efficient one-shot pruning for LLMs:
`python
# Conceptual: Uses second-order information
def sparse_gpt_prune(layer, sparsity):
W = layer.weight
H = compute_hessian(layer) # Fisher information
for col in range(W.shape[1]):
# Find which weights to prune
scores = W[:, col] ** 2 / H.diagonal()
threshold = compute_threshold(scores, sparsity)
# Prune and update remaining weights
mask = scores > threshold
W[:, col] *= mask
``
Other Compression Techniques
| Technique | Description |
|-----------|-------------|
| Quantization | Reduce precision (FP16, INT8) |
| Distillation | Train smaller model |
| Low-rank factorization | Decompose weight matrices |
| Weight sharing | Reuse weights |
Sparsity Formats
| Format | Use Case |
|--------|----------|
| Dense + mask | Simple, flexible |
| CSR/CSC | Unstructured sparse |
| Block sparse | Hardware accelerated |
| N:M sparsity | NVIDIA Ampere/Ada |
N:M Sparsity (NVIDIA)
2:4 sparsity: 2 non-zero values per 4-element block
- Hardware-accelerated on A100/H100
- 2x theoretical speedup
Tools
| Tool | Purpose |
|------|---------|
| torch.prune | PyTorch pruning |
| Neural Magic | Sparse inference |
| SparseML | Sparsity recipes |
| NVIDIA ASP | Automatic sparsity |
Best Practices
- Start with structured pruning for speedups
- Finetune after pruning
- Use gradual pruning for high sparsity
- Consider N:M sparsity for NVIDIA GPUs
- Combine with quantization for max compression