Model Pruning and Compression

Model Pruning and Compression

What is Pruning?
Removing unnecessary weights/structures from neural networks to reduce size and increase speed.

Pruning Types

Unstructured Pruning
Remove individual weights:
``python import torch.nn.utils.prune as prune

# Prune 50% of weights with lowest magnitude prune.l1_unstructured(model.fc, name="weight", amount=0.5)

# See pruning mask model.fc.weight_mask`

Structured Pruning Remove entire channels/heads:`python # Prune attention heads def prune_heads(model, heads_to_prune): for layer_idx, head_indices in heads_to_prune.items(): model.layers[layer_idx].attention.prune_heads(head_indices)`

Pruning Criteria | Criterion | Prune by | |-----------|----------| | Magnitude | Smallest absolute weights | | Gradient | Smallest gradient impact | | Activation | Least activated neurons | | Taylor | First-order Taylor approximation |

One-Shot vs Iterative`python # One-shot: Prune all at once pruned_model = prune(model, amount=0.5) pruned_model = finetune(pruned_model)

# Iterative: Prune gradually for _ in range(iterations): model = prune(model, amount=0.1) # 10% each time model = finetune(model)`

SparseGPT Efficient one-shot pruning for LLMs:`python # Conceptual: Uses second-order information def sparse_gpt_prune(layer, sparsity): W = layer.weight H = compute_hessian(layer) # Fisher information

for col in range(W.shape[1]): # Find which weights to prune scores = W[:, col] ** 2 / H.diagonal() threshold = compute_threshold(scores, sparsity)

# Prune and update remaining weights mask = scores > threshold W[:, col] *= mask``

Other Compression Techniques
| Technique | Description |
|-----------|-------------|
| Quantization | Reduce precision (FP16, INT8) |
| Distillation | Train smaller model |
| Low-rank factorization | Decompose weight matrices |
| Weight sharing | Reuse weights |

Sparsity Formats
| Format | Use Case |
|--------|----------|
| Dense + mask | Simple, flexible |
| CSR/CSC | Unstructured sparse |
| Block sparse | Hardware accelerated |
| N:M sparsity | NVIDIA Ampere/Ada |

N:M Sparsity (NVIDIA)
2:4 sparsity: 2 non-zero values per 4-element block
- Hardware-accelerated on A100/H100
- 2x theoretical speedup

Tools
| Tool | Purpose |
|------|---------|
| torch.prune | PyTorch pruning |
| Neural Magic | Sparse inference |
| SparseML | Sparsity recipes |
| NVIDIA ASP | Automatic sparsity |

Best Practices
- Start with structured pruning for speedups
- Finetune after pruning
- Use gradual pruning for high sparsity
- Consider N:M sparsity for NVIDIA GPUs
- Combine with quantization for max compression

Want to learn more?