Home Knowledge Base Model Pruning and Compression

Model Pruning and Compression

What is Pruning? Removing unnecessary weights/structures from neural networks to reduce size and increase speed.

Pruning Types

Unstructured Pruning Remove individual weights:

import torch.nn.utils.prune as prune

# Prune 50% of weights with lowest magnitude
prune.l1_unstructured(model.fc, name="weight", amount=0.5)

# See pruning mask
model.fc.weight_mask

Structured Pruning Remove entire channels/heads:

# Prune attention heads
def prune_heads(model, heads_to_prune):
    for layer_idx, head_indices in heads_to_prune.items():
        model.layers[layer_idx].attention.prune_heads(head_indices)

Pruning Criteria

CriterionPrune by
MagnitudeSmallest absolute weights
GradientSmallest gradient impact
ActivationLeast activated neurons
TaylorFirst-order Taylor approximation

One-Shot vs Iterative

# One-shot: Prune all at once
pruned_model = prune(model, amount=0.5)
pruned_model = finetune(pruned_model)

# Iterative: Prune gradually
for _ in range(iterations):
    model = prune(model, amount=0.1)  # 10% each time
    model = finetune(model)

SparseGPT Efficient one-shot pruning for LLMs:

# Conceptual: Uses second-order information
def sparse_gpt_prune(layer, sparsity):
    W = layer.weight
    H = compute_hessian(layer)  # Fisher information

    for col in range(W.shape[1]):
        # Find which weights to prune
        scores = W[:, col] ** 2 / H.diagonal()
        threshold = compute_threshold(scores, sparsity)

        # Prune and update remaining weights
        mask = scores > threshold
        W[:, col] *= mask

Other Compression Techniques

TechniqueDescription
QuantizationReduce precision (FP16, INT8)
DistillationTrain smaller model
Low-rank factorizationDecompose weight matrices
Weight sharingReuse weights

Sparsity Formats

FormatUse Case
Dense + maskSimple, flexible
CSR/CSCUnstructured sparse
Block sparseHardware accelerated
N:M sparsityNVIDIA Ampere/Ada

N:M Sparsity (NVIDIA) 2:4 sparsity: 2 non-zero values per 4-element block

Tools

ToolPurpose
torch.prunePyTorch pruning
Neural MagicSparse inference
SparseMLSparsity recipes
NVIDIA ASPAutomatic sparsity

Best Practices

pruningsparsitycompression

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.