Model Quantization is the compression technique that reduces neural network weight and activation precision from 32-bit floating-point to lower-bitwidth representations (INT8, INT4, or even binary) — achieving 2-8× model size reduction and 2-4× inference speedup on hardware with integer compute units, with carefully managed accuracy degradation.
Quantization Fundamentals:
- Uniform Quantization: maps continuous float values to discrete integer levels at uniform intervals; q = round(x/scale + zero_point); scale = (max-min)/(2^bits - 1); covers the range linearly
- Symmetric vs Asymmetric: symmetric quantization uses zero_point=0 (range is [-max, max]); asymmetric uses non-zero offset for skewed distributions (e.g., ReLU activations are always non-negative); asymmetric is more precise for one-sided distributions
- Per-Tensor vs Per-Channel: per-tensor uses one scale for the entire tensor; per-channel uses different scales for each output channel of a weight tensor — per-channel captures weight distribution variation across channels, critical for accuracy in convolutional networks
- Calibration: determining scale and zero_point from representative data statistics; methods include MinMax (range of observed values), percentile (ignore outliers at 99.9th percentile), and MSE minimization (minimize quantization error)
Post-Training Quantization (PTQ):
- Static PTQ: calibrate quantization parameters on a representative dataset; all weights and activations quantized to fixed integers at inference; requires 100-1000 calibration samples; typically achieves <1% accuracy loss for INT8 on vision models
- Dynamic PTQ: weights quantized statically; activations quantized dynamically at inference based on observed range per batch or per-token; slightly higher overhead but adapts to input-dependent activation distributions
- GPTQ (LLM-Specific): layer-wise quantization using second-order information (Hessian); quantizes weights column-by-column while compensating for quantization error in remaining columns; enables INT4 weight quantization of LLMs with minimal perplexity increase
- AWQ (Activation-Aware Weight Quantization): identifies salient weight channels by analyzing activation magnitudes; scales salient channels up before quantization to preserve their precision — 4-bit LLM quantization with better quality than uniform rounding
Quantization-Aware Training (QAT):
- Simulated Quantization: insert quantization-dequantization (fake quantization) operations during training; forward pass uses quantized values, backward pass uses straight-through estimator (STE) to approximate gradient through the non-differentiable rounding operation
- Benefits: model learns to compensate for quantization error during training; typically recovers 0.5-2% accuracy over PTQ for aggressive quantization (INT4, INT2); essential when PTQ accuracy loss is unacceptable
- Computation Cost: QAT requires full retraining or fine-tuning (10-100 epochs); 2-3× more expensive than standard training due to additional quantization operations; justified only when PTQ fails to meet accuracy targets
- Mixed-Precision QAT: different layers quantized to different bitwidths based on sensitivity analysis; first and last layers often kept at higher precision (INT8) while middle layers use INT4; automated mixed-precision search finds optimal per-layer bitwidth allocation
Hardware Acceleration:
- INT8 Tensor Cores: NVIDIA A100/H100 Tensor Cores achieve 2× throughput for INT8 vs FP16 GEMM (624 TOPS vs 312 TFLOPS on A100); inference frameworks like TensorRT automatically leverage INT8 operations
- INT4 Support: specialized hardware (Qualcomm Hexagon DSP, Apple Neural Engine) provides INT4 compute; GPU support emerging through packed INT4 operations and lookup-table-based computation
- Inference Frameworks: TensorRT, ONNX Runtime, OpenVINO, and llama.cpp provide optimized quantized kernels; automatic graph optimization fuses quantize/dequantize operations with compute kernels to minimize overhead
Model quantization is the most practical and widely deployed technique for efficient neural network inference — enabling deployment of large language models on consumer hardware (running 70B parameter models on a laptop via INT4 quantization) and achieving real-time inference on edge devices without prohibitive accuracy loss.