Embedding Compression and Dimensionality Reduction is the technique of reducing the size of learned vector representations while preserving the semantic relationships encoded in those representations — enabling lower storage costs, faster similarity search, reduced memory bandwidth, and improved interpretability, through methods ranging from classical linear projections (PCA) to modern learned compression techniques like Matryoshka Representation Learning.
Why Compress Embeddings
- Storage: 1M embeddings × 1536 dimensions × 4 bytes = 6GB → impractical for edge devices.
- Latency: Larger vectors → slower ANN search → higher query latency.
- Memory: GPU VRAM limits batch size for re-ranking → smaller embeddings → larger batches.
- Bandwidth: Embedding serving at scale → TB/day of data transfer.
PCA (Principal Component Analysis)
- Finds orthogonal directions of maximum variance in embedding space.
- Project n-dim embeddings onto top-k PCA components → k-dim representation.
- Linear, fast, interpretable → widely used for visualization (k=2 or 3).
- Limitation: Linear → cannot capture non-linear manifold structure.
``python
from sklearn.decomposition import PCA
pca = PCA(n_components=64) # 1536 → 64 dims
pca.fit(embeddings_train)
embeddings_compressed = pca.transform(embeddings_all)
print(f"Variance retained: {sum(pca.explained_variance_ratio_):.1%}")
``
UMAP and t-SNE (Visualization)
- t-SNE: Models pairwise similarities in high-dim and low-dim spaces → KL divergence minimization → 2D/3D visualization.
- Preserves local structure; clusters appear clearly → ideal for inspecting embedding quality.
- Slow: O(N²) naively; O(N log N) with Barnes-Hut; not suitable for large N.
- UMAP: Constructs fuzzy topological graph in high-dim → optimizes low-dim layout.
- Faster than t-SNE; better preserves global structure; can be used for compression (not just visualization).
- Hyperparameters: n_neighbors (local vs global), min_dist (cluster spread).
Matryoshka Representation Learning (MRL)
- Train single embedding model to produce representations at multiple resolutions simultaneously.
- Loss: Sum of losses at multiple truncation points: L = L_{d=8} + L_{d=16} + L_{d=32} + ... + L_{d=1536}.
- First 8 dimensions capture coarsest semantic structure; first 1536 capture finest detail.
- At inference: Use smaller prefix (e.g., 128-d) for fast approximate retrieval → rerank with full 1536-d.
- OpenAI text-embedding-3 models use MRL → users can specify desired dimensions.
Product Quantization (PQ)
- Split d-dimensional vector into M subvectors of d/M dimensions each.
- Quantize each subvector into one of K centroids → represent with log₂K bits.
- Total bits: M × log₂K (instead of 32-bit floats × d).
- Example: 128-d, M=8, K=256 → 64 bits instead of 4096 bits → 64× compression.
- Quality: Near-exact nearest neighbor retrieval; used in FAISS for billion-scale search.
Knowledge Distillation for Embeddings
- Teacher: Large, high-quality embedding model (e.g., 7B LLM).
- Student: Smaller, faster model trained to match teacher's embeddings.
- Loss: MSE between teacher and student embeddings on same inputs.
- Result: 125M student can match quality of 7B teacher at 50× less inference cost.
Scalar and Binary Quantization
- Scalar (int8): Float32 → int8 per dimension → 4× compression, ~1% quality loss.
- Binary: Float → sign bit only → 32× compression, useful for coarse retrieval + re-ranking.
- FAISS supports both; binary quantization enables billion-scale retrieval on CPUs.
Embedding compression and dimensionality reduction are the scaling layer that makes semantic search feasible at internet scale — by reducing 1536-dimensional embeddings to 128 dimensions with < 5% quality loss, or to binary hashes for coarse retrieval, these techniques enable vector databases serving billions of documents on hardware that would be overwhelmed by raw full-precision embeddings, making the retrieval backbone of modern AI applications both affordable and fast enough to operate at millisecond latency for real-time user-facing applications.