Contrastive Self-Supervised Learning

Contrastive Self-Supervised Learning is the unsupervised learning framework where models distinguish between augmented views of same sample (positive pairs) versus different samples (negative pairs) — learning rich visual representations rivaling supervised pretraining without labeled data.

Contrastive Learning Objective:
- Positive pairs: two augmented versions of same image; should have similar embeddings
- Negative pairs: augmentations of different images; should have dissimilar embeddings
- Contrastive loss: minimize distance for positives; maximize distance for negatives
- Unsupervised signal: no labels required; augmentation-induced variance provides learning signal
- Representation quality: learned representations effectively capture visual structure and semantic information

NT-Xent Loss (Normalized Temperature-Scaled Cross Entropy):
- Softmax contrast: normalize similarity scores; apply softmax and cross-entropy loss
- NT-Xent formulation: loss = -log[exp(sim(z_i, z_j)/τ) / ∑_k exp(sim(z_i, z_k)/τ)]
- Temperature parameter: τ controls distribution sharpness; τ = 0.07 typical; smaller τ → harder negatives
- Similarity metric: usually cosine similarity between normalized embeddings
- Batch as negatives: positive pair from single image; 2N-2 negatives from other batch samples

SimCLR Framework:
- Large batch size: 4096 samples typical; large batch provides diverse negatives
- Strong augmentation: color jitter, random crops, Gaussian blur; augmentation strength crucial
- Non-linear projection head: two-layer MLP with hidden dimension larger than output; improves downstream performance
- Contrastive training: large batch essential; 10x batch → 30% performance improvement
- Downstream fine-tuning: linear evaluation on frozen representations; evaluate transfer quality

Momentum Contrast (MoCo):
- Queue mechanism: maintain queue of previous embeddings; large dictionary without large batch
- Momentum encoder: slowly updated copy of main encoder via momentum (exponential moving average)
- Key advantage: decouples dictionary size from batch size; enables large dictionaries with manageable batch sizes
- MoCo variants: MoCo-v2 improves augmentations/projections; MoCo-v3 removes momentum encoder

Contrastive Learning Variants:
- BYOL (Bootstrap Your Own Latent): no negative pairs; momentum encoder and online networks; surprising finding
- SimSiam: simplified BYOL; just stop-gradient; shows importance of asymmetric architecture
- SwAV: online clustering and contrastive learning; cluster centroids provide self-labels
- DenseCL: dense prediction in contrastive learning; helps downstream dense prediction tasks

Representation Learning Insights:
- Invariance to augmentation: learned representation invariant to geometric/color transforms; semantic-preserving
- Feature reuse: representations learned via contrastive learning transfer well to downstream tasks
- Self-supervised equivalence: contrastive learning without labels approximates supervised learning quality
- Scaling with model size: larger models benefit from contrastive learning; improve supervised baselines

Downstream Fine-Tuning:
- Linear evaluation: freeze representations; train linear classifier on downstream task
- Full fine-tuning: also update representation parameters on downstream task; slight improvements
- Transfer quality: downstream accuracy reflects representation quality; benchmark for unsupervised method quality
- Task diversity: tested on classification, detection, segmentation; strong across diverse tasks

Positive Pair Construction:
- Image augmentation: random crops, color distortion, Gaussian blur; preserve semantic content
- Augmentation strength: stronger augmentation → harder learning problem but better learned features
- Domain-specific augmentation: video contrastive (temporal consistency), 3D point clouds (rotation-invariance)
- Negative pair sampling: importance sampling (hard negatives) vs uniform sampling (standard)

Contrastive Learning Theory:
- Mutual information lower bound: contrastive loss lower bounds mutual information between views
- Optimal augmentation: theoretically optimal augmentation level balances view similarity and information content
- Connection to noise-contrastive estimation: contrastive learning related to NCE; unnormalized probability approximation

Scaling to Billion-Parameter Models:
- Foundation models: CLIP, ALIGN, LiT combine contrastive learning with language models
- Vision-language pretraining: contrastive learning between images and text descriptions
- Scale benefits: larger models, larger batches, more data → substantial improvements
- Emergent capabilities: scaling contrastive pretraining enables impressive zero-shot performance

Contrastive self-supervised learning leverages augmentation-based positive/negative pair learning — achieving competitive representations without labeled data through principles of information maximization between augmented views.

Contrastive Self-Supervised Learning

Want to learn more?