Contrastive Learning Frameworks (SimCLR, MoCo, DINO, BYOL)

Contrastive Learning Frameworks (SimCLR, MoCo, DINO, BYOL) is a family of self-supervised representation learning methods that train visual encoders by learning to distinguish similar (positive) pairs from dissimilar (negative) pairs without requiring labeled data — achieving representation quality that rivals or exceeds supervised pretraining on downstream vision tasks.

Contrastive Learning Foundations

Contrastive learning trains encoders to map augmented views of the same image (positive pairs) to nearby points in embedding space while pushing apart representations of different images (negative pairs). The InfoNCE loss function treats the task as classification: for a query embedding q and positive key k+, minimize $-log frac{exp(q cdot k^+ / au)}{sum_i exp(q cdot k_i / au)}$ where τ is temperature and the denominator sums over all keys including negatives. The quality of learned representations depends critically on augmentation strategies, negative sampling, and projection head design.

SimCLR: Simple Contrastive Learning of Representations

- Framework: Two random augmentations of the same image pass through a shared encoder (ResNet) and projection head (MLP); other images in the mini-batch serve as negatives
- Augmentation pipeline: Random crop + resize, color jittering (strength 0.8), Gaussian blur, and random horizontal flip—crop and color distortion are most critical
- Projection head: 2-layer MLP projects encoder features to 128-dim space where contrastive loss is computed; representations before projection head transfer better to downstream tasks
- Large batch requirement: Performance scales with batch size (4096-8192 needed); each sample requires 2N-2 negatives from the batch
- SimCLR v2: Adds larger ResNet backbone, deeper projection head (3 layers), and MoCo-style momentum encoder, achieving 79.8% ImageNet linear evaluation accuracy

MoCo: Momentum Contrast

- Queue-based negatives: Maintains a dictionary queue of 65,536 negative keys, decoupling negative count from batch size
- Momentum encoder: Key encoder updated via exponential moving average of query encoder weights (m=0.999) ensuring consistent representations in the queue
- Memory efficiency: Requires only standard batch sizes (256) unlike SimCLR's large batch dependency
- MoCo v2: Incorporates SimCLR improvements (stronger augmentation, MLP projection head), matching SimCLR performance with 8x smaller batches
- MoCo v3: Extends to Vision Transformers (ViT) with patch-based processing and stability improvements for transformer training

BYOL: Bootstrap Your Own Latent

- No negatives required: Achieves strong representations without negative pairs, challenging the assumption that contrastive learning requires negatives
- Asymmetric architecture: Online network (encoder + projector + predictor) learns to predict the target network's representations; target network is momentum-updated (EMA)
- Predictor prevents collapse: The additional predictor MLP in the online network, combined with stop-gradient on the target, prevents representational collapse to a constant
- Performance: 74.3% ImageNet linear evaluation with ResNet-50—competitive with contrastive methods while simpler conceptually
- Batch normalization role: BatchNorm in the projector implicitly provides a form of contrastive signal through batch statistics; removing it can cause collapse

DINO: Self-Distillation with No Labels

- Self-distillation: Student and teacher networks (both ViT) process different crops of the same image; student trained to match teacher's output distribution via cross-entropy
- Multi-crop strategy: Teacher receives 2 global crops (224x224); student receives 2 global + several local crops (96x96)—local-to-global correspondence enables learning of spatial structure
- Emergent properties: DINO-trained ViTs spontaneously learn object segmentation—attention maps cleanly segment foreground objects without any segmentation supervision
- Centering and sharpening: Teacher outputs are centered (subtract running mean) and sharpened (low temperature) to prevent mode collapse
- DINOv2 (Meta, 2023): Scaled to ViT-g with curated LVD-142M dataset, producing frozen visual features competitive with fine-tuned models across dense and semantic tasks

Downstream Transfer and Impact

- Linear evaluation protocol: Freeze the encoder, train a linear classifier on labeled data; measures representation quality independent of fine-tuning capacity
- Semi-supervised learning: Contrastive pre-training dramatically improves accuracy with limited labels (1% or 10% ImageNet labels)
- Dense prediction: Contrastive features transfer to detection, segmentation, and depth estimation with minimal adaptation
- Foundation model pretraining: DINOv2 features serve as general-purpose visual representations competitive with CLIP for many tasks

Contrastive and self-distillation frameworks have fundamentally changed visual representation learning, proving that large-scale unlabeled data combined with carefully designed learning objectives can produce features rivaling decades of supervised pretraining research.

Contrastive Learning Frameworks (SimCLR, MoCo, DINO, BYOL)

Want to learn more?