Contrastive Learning

Contrastive Learning is the self-supervised representation learning framework that trains neural networks to pull representations of semantically similar (positive) pairs close together in embedding space while pushing dissimilar (negative) pairs apart — learning powerful visual and textual representations from unlabeled data that rival or exceed supervised pretraining when transferred to downstream tasks.

The Core Idea

Without labels, the model cannot learn "this is a cat." Instead, contrastive learning creates a pretext task: "these two views of the same image should have similar representations, while views of different images should have different representations." The model learns features that capture semantic similarity by solving this discrimination task at scale.

InfoNCE Loss

The standard contrastive objective (Noise-Contrastive Estimation applied to mutual information):

L = −log(exp(sim(z_i, z_j)/τ) / Σ_k exp(sim(z_i, z_k)/τ))

where z_i, z_j are the positive pair embeddings, z_k includes all negatives in the batch, sim is cosine similarity, and τ is a temperature parameter. The loss maximizes agreement between positive pairs relative to all negatives.

Key Methods

- SimCLR (Chen et al., 2020): Generate two augmented views of each image (random crop, color jitter, Gaussian blur). Pass both through the same encoder + projection head. The two views form a positive pair; all other images in the batch are negatives. Requires large batch sizes (4096+) for enough negatives. Simple but compute-intensive.

- MoCo (He et al., 2020): Maintains a momentum-updated encoder for generating negative embeddings stored in a queue. The queue decouples the negative count from batch size, enabling effective contrastive learning with normal batch sizes (256). The momentum encoder provides slowly-evolving targets that stabilize training.

- BYOL / DINO (Non-Contrastive): Technically not contrastive (no explicit negatives), but related. A student network learns to predict the output of a momentum-teacher network from different augmented views. Avoids the need for large negative counts. DINO (self-distillation) applied to Vision Transformers produces features with emergent object segmentation properties.

- CLIP (Radford et al., 2021): Contrastive learning between image and text representations. Positive pairs are matching (image, caption) from the internet; negatives are non-matching combinations in the batch. Learns a shared embedding space enabling zero-shot image classification by comparing image embeddings to text embeddings of class descriptions.

Why Augmentation Is Critical

The augmentations define what the model learns to be invariant to. Crop-based augmentation forces the model to recognize objects regardless of position; color jitter forces color invariance. The choice of augmentations encodes the inductive bias about what constitutes "semantically similar."

Contrastive Learning is the technique that taught machines to see without labels — exploiting the simple principle that different views of the same thing should look alike in feature space to learn representations rich enough to power downstream tasks from classification to retrieval.

Want to learn more?