Loss Functions

Loss Functions are the mathematical objectives that quantify the discrepancy between model predictions and desired outputs, guiding the optimization process through gradient descent — the choice of loss function fundamentally determines what the model learns to optimize, and selecting the wrong loss can result in a model that minimizes its objective perfectly while failing at the actual task.

Classification Losses

Cross-Entropy Loss (Standard)
$L = -\sum_{c=1}^{C} y_c \log(p_c)$
- For binary: $L = -[y\log(p) + (1-y)\log(1-p)]$.
- Default for classification tasks. Pairs with softmax output.
- Assumes balanced classes — struggles with class imbalance.

Focal Loss (Lin et al., 2017)
$L_{focal} = -\alpha_t (1 - p_t)^\gamma \log(p_t)$
- Down-weights loss for easy, well-classified examples.
- γ = 2 (default): Easy examples (p_t > 0.9) contribute 100x less to loss.
- Designed for object detection (RetinaNet) where background class dominates.
- Solves class imbalance without oversampling.

Label Smoothing
$y_{smooth} = (1 - \epsilon) \cdot y_{onehot} + \epsilon / C$
- Replace hard one-hot labels with soft labels (ε = 0.1 typical).
- Prevents overconfident predictions.
- Improves generalization and calibration.

Metric Learning Losses

| Loss | Inputs | Purpose |
|------|--------|---------|
| Triplet Loss | Anchor, positive, negative | Learn distance metric |
| InfoNCE | Anchor, positive, N negatives | Contrastive learning (CLIP, SimCLR) |
| ArcFace | Features + class centers | Face recognition |
| Circle Loss | Flexible weighting of pairs | Unified metric learning |

Triplet Loss
$L = \max(0, ||a - p||^2 - ||a - n||^2 + margin)$
- Pull anchor-positive pairs closer than anchor-negative pairs by margin.
- Mining strategy: Semi-hard negatives (within margin but still correct) give best training signal.

Regression Losses

| Loss | Formula | Robustness to Outliers |
|------|---------|----------------------|
| MSE (L2) | $(y - \hat{y})^2$ | Sensitive (squares large errors) |
| MAE (L1) | $|y - \hat{y}|$ | Robust (linear penalty) |
| Huber | L2 for small errors, L1 for large | Configurable (δ parameter) |
| Log-Cosh | $\log(\cosh(y - \hat{y}))$ | Smooth approximation of Huber |

LLM Training Losses

- Autoregressive LM: Cross-entropy on next-token prediction.
- DPO (Direct Preference Optimization): $L = -\log\sigma(\beta(\log\frac{\pi_\theta(y_w)}{\pi_{ref}(y_w)} - \log\frac{\pi_\theta(y_l)}{\pi_{ref}(y_l)}))$.
- Preference losses: Train model to prefer "good" outputs over "bad" outputs.

Loss function design is one of the most impactful and underappreciated aspects of deep learning — the loss function is quite literally the specification of what the model should learn, and innovations in loss functions (focal loss, contrastive losses, DPO) have enabled breakthroughs that architecture changes alone could not achieve.

Want to learn more?