Cross-Entropy Loss is the standard loss function for classification tasks in deep learning, measuring the divergence between the model's predicted probability distribution and the true label distribution. Derived from information theory — specifically Shannon entropy and Kullback-Leibler divergence — cross-entropy loss has strong theoretical grounding and produces gradients that enable efficient, stable optimization of classification models from logistic regression to billion-parameter LLMs.
Mathematical Foundation
Given a true label distribution $y$ and a predicted probability distribution $p$, cross-entropy is:
$$H(y, p) = -\sum_{c=1}^{C} y_c \log p_c$$
For one-hot encoded labels (standard classification with $C$ classes):
$$L = -\log p_{y^*}$$
where $y^*$ is the true class index. Cross-entropy simply becomes the negative log probability of the correct class.
Binary Cross-Entropy
For binary classification ($C = 2$) with sigmoid output:
$$L = -[y \log(p) + (1-y) \log(1-p)]$$
- When $y=1$: Loss $= -\log(p)$. High confidence correct prediction (p≈1) → near-zero loss. Wrong prediction (p≈0) → very large loss.
- When $y=0$: Loss $= -\log(1-p)$. Same asymmetry applies.
- Used in: binary classifiers, multi-label classification (each class uses its own sigmoid), logistic regression.
Why Cross-Entropy Outperforms MSE for Classification
Mean Squared Error (MSE) is the intuitive choice but fails for classification:
| Property | Cross-Entropy | MSE for Classification |
|---|---|---|
| Gradient near decision boundary | Strong signal | Near-zero (gradient vanishing) |
| Gradient when very wrong | Strong correction | Weak correction |
| Probabilistic interpretation | Information-theoretically grounded | Not principled |
| Training speed | Fast convergence | Slow convergence |
| Calibration | Better calibrated | Poor calibration |
With sigmoid+MSE, if a model predicts 0.01 for a positive example (very wrong), the gradient of $(p-y)^2$ with respect to the logit is tiny because sigmoid is saturated. Cross-entropy avoids this: the gradient with respect to the logit is simply $(p - y)$ — proportional to the error, regardless of saturation.
Cross-Entropy + Softmax (The Standard Recipe)
The most common pattern in deep learning:
1. Network outputs logits $z \in \mathbb{R}^C$ (any real values) 2. Apply softmax: $p_c = e^{z_c} / \sum_j e^{z_j}$ 3. Compute cross-entropy: $L = -\log p_{y^} = -z_{y^} + \log \sum_j e^{z_j}$
The gradient $\partial L / \partial z_c = p_c - \mathbb{1}[c = y^*]$ — clean, numerically stable, and fast.
In PyTorch, torch.nn.CrossEntropyLoss fuses softmax and log into a single numerically stable operation using the log-sum-exp trick. Never manually implement log(softmax(x)) — use F.log_softmax or F.cross_entropy directly.
Cross-Entropy as Language Model Loss
Large language models are trained to minimize cross-entropy loss over next-token prediction:
$$L_{\text{LM}} = -\frac{1}{T} \sum_{t=1}^{T} \log P(x_t | x_{ This is equivalent to negative log-likelihood of the training data under the model's distribution. Perplexity — the primary metric for language model evaluation — is simply $e^{L_{\text{LM}}}$. Training GPT-4 involved minimizing this loss over trillions of tokens. The entire intelligence of modern LLMs emerges from gradient descent on cross-entropy loss at massive scale. Loss Function Variants Label Smoothing One important industrial practice: label smoothing replaces hard one-hot targets with soft targets: $$\tilde{y}_c = (1 - \epsilon) \cdot \mathbb{1}[c = y^*] + \frac{\epsilon}{C}$$ With $\epsilon = 0.1$ (standard), the correct class gets 0.9 and all others get $0.1/(C-1)$. This: Cross-entropy loss is the universal loss function for classification across every domain — image recognition, NLP, speech, medical imaging, and the next-token prediction objective that underlies all modern large language models. From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.Variant Formula Use Case Cross-Entropy $-\log p_{y^*}$ Standard multi-class classification Binary CE $-y\log p - (1-y)\log(1-p)$ Binary or multi-label Label Smoothing CE Smooth targets: $y^* = 1-\epsilon$, others = $\epsilon/(C-1)$ Prevents overconfidence, improves calibration Focal Loss $(1-p_{y^*})^\gamma \cdot \text{CE}$ Class imbalance in object detection (RetinaNet) NLL Loss $-\log p_{y^*}$ (with log-softmax input) PyTorch standard when logits are pre-softmaxed KL Divergence $\sum_c q_c \log(q_c / p_c)$ Knowledge distillation, VAE training Explore 500+ Semiconductor & AI Topics