Home Knowledge Base Cross-Entropy Loss

Cross-Entropy Loss is the standard loss function for classification tasks in deep learning, measuring the divergence between the model's predicted probability distribution and the true label distribution. Derived from information theory — specifically Shannon entropy and Kullback-Leibler divergence — cross-entropy loss has strong theoretical grounding and produces gradients that enable efficient, stable optimization of classification models from logistic regression to billion-parameter LLMs.

Mathematical Foundation

Given a true label distribution $y$ and a predicted probability distribution $p$, cross-entropy is:

$$H(y, p) = -\sum_{c=1}^{C} y_c \log p_c$$

For one-hot encoded labels (standard classification with $C$ classes):

$$L = -\log p_{y^*}$$

where $y^*$ is the true class index. Cross-entropy simply becomes the negative log probability of the correct class.

Binary Cross-Entropy

For binary classification ($C = 2$) with sigmoid output:

$$L = -[y \log(p) + (1-y) \log(1-p)]$$

Why Cross-Entropy Outperforms MSE for Classification

Mean Squared Error (MSE) is the intuitive choice but fails for classification:

PropertyCross-EntropyMSE for Classification
Gradient near decision boundaryStrong signalNear-zero (gradient vanishing)
Gradient when very wrongStrong correctionWeak correction
Probabilistic interpretationInformation-theoretically groundedNot principled
Training speedFast convergenceSlow convergence
CalibrationBetter calibratedPoor calibration

With sigmoid+MSE, if a model predicts 0.01 for a positive example (very wrong), the gradient of $(p-y)^2$ with respect to the logit is tiny because sigmoid is saturated. Cross-entropy avoids this: the gradient with respect to the logit is simply $(p - y)$ — proportional to the error, regardless of saturation.

Cross-Entropy + Softmax (The Standard Recipe)

The most common pattern in deep learning:

1. Network outputs logits $z \in \mathbb{R}^C$ (any real values) 2. Apply softmax: $p_c = e^{z_c} / \sum_j e^{z_j}$ 3. Compute cross-entropy: $L = -\log p_{y^} = -z_{y^} + \log \sum_j e^{z_j}$

The gradient $\partial L / \partial z_c = p_c - \mathbb{1}[c = y^*]$ — clean, numerically stable, and fast.

In PyTorch, torch.nn.CrossEntropyLoss fuses softmax and log into a single numerically stable operation using the log-sum-exp trick. Never manually implement log(softmax(x)) — use F.log_softmax or F.cross_entropy directly.

Cross-Entropy as Language Model Loss

Large language models are trained to minimize cross-entropy loss over next-token prediction:

$$L_{\text{LM}} = -\frac{1}{T} \sum_{t=1}^{T} \log P(x_t | x_{

This is equivalent to negative log-likelihood of the training data under the model's distribution. Perplexity — the primary metric for language model evaluation — is simply $e^{L_{\text{LM}}}$.

Training GPT-4 involved minimizing this loss over trillions of tokens. The entire intelligence of modern LLMs emerges from gradient descent on cross-entropy loss at massive scale.

Loss Function Variants

VariantFormulaUse Case
Cross-Entropy$-\log p_{y^*}$Standard multi-class classification
Binary CE$-y\log p - (1-y)\log(1-p)$Binary or multi-label
Label Smoothing CESmooth targets: $y^* = 1-\epsilon$, others = $\epsilon/(C-1)$Prevents overconfidence, improves calibration
Focal Loss$(1-p_{y^*})^\gamma \cdot \text{CE}$Class imbalance in object detection (RetinaNet)
NLL Loss$-\log p_{y^*}$ (with log-softmax input)PyTorch standard when logits are pre-softmaxed
KL Divergence$\sum_c q_c \log(q_c / p_c)$Knowledge distillation, VAE training

Label Smoothing

One important industrial practice: label smoothing replaces hard one-hot targets with soft targets:

$$\tilde{y}_c = (1 - \epsilon) \cdot \mathbb{1}[c = y^*] + \frac{\epsilon}{C}$$

With $\epsilon = 0.1$ (standard), the correct class gets 0.9 and all others get $0.1/(C-1)$. This:

Cross-entropy loss is the universal loss function for classification across every domain — image recognition, NLP, speech, medical imaging, and the next-token prediction objective that underlies all modern large language models.

cross entropy losscross entropylog lossbinary cross entropyclassification loss function

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.