Home Knowledge Base Leaky ReLU (Leaky Rectified Linear Unit)

Leaky ReLU (Leaky Rectified Linear Unit) is a modification of the ReLU activation function that allows a small, non-zero gradient to pass through for negative inputs, preventing the "dying neuron" failure mode where ReLU neurons permanently output zero and cease learning. Introduced by Maas et al. (Stanford, 2013), Leaky ReLU and its variants (PReLU, RReLU, ELU, GELU) form an important family of activations that trade ReLU's clean gradient for more robust training dynamics across diverse network architectures.

The Dying ReLU Problem

ReLU's definition: $f(x) = \max(0, x)$ means the gradient is exactly 0 for all negative inputs. This creates a failure mode:

1. A neuron receives a very negative weighted sum (e.g., after a large incorrect weight update) 2. ReLU outputs 0; gradient through this neuron is 0 3. The neuron's weights receive no gradient signal — they never update 4. The neuron remains in the "dead" state permanently, even if its input becomes positive theoretically

With high learning rates or poor initialization, significant fractions of neurons can die in the first few batches of training. Networks with 20-40% dead neurons are functional but waste capacity. Leaky ReLU eliminates dying neurons by ensuring the gradient is always non-zero.

Leaky ReLU Formula and Properties

$$\text{LeakyReLU}(x) = \begin{cases} x & \text{if } x > 0 \\ \alpha x & \text{if } x \leq 0 \end{cases}$$

where $\alpha$ is the leak coefficient, typically $\alpha = 0.01$.

Gradient: $$\frac{d}{dx}\text{LeakyReLU}(x) = \begin{cases} 1 & \text{if } x > 0 \\ \alpha & \text{if } x < 0 \end{cases}$$

Key properties:

Comparison of ReLU Variants

ActivationFormula ($x \leq 0$)Gradient ($x < 0$)Dead NeuronsExtra ParamsUsed In
ReLU00Yes0ResNet, CNNs generally
Leaky ReLU$0.01x$0.01No0GANs, fast training
PReLU$\alpha x$ (learned)$\alpha$ (learned)No1 per neuronDeep face recognition
RReLU$\alpha x$ ($\alpha$ random)Random $\alpha$No0Regularized training
ELU$\alpha(e^x - 1)$$\alpha e^x$No1Self-normalizing networks
SELU$\lambda \alpha(e^x-1)$$\lambda \alpha e^x$No0 (fixed)Self-normalizing NNs
GELU$x\Phi(x)$SmoothNo0BERT, GPT, all transformers
SiLU/Swish$x\sigma(x)$SmoothNo0LLaMA, EfficientNet

Parametric ReLU (PReLU)

PReLU (He et al., Microsoft Research, 2015) learns the leak coefficient $\alpha$ during training:

Leaky ReLU in GANs

Leaky ReLU is nearly universal in GAN discriminators (with $\alpha = 0.2$, not 0.01):

ELU: Smooth Alternative

Exponential Linear Unit (ELU) provides a smooth transition at $x = 0$: $$\text{ELU}(x) = \begin{cases} x & x > 0 \\ \alpha(e^x - 1) & x \leq 0 \end{cases}$$

ELU advantages over Leaky ReLU:

ELU disadvantages: Exponential computation is slower than Leaky ReLU's multiply.

When to Use Each in Practice

Leaky ReLU is a pragmatic, zero-cost improvement over ReLU for any scenario where neuron death is a concern — most notably in GANs, very deep networks without batch norm, and any training setup with high learning rates or aggressive optimization.

leaky reluleaky rectified linear unitpreluparametric reludying relu fix

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.