Activation Functions Survey (ReLU, GELU, SiLU) compares fundamental non-linearities used in deep learning that introduce non-linearity enabling neural networks to learn complex functions β each activation offering different trade-offs in complexity, gradient flow, and computational efficiency across modern architectures from CNNs to transformers.
ReLU (Rectified Linear Unit):
- Formula: ReLU(x) = max(0, x) β identity for x>0, zero for xβ€0
- History: introduced in 2011, revolutionized deep learning enabling efficient training of very deep networks (AlexNet, ResNet)
- Advantages: computationally simple (single comparison), sparse activation (50% neurons inactive) providing implicit regularization
- Gradient: βReLU/βx = 0 for x<0, 1 for x>0 β clean gradients but zero gradients for negative inputs (dying ReLU problem)
- Dying ReLU: neurons with negative pre-activations permanently zero out in some settings β particularly in early training with poor initialization
ReLU Variants and Extensions:
- Leaky ReLU: ReLU(x) = x if x>0 else 0.01Γx β small negative slope prevents complete zero-out, improves gradient flow
- ELU (Exponential Linear Unit): ELU(x) = x if x>0 else Ξ±(e^x - 1) β smooth exponential transition, reduces mean shift effect
- SELU (Scaled ELU): adding scale factors enabling self-normalization β maintains unit mean/variance across layers without explicit normalization
- Gelu (Gaussian Error Linear Unit): Gelu(x) = xΒ·Ξ¦(x) where Ξ¦ is standard normal CDF β smooth approximation to ReLU with superior gradient flow
GELU (Gaussian Error Linear Unit):
- Formula: Gelu(x) = xΒ·Ξ¦(x) β 0.5x(1 + tanh(β(2/Ο)(x + 0.044715xΒ³))) β approximation formula enables efficient computation
- Motivation: modeling stochasticity as Gaussian; GELU(x) = xΒ·P(Xβ€x) for X~N(0,1)
- Characteristics: smooth everywhere with non-zero gradient even for negative inputs β eliminates dying unit problem
- Adoption: standard in BERT, GPT-2, RoBERTa; chosen for superior downstream task performance vs ReLU
- Computational Cost: slightly higher than ReLU due to approximation formula; negligible overhead on modern hardware
GELU vs ReLU Empirical Comparison:
- Perplexity: GELU achieves 3.2 on WIKITEXT-103 vs ReLU 3.5 β consistently better language modeling
- Fine-tuning: BERT-base with GELU outperforms ReLU by 1-2% on GLUE tasks β consistent across task types
- Training Convergence: GELU enables slightly faster convergence (fewer steps to same loss) due to better gradient flow
- Computational Speed: GELU marginally slower per-step but reaches target performance in fewer steps β net training time comparable
SiLU (Swish, Sigmoid Linear Unit):
- Formula: SiLU(x) = xΒ·sigmoid(x) β self-gated with sigmoid controlling magnitude based on input
- Characteristics: smooth everywhere, non-zero gradient for all inputs (no dying units), exhibits interesting saturating properties
- Gating Intuition: sigmoid(x) acts as soft gate (0.5 at zero, approaching 0/1 for extreme values) β enables learned importance weighting
- Adoption: used in EfficientNet, mobile architectures; becoming standard for modern LLMs (Llama, PaLM use SiLU variants)
- Performance: SiLU typically matches or exceeds GELU with slightly lower computational overhead
Modern Activation Function Trends:
- Transformer Standard: GELU or SiLU becoming default in transformers; ReLU deprecated for new architectures
- Parameter Efficiency: SiLU enables more efficient parameter utilization β lower parameter models with SiLU outperform higher-param ReLU models
- Scaling Laws: activation function choice influences scaling laws; SiLU/GELU models scale more efficiently with compute
- Hardware Alignment: modern GPUs optimize GELU/SiLU similarly to ReLU β no practical speed penalty for superior activation
Gradient Flow Characteristics:
- Gradient Magnitude: ReLU gradient 0 or 1 (sharp); GELU/SiLU have smooth gradients varying continuously β less vanishing gradient risk
- Second Derivative: GELU/SiLU have non-zero second derivatives enabling better curvature information for optimization
- Initialization Interaction: better gradient flow enables He initialization without fine-tuning for GELU/SiLU vs ReLU
- Deep Network Training: >50 layer networks train more stably with GELU/SiLU vs ReLU β evident in modern architectures
Activation Statistics and Learned Representations:
- Sparsity: ReLU induces 50% sparsity naturally; GELU/SiLU less sparse (80-90% active) β affects model capacity/efficiency trade-offs
- Information Content: analyzing mutual information between activations and outputs; GELU/SiLU preserve more information than ReLU
- Saturation: tanh/sigmoid saturate for |x|>2; GELU/SiLU less prone to saturation β enables better gradient flow
- Dead Neuron Rate: ReLU 5-15% dead neurons depending on initialization; GELU/SiLU <1% dead neurons
Computational Complexity and Hardware Considerations:
- FLOPS: ReLU 1 comparison operation (essentially free); GELU 2-3 FLOPs; SiLU 3 FLOPs (sigmoid + multiply)
- Peak Throughput: A100 tensor cores achieve >90% peak FLOPS for matrix multiply regardless of activation (overhead <5%)
- Memory Bandwidth: activation computation negligible compared to matrix multiply; bandwidth-bound operations dominate
- Mobile Devices: ReLU slightly preferred on limited hardware; GELU/SiLU sufficient with modern optimization
Activation Function Selection by Task:
- Image Classification: GELU achieves 1-2% improvement over ReLU on ImageNet; SiLU comparable to GELU
- Language Modeling: GELU/SiLU clearly superior to ReLU (2-4% improvement); standard in modern LLMs
- Object Detection: ReLU still common in detector backbones; GELU increasingly adopted in newer architectures
- Reinforcement Learning: ReLU traditional; GELU emerging as better choice for policy/value networks
Theoretical Understanding:
- Expressiveness: continuous smooth activations (GELU, SiLU) theoretically more expressive than piecewise linear (ReLU)
- Universal Approximation: all smooth activations enable universal approximation given sufficient neurons β theoretical advantages marginal
- Optimization Landscape: GELU/SiLU produce smoother loss landscapes β fewer local minima, easier optimization
- Implicit Regularization: ReLU sparsity provides regularization; GELU/SiLU require explicit regularization (dropout, weight decay)
Activation Functions Survey (ReLU, GELU, SiLU) reveals fundamental shifts in modern architecture design β transitioning from ReLU's computational simplicity to GELU/SiLU's superior optimization properties enabling more efficient scaling of deep networks.