Activation Functions Survey (ReLU, GELU, SiLU)

Activation Functions Survey (ReLU, GELU, SiLU) compares fundamental non-linearities used in deep learning that introduce non-linearity enabling neural networks to learn complex functions — each activation offering different trade-offs in complexity, gradient flow, and computational efficiency across modern architectures from CNNs to transformers.

ReLU (Rectified Linear Unit):
- Formula: ReLU(x) = max(0, x) — identity for x>0, zero for x≤0
- History: introduced in 2011, revolutionized deep learning enabling efficient training of very deep networks (AlexNet, ResNet)
- Advantages: computationally simple (single comparison), sparse activation (50% neurons inactive) providing implicit regularization
- Gradient: ∂ReLU/∂x = 0 for x<0, 1 for x>0 — clean gradients but zero gradients for negative inputs (dying ReLU problem)
- Dying ReLU: neurons with negative pre-activations permanently zero out in some settings — particularly in early training with poor initialization

ReLU Variants and Extensions:
- Leaky ReLU: ReLU(x) = x if x>0 else 0.01×x — small negative slope prevents complete zero-out, improves gradient flow
- ELU (Exponential Linear Unit): ELU(x) = x if x>0 else α(e^x - 1) — smooth exponential transition, reduces mean shift effect
- SELU (Scaled ELU): adding scale factors enabling self-normalization — maintains unit mean/variance across layers without explicit normalization
- Gelu (Gaussian Error Linear Unit): Gelu(x) = x·Φ(x) where Φ is standard normal CDF — smooth approximation to ReLU with superior gradient flow

GELU (Gaussian Error Linear Unit):
- Formula: Gelu(x) = x·Φ(x) ≈ 0.5x(1 + tanh(√(2/π)(x + 0.044715x³))) — approximation formula enables efficient computation
- Motivation: modeling stochasticity as Gaussian; GELU(x) = x·P(X≤x) for X~N(0,1)
- Characteristics: smooth everywhere with non-zero gradient even for negative inputs — eliminates dying unit problem
- Adoption: standard in BERT, GPT-2, RoBERTa; chosen for superior downstream task performance vs ReLU
- Computational Cost: slightly higher than ReLU due to approximation formula; negligible overhead on modern hardware

GELU vs ReLU Empirical Comparison:
- Perplexity: GELU achieves 3.2 on WIKITEXT-103 vs ReLU 3.5 — consistently better language modeling
- Fine-tuning: BERT-base with GELU outperforms ReLU by 1-2% on GLUE tasks — consistent across task types
- Training Convergence: GELU enables slightly faster convergence (fewer steps to same loss) due to better gradient flow
- Computational Speed: GELU marginally slower per-step but reaches target performance in fewer steps — net training time comparable

SiLU (Swish, Sigmoid Linear Unit):
- Formula: SiLU(x) = x·sigmoid(x) — self-gated with sigmoid controlling magnitude based on input
- Characteristics: smooth everywhere, non-zero gradient for all inputs (no dying units), exhibits interesting saturating properties
- Gating Intuition: sigmoid(x) acts as soft gate (0.5 at zero, approaching 0/1 for extreme values) — enables learned importance weighting
- Adoption: used in EfficientNet, mobile architectures; becoming standard for modern LLMs (Llama, PaLM use SiLU variants)
- Performance: SiLU typically matches or exceeds GELU with slightly lower computational overhead

Modern Activation Function Trends:
- Transformer Standard: GELU or SiLU becoming default in transformers; ReLU deprecated for new architectures
- Parameter Efficiency: SiLU enables more efficient parameter utilization — lower parameter models with SiLU outperform higher-param ReLU models
- Scaling Laws: activation function choice influences scaling laws; SiLU/GELU models scale more efficiently with compute
- Hardware Alignment: modern GPUs optimize GELU/SiLU similarly to ReLU — no practical speed penalty for superior activation

Gradient Flow Characteristics:
- Gradient Magnitude: ReLU gradient 0 or 1 (sharp); GELU/SiLU have smooth gradients varying continuously — less vanishing gradient risk
- Second Derivative: GELU/SiLU have non-zero second derivatives enabling better curvature information for optimization
- Initialization Interaction: better gradient flow enables He initialization without fine-tuning for GELU/SiLU vs ReLU
- Deep Network Training: >50 layer networks train more stably with GELU/SiLU vs ReLU — evident in modern architectures

Activation Statistics and Learned Representations:
- Sparsity: ReLU induces 50% sparsity naturally; GELU/SiLU less sparse (80-90% active) — affects model capacity/efficiency trade-offs
- Information Content: analyzing mutual information between activations and outputs; GELU/SiLU preserve more information than ReLU
- Saturation: tanh/sigmoid saturate for |x|>2; GELU/SiLU less prone to saturation — enables better gradient flow
- Dead Neuron Rate: ReLU 5-15% dead neurons depending on initialization; GELU/SiLU <1% dead neurons

Computational Complexity and Hardware Considerations:
- FLOPS: ReLU 1 comparison operation (essentially free); GELU 2-3 FLOPs; SiLU 3 FLOPs (sigmoid + multiply)
- Peak Throughput: A100 tensor cores achieve >90% peak FLOPS for matrix multiply regardless of activation (overhead <5%)
- Memory Bandwidth: activation computation negligible compared to matrix multiply; bandwidth-bound operations dominate
- Mobile Devices: ReLU slightly preferred on limited hardware; GELU/SiLU sufficient with modern optimization

Activation Function Selection by Task:
- Image Classification: GELU achieves 1-2% improvement over ReLU on ImageNet; SiLU comparable to GELU
- Language Modeling: GELU/SiLU clearly superior to ReLU (2-4% improvement); standard in modern LLMs
- Object Detection: ReLU still common in detector backbones; GELU increasingly adopted in newer architectures
- Reinforcement Learning: ReLU traditional; GELU emerging as better choice for policy/value networks

Theoretical Understanding:
- Expressiveness: continuous smooth activations (GELU, SiLU) theoretically more expressive than piecewise linear (ReLU)
- Universal Approximation: all smooth activations enable universal approximation given sufficient neurons — theoretical advantages marginal
- Optimization Landscape: GELU/SiLU produce smoother loss landscapes — fewer local minima, easier optimization
- Implicit Regularization: ReLU sparsity provides regularization; GELU/SiLU require explicit regularization (dropout, weight decay)

Activation Functions Survey (ReLU, GELU, SiLU) reveals fundamental shifts in modern architecture design — transitioning from ReLU's computational simplicity to GELU/SiLU's superior optimization properties enabling more efficient scaling of deep networks.

Activation Functions Survey (ReLU, GELU, SiLU)

Want to learn more?