Blog / Activation Functions & Optimizers: Complete Deep Dive with Formulas, Use Cases, and Trade-offs
Deep Learning
Activation Functions & Optimizers: Complete Deep Dive with Formulas, Use Cases, and Trade-offs
2 March 2026

A complete, no-compromise guide to activation functions and optimizers covering all formulas, advantages, disadvantages, and real-world usage across modern AI systems.
Activation functions and optimizers are the two most fundamental components of neural networks. Activation functions determine how neurons respond, while optimizers determine how models learn.
1. What is an Activation Function
A neuron first computes:
z = wᵀx + bThen activation function 𝑓 is applied:
a=𝑓(z)Without activation functions, deep networks collapse into linear models regardless of depth. They introduce non-linearity, enabling complex pattern learning.
2. Why Non-Linearity is Important
Suppose every layer is only linear:
y = W3(W2(W1x + b1) + b2) + b3This whole thing can still be reduced to:
y = Wx + bStacking linear layers results in another linear function. Activation functions break this limitation.
3. Activation Functions (Complete List)
Binary Step
f(x) = {1 if x ≥ 0, 0 if x < 0}- Advantage: Simple
- Disadvantage: Not differentiable, zero gradients
- Use: Early perceptrons (historical)
Linear / Identity
f(x) = x
f'(x) = 1- Advantage: Simple
- Disadvantage: No non-linearity
- Use: Regression outputs
Sigmoid
σ(x) = 1 / (1 + e^-x)
σ'(x) = σ(x)(1 - σ(x))- Advantage: Probabilities (0–1)
- Disadvantage: Vanishing gradient
- Use: Binary output, LSTM gates
Tanh
tanh(x) = (e^x - e^-x)/(e^x + e^-x)
tanh'(x) = 1 - tanh^2(x)ReLU
f(x) = max(0,x)Leaky ReLU
f(x) = x (x>0), αx (x≤0)PReLU
f(x) = x (x>0), ax (x≤0)ELU
f(x) = x (x>0), α(e^x - 1)SELU
f(x) = λ(x if x>0 else α(e^x - 1))Softplus
f(x) = ln(1 + e^x)Softsign
f(x) = x / (1 + |x|)Swish
f(x) = x * sigmoid(x)GELU
f(x) = xΦ(x)Mish
f(x) = x * tanh(ln(1 + e^x))Hard Sigmoid
f(x) = max(0, min(1, (x+1)/2))Hard Swish
f(x) = x * ReLU6(x+3)/6Softmax
softmax(z_i) = e^{z_i} / Σ e^{z_j}GLU
GLU(a,b) = a ⊗ σ(b)4. Optimizers
θ(t+1) = θ(t) - η ∇L(θ)SGD
θ = θ - η∇J(θ)Mini-batch
θ = θ - η∇J(θ; B)Momentum
v_t = βv_{t-1} + ηg
θ = θ - v_tNesterov
θ = θ - η∇L(θ - βv)Adagrad
θ = θ - η / √(G+ε) * gRMSProp
θ = θ - η / √(E[g²]+ε)Adam
m_t = β1 m + (1-β1)g
v_t = β2 v + (1-β2)g²
θ = θ - η m̂/(√v̂ + ε)AdamW
θ = θ - η(m̂/(√v̂ + ε) + λθ)AMSGrad
v̂ = max(v̂_prev, v)5. Compact comparison table
| Item | Purpose | Formula Idea | Best Use | Main Issue |
|---|---|---|---|---|
| Sigmoid | Squash to (0,1) | 1/(1+e^-x) | Binary output, gates | Vanishing gradient |
| Tanh | Squash to (-1,1) | tanh(x) | RNN hidden state | Vanishing gradient |
| ReLU | Keep positives | max(0,x) | Hidden layers | Dying ReLU |
| Leaky ReLU | Small negative slope | x, αx | CNN/GAN | Tune slope |
| GELU | Smooth gating | xΦ(x) | Transformers | Costlier |
| Softmax | Class probabilities | exp normalize | Multi-class output | Output only |
| SGD | Simple update | -ηg | Generalization | Slow/noisy |
| Momentum | SGD + velocity | Running avg | Vision/CNN | Tune beta |
| RMSProp | Adaptive learning rate | Avg grad² | RNN | May generalize worse |
| Adam | Adaptive moments | m, v estimates | General DL | Memory heavy |
| AdamW | Adam + weight decay | Adam + decay | Transformers | Needs tuning |
6. Key Differences
| Component | Role |
|---|---|
| Activation | Controls neuron output |
| Optimizer | Updates weights |
7. Best Practical Defaults
- CNN → ReLU + SGD Momentum
- Transformer → GELU + AdamW
- RNN → Tanh + Adam
- Binary → Sigmoid + Adam
- Multi-class → Softmax + Adam
Final Insight
Activation functions define how neurons respond. Optimizers define how models learn. Together, they are the mathematical foundation of deep learning.