Blog / Activation Functions & Optimizers: Complete Deep Dive with Formulas, Use Cases, and Trade-offs

Deep Learning

Activation Functions & Optimizers: Complete Deep Dive with Formulas, Use Cases, and Trade-offs

2 March 2026

Activation Functions & Optimizers: Complete Deep Dive with Formulas, Use Cases, and Trade-offs

A complete, no-compromise guide to activation functions and optimizers covering all formulas, advantages, disadvantages, and real-world usage across modern AI systems.

Activation functions and optimizers are the two most fundamental components of neural networks. Activation functions determine how neurons respond, while optimizers determine how models learn.

1. What is an Activation Function

A neuron first computes:

z = wᵀx + b

Then activation function 𝑓 is applied:

a=𝑓(z)

Without activation functions, deep networks collapse into linear models regardless of depth. They introduce non-linearity, enabling complex pattern learning.

2. Why Non-Linearity is Important

Suppose every layer is only linear:

y = W3(W2(W1x + b1) + b2) + b3

This whole thing can still be reduced to:

y = Wx + b

Stacking linear layers results in another linear function. Activation functions break this limitation.

3. Activation Functions (Complete List)

Binary Step

f(x) = {1 if x ≥ 0, 0 if x < 0}
  • Advantage: Simple
  • Disadvantage: Not differentiable, zero gradients
  • Use: Early perceptrons (historical)

Linear / Identity

f(x) = x
f'(x) = 1
  • Advantage: Simple
  • Disadvantage: No non-linearity
  • Use: Regression outputs

Sigmoid

σ(x) = 1 / (1 + e^-x)
σ'(x) = σ(x)(1 - σ(x))
  • Advantage: Probabilities (0–1)
  • Disadvantage: Vanishing gradient
  • Use: Binary output, LSTM gates

Tanh

tanh(x) = (e^x - e^-x)/(e^x + e^-x)
tanh'(x) = 1 - tanh^2(x)

ReLU

f(x) = max(0,x)

Leaky ReLU

f(x) = x (x>0), αx (x≤0)

PReLU

f(x) = x (x>0), ax (x≤0)

ELU

f(x) = x (x>0), α(e^x - 1)

SELU

f(x) = λ(x if x>0 else α(e^x - 1))

Softplus

f(x) = ln(1 + e^x)

Softsign

f(x) = x / (1 + |x|)

Swish

f(x) = x * sigmoid(x)

GELU

f(x) = xΦ(x)

Mish

f(x) = x * tanh(ln(1 + e^x))

Hard Sigmoid

f(x) = max(0, min(1, (x+1)/2))

Hard Swish

f(x) = x * ReLU6(x+3)/6

Softmax

softmax(z_i) = e^{z_i} / Σ e^{z_j}

GLU

GLU(a,b) = a ⊗ σ(b)

4. Optimizers

θ(t+1) = θ(t) - η ∇L(θ)

SGD

θ = θ - η∇J(θ)

Mini-batch

θ = θ - η∇J(θ; B)

Momentum

v_t = βv_{t-1} + ηg
θ = θ - v_t

Nesterov

θ = θ - η∇L(θ - βv)

Adagrad

θ = θ - η / √(G+ε) * g

RMSProp

θ = θ - η / √(E[g²]+ε)

Adam

m_t = β1 m + (1-β1)g
v_t = β2 v + (1-β2)g²
θ = θ - η m̂/(√v̂ + ε)

AdamW

θ = θ - η(m̂/(√v̂ + ε) + λθ)

AMSGrad

v̂ = max(v̂_prev, v)

5. Compact comparison table

ItemPurposeFormula IdeaBest UseMain Issue
SigmoidSquash to (0,1)1/(1+e^-x)Binary output, gatesVanishing gradient
TanhSquash to (-1,1)tanh(x)RNN hidden stateVanishing gradient
ReLUKeep positivesmax(0,x)Hidden layersDying ReLU
Leaky ReLUSmall negative slopex, αxCNN/GANTune slope
GELUSmooth gatingxΦ(x)TransformersCostlier
SoftmaxClass probabilitiesexp normalizeMulti-class outputOutput only
SGDSimple update-ηgGeneralizationSlow/noisy
MomentumSGD + velocityRunning avgVision/CNNTune beta
RMSPropAdaptive learning rateAvg grad²RNNMay generalize worse
AdamAdaptive momentsm, v estimatesGeneral DLMemory heavy
AdamWAdam + weight decayAdam + decayTransformersNeeds tuning

6. Key Differences

ComponentRole
ActivationControls neuron output
OptimizerUpdates weights

7. Best Practical Defaults

  • CNN → ReLU + SGD Momentum
  • Transformer → GELU + AdamW
  • RNN → Tanh + Adam
  • Binary → Sigmoid + Adam
  • Multi-class → Softmax + Adam

Final Insight

Activation functions define how neurons respond. Optimizers define how models learn. Together, they are the mathematical foundation of deep learning.

← Back to blog