Blog / Activation Functions & Optimizers: Complete Deep Dive with Formulas, Use Cases, and Trade-offs

Deep Learning

Activation Functions & Optimizers: Complete Deep Dive with Formulas, Use Cases, and Trade-offs

2 March 2026

A complete, no-compromise guide to activation functions and optimizers covering all formulas, advantages, disadvantages, and real-world usage across modern AI systems.

Activation functions and optimizers are the two most fundamental components of neural networks. Activation functions determine how neurons respond, while optimizers determine how models learn.

1. What is an Activation Function

A neuron first computes:

z = wᵀx + b

Then activation function 𝑓 is applied:

a=𝑓(z)

Without activation functions, deep networks collapse into linear models regardless of depth. They introduce non-linearity, enabling complex pattern learning.

2. Why Non-Linearity is Important

Suppose every layer is only linear:

y = W3(W2(W1x + b1) + b2) + b3

This whole thing can still be reduced to:

y = Wx + b

Stacking linear layers results in another linear function. Activation functions break this limitation.

3. Activation Functions (Complete List)

Binary Step

f(x) = {1 if x ≥ 0, 0 if x < 0}

Advantage: Simple
Disadvantage: Not differentiable, zero gradients
Use: Early perceptrons (historical)

Linear / Identity

f(x) = x
f'(x) = 1

Advantage: Simple
Disadvantage: No non-linearity
Use: Regression outputs

Sigmoid

σ(x) = 1 / (1 + e^-x)
σ'(x) = σ(x)(1 - σ(x))

Advantage: Probabilities (0–1)
Disadvantage: Vanishing gradient
Use: Binary output, LSTM gates

Tanh

tanh(x) = (e^x - e^-x)/(e^x + e^-x)
tanh'(x) = 1 - tanh^2(x)

ReLU

f(x) = max(0,x)

Leaky ReLU

f(x) = x (x>0), αx (x≤0)

PReLU

f(x) = x (x>0), ax (x≤0)

ELU

f(x) = x (x>0), α(e^x - 1)

SELU

f(x) = λ(x if x>0 else α(e^x - 1))

Softplus

f(x) = ln(1 + e^x)

Softsign

f(x) = x / (1 + |x|)

Swish

f(x) = x * sigmoid(x)

GELU

f(x) = xΦ(x)

Mish

f(x) = x * tanh(ln(1 + e^x))

Hard Sigmoid

f(x) = max(0, min(1, (x+1)/2))

Hard Swish

f(x) = x * ReLU6(x+3)/6

Softmax

softmax(z_i) = e^{z_i} / Σ e^{z_j}

GLU

GLU(a,b) = a ⊗ σ(b)

4. Optimizers

θ(t+1) = θ(t) - η ∇L(θ)

SGD

θ = θ - η∇J(θ)

Mini-batch

θ = θ - η∇J(θ; B)

Momentum

v_t = βv_{t-1} + ηg
θ = θ - v_t

Nesterov

θ = θ - η∇L(θ - βv)

Adagrad

θ = θ - η / √(G+ε) * g

RMSProp

θ = θ - η / √(E[g²]+ε)

Adam

m_t = β1 m + (1-β1)g
v_t = β2 v + (1-β2)g²
θ = θ - η m̂/(√v̂ + ε)

AdamW

θ = θ - η(m̂/(√v̂ + ε) + λθ)

AMSGrad

v̂ = max(v̂_prev, v)

5. Compact comparison table

Item	Purpose	Formula Idea	Best Use	Main Issue
Sigmoid	Squash to (0,1)	1/(1+e^-x)	Binary output, gates	Vanishing gradient
Tanh	Squash to (-1,1)	tanh(x)	RNN hidden state	Vanishing gradient
ReLU	Keep positives	max(0,x)	Hidden layers	Dying ReLU
Leaky ReLU	Small negative slope	x, αx	CNN/GAN	Tune slope
GELU	Smooth gating	xΦ(x)	Transformers	Costlier
Softmax	Class probabilities	exp normalize	Multi-class output	Output only
SGD	Simple update	-ηg	Generalization	Slow/noisy
Momentum	SGD + velocity	Running avg	Vision/CNN	Tune beta
RMSProp	Adaptive learning rate	Avg grad²	RNN	May generalize worse
Adam	Adaptive moments	m, v estimates	General DL	Memory heavy
AdamW	Adam + weight decay	Adam + decay	Transformers	Needs tuning

6. Key Differences

Component	Role
Activation	Controls neuron output
Optimizer	Updates weights

7. Best Practical Defaults

CNN → ReLU + SGD Momentum
Transformer → GELU + AdamW
RNN → Tanh + Adam
Binary → Sigmoid + Adam
Multi-class → Softmax + Adam

Final Insight

Activation functions define how neurons respond. Optimizers define how models learn. Together, they are the mathematical foundation of deep learning.

← Back to blog