Description
An Activation Function is a mathematical operation applied to the output of a neuron in an artificial neural network. Its purpose is to introduce non-linearity into the network’s architecture, enabling it to learn complex patterns and solve non-trivial tasks such as image recognition, natural language processing, and financial forecasting.
Without activation functions, a neural network would simply behave like a linear regression model, regardless of how many layers it has. Activation functions help break this limitation, allowing networks to model complex relationships and make decisions beyond linear separability.
Why Activation Functions Matter
- 🚀 Non-Linearity
Enables the neural network to model and approximate complex, real-world functions. - 🧠 Decision Boundaries
Helps form intricate classification boundaries in high-dimensional data. - 🎯 Learnability
Influences how the model learns through gradient descent and backpropagation. - 🔁 Enables Deep Learning
Essential for deep networks with multiple hidden layers.
In essence, activation functions define how information flows and transforms from one layer to another in a neural network.
Where It’s Used in a Neural Network
An activation function is applied after a neuron computes the weighted sum of its inputs and adds a bias:
z = w₁x₁ + w₂x₂ + ... + wₙxₙ + b
a = activation(z)
Here, z is the linear output, and a is the activated output that gets passed to the next layer.
Types of Activation Functions
1. Linear Activation Function
Formula:
f(x) = x
Characteristics:
- Does not introduce non-linearity
- Rarely used in hidden layers
- Used occasionally in output layers for regression
2. Binary Step Function
Formula:
f(x) = 1 if x ≥ 0 else 0
Characteristics:
- Simple thresholding logic
- Non-differentiable
- Not suitable for gradient-based learning
3. Sigmoid Function
Formula:
f(x) = 1 / (1 + e^-x)
Output Range: (0, 1)
Pros:
- Smooth gradient
- Output can be interpreted as probability
- Historically popular
Cons:
- Causes vanishing gradients
- Saturates quickly
- Not zero-centered
4. Hyperbolic Tangent (Tanh)
Formula:
f(x) = (e^x - e^-x) / (e^x + e^-x)
Output Range: (-1, 1)
Pros:
- Zero-centered output
- Steeper gradient than sigmoid
Cons:
- Still suffers from vanishing gradients for large input magnitudes
5. ReLU (Rectified Linear Unit)
Formula:
f(x) = max(0, x)
Output Range: [0, ∞)
Pros:
- Sparse activation (many zero outputs)
- Computationally efficient
- Reduces likelihood of vanishing gradient
Cons:
- Dying ReLU Problem: neurons can become inactive and never recover
6. Leaky ReLU
Formula:
f(x) = x if x > 0 else αx
Where α is a small positive constant (e.g., 0.01).
Pros:
- Addresses dying ReLU
- Allows small gradients when input is negative
7. Parametric ReLU (PReLU)
A generalization of Leaky ReLU where α is learned during training.
8. Softmax Function
Formula:
softmax(xᵢ) = e^(xᵢ) / Σ e^(xⱼ) for j in all classes
Purpose:
Used in the output layer of classification networks to produce a probability distribution over classes.
9. Swish
Formula:
f(x) = x * sigmoid(x)
Developed by Google and has shown better performance in deep networks.
10. GELU (Gaussian Error Linear Unit)
Used in Transformer architectures like BERT.
Formula (approximate):
f(x) = 0.5x(1 + tanh(√(2/π)(x + 0.044715x³)))
Visual Comparison of Activation Functions
| Function | Non-Linearity | Differentiable | Popular Use Case |
|---|---|---|---|
| Linear | ❌ | ✅ | Regression output |
| Sigmoid | ✅ | ✅ | Binary classification |
| Tanh | ✅ | ✅ | NLP, RNNs |
| ReLU | ✅ | ✅ | CNNs, MLPs |
| Leaky ReLU | ✅ | ✅ | Deep networks |
| Softmax | ✅ | ✅ | Multi-class classification |
| GELU | ✅ | ✅ | Transformers, LLMs |
Choosing the Right Activation Function
- ReLU: Default choice for hidden layers in most deep learning models
- Sigmoid/Tanh: Good for shallow networks or traditional tasks
- Softmax: Required for multi-class classification outputs
- Linear: Use in regression problems
- GELU/Swish: State-of-the-art deep architectures (BERT, EfficientNet)
Tip: Always match the activation function with the loss function and task type (e.g., use softmax with categorical_crossentropy).
Mathematical Properties
| Function | Differentiable | Bounded Output | Zero-Centered | Non-Linearity |
|---|---|---|---|---|
| Linear | ✅ | ❌ | ✅ | ❌ |
| Sigmoid | ✅ | ✅ | ❌ | ✅ |
| Tanh | ✅ | ✅ | ✅ | ✅ |
| ReLU | ✅ | ❌ | ❌ | ✅ |
| Softmax | ✅ | ✅ | ✅ (normalized) | ✅ |
Sample Code Snippets
ReLU in PyTorch
import torch.nn as nn
model = nn.Sequential(
nn.Linear(128, 64),
nn.ReLU(),
nn.Linear(64, 10)
)
Sigmoid in TensorFlow
import tensorflow as tf
output = tf.keras.layers.Dense(1, activation='sigmoid')(input_tensor)
Custom Leaky ReLU in NumPy
import numpy as np
def leaky_relu(x, alpha=0.01):
return np.where(x > 0, x, alpha * x)
Pitfalls to Avoid
❌ Using sigmoid in deep hidden layers — leads to vanishing gradients
❌ Forgetting to normalize input data — can disrupt activation behavior
❌ Using softmax in hidden layers — only appropriate for final classification layer
❌ Not tuning α in Leaky ReLU or PReLU — improper slopes can hurt learning
❌ Stacking non-zero-centered activations — slows down convergence in some networks
Activation Function in Backpropagation
In backpropagation, the derivative of the activation function is used to compute gradients for weight updates. A good activation function should be:
- Smooth and differentiable
- Avoid flat regions where gradients vanish
- Computationally efficient for GPU/TPU usage
Example: Derivative of sigmoid
f'(x) = sigmoid(x) * (1 - sigmoid(x))
Related Keywords
Artificial Neuron
Backpropagation
Binary Step Function
Computational Graph
Deep Neural Network
Differentiable Function
Gradient Vanishing
Hyperbolic Tangent
Layer Normalization
Linear Activation
Loss Function
Neural Network
Non Linearity
Output Layer
Piecewise Function
ReLU Function
Sigmoid Function
Softmax Function
Threshold Function
Transfer Function









