Description

An Activation Function is a mathematical operation applied to the output of a neuron in an artificial neural network. Its purpose is to introduce non-linearity into the network’s architecture, enabling it to learn complex patterns and solve non-trivial tasks such as image recognition, natural language processing, and financial forecasting.

Without activation functions, a neural network would simply behave like a linear regression model, regardless of how many layers it has. Activation functions help break this limitation, allowing networks to model complex relationships and make decisions beyond linear separability.

Why Activation Functions Matter

  • 🚀 Non-Linearity
    Enables the neural network to model and approximate complex, real-world functions.
  • 🧠 Decision Boundaries
    Helps form intricate classification boundaries in high-dimensional data.
  • 🎯 Learnability
    Influences how the model learns through gradient descent and backpropagation.
  • 🔁 Enables Deep Learning
    Essential for deep networks with multiple hidden layers.

In essence, activation functions define how information flows and transforms from one layer to another in a neural network.

Where It’s Used in a Neural Network

An activation function is applied after a neuron computes the weighted sum of its inputs and adds a bias:

z = w₁x₁ + w₂x₂ + ... + wₙxₙ + b
a = activation(z)

Here, z is the linear output, and a is the activated output that gets passed to the next layer.

Types of Activation Functions

1. Linear Activation Function

Formula:

f(x) = x

Characteristics:

  • Does not introduce non-linearity
  • Rarely used in hidden layers
  • Used occasionally in output layers for regression

2. Binary Step Function

Formula:

f(x) = 1 if x ≥ 0 else 0

Characteristics:

  • Simple thresholding logic
  • Non-differentiable
  • Not suitable for gradient-based learning

3. Sigmoid Function

Formula:

f(x) = 1 / (1 + e^-x)

Output Range: (0, 1)

Pros:

  • Smooth gradient
  • Output can be interpreted as probability
  • Historically popular

Cons:

  • Causes vanishing gradients
  • Saturates quickly
  • Not zero-centered

4. Hyperbolic Tangent (Tanh)

Formula:

f(x) = (e^x - e^-x) / (e^x + e^-x)

Output Range: (-1, 1)

Pros:

  • Zero-centered output
  • Steeper gradient than sigmoid

Cons:

  • Still suffers from vanishing gradients for large input magnitudes

5. ReLU (Rectified Linear Unit)

Formula:

f(x) = max(0, x)

Output Range: [0, ∞)

Pros:

  • Sparse activation (many zero outputs)
  • Computationally efficient
  • Reduces likelihood of vanishing gradient

Cons:

  • Dying ReLU Problem: neurons can become inactive and never recover

6. Leaky ReLU

Formula:

f(x) = x if x > 0 else αx

Where α is a small positive constant (e.g., 0.01).

Pros:

  • Addresses dying ReLU
  • Allows small gradients when input is negative

7. Parametric ReLU (PReLU)

A generalization of Leaky ReLU where α is learned during training.

8. Softmax Function

Formula:

softmax(xᵢ) = e^(xᵢ) / Σ e^(xⱼ) for j in all classes

Purpose:
Used in the output layer of classification networks to produce a probability distribution over classes.

9. Swish

Formula:

f(x) = x * sigmoid(x)

Developed by Google and has shown better performance in deep networks.

10. GELU (Gaussian Error Linear Unit)

Used in Transformer architectures like BERT.

Formula (approximate):

f(x) = 0.5x(1 + tanh(√(2/π)(x + 0.044715x³)))

Visual Comparison of Activation Functions

FunctionNon-LinearityDifferentiablePopular Use Case
LinearRegression output
SigmoidBinary classification
TanhNLP, RNNs
ReLUCNNs, MLPs
Leaky ReLUDeep networks
SoftmaxMulti-class classification
GELUTransformers, LLMs

Choosing the Right Activation Function

  • ReLU: Default choice for hidden layers in most deep learning models
  • Sigmoid/Tanh: Good for shallow networks or traditional tasks
  • Softmax: Required for multi-class classification outputs
  • Linear: Use in regression problems
  • GELU/Swish: State-of-the-art deep architectures (BERT, EfficientNet)

Tip: Always match the activation function with the loss function and task type (e.g., use softmax with categorical_crossentropy).

Mathematical Properties

FunctionDifferentiableBounded OutputZero-CenteredNon-Linearity
Linear
Sigmoid
Tanh
ReLU
Softmax✅ (normalized)

Sample Code Snippets

ReLU in PyTorch

import torch.nn as nn

model = nn.Sequential(
    nn.Linear(128, 64),
    nn.ReLU(),
    nn.Linear(64, 10)
)

Sigmoid in TensorFlow

import tensorflow as tf

output = tf.keras.layers.Dense(1, activation='sigmoid')(input_tensor)

Custom Leaky ReLU in NumPy

import numpy as np

def leaky_relu(x, alpha=0.01):
    return np.where(x > 0, x, alpha * x)

Pitfalls to Avoid

Using sigmoid in deep hidden layers — leads to vanishing gradients
Forgetting to normalize input data — can disrupt activation behavior
Using softmax in hidden layers — only appropriate for final classification layer
Not tuning α in Leaky ReLU or PReLU — improper slopes can hurt learning
Stacking non-zero-centered activations — slows down convergence in some networks

Activation Function in Backpropagation

In backpropagation, the derivative of the activation function is used to compute gradients for weight updates. A good activation function should be:

  • Smooth and differentiable
  • Avoid flat regions where gradients vanish
  • Computationally efficient for GPU/TPU usage

Example: Derivative of sigmoid

f'(x) = sigmoid(x) * (1 - sigmoid(x))

Related Keywords

Artificial Neuron
Backpropagation
Binary Step Function
Computational Graph
Deep Neural Network
Differentiable Function
Gradient Vanishing
Hyperbolic Tangent
Layer Normalization
Linear Activation
Loss Function
Neural Network
Non Linearity
Output Layer
Piecewise Function
ReLU Function
Sigmoid Function
Softmax Function
Threshold Function
Transfer Function