Activation Function

Description

An Activation Function is a mathematical operation applied to the output of a neuron in an artificial neural network. Its purpose is to introduce non-linearity into the network’s architecture, enabling it to learn complex patterns and solve non-trivial tasks such as image recognition, natural language processing, and financial forecasting.

Without activation functions, a neural network would simply behave like a linear regression model, regardless of how many layers it has. Activation functions help break this limitation, allowing networks to model complex relationships and make decisions beyond linear separability.

Why Activation Functions Matter

🚀 Non-Linearity
Enables the neural network to model and approximate complex, real-world functions.
🧠 Decision Boundaries
Helps form intricate classification boundaries in high-dimensional data.
🎯 Learnability
Influences how the model learns through gradient descent and backpropagation.
🔁 Enables Deep Learning
Essential for deep networks with multiple hidden layers.

In essence, activation functions define how information flows and transforms from one layer to another in a neural network.

Where It’s Used in a Neural Network

An activation function is applied after a neuron computes the weighted sum of its inputs and adds a bias:

z = w₁x₁ + w₂x₂ + ... + wₙxₙ + b
a = activation(z)

Here, z is the linear output, and a is the activated output that gets passed to the next layer.

Types of Activation Functions

1. Linear Activation Function

Formula:

f(x) = x

Characteristics:

Does not introduce non-linearity
Rarely used in hidden layers
Used occasionally in output layers for regression

2. Binary Step Function

Formula:

f(x) = 1 if x ≥ 0 else 0

Characteristics:

Simple thresholding logic
Non-differentiable
Not suitable for gradient-based learning

3. Sigmoid Function

Formula:

f(x) = 1 / (1 + e^-x)

Output Range: (0, 1)

Pros:

Smooth gradient
Output can be interpreted as probability
Historically popular

Cons:

Causes vanishing gradients
Saturates quickly
Not zero-centered

4. Hyperbolic Tangent (Tanh)

Formula:

f(x) = (e^x - e^-x) / (e^x + e^-x)

Output Range: (-1, 1)

Pros:

Zero-centered output
Steeper gradient than sigmoid

Cons:

Still suffers from vanishing gradients for large input magnitudes

5. ReLU (Rectified Linear Unit)

Formula:

f(x) = max(0, x)

Output Range: [0, ∞)

Pros:

Sparse activation (many zero outputs)
Computationally efficient
Reduces likelihood of vanishing gradient

Cons:

Dying ReLU Problem: neurons can become inactive and never recover

6. Leaky ReLU

Formula:

f(x) = x if x > 0 else αx

Where α is a small positive constant (e.g., 0.01).

Pros:

Addresses dying ReLU
Allows small gradients when input is negative

7. Parametric ReLU (PReLU)

A generalization of Leaky ReLU where α is learned during training.

8. Softmax Function

Formula:

softmax(xᵢ) = e^(xᵢ) / Σ e^(xⱼ) for j in all classes

Purpose:
Used in the output layer of classification networks to produce a probability distribution over classes.

9. Swish

Formula:

f(x) = x * sigmoid(x)

Developed by Google and has shown better performance in deep networks.

10. GELU (Gaussian Error Linear Unit)

Used in Transformer architectures like BERT.

Formula (approximate):

f(x) = 0.5x(1 + tanh(√(2/π)(x + 0.044715x³)))

Visual Comparison of Activation Functions

Function	Non-Linearity	Differentiable	Popular Use Case
Linear	❌	✅	Regression output
Sigmoid	✅	✅	Binary classification
Tanh	✅	✅	NLP, RNNs
ReLU	✅	✅	CNNs, MLPs
Leaky ReLU	✅	✅	Deep networks
Softmax	✅	✅	Multi-class classification
GELU	✅	✅	Transformers, LLMs

Choosing the Right Activation Function

ReLU: Default choice for hidden layers in most deep learning models
Sigmoid/Tanh: Good for shallow networks or traditional tasks
Softmax: Required for multi-class classification outputs
Linear: Use in regression problems
GELU/Swish: State-of-the-art deep architectures (BERT, EfficientNet)

Tip: Always match the activation function with the loss function and task type (e.g., use softmax with categorical_crossentropy).

Mathematical Properties

Function	Differentiable	Bounded Output	Zero-Centered	Non-Linearity
Linear	✅	❌	✅	❌
Sigmoid	✅	✅	❌	✅
Tanh	✅	✅	✅	✅
ReLU	✅	❌	❌	✅
Softmax	✅	✅	✅ (normalized)	✅

Sample Code Snippets

ReLU in PyTorch

import torch.nn as nn

model = nn.Sequential(
    nn.Linear(128, 64),
    nn.ReLU(),
    nn.Linear(64, 10)
)

Sigmoid in TensorFlow

import tensorflow as tf

output = tf.keras.layers.Dense(1, activation='sigmoid')(input_tensor)

Custom Leaky ReLU in NumPy

import numpy as np

def leaky_relu(x, alpha=0.01):
    return np.where(x > 0, x, alpha * x)

Pitfalls to Avoid

❌ Using sigmoid in deep hidden layers — leads to vanishing gradients
❌ Forgetting to normalize input data — can disrupt activation behavior
❌ Using softmax in hidden layers — only appropriate for final classification layer
❌ Not tuning α in Leaky ReLU or PReLU — improper slopes can hurt learning
❌ Stacking non-zero-centered activations — slows down convergence in some networks

Activation Function in Backpropagation

In backpropagation, the derivative of the activation function is used to compute gradients for weight updates. A good activation function should be:

Smooth and differentiable
Avoid flat regions where gradients vanish
Computationally efficient for GPU/TPU usage

Example: Derivative of sigmoid

f'(x) = sigmoid(x) * (1 - sigmoid(x))

Related Keywords

Artificial Neuron
Backpropagation
Binary Step Function
Computational Graph
Deep Neural Network
Differentiable Function
Gradient Vanishing
Hyperbolic Tangent
Layer Normalization
Linear Activation
Loss Function
Neural Network
Non Linearity
Output Layer
Piecewise Function
ReLU Function
Sigmoid Function
Softmax Function
Threshold Function
Transfer Function