Description

Batch Gradient Descent is an optimization algorithm used in training machine learning models, particularly neural networks. It is a variant of the gradient descent method that computes the gradient of the cost function with respect to model parameters using the entire training dataset for each update.

As a cornerstone of supervised learning, batch gradient descent helps the model minimize a loss function (such as mean squared error or cross-entropy) by adjusting weights and biases in the opposite direction of the gradient.

This approach guarantees convergence to the global minimum for convex loss functions and is widely used in classical and deep learning systems.

How It Works

The basic idea of gradient descent is to find the minimum of a function (typically a loss or cost function) by iteratively moving in the direction of the steepest descent, which is the negative gradient.

In batch gradient descent, the gradient is calculated across all training examples at once:

Update Rule:

θ := θ - η * ∇J(θ)

Where:

  • θ: vector of parameters (weights, biases)
  • η: learning rate (step size)
  • ∇J(θ): gradient of the cost function with respect to θ
  • J(θ): cost function (e.g., loss over all samples)

Workflow of Batch Gradient Descent

  1. Initialize Parameters
    Start with random values for weights and biases.
  2. Forward Pass
    Use the model to compute predictions on the entire training set.
  3. Compute Loss
    Calculate how far predictions are from actual labels.
  4. Backpropagation
    Compute gradients of the loss with respect to parameters.
  5. Update Parameters
    Use the gradient and learning rate to adjust weights.
  6. Repeat
    Iterate this process for many epochs until convergence.

Mathematical Formulation

Let:

  • X: input matrix of shape (n_samples, n_features)
  • y: target vector of shape (n_samples,)
  • θ: model parameters
  • hθ(X): model prediction
  • J(θ): cost function (e.g., MSE)

The cost function in batch gradient descent is:

J(θ) = (1/2n) * Σ (hθ(xᵢ) - yᵢ)²

The gradient with respect to θ is:

∇J(θ) = (1/n) * Xᵀ(hθ(X) - y)

Then update:

θ := θ - η * ∇J(θ)

Example: Linear Regression with Batch Gradient Descent

import numpy as np

# Inputs
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 6, 8, 10])

# Add bias term
X_b = np.c_[np.ones((5, 1)), X]

# Initialize parameters
theta = np.random.randn(2, 1)

# Hyperparameters
learning_rate = 0.01
n_iterations = 1000
m = len(X_b)

for iteration in range(n_iterations):
    gradients = (2/m) * X_b.T.dot(X_b.dot(theta) - y.reshape(-1, 1))
    theta -= learning_rate * gradients

print("Final parameters:", theta)

Advantages of Batch Gradient Descent

Stable Convergence
Each update is based on the entire dataset, leading to smooth, stable convergence.

Accurate Gradient Estimates
No noise due to averaging over full dataset.

Theoretical Guarantees
For convex functions, it converges to the global minimum.

Ideal for Small Datasets
Perfect for datasets that easily fit into memory.

Limitations

Memory Intensive
Requires entire dataset to be loaded into memory.

Slow on Large Datasets
One update per epoch can make training inefficient for large-scale data.

Redundant Computation
Even if parameters have converged for a subset, it still recomputes gradients for all.

Not Suited for Online Learning
Cannot handle streaming data effectively.

Batch vs Stochastic vs Mini-Batch Gradient Descent

TypeDataset Used per UpdateProsCons
BatchEntire datasetStable, accurateMemory-heavy, slow on big data
Stochastic (SGD)1 exampleFast, online learning possibleNoisy updates, may not converge
Mini-BatchSubsets of data (batches)Balance between speed and stabilityRequires tuning batch size

Best Practices

  • Normalize or standardize input features
  • Monitor training and validation loss
  • Use appropriate learning rate (with decay or scheduling)
  • Combine with momentum or adaptive optimizers (e.g., Adam, RMSprop)
  • Use early stopping to avoid overfitting
  • For large datasets, prefer mini-batch variant

Use in Deep Learning

In deep neural networks, batch gradient descent is rarely used directly due to the size of modern datasets. Instead, mini-batch gradient descent (a compromise between batch and stochastic) is the preferred method.

However, the idea and derivation of all variants stem from the same core principles of batch gradient descent.

Visual Illustration

Epoch 1
 ↳ Entire training set used → compute average gradient
 ↳ Update weights

Epoch 2
 ↳ Again, full dataset used → compute new gradient
 ↳ Update weights
...

Loss function curve is smooth but updates are infrequent.

When to Use Batch Gradient Descent

✅ When dataset is small and fits in memory
✅ When computational cost is not a concern
✅ When seeking smooth convergence
✅ When theoretical convergence properties are important
✅ When performing analytical studies of loss landscapes

Key Snippets

Gradient Update in Numpy

gradient = (1 / len(X)) * X.T @ (X @ theta - y)
theta -= learning_rate * gradient

Learning Rate Scheduler (Manual Decay)

learning_rate = initial_rate / (1 + decay_rate * epoch)

Cost Function (MSE)

def compute_cost(X, y, theta):
    m = len(y)
    predictions = X.dot(theta)
    cost = (1 / (2 * m)) * np.sum((predictions - y)**2)
    return cost

Related Keywords

Activation Function
Artificial Neural Network
Backpropagation
Convergence Rate
Cost Function
Deep Learning
Epoch
Gradient Descent
Learning Rate
Loss Function
Mini Batch Descent
Momentum Optimization
Neural Network Training
Optimization Algorithm
Parameter Update
Regularization Technique
Stochastic Gradient Descent
Training Dataset
Weight Initialization
Weight Update