Description
Batch Gradient Descent is an optimization algorithm used in training machine learning models, particularly neural networks. It is a variant of the gradient descent method that computes the gradient of the cost function with respect to model parameters using the entire training dataset for each update.
As a cornerstone of supervised learning, batch gradient descent helps the model minimize a loss function (such as mean squared error or cross-entropy) by adjusting weights and biases in the opposite direction of the gradient.
This approach guarantees convergence to the global minimum for convex loss functions and is widely used in classical and deep learning systems.
How It Works
The basic idea of gradient descent is to find the minimum of a function (typically a loss or cost function) by iteratively moving in the direction of the steepest descent, which is the negative gradient.
In batch gradient descent, the gradient is calculated across all training examples at once:
Update Rule:
θ := θ - η * ∇J(θ)
Where:
θ: vector of parameters (weights, biases)η: learning rate (step size)∇J(θ): gradient of the cost function with respect to θJ(θ): cost function (e.g., loss over all samples)
Workflow of Batch Gradient Descent
- Initialize Parameters
Start with random values for weights and biases. - Forward Pass
Use the model to compute predictions on the entire training set. - Compute Loss
Calculate how far predictions are from actual labels. - Backpropagation
Compute gradients of the loss with respect to parameters. - Update Parameters
Use the gradient and learning rate to adjust weights. - Repeat
Iterate this process for many epochs until convergence.
Mathematical Formulation
Let:
X: input matrix of shape (n_samples, n_features)y: target vector of shape (n_samples,)θ: model parametershθ(X): model predictionJ(θ): cost function (e.g., MSE)
The cost function in batch gradient descent is:
J(θ) = (1/2n) * Σ (hθ(xᵢ) - yᵢ)²
The gradient with respect to θ is:
∇J(θ) = (1/n) * Xᵀ(hθ(X) - y)
Then update:
θ := θ - η * ∇J(θ)
Example: Linear Regression with Batch Gradient Descent
import numpy as np
# Inputs
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 6, 8, 10])
# Add bias term
X_b = np.c_[np.ones((5, 1)), X]
# Initialize parameters
theta = np.random.randn(2, 1)
# Hyperparameters
learning_rate = 0.01
n_iterations = 1000
m = len(X_b)
for iteration in range(n_iterations):
gradients = (2/m) * X_b.T.dot(X_b.dot(theta) - y.reshape(-1, 1))
theta -= learning_rate * gradients
print("Final parameters:", theta)
Advantages of Batch Gradient Descent
✅ Stable Convergence
Each update is based on the entire dataset, leading to smooth, stable convergence.
✅ Accurate Gradient Estimates
No noise due to averaging over full dataset.
✅ Theoretical Guarantees
For convex functions, it converges to the global minimum.
✅ Ideal for Small Datasets
Perfect for datasets that easily fit into memory.
Limitations
❌ Memory Intensive
Requires entire dataset to be loaded into memory.
❌ Slow on Large Datasets
One update per epoch can make training inefficient for large-scale data.
❌ Redundant Computation
Even if parameters have converged for a subset, it still recomputes gradients for all.
❌ Not Suited for Online Learning
Cannot handle streaming data effectively.
Batch vs Stochastic vs Mini-Batch Gradient Descent
| Type | Dataset Used per Update | Pros | Cons |
|---|---|---|---|
| Batch | Entire dataset | Stable, accurate | Memory-heavy, slow on big data |
| Stochastic (SGD) | 1 example | Fast, online learning possible | Noisy updates, may not converge |
| Mini-Batch | Subsets of data (batches) | Balance between speed and stability | Requires tuning batch size |
Best Practices
- Normalize or standardize input features
- Monitor training and validation loss
- Use appropriate learning rate (with decay or scheduling)
- Combine with momentum or adaptive optimizers (e.g., Adam, RMSprop)
- Use early stopping to avoid overfitting
- For large datasets, prefer mini-batch variant
Use in Deep Learning
In deep neural networks, batch gradient descent is rarely used directly due to the size of modern datasets. Instead, mini-batch gradient descent (a compromise between batch and stochastic) is the preferred method.
However, the idea and derivation of all variants stem from the same core principles of batch gradient descent.
Visual Illustration
Epoch 1
↳ Entire training set used → compute average gradient
↳ Update weights
Epoch 2
↳ Again, full dataset used → compute new gradient
↳ Update weights
...
Loss function curve is smooth but updates are infrequent.
When to Use Batch Gradient Descent
✅ When dataset is small and fits in memory
✅ When computational cost is not a concern
✅ When seeking smooth convergence
✅ When theoretical convergence properties are important
✅ When performing analytical studies of loss landscapes
Key Snippets
Gradient Update in Numpy
gradient = (1 / len(X)) * X.T @ (X @ theta - y)
theta -= learning_rate * gradient
Learning Rate Scheduler (Manual Decay)
learning_rate = initial_rate / (1 + decay_rate * epoch)
Cost Function (MSE)
def compute_cost(X, y, theta):
m = len(y)
predictions = X.dot(theta)
cost = (1 / (2 * m)) * np.sum((predictions - y)**2)
return cost
Related Keywords
Activation Function
Artificial Neural Network
Backpropagation
Convergence Rate
Cost Function
Deep Learning
Epoch
Gradient Descent
Learning Rate
Loss Function
Mini Batch Descent
Momentum Optimization
Neural Network Training
Optimization Algorithm
Parameter Update
Regularization Technique
Stochastic Gradient Descent
Training Dataset
Weight Initialization
Weight Update









