Description
Stochastic Gradient Descent (SGD) is one of the most fundamental and widely used optimization algorithms in machine learning and deep learning. It is a variant of the standard Gradient Descent algorithm that updates model parameters using a single training example (or a small batch) at a time, rather than the entire dataset.
This allows SGD to be faster, more memory-efficient, and capable of escaping local minima due to its inherent randomness.
Mathematical Formulation
The goal of any optimization algorithm is to minimize a loss function L(θ)L(\theta)L(θ), where θ\thetaθ represents the model parameters (weights and biases).
Standard Gradient Descent (Full Batch)
θ = θ - η * ∇L(θ)
Where:
θis the vector of parametersηis the learning rate∇L(θ)is the gradient of the loss function with respect to the parameters
Stochastic Gradient Descent (SGD)
Instead of computing the gradient over the entire dataset, SGD uses a single randomly selected data point:
θ = θ - η * ∇L_i(θ)
Where:
L_i(θ)is the loss for the i-th data sample- The gradient
∇L_i(θ)is only an approximation of the full gradient, introducing randomness into the process
Why Use SGD?
- ✅ Faster convergence for large datasets
- ✅ Lower memory requirements
- ✅ Better generalization due to noise
- ✅ Can escape local minima or saddle points
Types of Gradient Descent
| Type | Description |
|---|---|
| Batch Gradient | Uses the entire dataset for each step |
| Stochastic | Uses one sample per step (SGD) |
| Mini-Batch | Uses a small batch (e.g., 32 or 64) |
Code Examples
Python (Basic SGD)
import numpy as np
def sgd_step(theta, grad, learning_rate):
return theta - learning_rate * grad
PyTorch Example
import torch.optim as optim
optimizer = optim.SGD(model.parameters(), lr=0.01)
for inputs, labels in dataloader:
optimizer.zero_grad()
outputs = model(inputs)
loss = loss_fn(outputs, labels)
loss.backward()
optimizer.step()
TensorFlow/Keras Example
from tensorflow.keras.optimizers import SGD
model.compile(optimizer=SGD(learning_rate=0.01), loss='mse')
Learning Rate (η)
The learning rate is one of the most critical hyperparameters in SGD. It determines the step size at each iteration:
- Too high: risk of overshooting the minimum
- Too low: slow convergence
- Ideal: balances speed and stability
Learning Rate Scheduling
You can adapt the learning rate during training using schedulers:
- Constant decay: Multiply learning rate by a factor every few epochs
- Exponential decay
- Step decay
- Adaptive learning rate (e.g., ReduceLROnPlateau)
SGD with Momentum
Momentum adds a fraction of the previous update to the current update:
v = β * v - η * ∇L_i(θ)
θ = θ + v
Where:
vis the velocity (accumulated gradient)βis the momentum coefficient (e.g., 0.9)
This helps smooth updates, avoid oscillations, and accelerate convergence.
Nesterov Accelerated Gradient (NAG)
An improvement over basic momentum:
v = β * v - η * ∇L(θ + β * v)
θ = θ + v
This uses a lookahead strategy for better updates.
Mini-Batch SGD
Combines the benefits of both stochastic and batch methods by computing the gradient over a small batch of samples (e.g., 32, 64, 128):
for batch in mini_batch_loader:
optimizer.zero_grad()
predictions = model(batch)
loss = loss_fn(predictions, labels)
loss.backward()
optimizer.step()
Advantages:
- Better generalization than batch
- More stable than pure SGD
- Efficient GPU utilization
Convergence Behavior
While full-batch gradient descent follows a smooth path to the minimum, SGD exhibits noisy but more dynamic behavior:
- May take longer to converge
- But can jump out of local minima
- Encourages better generalization
Use Cases
- Training deep neural networks (DNNs, CNNs, RNNs)
- Logistic regression
- Matrix factorization
- Online learning
- Reinforcement learning
Strengths
- ✅ Efficient for large datasets
- ✅ Can escape local minima
- ✅ Suitable for online learning
- ✅ Simple to implement
Limitations
- ❌ Noisy updates
- ❌ Sensitive to learning rate
- ❌ May never truly converge (just oscillate near the minimum)
Formulas Summary
Full-Batch Gradient Descent
θ = θ - η * ∇L(θ)
Stochastic Gradient Descent
θ = θ - η * ∇L_i(θ)
SGD with Momentum
v = β * v - η * ∇L_i(θ)
θ = θ + v
Nesterov Accelerated Gradient
v = β * v - η * ∇L(θ + β * v)
θ = θ + v
Tips for Using SGD Effectively
- Use mini-batches instead of pure stochastic updates
- Combine with momentum or Nesterov for stability
- Use learning rate decay or schedulers
- Monitor validation loss to adjust learning rates
- Consider switching to Adam or RMSprop for adaptive learning
Summary Table
| Concept | Description |
|---|---|
| Optimizer Type | First-order optimization algorithm |
| Update Style | One sample or small batch at a time |
| Speed | Fast, especially on large datasets |
| Memory Usage | Low |
| Common Enhancements | Momentum, NAG, learning rate scheduling |
| Loss Surface Behavior | Noisy, can escape local minima |
Related Keywords
Adam Optimizer
Backpropagation
Batch Gradient Descent
Convergence Rate
Cost Function
Deep Learning Optimization
Gradient Calculation
Learning Rate Decay
Mini Batch Learning
Momentum Optimizer
Nesterov Acceleration
Neural Network Training
Online Learning
Optimization Algorithm
Parameter Update
Regularization Technique
Step Size
Training Loop
Vanishing Gradient
Weight Adjustment









