Stochastic Gradient Descent

Description

Stochastic Gradient Descent (SGD) is one of the most fundamental and widely used optimization algorithms in machine learning and deep learning. It is a variant of the standard Gradient Descent algorithm that updates model parameters using a single training example (or a small batch) at a time, rather than the entire dataset.

This allows SGD to be faster, more memory-efficient, and capable of escaping local minima due to its inherent randomness.

Mathematical Formulation

The goal of any optimization algorithm is to minimize a loss function L(θ)L(\theta)L(θ), where θ\thetaθ represents the model parameters (weights and biases).

Standard Gradient Descent (Full Batch)

θ = θ - η * ∇L(θ)

Where:

θ is the vector of parameters
η is the learning rate
∇L(θ) is the gradient of the loss function with respect to the parameters

Stochastic Gradient Descent (SGD)

Instead of computing the gradient over the entire dataset, SGD uses a single randomly selected data point:

θ = θ - η * ∇L_i(θ)

Where:

L_i(θ) is the loss for the i-th data sample
The gradient ∇L_i(θ) is only an approximation of the full gradient, introducing randomness into the process

Why Use SGD?

✅ Faster convergence for large datasets
✅ Lower memory requirements
✅ Better generalization due to noise
✅ Can escape local minima or saddle points

Types of Gradient Descent

Type	Description
Batch Gradient	Uses the entire dataset for each step
Stochastic	Uses one sample per step (SGD)
Mini-Batch	Uses a small batch (e.g., 32 or 64)

Code Examples

Python (Basic SGD)

import numpy as np

def sgd_step(theta, grad, learning_rate):
    return theta - learning_rate * grad

PyTorch Example

import torch.optim as optim

optimizer = optim.SGD(model.parameters(), lr=0.01)

for inputs, labels in dataloader:
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = loss_fn(outputs, labels)
    loss.backward()
    optimizer.step()

TensorFlow/Keras Example

from tensorflow.keras.optimizers import SGD

model.compile(optimizer=SGD(learning_rate=0.01), loss='mse')

Learning Rate (η)

The learning rate is one of the most critical hyperparameters in SGD. It determines the step size at each iteration:

Too high: risk of overshooting the minimum
Too low: slow convergence
Ideal: balances speed and stability

Learning Rate Scheduling

You can adapt the learning rate during training using schedulers:

Constant decay: Multiply learning rate by a factor every few epochs
Exponential decay
Step decay
Adaptive learning rate (e.g., ReduceLROnPlateau)

SGD with Momentum

Momentum adds a fraction of the previous update to the current update:

v = β * v - η * ∇L_i(θ)
θ = θ + v

Where:

v is the velocity (accumulated gradient)
β is the momentum coefficient (e.g., 0.9)

This helps smooth updates, avoid oscillations, and accelerate convergence.

Nesterov Accelerated Gradient (NAG)

An improvement over basic momentum:

v = β * v - η * ∇L(θ + β * v)
θ = θ + v

This uses a lookahead strategy for better updates.

Mini-Batch SGD

Combines the benefits of both stochastic and batch methods by computing the gradient over a small batch of samples (e.g., 32, 64, 128):

for batch in mini_batch_loader:
    optimizer.zero_grad()
    predictions = model(batch)
    loss = loss_fn(predictions, labels)
    loss.backward()
    optimizer.step()

Advantages:

Better generalization than batch
More stable than pure SGD
Efficient GPU utilization

Convergence Behavior

While full-batch gradient descent follows a smooth path to the minimum, SGD exhibits noisy but more dynamic behavior:

May take longer to converge
But can jump out of local minima
Encourages better generalization

Use Cases

Training deep neural networks (DNNs, CNNs, RNNs)
Logistic regression
Matrix factorization
Online learning
Reinforcement learning

Strengths

✅ Efficient for large datasets
✅ Can escape local minima
✅ Suitable for online learning
✅ Simple to implement

Limitations

❌ Noisy updates
❌ Sensitive to learning rate
❌ May never truly converge (just oscillate near the minimum)

Formulas Summary

Full-Batch Gradient Descent

θ = θ - η * ∇L(θ)

Stochastic Gradient Descent

θ = θ - η * ∇L_i(θ)

SGD with Momentum

v = β * v - η * ∇L_i(θ)
θ = θ + v

Nesterov Accelerated Gradient

v = β * v - η * ∇L(θ + β * v)
θ = θ + v

Tips for Using SGD Effectively

Use mini-batches instead of pure stochastic updates
Combine with momentum or Nesterov for stability
Use learning rate decay or schedulers
Monitor validation loss to adjust learning rates
Consider switching to Adam or RMSprop for adaptive learning

Summary Table

Concept	Description
Optimizer Type	First-order optimization algorithm
Update Style	One sample or small batch at a time
Speed	Fast, especially on large datasets
Memory Usage	Low
Common Enhancements	Momentum, NAG, learning rate scheduling
Loss Surface Behavior	Noisy, can escape local minima

Related Keywords

Adam Optimizer
Backpropagation
Batch Gradient Descent
Convergence Rate
Cost Function
Deep Learning Optimization
Gradient Calculation
Learning Rate Decay
Mini Batch Learning
Momentum Optimizer
Nesterov Acceleration
Neural Network Training
Online Learning
Optimization Algorithm
Parameter Update
Regularization Technique
Step Size
Training Loop
Vanishing Gradient
Weight Adjustment