Conjugate Gradient

Description

Conjugate Gradient (CG) is an iterative optimization algorithm primarily used to solve systems of linear equations of the form:

A·x = b

where A is a symmetric positive-definite (SPD) matrix, x is the vector of unknowns, and b is the known result vector. Unlike direct methods (such as Gaussian elimination), Conjugate Gradient is favored for large, sparse systems where matrix inversion is computationally expensive or infeasible.

Originally proposed in the 1950s by Magnus Hestenes and Eduard Stiefel, the CG method has found applications in scientific computing, engineering simulations (e.g., finite element analysis), and in the training of certain machine learning models.

How It Works

The Conjugate Gradient method is based on the idea of minimizing the quadratic form associated with the linear system:

f(x) = ½ · xᵀ·A·x - bᵀ·x

Minimizing f(x) yields the same solution as solving A·x = b.

CG doesn’t compute a full matrix inverse or perform dense matrix operations. Instead, it uses successive projections along conjugate directions (directions that are A-orthogonal) to iteratively approach the solution.

Iterative Update Scheme:

At each iteration k, the algorithm:

Computes a residual vector rₖ = b - A·xₖ
Determines a new search direction pₖ that is A-conjugate to previous ones
Updates the solution vector xₖ
Repeats until convergence

Key Components

Initial Guess (x₀):
Starting point for the iterative solution. Often set to the zero vector.
Residual Vector (rₖ):
Difference between the current approximation and the actual solution:

rₖ = b - A·xₖ

Search Direction (pₖ):
A sequence of A-conjugate vectors used for descent.
Step Size (αₖ):
Determines how far to move along the current search direction.
Conjugacy Coefficient (βₖ):
Used to ensure the new direction is A-orthogonal to the previous one.

Algorithm (Pseudocode)

def conjugate_gradient(A, b, x0, tol, max_iter):
    r = b - A @ x0
    p = r.copy()
    x = x0.copy()
    
    for k in range(max_iter):
        Ap = A @ p
        alpha = (r.T @ r) / (p.T @ Ap)
        x = x + alpha * p
        r_new = r - alpha * Ap
        
        if np.linalg.norm(r_new) < tol:
            break
        
        beta = (r_new.T @ r_new) / (r.T @ r)
        p = r_new + beta * p
        r = r_new
        
    return x

Use Cases

Large Sparse Linear Systems
Solving A·x = b where A is huge and sparse (common in simulations, PDEs, etc.)
Finite Element Analysis (FEA)
Used in structural mechanics to solve large stiffness matrices.
Image Processing
Deconvolution and reconstruction tasks where direct methods are too costly.
Physics Simulations
E.g., in computational fluid dynamics (CFD), thermal simulations.
Kernel Methods in Machine Learning
In scenarios where the kernel matrix is SPD and large.

Benefits and Limitations

✅ Benefits

Memory Efficient:
No need to store or invert the entire matrix A.
Fast for Sparse Problems:
Particularly when A is sparse and well-conditioned.
Parallelization Friendly:
Most operations (dot products, matrix-vector multiplications) can be parallelized.
Converges in ≤ n steps:
For exact arithmetic, CG converges in at most n steps (where n = size of A).

❌ Limitations

SPD Requirement:
CG only works for symmetric positive-definite matrices. If A is not SPD, CG may diverge or behave unpredictably.
Condition Number Sensitivity:
Poor conditioning can slow convergence dramatically.
Not Suitable for All ML Tasks:
Despite theoretical appeal, not often used in modern deep learning (like SGD or Adam is).

Comparison with Related Algorithms

Algorithm	SPD Required	Storage	Use Case
Conjugate Gradient	✅ Yes	Low	Sparse linear systems
Gradient Descent	❌ No	Low	General optimization
LU Decomposition	❌ No	High	Small dense matrices
Jacobi/Gauss-Seidel	❌ No	Low	Preconditioned iterations
GMRES	❌ No	Medium	Nonsymmetric systems

Example (Numerical)

Solve the system:

A = [[4, 1],
     [1, 3]]
b = [1, 2]

Initial guess: x₀ = [0, 0]

We want to solve: A·x = b

Step 0:

r₀ = b - A·x₀ = [1, 2]
p₀ = r₀

Step 1:

α₀ = (r₀ᵀ·r₀) / (p₀ᵀ·A·p₀)  
   = (1² + 2²) / ([1,2] · [6,7]) = 5 / 20 = 0.25

x₁ = x₀ + α₀·p₀ = [0,0] + 0.25 * [1,2] = [0.25, 0.5]
r₁ = r₀ - α₀·A·p₀ = [1,2] - 0.25 * [6,7] = [-0.5, 0.25]

Continue in this fashion until ||r|| is below a chosen tolerance.

Preconditioning

A preconditioner M is often introduced to accelerate convergence. Instead of solving A·x = b, CG solves:

M⁻¹·A·x = M⁻¹·b

Where M is a matrix that approximates A⁻¹ and is cheap to compute. Common preconditioners include:

Jacobi (Diagonal)
Incomplete Cholesky
SSOR (Symmetric Successive Over-Relaxation)

This leads to the Preconditioned Conjugate Gradient (PCG) algorithm.

Real-World Analogy

Imagine you’re trying to reach the lowest point in a curved valley. Instead of walking straight downhill (gradient descent), you choose smarter paths that avoid zig-zagging, each time moving in a direction that corrects the inefficiency of the last. Over time, you reach the bottom with fewer, more strategic steps. That’s what Conjugate Gradient does—strategically orthogonalizing directions to avoid redundant work.

Key Formulas Summary

Residual:
rₖ = b - A·xₖ
Step size:
αₖ = (rₖᵀ·rₖ) / (pₖᵀ·A·pₖ)
Solution update:
xₖ₊₁ = xₖ + αₖ·pₖ
Residual update:
rₖ₊₁ = rₖ - αₖ·A·pₖ
Conjugate direction update:
βₖ = (rₖ₊₁ᵀ·rₖ₊₁) / (rₖᵀ·rₖ)
pₖ₊₁ = rₖ₊₁ + βₖ·pₖ