Description
Conjugate Gradient (CG) is an iterative optimization algorithm primarily used to solve systems of linear equations of the form:
A·x = b
where A is a symmetric positive-definite (SPD) matrix, x is the vector of unknowns, and b is the known result vector. Unlike direct methods (such as Gaussian elimination), Conjugate Gradient is favored for large, sparse systems where matrix inversion is computationally expensive or infeasible.
Originally proposed in the 1950s by Magnus Hestenes and Eduard Stiefel, the CG method has found applications in scientific computing, engineering simulations (e.g., finite element analysis), and in the training of certain machine learning models.
How It Works
The Conjugate Gradient method is based on the idea of minimizing the quadratic form associated with the linear system:
f(x) = ½ · xᵀ·A·x - bᵀ·x
Minimizing f(x) yields the same solution as solving A·x = b.
CG doesn’t compute a full matrix inverse or perform dense matrix operations. Instead, it uses successive projections along conjugate directions (directions that are A-orthogonal) to iteratively approach the solution.
Iterative Update Scheme:
At each iteration k, the algorithm:
- Computes a residual vector
rₖ = b - A·xₖ - Determines a new search direction
pₖthat is A-conjugate to previous ones - Updates the solution vector
xₖ - Repeats until convergence
Key Components
- Initial Guess (
x₀):
Starting point for the iterative solution. Often set to the zero vector. - Residual Vector (
rₖ):
Difference between the current approximation and the actual solution:
rₖ = b - A·xₖ
- Search Direction (
pₖ):
A sequence of A-conjugate vectors used for descent. - Step Size (
αₖ):
Determines how far to move along the current search direction. - Conjugacy Coefficient (
βₖ):
Used to ensure the new direction is A-orthogonal to the previous one.
Algorithm (Pseudocode)
def conjugate_gradient(A, b, x0, tol, max_iter):
r = b - A @ x0
p = r.copy()
x = x0.copy()
for k in range(max_iter):
Ap = A @ p
alpha = (r.T @ r) / (p.T @ Ap)
x = x + alpha * p
r_new = r - alpha * Ap
if np.linalg.norm(r_new) < tol:
break
beta = (r_new.T @ r_new) / (r.T @ r)
p = r_new + beta * p
r = r_new
return x
Use Cases
- Large Sparse Linear Systems
SolvingA·x = bwhereAis huge and sparse (common in simulations, PDEs, etc.) - Finite Element Analysis (FEA)
Used in structural mechanics to solve large stiffness matrices. - Image Processing
Deconvolution and reconstruction tasks where direct methods are too costly. - Physics Simulations
E.g., in computational fluid dynamics (CFD), thermal simulations. - Kernel Methods in Machine Learning
In scenarios where the kernel matrix is SPD and large.
Benefits and Limitations
✅ Benefits
- Memory Efficient:
No need to store or invert the entire matrixA. - Fast for Sparse Problems:
Particularly whenAis sparse and well-conditioned. - Parallelization Friendly:
Most operations (dot products, matrix-vector multiplications) can be parallelized. - Converges in ≤ n steps:
For exact arithmetic, CG converges in at mostnsteps (wheren= size ofA).
❌ Limitations
- SPD Requirement:
CG only works for symmetric positive-definite matrices. IfAis not SPD, CG may diverge or behave unpredictably. - Condition Number Sensitivity:
Poor conditioning can slow convergence dramatically. - Not Suitable for All ML Tasks:
Despite theoretical appeal, not often used in modern deep learning (like SGD or Adam is).
Comparison with Related Algorithms
| Algorithm | SPD Required | Storage | Use Case |
|---|---|---|---|
| Conjugate Gradient | ✅ Yes | Low | Sparse linear systems |
| Gradient Descent | ❌ No | Low | General optimization |
| LU Decomposition | ❌ No | High | Small dense matrices |
| Jacobi/Gauss-Seidel | ❌ No | Low | Preconditioned iterations |
| GMRES | ❌ No | Medium | Nonsymmetric systems |
Example (Numerical)
Solve the system:
A = [[4, 1],
[1, 3]]
b = [1, 2]
Initial guess: x₀ = [0, 0]
We want to solve: A·x = b
Step 0:
r₀ = b - A·x₀ = [1, 2]
p₀ = r₀
Step 1:
α₀ = (r₀ᵀ·r₀) / (p₀ᵀ·A·p₀)
= (1² + 2²) / ([1,2] · [6,7]) = 5 / 20 = 0.25
x₁ = x₀ + α₀·p₀ = [0,0] + 0.25 * [1,2] = [0.25, 0.5]
r₁ = r₀ - α₀·A·p₀ = [1,2] - 0.25 * [6,7] = [-0.5, 0.25]
Continue in this fashion until ||r|| is below a chosen tolerance.
Preconditioning
A preconditioner M is often introduced to accelerate convergence. Instead of solving A·x = b, CG solves:
M⁻¹·A·x = M⁻¹·b
Where M is a matrix that approximates A⁻¹ and is cheap to compute. Common preconditioners include:
- Jacobi (Diagonal)
- Incomplete Cholesky
- SSOR (Symmetric Successive Over-Relaxation)
This leads to the Preconditioned Conjugate Gradient (PCG) algorithm.
Real-World Analogy
Imagine you’re trying to reach the lowest point in a curved valley. Instead of walking straight downhill (gradient descent), you choose smarter paths that avoid zig-zagging, each time moving in a direction that corrects the inefficiency of the last. Over time, you reach the bottom with fewer, more strategic steps. That’s what Conjugate Gradient does—strategically orthogonalizing directions to avoid redundant work.
Key Formulas Summary
- Residual:
rₖ = b - A·xₖ - Step size:
αₖ = (rₖᵀ·rₖ) / (pₖᵀ·A·pₖ) - Solution update:
xₖ₊₁ = xₖ + αₖ·pₖ - Residual update:
rₖ₊₁ = rₖ - αₖ·A·pₖ - Conjugate direction update:
βₖ = (rₖ₊₁ᵀ·rₖ₊₁) / (rₖᵀ·rₖ)pₖ₊₁ = rₖ₊₁ + βₖ·pₖ
Related Keywords
- A Conjugate Direction
- Finite Element Method
- Gradient Descent
- Iterative Solver
- Jacobi Method
- Krylov Subspace
- Linear System Solver
- Matrix Factorization
- Preconditioner
- Sparse Matrix









