Parallel Iterative Solvers

Parallel Iterative Solvers are essential for large-scale scientific computing, enabling solution of sparse linear systems (Ax=b) through iterative refinement across distributed memory systems, critical for CFD, electromagnetics, structural mechanics.

Conjugate Gradient Algorithm and Parallelization

- CG Algorithm: Solves symmetric positive-definite (SPD) systems. Iteratively refines solution: x_{k+1} = x_k + α_k p_k (p_k = conjugate search direction).
- Per-Iteration Operations: SpMV (sparse matrix-vector multiply) A×p, inner products (dot products of residuals). Both highly parallelizable.
- Communication Bottlenecks: Inner products require global reduction (allreduce) per iteration. Synchronization points limit scalability beyond 10,000s of cores.
- Iteration Count: Convergence = O(κ) iterations (κ = condition number = λ_max / λ_min). Preconditioning reduces κ dramatically, improving scalability.

Preconditioning Techniques

- Incomplete LU (ILU): Approximate LU factorization retaining only significant entries. Preconditioner M ≈ L×U (A decomposition). Solve M^(-1)×A×x = M^(-1)×b (preconditioned system).
- Algebraic Multigrid (AMG): Automatically constructs coarse grids from fine grid. Solves coarse problem (fewer unknowns), interpolates back to fine grid. ~10x convergence improvement.
- ILU vs AMG Trade-off: ILU cheaper per iteration; AMG fewer iterations but higher per-iteration cost. AMG typically wins (fewer iterations compensates overhead).
- Parallel Preconditioners: Domain decomposition preconditioners (each subdomain solves local system) parallelize well. Block Jacobi, Additive Schwarz.

Krylov Subspace Methods

- GMRES (Generalized Minimum Residual): Solves non-symmetric systems. Minimizes residual norm over Krylov subspace. Memory overhead (stores all Krylov vectors).
- BiCGSTAB: Nonsymmetric solver, lower memory than GMRES. Uses BiConjugate Gradient algorithm with STAB stabilization. Faster breakdown avoidance.
- QMR (Quasi-Minimal Residual): Alternative nonsymmetric solver. Smoother iteration behavior than BiCGSTAB.
- Krylov Subspace Dimension: Larger subspace (k=50-100) converges faster but higher memory. Restarting GMRES(k) resets Krylov space periodically.

Sparse Matrix-Vector Product (SpMV)

- SpMV Parallelization: Distribute matrix rows across processors. Each processor computes partial output (rows assigned to processor). All-reduce sums contributions.
- Storage Format: CSR (Compressed Sparse Row) stores nonzeros per row. GPU-efficient formats (COO, ELL, HYB) optimize for particular sparsity patterns.
- Communication Pattern: Sparse matrices with irregular communication (wide stencils, unstructured meshes) cause all-to-all communication. Fat-tree topology limits scalability.
- Bandwidth Limiting: SpMV typically memory-bound (roofline model). Peak performance ~10-30% of theoretical peak on most systems. Bandwidth utilization drives speed.

Domain Decomposition Methods

- Partitioning Strategy: Divide domain (mesh) into subdomains, each assigned to processor. Interface edges connect subdomains.
- Local Solve: Each processor solves local subdomain independently. Interface conditions exchange boundary values across processors.
- Schwarz Methods: Additive Schwarz (concurrent solves, exchange solutions). Multiplicative Schwarz (sequential solves, better convergence but less parallel).
- Scalability: Domain decomposition enables weak scaling (fixed work per processor, scale problem size). Strong scaling limited by interface synchronization.

PETSc and Trilinos Frameworks

- PETSc (Portable Extensible Toolkit for Scientific Computing): Open-source library (Lawrence Berkeley Lab). Provides distributed matrices, vectors, solvers.
- Solver Suite: KSP (Krylov solver package), PC (preconditioner), SNES (nonlinear solver), TS (timestepper). Integrated profiling, automatic algorithm selection.
- Trilinos: Sandia National Laboratories library. Emphasis on performance on modern architectures. Includes Belos (iterative linear solvers), MueLu (algebraic multigrid).
- Adoption: Both widely used in CFD, finite-element codes. Provide industrial-strength implementations, tested on 100k+ core systems.

Convergence and Scalability Analysis

- Convergence Monitoring: Residual norm ‖r_k‖ = ‖b - A×x_k‖ tracked per iteration. Convergence criterion: ‖r_k‖ < ε‖r_0‖ (relative tolerance, ~1e-6).
- Stagnation: Residual plateaus without convergence. Indicates preconditioner inadequate, ill-conditioning. Switch solver/preconditioner.
- Weak Scaling: Work per processor constant, problem size increases proportionally with processor count. Iterations unchanged, communication per iteration increases (ideal: stay constant).
- Strong Scaling: Fixed problem size, processor count increases. Synchronization points dominate at high core counts, limiting speedup beyond 10,000 cores.

Want to learn more?