Parallel Iterative Solvers are essential for large-scale scientific computing, enabling solution of sparse linear systems (Ax=b) through iterative refinement across distributed memory systems, critical for CFD, electromagnetics, structural mechanics.
Conjugate Gradient Algorithm and Parallelization
- CG Algorithm: Solves symmetric positive-definite (SPD) systems. Iteratively refines solution: x_{k+1} = x_k + α_k p_k (p_k = conjugate search direction).
- Per-Iteration Operations: SpMV (sparse matrix-vector multiply) A×p, inner products (dot products of residuals). Both highly parallelizable.
- Communication Bottlenecks: Inner products require global reduction (allreduce) per iteration. Synchronization points limit scalability beyond 10,000s of cores.
- Iteration Count: Convergence = O(κ) iterations (κ = condition number = λ_max / λ_min). Preconditioning reduces κ dramatically, improving scalability.
Preconditioning Techniques
- Incomplete LU (ILU): Approximate LU factorization retaining only significant entries. Preconditioner M ≈ L×U (A decomposition). Solve M^(-1)×A×x = M^(-1)×b (preconditioned system).
- Algebraic Multigrid (AMG): Automatically constructs coarse grids from fine grid. Solves coarse problem (fewer unknowns), interpolates back to fine grid. ~10x convergence improvement.
- ILU vs AMG Trade-off: ILU cheaper per iteration; AMG fewer iterations but higher per-iteration cost. AMG typically wins (fewer iterations compensates overhead).
- Parallel Preconditioners: Domain decomposition preconditioners (each subdomain solves local system) parallelize well. Block Jacobi, Additive Schwarz.
Krylov Subspace Methods
- GMRES (Generalized Minimum Residual): Solves non-symmetric systems. Minimizes residual norm over Krylov subspace. Memory overhead (stores all Krylov vectors).
- BiCGSTAB: Nonsymmetric solver, lower memory than GMRES. Uses BiConjugate Gradient algorithm with STAB stabilization. Faster breakdown avoidance.
- QMR (Quasi-Minimal Residual): Alternative nonsymmetric solver. Smoother iteration behavior than BiCGSTAB.
- Krylov Subspace Dimension: Larger subspace (k=50-100) converges faster but higher memory. Restarting GMRES(k) resets Krylov space periodically.
Sparse Matrix-Vector Product (SpMV)
- SpMV Parallelization: Distribute matrix rows across processors. Each processor computes partial output (rows assigned to processor). All-reduce sums contributions.
- Storage Format: CSR (Compressed Sparse Row) stores nonzeros per row. GPU-efficient formats (COO, ELL, HYB) optimize for particular sparsity patterns.
- Communication Pattern: Sparse matrices with irregular communication (wide stencils, unstructured meshes) cause all-to-all communication. Fat-tree topology limits scalability.
- Bandwidth Limiting: SpMV typically memory-bound (roofline model). Peak performance ~10-30% of theoretical peak on most systems. Bandwidth utilization drives speed.
Domain Decomposition Methods
- Partitioning Strategy: Divide domain (mesh) into subdomains, each assigned to processor. Interface edges connect subdomains.
- Local Solve: Each processor solves local subdomain independently. Interface conditions exchange boundary values across processors.
- Schwarz Methods: Additive Schwarz (concurrent solves, exchange solutions). Multiplicative Schwarz (sequential solves, better convergence but less parallel).
- Scalability: Domain decomposition enables weak scaling (fixed work per processor, scale problem size). Strong scaling limited by interface synchronization.
PETSc and Trilinos Frameworks
- PETSc (Portable Extensible Toolkit for Scientific Computing): Open-source library (Lawrence Berkeley Lab). Provides distributed matrices, vectors, solvers.
- Solver Suite: KSP (Krylov solver package), PC (preconditioner), SNES (nonlinear solver), TS (timestepper). Integrated profiling, automatic algorithm selection.
- Trilinos: Sandia National Laboratories library. Emphasis on performance on modern architectures. Includes Belos (iterative linear solvers), MueLu (algebraic multigrid).
- Adoption: Both widely used in CFD, finite-element codes. Provide industrial-strength implementations, tested on 100k+ core systems.
Convergence and Scalability Analysis
- Convergence Monitoring: Residual norm ‖r_k‖ = ‖b - A×x_k‖ tracked per iteration. Convergence criterion: ‖r_k‖ < ε‖r_0‖ (relative tolerance, ~1e-6).
- Stagnation: Residual plateaus without convergence. Indicates preconditioner inadequate, ill-conditioning. Switch solver/preconditioner.
- Weak Scaling: Work per processor constant, problem size increases proportionally with processor count. Iterations unchanged, communication per iteration increases (ideal: stay constant).
- Strong Scaling: Fixed problem size, processor count increases. Synchronization points dominate at high core counts, limiting speedup beyond 10,000 cores.
parallel numerical methods iterative solverconjugate gradient parallelpreconditioning parallelkrylov subspace methodsparallel sparse solver
Explore 500+ Semiconductor & AI Topics
From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.