Parallel Stencil Computation

Parallel Stencil Computation is the numerical method where each grid point is updated based on a fixed pattern of neighboring values (the stencil) — ubiquitous in computational fluid dynamics, weather simulation, image processing, and PDE solvers — and one of the most important parallel computing patterns because the regular, local data access pattern enables highly efficient parallelization through domain decomposition with halo exchange, achieving near-linear scaling to millions of cores when communication is properly overlapped with computation.

Stencil Pattern

A 2D 5-point stencil:
``new[i][j] = w0old[i][j] + w1old[i-1][j] + w2*old[i+1][j] + w3old[i][j-1] + w4old[i][j+1]``

Each point depends only on its immediate neighbors. Applied to every point in a 2D/3D grid for each timestep. Examples: Jacobi iteration, Gauss-Seidel (with dependency ordering), heat equation, wave equation, weather prediction.

Domain Decomposition

The grid is divided into subdomains, one per processor. Each processor updates its local subdomain independently — except at subdomain boundaries, where stencil calculations need values from adjacent processors' domains.

Halo Exchange (Ghost Cells)

- Ghost/Halo Region: Each subdomain is padded with an extra layer of cells (1-3 layers depending on stencil radius) copied from neighboring processors.
- Exchange Protocol: Before each timestep, each processor sends its boundary cells to neighbors and receives neighbors' boundary cells into its ghost region. For a 2D decomposition with 4 neighbors, 4 send/receive pairs per timestep.
- Communication Volume: For an N×N local subdomain with a 1-cell halo, communication per timestep = 4N elements (surface) while computation = N² elements (volume). The surface-to-volume ratio decreases as N increases → larger subdomains have better computation-to-communication ratio.

Optimization Techniques

- Communication-Computation Overlap: Start halo exchange (non-blocking MPI_Isend/Irecv), compute interior points (which don't need ghost cells), then wait for halo exchange completion and compute boundary points. Hides communication latency behind useful computation.
- Temporal Blocking (Tiling): Instead of exchanging halos every timestep, expand the halo by k cells and compute k timesteps before exchanging. Reduces communication frequency by k× at the cost of computing redundant cells in the expanded halo.
- Cache-Oblivious Tiling: Tile both spatial and temporal dimensions to maximize data reuse within the cache hierarchy. Achieved through recursive decomposition (space-time wavefront tiling).
- Vectorization (SIMD): Stencil operations on contiguous grid rows vectorize naturally — adjacent grid points are processed by adjacent SIMD lanes. Array padding to cache-line boundaries maximizes vectorization efficiency.

GPU Stencil Implementation

Load a tile of the grid (plus halo) into shared memory. Each thread computes one grid point using shared memory reads (fast, no bank conflicts for stencil patterns). Thread blocks process tiles; the grid is tiled across the entire GPU grid of blocks.

Parallel Stencil Computation is the poster child of structured parallel computing — combining regular data access, predictable communication, and natural domain decomposition into a pattern that scales to the largest supercomputers on Earth, underpinning the simulations that predict weather, design aircraft, and model physical phenomena.

Want to learn more?