Parallel Stencil Computation is the high-performance computing pattern for algorithms where each output element is computed from a fixed neighborhood (stencil) of input elements — the core computational pattern in finite difference methods for PDEs, image processing kernels, cellular automata, and lattice physics simulations. Stencil computations are memory-bandwidth bound and require carefully designed tiling and communication strategies to achieve high efficiency on multi-core CPUs, GPUs, and distributed clusters.
What Is a Stencil?
- A stencil defines which neighboring grid points contribute to the computation of each output point.
- 1D 3-point stencil:
out[i] = ain[i-1] + bin[i] + c*in[i+1]. - 2D 5-point stencil (von Neumann):
out[i,j] = in[i-1,j] + in[i+1,j] + in[i,j-1] + in[i,j+1] − 4*in[i,j]. - 3D 7-point stencil: Laplacian in 3D → heat equation, Poisson's equation.
- High-order stencil: 25-point or 49-point stencil → more neighbors → higher accuracy, more flops, more memory traffic.
Arithmetic Intensity of Stencil Computations
- 3D 7-point stencil: 7 loads + 1 store per output point, 7 FMAs → AI ≈ 7/8 ≈ 0.88 FLOP/byte.
- Modern GPUs: Ridge point ~30 FLOP/byte → stencil is extremely memory bandwidth bound.
- CPU cache blocking: Reuse data from cache → AI increases → approaches cache bandwidth limit.
GPU Stencil Implementation
__global__ void stencil_3d_7pt(float* out, const float* in, int N) {
int i = blockIdx.x*blockDim.x + threadIdx.x;
int j = blockIdx.y*blockDim.y + threadIdx.y;
int k = blockIdx.z*blockDim.z + threadIdx.z;
if (i>0 && i<N-1 && j>0 && j<N-1 && k>0 && k<N-1) {
out[IDX(i,j,k)] = in[IDX(i-1,j,k)] + in[IDX(i+1,j,k)]
+ in[IDX(i,j-1,k)] + in[IDX(i,j+1,k)]
+ in[IDX(i,j,k-1)] + in[IDX(i,j,k+1)]
- 6.0f * in[IDX(i,j,k)];
}
}
Shared Memory Tiling (GPU)
- Load tile + halo into shared memory → compute stencil from shared memory → fewer global loads.
- For 2D 5-point stencil, 16×16 tile + 1-cell halo: 18×18=324 loads → 16×16=256 output points → reuse factor 324/256=1.27 for center, but saves boundary redundancy.
- For 3D stencil: 3D tiling in shared memory → significant reuse → 2–3× speedup over naive.
Wavefront Parallelism
- Some stencils have data dependencies:
out[i,t+1]depends onout[i-1,t],out[i,t],out[i+1,t](time-stepping). - Naive: Time step → barrier → time step → sequential time dimension.
- Wavefront: Diagonal sweeps — compute multiple time steps simultaneously using diamond-shaped execution.
- Points on same diagonal are independent → computed in parallel → exposes temporal parallelism.
Diamond Tiling
- Advanced cache-oblivious tiling for time-dependent stencils.
- Diamond tiles: Each tile computes multiple time steps for a spatial region.
- Data reuse across time: Load spatial region once → compute T time steps → store → reduces memory bandwidth by T×.
- Enables out-of-cache computation: Tile fits in L2 → multiple time steps → reduce DRAM traffic.
- Framework: PLUTO (Polyhedral Loop Transformation), PolyMage → automatic diamond tiling generation.
Stencil Applications
| Application | Stencil Type | Domain Size |
|---|---|---|
| Weather simulation (WRF) | 3D 7-point | 1000×1000×50 grid points |
| Seismic RTM | 3D 25-point high order | 500³ points |
| DNS turbulence simulation | 3D spectral + physical | 4096³ points |
| Image bilateral filter | 2D variable | 4K×4K pixels |
| Lattice QCD | 4D U(1) stencil | 64⁴ lattice sites |
| Cardiac electrophysiology | 3D reaction-diffusion | Heart mesh |
Distributed Stencil with MPI
- Decompose 3D grid across MPI ranks (domain decomposition).
- Halo exchange: Each rank sends boundary planes to neighbors → fills ghost cells → computes stencil.
- Overlap: Non-blocking halo exchange → compute interior → wait for halos → compute boundary.
- Scaling: 7-point stencil, 3D decomposition → communication grows as P^(2/3) → good weak scaling.
Parallel stencil computation is the fundamental computational pattern of physics-based simulation — from predicting next week's weather to simulating nuclear reactors to rendering realistic fluid effects in movies, stencil computations on distributed parallel hardware are the mathematical machinery that transforms differential equations into numerical solutions, making parallel stencil optimization one of the most impactful skills in scientific high-performance computing.
Explore 500+ Semiconductor & AI Topics
From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.