Parallel Prefix Sum (Scan) is the foundational parallel algorithm that computes all prefix sums of an input array in O(log N) steps — transforming [a₀, a₁, a₂, ...] into [a₀, a₀+a₁, a₀+a₁+a₂, ...] (inclusive scan) — and serving as the core building block for parallel sorting, stream compaction, radix sort, histogram computation, and memory allocation in GPU computing.
Why Scan Is Fundamental
Scan converts a seemingly sequential operation (running sum) into a parallel operation. More importantly, scan solves the "parallel addressing" problem: when each thread produces a variable number of outputs and needs to know where in the output array to write them, an exclusive scan of the output counts gives each thread its starting write position. This makes scan the gateway to parallelizing any irregular, data-dependent computation.
Algorithm Variants
- Hillis-Steele (Inclusive Scan): In step k, each element adds the element k positions to its left. After log2(N) steps, all prefix sums are complete. Does O(N log N) total work — simple but not work-efficient.
- Blelloch (Work-Efficient Scan): Two phases:
1. Up-Sweep (Reduce): Build a reduction tree bottom-up, computing partial sums at each level. O(N) work in log2(N) steps.
2. Down-Sweep: Propagate partial sums back down the tree to compute all prefix sums. O(N) work in log2(N) steps.
Total work: O(N) — optimal. Total steps: O(log N). This is the standard GPU scan algorithm.
GPU Implementation (Large Arrays)
For arrays larger than one thread block:
1. Each thread block scans its local chunk (e.g., 1024 elements) and writes the block's total sum to an auxiliary array.
2. A second kernel scans the auxiliary array (block sums).
3. A third kernel adds each block's prefix to all elements in that block.
This three-kernel approach handles arrays of any size. CUB and Thrust libraries provide optimized implementations that achieve >90% of peak memory bandwidth on modern GPUs.
Applications
- Stream Compaction: Given a predicate array [1,0,1,1,0], exclusive scan gives write positions [0,1,1,2,3]. Threads with predicate=1 write to the scanned position, compacting the array without gaps.
- Radix Sort: Each radix digit is sorted by computing histograms and scans to determine output positions for each digit value.
- Sparse Matrix Construction: Scan of per-row nonzero counts gives the CSR row pointer array.
- Dynamic Memory Allocation: Each thread declares how much memory it needs; scan gives each thread its allocation offset.
Parallel Prefix Sum is the Swiss Army knife of parallel algorithms — appearing inside nearly every non-trivial GPU algorithm as the mechanism that converts irregular, data-dependent parallelism into efficient, conflict-free parallel execution.