Parallel Prefix Scan is a foundational parallel algorithm computing cumulative sums (or generic associative operation) across array elements with logarithmic depth, essential for stream compaction, sorting, and GPU application performance.
Inclusive vs Exclusive Scan Definitions
- Inclusive Scan: Output[i] = sum(Input[0..i]). Last element = total sum. Example: input=[1,2,3,4] → output=[1,3,6,10].
- Exclusive Scan: Output[i] = sum(Input[0..i-1]). Last element = sum of first N-1 elements. Example: input=[1,2,3,4] → output=[0,1,3,6].
- Scan Generalization: Replace addition with generic associative operator (min, max, bitwise AND/OR). Inclusive/exclusive semantics apply to any operator.
- Prefix Operation: Generic term for all cumulative operations. Scan = GPU terminology; prefix sum = generic algorithmic terminology.
Blelloch Up-Sweep/Down-Sweep Algorithm
- Work-Efficient Scan: O(N) work (same as sequential), O(log N) depth (parallel). Contrasts with Hillis-Steele (O(N log N) work).
- Up-Sweep Phase: Tree reduction bottom-up. Pair adjacent elements (stride 1), add to right neighbor. Stride doubles (2, 4, 8, ...); continue log(N) levels.
- Down-Sweep Phase: Tree expansion top-down. Implicit left child, propagate partial sums. Builds final scan output.
- Example: Array [1,2,3,4,5,6,7,8]. Up-sweep computes sums at each level. Down-sweep produces [0,1,3,6,10,15,21,28] (exclusive scan).
Hillis-Steele Step-Efficient Scan
- Step Efficient: O(N log N) work, O(log N) depth. Work redundancy acceptable for parallelism benefit.
- Algorithm: Pass k adds elements at offset 2^(k-1). Pass 0 adds offset 1; pass 1 adds offset 2; pass 2 adds offset 4.
- Correctness: After log(N) passes, all elements have received all necessary contributions. Output = exclusive scan.
- Implementation: Simple, shorter code. Cache-efficient (stride pattern locality). Common GPU implementation (thrust::inclusive_scan).
GPU Scan Implementation Using Shared Memory
- Block-Level Scan: Shared memory scan for block_size elements (typically 512-1024). Load thread-local elements into shared memory, perform in-memory scan.
- Warp-Level Scan: Fast scan within warp (32 threads) using __shfl_sync. Shuffle broadcasts partial sums to other threads.
- Multi-Block Scan: Larger arrays require multiple blocks. Scan each block independently, compute block sums, broadcast block sums.
- Bank Conflict Avoidance: Access shared memory with stride patterns avoiding bank conflicts. Padding needed for certain strides.
Warp-Level Scan via Shuffle Operations
- __shfl_sync(): Warp communication intrinsic. Broadcasts register value from thread srcLane to all threads in warp.
- Shuffle-Based Scan: Warp 0 computes scan in place via shuffle operations (no shared memory). Each thread maintains partial sum, shuffles intermediate results.
- Efficiency: No shared memory contention. Low latency (3-5 cycles per shuffle). Throughput sufficient for 32 threads.
- Recursive Calls: Large arrays call shuffle-scan on successive 32-element segments, merge results via global memory.
Segmented Scan and Applications
- Segmented Scan: Multiple independent scans on contiguous segments. Segment boundaries determined by flag array.
- Implementation: Carry-in value propagates across segment boundaries. Boundary detected via flag; carry-in reset for new segment.
- Stream Compaction Use Case: Predicate array (element 'selected' or not) scanned. Output indices for selected elements computed via segmented exclusive scan.
- RLE Compression: Run-length encoding uses segmented scan to compute output positions for each run.
Prefix Sum Applications in GPU Computing
- Stream Compaction: Filter array removing unwanted elements. Boolean predicate scan yields output indices; compacted array built from selected elements.
- Radix Sort: Counting sort histogram (count elements in each bucket), then prefix scan of counts yields output positions. Elements scattered to output based on positions.
- Histogram Computation: Count occurrences of each value. Atomic-based histogram slow (contention). Segmented scan groups histogram updates.
- Polynomial Evaluation: Horner's method (y = a_n + x(a_{n-1} + x(a_{n-2} + ...))). Scan propagates intermediate values. Fine-grain parallelism via scan.
Performance and Scalability
- Blelloch Complexity: O(N) work, O(log N) depth. Practical speedup for 1000s of elements (overhead amortizes). For 100+ million elements, multiple GPU kernels necessary.
- Bandwidth: Scan memory-bound (two passes over data per scan). Peak bandwidth utilization ~50-80% typical. Bottleneck: memory, not computation.
- Throughput Library: Libraries (thrust::inclusive_scan, CUB) heavily optimized. Typical: 200-500 GB/s actual bandwidth (vs 2 TB/s peak HBM). Algorithm variants chosen per GPU model.
- Multi-GPU Scaling: Global scan across GPUs requires local scans → allreduce (all block sums) → broadcast (global offset) → add offset to local results. Allreduce dominates communication cost.