SIMT Execution and Warp Divergence

SIMT Execution and Warp Divergence characterizes the single-instruction-multiple-thread execution model where all threads in a warp must execute same instruction, forcing serialized computation of divergent control flow and enabling fine-grained synchronization via warp voting functions.

SIMT Execution Model Fundamentals

- Warp Definition: 32 threads executing in lockstep (Ampere, Hopper). All threads execute same instruction simultaneously (same program counter).
- Program Counter Synchronicity: All threads in warp share PC. Branches create divergence; some threads take branch, others don't.
- Instruction Level Parallelism (ILP): Warp issues 1-4 instructions per cycle (depending on available execution units, latency). Dual-issue allows concurrent FP32 + memory operations.
- SIMT vs SIMD: SIMT scalar (each thread has scalar registers), SIMD vector (threads share vector registers). SIMT simpler programming model.

Warp Divergence at Branch Points

- Branch Condition: if (thread_id < 16) {...}. Some threads take branch, others skip.
- Divergence Impact: Warp serializes: execute if-branch code with active threads masking (inactive threads stall). Then execute else-branch for alternate threads.
- Serial Execution: Both branches executed sequentially (not parallel). Effective throughput halved if 50/50 branch distribution (worst case).
- Convergence Stack: Hardware maintains predication masks tracking which threads active. Stack-based mechanism (IPDOM tree) manages nesting.

Predicated Execution

- Predicate Register: Boolean flag per thread (32-bit register with predicate bits). Instruction conditional on predicate (@p0 instruction executes if p0 true for thread).
- Predication Implementation: All instructions in branch executed, but predicate masks results. Inactive threads produce side effects (state unchanged).
- Branch Elimination: Small if-else blocks predicated (no explicit branch). Reduces branch misprediction penalty, enables better ILP.
- Predicate Overhead: Extra instruction (set predicate), + masked instruction execution (no branch, but no result storage). Faster than explicit branch if block small (<4 instructions).

Branch Reconvergence via IPDOM Stack

- Instruction Level Dominance (IPDOM): Reverse dominance in CFG (control flow graph). IPDOM identifies post-dominating blocks (executed after all branches reconverge).
- Reconvergence Point: IPDOM target = block where all branches from divergence point rejoin. All threads active again.
- Stack Mechanism: Upon branch, hardware pushes divergence info (predicate masks, target) on stack. Upon reaching reconvergence, pops stack.
- Nesting Complexity: Nested divergence (if within if) creates stack depth > 1. Deep nesting (>8 levels) possible but rare.

Warp Voting Functions

- __ballot_sync(mask, predicate): Ballot across warp. Returns 32-bit integer with bit i set if thread i's predicate true. Mask specifies participating threads.
- __any_sync(mask, predicate): Reduction AND. Returns 1 if any thread's predicate true, else 0 across masked warp.
- __all_sync(mask, predicate): Reduction AND. Returns 1 if all threads' predicate true, else 0.
- Use Cases: ballot() for warp-level histogram; any() for early exit (any thread found solution); all() for synchronization (all threads ready).

Avoiding Divergence via Data-Dependent Branching Analysis

- Divergence Detection: Profiler reports "warp stall due to branch" metric. Indicates branch frequency and impact.
- Data-Dependent Patterns: Analysis of branch conditions determines if thread divergence likely. Example: if (array[tid] > threshold) may have high divergence if array values random.
- Sorting Trick: For highly-divergent conditionals, sort data by condition value. Clusters threads with same condition together (better branch prediction, less divergence).
- Early Exit: Loop termination conditions checked via ballot(). Mask inactive threads (data processed), continue active threads. Reduces warp idleness.

Structured vs Unstructured Control Flow

- Structured Flow: Single entry/exit loops, if-else blocks. Compiler easily determines reconvergence points. Simple hardware handling.
- Unstructured Flow: Multiple exits (break, return), goto statements. Complicates reconvergence analysis. Modern GPUs handle but with overhead.
- Best Practice: Favor structured loops/conditionals. Avoid deep nesting. Minimize branches in hot kernels.

Performance Implications

- Branch Prediction: Modern GPUs (Hopper) have branch predictors similar to CPUs. Predicted branches have <5 cycle penalty (vs ~15 cycles misprediction).
- Occupancy Trade-off: Loop divergence (some threads exit early) may limit occupancy (warps with all threads done freed). Improved throughput overall.
- Warp Efficiency Metric: Percentage of threads executing useful work. Divergence reduces warp efficiency (inactive threads masked). Target >80% warp efficiency.

SIMT Execution and Warp Divergence

Want to learn more?