CUDA Dynamic Parallelism

CUDA Dynamic Parallelism is the ability for GPU kernels to launch other GPU kernels directly from the device — eliminating round-trips to the CPU for recursive or adaptive algorithms where the next work unit depends on computed results.

Traditional GPU Programming Constraint

- Old model: CPU → launch kernel → GPU runs → CPU reads results → CPU decides next work → launch next kernel.
- Round-trip CPU-GPU overhead: 10–50 μs per kernel launch.
- Problem: Algorithms needing recursive subdivision required hundreds of CPU-GPU round-trips.

Dynamic Parallelism Solution

``cuda __global__ void parent_kernel(int* data, int n) { if (n > THRESHOLD) { // Launch child kernel from within GPU kernel child_kernel<<<n/2, 256>>>(data, n/2); cudaDeviceSynchronize(); // Wait for child merge_results<<<1, 32>>>(data, n); } else { base_case(data, n); } }`

- Child kernels: Inherit parent's CUDA context. - Synchronization:cudaDeviceSynchronize()` within kernel waits for all launched children.
- Stream inheritance: Children run on parent's stream by default.

When Dynamic Parallelism Helps

- Adaptive mesh refinement: Refine only high-error regions → launch child kernels for refined areas.
- Quicksort on GPU: Partition → recursively sort two halves from device.
- Sparse BFS: Expand only non-empty frontier — don't launch fixed-size kernels.
- Traversal algorithms: Octree, BVH traversal with unknown depth.

Performance Considerations

- Child launch overhead: ~500ns on modern NVIDIA GPUs (vs. 10-50μs CPU-to-GPU).
- Memory: Child grid descriptors stored in global memory — small overhead.
- Nesting: Up to 24 levels of nesting supported (CUDA 5.0+).
- Overhead vs. benefit: Only worthwhile when CPU launch overhead was the bottleneck.

Alternatives

- Persistent threads: One kernel with internal work queue instead of nested launches.
- CUDA Graphs: Pre-record dynamic work patterns if structure is known.

CUDA Dynamic Parallelism is the key enabler for GPU-native recursive and adaptive algorithms — it eliminates the synchronization bottleneck that forced CPU coordination for work-adaptive GPU programs, enabling fully GPU-resident implementations of tree algorithms and adaptive solvers.

Want to learn more?