CUDA Dynamic Parallelism is the ability for GPU kernels to launch other GPU kernels directly from the device â eliminating round-trips to the CPU for recursive or adaptive algorithms where the next work unit depends on computed results.
Traditional GPU Programming Constraint
- Old model: CPU â launch kernel â GPU runs â CPU reads results â CPU decides next work â launch next kernel.
- Round-trip CPU-GPU overhead: 10â50 Ξs per kernel launch.
- Problem: Algorithms needing recursive subdivision required hundreds of CPU-GPU round-trips.
Dynamic Parallelism Solution
``cuda`
__global__ void parent_kernel(int* data, int n) {
if (n > THRESHOLD) {
// Launch child kernel from within GPU kernel
child_kernel<<<n/2, 256>>>(data, n/2);
cudaDeviceSynchronize(); // Wait for child
merge_results<<<1, 32>>>(data, n);
} else {
base_case(data, n);
}
}
- Child kernels: Inherit parent's CUDA context.
- Synchronization: cudaDeviceSynchronize()` within kernel waits for all launched children.
- Stream inheritance: Children run on parent's stream by default.
When Dynamic Parallelism Helps
- Adaptive mesh refinement: Refine only high-error regions â launch child kernels for refined areas.
- Quicksort on GPU: Partition â recursively sort two halves from device.
- Sparse BFS: Expand only non-empty frontier â don't launch fixed-size kernels.
- Traversal algorithms: Octree, BVH traversal with unknown depth.
Performance Considerations
- Child launch overhead: ~500ns on modern NVIDIA GPUs (vs. 10-50Ξs CPU-to-GPU).
- Memory: Child grid descriptors stored in global memory â small overhead.
- Nesting: Up to 24 levels of nesting supported (CUDA 5.0+).
- Overhead vs. benefit: Only worthwhile when CPU launch overhead was the bottleneck.
Alternatives
- Persistent threads: One kernel with internal work queue instead of nested launches.
- CUDA Graphs: Pre-record dynamic work patterns if structure is known.
CUDA Dynamic Parallelism is the key enabler for GPU-native recursive and adaptive algorithms â it eliminates the synchronization bottleneck that forced CPU coordination for work-adaptive GPU programs, enabling fully GPU-resident implementations of tree algorithms and adaptive solvers.