CUDA Streams and Asynchronous Execution

CUDA Streams and Asynchronous Execution enable concurrent kernel launches, memory transfers, and host-device synchronization, hiding latencies and improving GPU utilization through fine-grained task scheduling and pipelining.

Stream Concept and Execution Model

- CUDA Stream: Ordered queue of GPU operations (kernels, memory transfers, callbacks). Operations within stream execute sequentially; operations in different streams may execute concurrently.
- Default Stream (Stream 0): All operations enqueued to default stream if not specified. Default stream synchronizes with all other streams (implicit barrier). Avoid for concurrent execution.
- Non-Default Streams: Streams 1, 2, 3,... execute independent of default and each other. Multiple kernels in different streams execute in parallel (if hardware permits).
- Hardware Concurrency Limits: Modern GPUs (Volta+) support 32-128 concurrent streams. Depends on SM count, kernel complexity, resource contention.

CUDA Events and Synchronization Primitives

- cudaEvent Creation: cudaCreateEvent() allocates event. Event timestamps GPU wall-clock time for precise measurements.
- Event Recording: cudaEventRecord(event, stream) inserts event into stream. GPU records timestamp when reaching that point in stream.
- Event Query: cudaEventQuery(event) checks if event reached (non-blocking). cudaEventSynchronize(event) blocks host until event reached.
- Elapsed Time Measurement: (event_end_time - event_start_time) gives kernel duration. More accurate than host timing due to GPU pipeline effects.

Multi-Stream Concurrency and Concurrency Limitations

- Maximum Concurrent Kernels: SM (Streaming Multiprocessor) can execute multiple kernels simultaneously if occupancy permits. Typically 8-32 concurrent kernels depending on register/shared memory usage.
- Occupancy Trade-off: Multiple lightweight kernels co-execute, each using fewer registers. Heavy kernels (high register count) limit concurrent kernel count.
- Load Balancing: Kernel scheduling across SMs dynamic. GPU scheduler assigns kernels to free SMs, migrates work as kernels complete.
- Resource Contention: L2 cache, register file, shared memory shared across all kernels. High-memory-footprint kernels increase latency for light concurrent kernels.

Asynchronous Memory Copy with Compute Overlap

- cudaMemcpyAsync: Non-blocking memory transfer returns immediately after DMA queued. Compute kernels in other streams progress concurrently with memcpy.
- DMA Engine Limitations: Most GPUs support 1 host-to-device and 1 device-to-host DMA concurrently (bidirectional, but only 1 H2D and 1 D2H). Cannot overlap multiple H2D transfers.
- Pinned vs Pageable Memory: Pinned (DMA-able) host memory required for asynchronous transfers. Pageable memory requires intermediate staging → performance loss.
- PCIe Bandwidth: 16x PCIe 3.0 = 16 GB/s. Theoretical bidirectional = 8 GB/s each direction (practical: 12-14 GB/s unidirectional due to protocol overhead).

Overlap Efficiency

- Perfect Overlap Conditions: Kernel computation (10ms) + simultaneous memcpy (10ms) = 10ms total (no additional delay). Requires computation and memcpy duration matched.
- Bottleneck Analysis: If memcpy faster than kernel (e.g., 5ms memcpy, 10ms kernel), GPU idles 5ms before next memcpy. Padding with extra work hides idle.
- Pipelining: Stage K computes while Stage K-1 copies output to host. Multiple overlapping stages maintain GPU saturation.
- Profiler Visualization: Nsight Systems shows timeline of kernel, memcpy, host activities. Overlapping activities side-by-side visualize concurrency.

Stream Priority and Quality of Service

- Stream Priority: cudaStreamCreateWithPriority() assigns priority (1-32, higher = higher priority). GPU scheduler prefers high-priority streams.
- Priority Effectiveness: Only works if GPU has spare resources (not fully saturated). Under full load, priority irrelevant.
- Use Cases: Critical kernels (audio processing) prioritized over background kernels (profiling). Real-time applications leverage priorities.

Best Practices for Hiding PCIe Latency

- Batch Transfers: Multiple cudaMemcpyAsync() calls in rapid succession (all within same stream) amortize PCIe latency.
- Unpinned Memory Workaround: If pinned memory unavailable, use temporary pinned buffer, copy into unpinned (slower but functional).
- Bidirectional Pipeline: Overlap H2D (input data), compute (processing), D2H (output data) in 3-stage pipeline. Maintains GPU utilization across phases.
- Persistent Kernels: Long-running kernels with internal loops reduce kernel launch overhead. Single kernel overlaps internal stages vs multiple kernel launches.

CUDA Streams and Asynchronous Execution

Want to learn more?