CUDA Streams and Asynchronous Execution enable concurrent kernel launches, memory transfers, and host-device synchronization, hiding latencies and improving GPU utilization through fine-grained task scheduling and pipelining.
Stream Concept and Execution Model
- CUDA Stream: Ordered queue of GPU operations (kernels, memory transfers, callbacks). Operations within stream execute sequentially; operations in different streams may execute concurrently.
- Default Stream (Stream 0): All operations enqueued to default stream if not specified. Default stream synchronizes with all other streams (implicit barrier). Avoid for concurrent execution.
- Non-Default Streams: Streams 1, 2, 3,... execute independent of default and each other. Multiple kernels in different streams execute in parallel (if hardware permits).
- Hardware Concurrency Limits: Modern GPUs (Volta+) support 32-128 concurrent streams. Depends on SM count, kernel complexity, resource contention.
CUDA Events and Synchronization Primitives
- cudaEvent Creation: cudaCreateEvent() allocates event. Event timestamps GPU wall-clock time for precise measurements.
- Event Recording: cudaEventRecord(event, stream) inserts event into stream. GPU records timestamp when reaching that point in stream.
- Event Query: cudaEventQuery(event) checks if event reached (non-blocking). cudaEventSynchronize(event) blocks host until event reached.
- Elapsed Time Measurement: (event_end_time - event_start_time) gives kernel duration. More accurate than host timing due to GPU pipeline effects.
Multi-Stream Concurrency and Concurrency Limitations
- Maximum Concurrent Kernels: SM (Streaming Multiprocessor) can execute multiple kernels simultaneously if occupancy permits. Typically 8-32 concurrent kernels depending on register/shared memory usage.
- Occupancy Trade-off: Multiple lightweight kernels co-execute, each using fewer registers. Heavy kernels (high register count) limit concurrent kernel count.
- Load Balancing: Kernel scheduling across SMs dynamic. GPU scheduler assigns kernels to free SMs, migrates work as kernels complete.
- Resource Contention: L2 cache, register file, shared memory shared across all kernels. High-memory-footprint kernels increase latency for light concurrent kernels.
Asynchronous Memory Copy with Compute Overlap
- cudaMemcpyAsync: Non-blocking memory transfer returns immediately after DMA queued. Compute kernels in other streams progress concurrently with memcpy.
- DMA Engine Limitations: Most GPUs support 1 host-to-device and 1 device-to-host DMA concurrently (bidirectional, but only 1 H2D and 1 D2H). Cannot overlap multiple H2D transfers.
- Pinned vs Pageable Memory: Pinned (DMA-able) host memory required for asynchronous transfers. Pageable memory requires intermediate staging → performance loss.
- PCIe Bandwidth: 16x PCIe 3.0 = 16 GB/s. Theoretical bidirectional = 8 GB/s each direction (practical: 12-14 GB/s unidirectional due to protocol overhead).
Overlap Efficiency
- Perfect Overlap Conditions: Kernel computation (10ms) + simultaneous memcpy (10ms) = 10ms total (no additional delay). Requires computation and memcpy duration matched.
- Bottleneck Analysis: If memcpy faster than kernel (e.g., 5ms memcpy, 10ms kernel), GPU idles 5ms before next memcpy. Padding with extra work hides idle.
- Pipelining: Stage K computes while Stage K-1 copies output to host. Multiple overlapping stages maintain GPU saturation.
- Profiler Visualization: Nsight Systems shows timeline of kernel, memcpy, host activities. Overlapping activities side-by-side visualize concurrency.
Stream Priority and Quality of Service
- Stream Priority: cudaStreamCreateWithPriority() assigns priority (1-32, higher = higher priority). GPU scheduler prefers high-priority streams.
- Priority Effectiveness: Only works if GPU has spare resources (not fully saturated). Under full load, priority irrelevant.
- Use Cases: Critical kernels (audio processing) prioritized over background kernels (profiling). Real-time applications leverage priorities.
Best Practices for Hiding PCIe Latency
- Batch Transfers: Multiple cudaMemcpyAsync() calls in rapid succession (all within same stream) amortize PCIe latency.
- Unpinned Memory Workaround: If pinned memory unavailable, use temporary pinned buffer, copy into unpinned (slower but functional).
- Bidirectional Pipeline: Overlap H2D (input data), compute (processing), D2H (output data) in 3-stage pipeline. Maintains GPU utilization across phases.
- Persistent Kernels: Long-running kernels with internal loops reduce kernel launch overhead. Single kernel overlaps internal stages vs multiple kernel launches.
cuda streams asynchronous gpucuda event synchronizationmulti stream overlapasync memcpy compute overlapstream priority cuda
Related Topics
Explore 500+ Semiconductor & AI Topics
From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.