GPU Stream and Event Synchronization is the CUDA programming model for managing concurrent operations on the GPU by organizing kernels and memory transfers into streams (ordered sequences of operations) and using events to synchronize between them โ effective stream usage enables overlapping computation with data transfer, concurrent kernel execution, and precise timing measurements.
CUDA Stream Fundamentals:
- Stream Definition: a stream is a sequence of GPU operations that execute in order โ operations in different streams may execute concurrently if the GPU has available resources
- Default Stream: operations without an explicit stream use stream 0 (the default/legacy stream) โ the default stream implicitly synchronizes with all other streams unless compiled with --default-stream per-thread
- Stream Creation: cudaStreamCreate(&stream) creates a non-blocking stream โ cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking) creates a stream that doesn't synchronize with the default stream
- Stream Destruction: cudaStreamDestroy(stream) releases stream resources โ any pending operations in the stream complete before destruction
Overlapping Computation and Transfer:
- Concurrent Copy and Execute: launch memory copies (cudaMemcpyAsync) on one stream and kernels on another โ the GPU's copy engines (DMA) and compute engines (SMs) operate independently
- Double Buffering Pattern: alternate between two buffers โ while the GPU computes on buffer A, transfer results from the previous iteration and load input for the next iteration using buffer B
- Triple Buffering: three buffers provide maximum overlap โ one being computed, one being uploaded, one being downloaded โ hides both upload and download latency simultaneously
- Pinned Memory Requirement: cudaMemcpyAsync requires pinned (page-locked) host memory allocated with cudaMallocHost โ unpinned memory forces synchronous copies regardless of stream assignment
Multi-Stream Concurrency:
- Concurrent Kernels: multiple small kernels on different streams can execute simultaneously if the GPU has enough SMs โ Ampere A100 supports up to 128 concurrent kernels
- Resource Partitioning: concurrent kernels share SM resources โ each kernel gets a portion of SMs proportional to its grid size, enabling fine-grained GPU sharing
- Stream Priority: cudaStreamCreateWithPriority(&stream, flags, priority) assigns a scheduling priority โ high-priority streams preempt low-priority ones at thread block boundaries
- Breadth-First Launch: launch all streams' operations in interleaved order (kernel1_stream1, kernel1_stream2, ...) rather than depth-first (all of stream1, then all of stream2) โ ensures the GPU sees concurrent work from all streams
CUDA Events:
- Event Creation: cudaEventCreate(&event) creates an event marker โ events are recorded into streams and can be queried or waited on from the host or other streams
- Recording: cudaEventRecord(event, stream) inserts the event into the stream's operation queue โ the event is "completed" when all preceding operations in that stream finish
- Host Synchronization: cudaEventSynchronize(event) blocks the host thread until the event completes โ more fine-grained than cudaStreamSynchronize which waits for all stream operations
- Query: cudaEventQuery(event) returns cudaSuccess if the event is complete or cudaErrorNotReady if still pending โ enables non-blocking polling from the host
Inter-Stream Synchronization:
- cudaStreamWaitEvent: cudaStreamWaitEvent(stream, event, 0) makes all subsequent operations in stream wait until event is complete โ establishes a dependency between two streams without blocking the host
- Use Case: launch data preparation on stream 1, record an event, then make stream 2 (which runs the compute kernel) wait on that event โ ensures data is ready before computation begins
- Fork-Join Pattern: launch multiple independent kernels on separate streams, create events for each, and have a final stream wait on all events before launching the reduction kernel โ maximizes concurrency while maintaining correctness
- Graph Capture: cudaStreamBeginCapture captures a sequence of stream operations into a CUDA Graph โ replaying the graph eliminates per-launch overhead (5-10 ยตs per kernel launch saved)
Timing with Events:
- Elapsed Time: cudaEventElapsedTime(&ms, start, stop) returns the time in milliseconds between two recorded events โ GPU-side timing with ~0.5 ยตs resolution, unaffected by host-side scheduling
- Pattern: cudaEventRecord(start, stream) โ launch kernel โ cudaEventRecord(stop, stream) โ cudaEventSynchronize(stop) โ cudaEventElapsedTime(&ms, start, stop) โ the standard GPU timing pattern
- Profiling: events provide accurate per-kernel timing without external profiler overhead โ essential for performance optimization and regression detection in production code
Common Pitfalls:
- Default Stream Serialization: accidentally using the default stream for one operation serializes the entire GPU โ always use explicit streams for concurrent work
- Insufficient Concurrency: launching many small kernels sequentially on one stream wastes GPU resources โ distribute across multiple streams to utilize idle SMs
- Missing Synchronization: reading GPU results on the host without synchronization leads to undefined behavior โ always synchronize (event, stream, or device) before accessing results
- Pinned Memory Exhaustion: allocating too much pinned memory with cudaMallocHost can degrade system performance โ pinned memory can't be swapped, limiting available system memory
Stream and event management is the programmer's primary tool for maximizing GPU utilization โ well-structured multi-stream applications achieve 90-95% GPU utilization by overlapping transfers with computation, hiding latency behind concurrent operations, and minimizing synchronization barriers.