Parallel Debugging and Correctness

Parallel Debugging and Correctness tools enable systematic identification and fixing of concurrency bugs (race conditions, deadlocks, synchronization errors) that are notoriously difficult to reproduce and diagnose in multi-threaded and GPU applications.

CUDA-GDB Debugger for GPU Code

- CUDA-GDB: Integrated debugging environment for CUDA applications. Debugs both host (C/C++) and device (CUDA kernel) code simultaneously.
- Breakpoint Setting: Set breakpoints on host or kernel code. Kernel breakpoints trigger per-thread or per-warp (all threads in warp break together).
- Variable Inspection: Inspect host variables (standard gdb) and device variables (kernel local variables, shared memory, global memory).
- Thread Navigation: Switch between host threads and kernel threads. Query thread registers, memory contents, execution state.

CUDA-GDB Capabilities and Limitations

- Single-Stepping: Step through kernel instructions (warp-level, not individual thread). All threads in warp advance together (synchronous execution).
- Conditional Breakpoints: Break when thread_id == 5 AND block_id == 0. Enables targeted debugging of specific GPU threads.
- Print/Watch: Monitor variable changes (memory access patterns). Track memory writes, identify corruption sources.
- Performance Impact: Debugging 10-100x slower than normal execution. Suitable for small inputs, quick turnaround debugging.

Compute Sanitizer (cuda-memcheck)

- cuda-memcheck: Runtime memory debugging tool. Detects out-of-bounds accesses, uninitialized reads, memory leaks.
- Memcheck Detector: Instruments kernels to track memory accesses. Every load/store checked against allocated memory ranges.
- False Positive Filtering: Shared memory aliasing can trigger false positives (intentional pattern reuse). Configuration allows whitelisting.
- Overhead: Instrumentation adds 5-50x slowdown. Suitable for correctness validation, not performance profiling.

Race Condition and Synchronization Detectors

- Racecheck: Detects data races (concurrent access to same memory location without synchronization). Uses dynamic analysis (instrument kernels) or static analysis (compile-time checks).
- Race Pattern: Two threads access same memory location, at least one write, without synchronization (barrier, atomic). Pattern flagged as race.
- Shared Memory Races: Racecheck detects shared memory races (common in GPU computing). Global memory races also detected (less common, often intentional with atomics).
- False Positives: Properly synchronized code with complex synchronization patterns may trigger false alarms. Expert review necessary.

Initcheck and Other Detectors

- Initcheck: Detects unitialized shared memory accesses. Tracks which shared memory locations written. Reads to unwritten locations flagged.
- Synccheck: Detects warp divergence, thread barriers within conditionals (can serialize execution). Identifies performance issues from incorrect synchronization.
- Combined Tools: cuda-memcheck runs multiple detectors in single pass. Results aggregated, reported with source-line mapping.

Intel Inspector for CPU Parallelism

- Inspector XE: Detects data races, memory corruption, memory leaks in OpenMP/pthreads applications.
- Synchronization Analysis: Tracks locks, barriers, semaphores. Identifies missing synchronization (race conditions), deadlocks.
- Memory Tracking: Similar to cuda-memcheck. Monitors memory allocation, deallocation, accesses.
- Lightweight vs Detailed: Light collection (minimal overhead, less info) for production; detailed collection for debugging (significant overhead).

Valgrind Helgrind for Multi-threaded Debugging

- Helgrind Tool: Memcheck for multi-threaded C/C++ programs. Detects races, synchronization issues via dynamic binary instrumentation.
- Happens-Before Graph: Constructs synchronization graph. Race = two accesses violating happens-before relation (no synchronization path between them).
- False Positive Rate: Significant false positive rate (~30-50%) due to conservative analysis. Manual verification of detected races required.
- Overhead: 100-500x slowdown. Practical only for small test cases.

Parallel Correctness Workflows

- Regression Testing: Correctness tests run with multiple thread counts (2, 4, 8, etc.). Race conditions more likely with higher thread counts (higher contention).
- Stress Testing: High contention artificially induced (tight loops, memory pressure). Amplifies race conditions, makes reproduction easier.
- Determinism: Parallel programs inherently non-deterministic (thread scheduling random). Record-and-replay systems record execution path, enable deterministic replay for debugging.
- Symbol Debugging: Build with debug symbols (-g compiler flag). Tools correlate memory addresses with source lines, enable source-level debugging.

Deadlock Detection and Avoidance

- Deadlock Conditions: Circular wait (Thread A holds lock L1 waiting for L2; Thread B holds L2 waiting for L1). All four Coffman conditions must be present.
- Static Analysis: Code analysis identifying potential deadlock patterns (lock acquisition order violations).
- Dynamic Detection: Runtime monitoring of lock wait-for graph. Cycle detection → deadlock alert.
- Prevention Strategies: Enforce global lock ordering (if A then B then C). Timed locks (timeout instead of indefinite wait) recover from deadlocks.

Parallel Debugging and Correctness

Want to learn more?