OpenMP Parallel Programming

OpenMP Parallel Programming provides a pragmatic, standards-based API for shared-memory parallelism using directives, enabling rapid parallel code development without explicit thread management.

Fork-Join Model and Pragma Syntax

- OpenMP Execution Model: Main thread creates team of worker threads at parallel regions. Workers execute concurrently, rejoin at implicit barrier.
- Pragma Syntax: #pragma omp parallel directives inserted before loops/code blocks. Preprocessor expands pragmas; implicit compiler code generation.
- Region Definition: #pragma omp parallel creates team. Implicit barrier at end (threads wait for all to complete before proceeding).
- Multiple Region Types: parallel, parallel for, parallel sections, parallel critical. Each combines task distribution with synchronization.

Parallel For Loops and Work Distribution

- #pragma omp parallel for: Divides loop iterations across threads. Implicit team creation + loop distribution + implicit barrier.
- Static Scheduling: Iterations 0-N divided into chunks allocated at compile time. Thread i gets chunk i. Good for balanced loops, poor for variable iteration counts.
- Dynamic Scheduling: Chunks grabbed by threads as they finish previous chunks. Good for imbalanced loops (iterations vary in time), higher overhead.
- Guided Scheduling: Chunk size decreases as loop progresses. Reduces overhead vs full dynamic while maintaining load balance.

Reduction and Shared/Private Variable Clauses

- Reduction Clause: #pragma omp parallel for reduction(+:sum) accumulates partial sums from threads into global sum. Prevents race conditions.
- Supported Operators: +, -, *, /, &, |, ^, &&, || for integer; min, max. Custom reductions via user-defined operations.
- Shared Clause: Variables marked shared accessible to all threads (synchronization required). Implicit for global variables.
- Private Clause: Each thread gets independent copy initialized at region entry. Implicit for loop counters, scalars.
- Critical Section: #pragma omp critical serializes updates (only one thread enters at a time). Lower overhead than mutex but serialized.

Task Parallelism (OpenMP 4.0+)

- omp task Directive: Generates task for asynchronous execution. Parent thread enqueues task; worker threads execute when available.
- Recursive Parallelism: Quicksort, tree traversal naturally expressed via tasks. Each task spawns subtasks, creating dynamic task tree.
- Task Dependencies: #pragma omp task depend(in:A) depends(out:B) specifies data dependencies. Runtime scheduler respects dependencies, enabling asynchronous execution.
- Taskgroup: #pragma omp taskgroup creates barrier for all spawned tasks. Ensures tasks complete before proceeding.

SIMD Vectorization Directives

- #pragma omp simd: Compiler unrolls loop for vectorization (SIMD units: AVX-512, NEON, etc.). Compiler generates vector instructions for supported data types.
- Vector Length Control: pragma omp simd simdlen(16) requests specific vector width. Compiler uses widest available that supports simdlen.
- Collapse: #pragma omp simd collapse(2) enables vectorization across nested loops. Collapses 2D loop into 1D for better vectorization.
- Reduction + SIMD: omp simd reduction(+:sum) combines loop unrolling with reduction. Compiler uses vector units for partial sums.

Nested Parallelism

- Nested Parallel Regions: Inner parallel regions create additional thread levels. Threads nested up to implementation limits (typically 2-3 levels).
- omp_get_num_levels(): Query nesting depth. omp_get_ancestor_thread_num() identify ancestor threads in hierarchy.
- Performance Considerations: Excessive nesting reduces SIMD width per thread (threads per core), increases synchronization overhead. Typically avoid >2 levels.

OpenMP 5.0 Target Offloading to GPU

- #pragma omp target: Offload computation to GPU. Similar to CUDA but uses OpenMP syntax.
- Target Data: #pragma omp target data map(to:A[0:N]) specifies data transfer (host to device). Avoided repeated transfers.
- Parallel Teams: #pragma omp target teams parallel for combines multiple levels of parallelism (multiple blocks of multiple threads).
- GPU Kernels: omp target regions compile to GPU kernels. NVIDIA/AMD/Intel compilers generate ISA-specific code.

Real-World Applications and Performance

- Adoption: OpenMP standard in scientific/HPC communities (Fortran, C/C++). ~80% of HPC codes use OpenMP for shared-memory parallelism.
- Performance Predictability: Static scheduling easier to profile/optimize; dynamic scheduling less predictable.
- Compiler Variability: Different compilers generate different code quality. Intel icc often outperforms GCC/Clang for OpenMP.
- Hybrid Paradigms: MPI (distributed memory) + OpenMP (shared-memory within node) dominant in HPC. Scales 100s-1000s cores across clusters.

Want to learn more?