Advanced MPI Communication

Advanced MPI Communication encompasses sophisticated messaging primitives beyond basic send/receive, including persistent requests for reduced overhead, one-sided remote-memory-access patterns, and specialized datatype handling for irregular communication.

MPI Persistent Requests

- Persistent Send/Recv: Pre-allocate send/recv request (MPI_Send_init, MPI_Recv_init) with parameters (buffer, count, datatype, dest, tag). Reuse request in tight loops.
- Performance Benefit: Request initialization overhead amortized across multiple uses. Typical overhead reduction: 20-40% for bandwidth-limited messages.
- Usage Pattern: Start/complete cycle (MPI_Start, MPI_Wait). Multiple requests can be started (MPI_Startall) enabling pipelined communication.
- Compared to Non-Persistent: Each send/recv allocates request (small overhead but accumulates). Persistent requests ~5-10% faster in tight loops.

One-Sided Communication (Remote Memory Access, RMA)

- MPI Window Creation: MPI_Win_create(base, size, ...) registers memory region for RMA access. Other processes can read/write this window.
- RMA Operations: MPI_Put (write remote memory), MPI_Get (read remote memory), MPI_Accumulate (atomic operation on remote memory).
- Advantages: Sender initiates operation (PUT/GET) without target blocking. Sender knows when operation complete (local semantics). Enables asynchronous communication.
- Use Cases: Producer-consumer, work-stealing, load-balancing algorithms naturally express via RMA.

MPI Window Synchronization Semantics

- Fence Synchronization: MPI_Win_fence() acts as collective barrier (all processes in window). Ensures previous RMA operations completed globally.
- Post-Wait-Complete-Wait (PSCW): More flexible synchronization. MPI_Win_post(), MPI_Win_start(), MPI_Win_complete(), MPI_Win_wait(). Processes indicate participation, synchronize only when needed.
- Lock Synchronization: MPI_Win_lock() acquires exclusive/shared lock on target process. MPI_Win_unlock() releases. Enables fine-grained mutual exclusion.
- Memory Model: Fence: all processes agree on consistency. Lock: only target process sees consistent view. Pipelining: process-specific synchronization.

Derived Datatypes and Communication of Non-Contiguous Data

- Contiguous Datatype: MPI_FLOAT, MPI_INT, etc. communicate single array in memory.
- Vector Datatype: MPI_Type_vector(count, blocklen, stride, base_type) communicates evenly-spaced blocks. Example: column of matrix (stride = row_width).
- Indexed Datatype: MPI_Type_indexed(count, array_of_blocklengths, array_of_displacements) arbitrary displacements. Example: sparse matrix rows.
- Struct Datatype: MPI_Type_create_struct() combines multiple types with offsets. Example: structure containing integer + float fields.

Derived Datatype Usage

- MPI_Type_commit(): Finalize datatype definition before use. Commit enables compiler optimizations (e.g., compute contiguous regions).
- Packing Advantage: Derived datatype reduces host-CPU overhead vs manual packing/unpacking. Single MPI call vs loop of multiple calls.
- Subarray Extraction: MPI_Type_create_subarray() extracts rectangular region of N-dimensional array. Useful for domain decomposition (decompose 3D domain into 1D slices).

Neighborhood Collectives (MPI 3.0+)

- MPI_Neighbor_allgather: Local gather from neighbors (defined by topology/graph). Replaces global allgather for sparse communication patterns.
- MPI_Neighbor_alltoall: Local all-to-all (each rank sends to all neighbors, receives from all). Efficient for stencil computations.
- Topology Definition: MPI_Dist_graph_create() defines custom neighbor topology (sparse directed graph). Enables application-specific communication patterns.
- Optimization Opportunity: Neighborhood collectives permit more aggressive optimization (fewer ranks participate, topology-aware routing).

MPI-4 Features and Enhancements

- Persistent Collectives: MPI_Allreduce_init() similar to persistent send/recv. Pre-allocate collective request, reuse in loops.
- Partitioned Point-to-Point: Send/recv partitioned into smaller sub-messages, enabling overlap across multiple messages.
- Request-Based Collectives: Non-blocking collectives return request immediately. Enable pipelined collective operations across multiple pairs.
- Topology-Aware Mapping: Queries machine topology, maps ranks to optimize communication locality (reduce inter-socket/inter-switch traffic).

Real-World Optimization Strategies

- Double Buffering: Alternate between two buffers for ping-pong communication. While GPU computes buffer N, GPU transfers buffer N+1 to host asynchronously.
- Batching: Collect multiple small messages, send single large message. Reduces overhead (fewer syscalls, network headers).
- Stencil Optimization: Halos (boundary rows/cols) communicated separately from bulk. Computation on interior while edges exchange.

Want to learn more?