Collective I/O (MPI-IO)

Collective I/O (MPI-IO) is the coordinated parallel file access technique where multiple MPI processes cooperate to read or write a shared file in large, contiguous I/O operations — transforming many small, non-contiguous accesses from individual processes into fewer large transfers through a two-phase I/O protocol with designated aggregator processes, which can improve parallel file system throughput by 10-100× compared to independent I/O by matching the file system's preference for large sequential operations.

The Parallel I/O Problem

``Without collective I/O (independent I/O): Rank 0: Read bytes [0:100], [400:500], [800:900] ← 3 small reads Rank 1: Read bytes [100:200], [500:600], [900:1000] ← 3 small reads Rank 2: Read bytes [200:300], [600:700], [1000:1100] ← 3 small reads Rank 3: Read bytes [300:400], [700:800], [1100:1200] ← 3 small reads → 12 separate I/O requests → file system thrashes

With collective I/O (two-phase): Aggregator 0: Read bytes [0:600] ← 1 large read Aggregator 1: Read bytes [600:1200] ← 1 large read → 2 large I/O requests → communicate data to correct ranks → 10-100× faster on parallel file systems`

Two-Phase I/O Protocol

`Phase 1 (I/O): Aggregator processes perform large contiguous reads/writes to the parallel file system (Lustre, GPFS)

Phase 2 (Communication): Aggregators redistribute data to/from all processes via MPI communication (AlltoAll)

Result: File system sees large, sequential I/O (efficient) Processes get their non-contiguous data (correct)`

MPI-IO API

`c // Open file collectively MPI_File fh; MPI_File_open(MPI_COMM_WORLD, "output.dat", MPI_MODE_WRONLY | MPI_MODE_CREATE, MPI_INFO_NULL, &fh);

// Set per-rank view (each rank writes different portion) MPI_File_set_view(fh, rank chunk_size sizeof(double), MPI_DOUBLE, MPI_DOUBLE, "native", MPI_INFO_NULL);

// Collective write (all ranks participate) MPI_File_write_all(fh, local_data, chunk_size, MPI_DOUBLE, &status); // ^^^ _all suffix = collective

MPI_File_close(&fh);`

Independent vs. Collective I/O

| Aspect | Independent (File_write) | Collective (File_write_all) | |--------|------------------------|---------------------------| | Coordination | None | All ranks in communicator | | I/O pattern | Each rank issues own I/O | Aggregators combine requests | | Small accesses | Many small I/Os (slow) | Merged into large I/Os (fast) | | Network traffic | None (direct file access) | MPI communication phase | | Throughput on Lustre | 1-10 GB/s | 50-200 GB/s |

HDF5 Parallel I/O

`c // HDF5 collective I/O (built on MPI-IO) hid_t plist = H5Pcreate(H5P_FILE_ACCESS); H5Pset_fapl_mpio(plist, MPI_COMM_WORLD, MPI_INFO_NULL); hid_t file = H5Fopen("data.h5", H5F_ACC_RDONLY, plist);

// Collective transfer property hid_t xfer = H5Pcreate(H5P_DATASET_XFER); H5Pset_dxpl_mpio(xfer, H5FD_MPIO_COLLECTIVE);

H5Dread(dataset, H5T_NATIVE_DOUBLE, memspace, filespace, xfer, data);`

Tuning Collective I/O

| Parameter | What | Impact | |-----------|------|--------| | cb_nodes | Number of aggregator processes | More aggregators → more parallel I/O | | cb_buffer_size | Buffer size per aggregator | Larger → fewer I/O calls | | striping_factor | Lustre stripe count | Match cb_nodes to stripe count | | romio_ds_write | Data sieving for writes | Helps non-contiguous patterns |

`c MPI_Info info; MPI_Info_create(&info); MPI_Info_set(info, "cb_nodes", "64"); // 64 aggregators MPI_Info_set(info, "cb_buffer_size", "67108864"); // 64 MB buffer MPI_File_open(comm, "output.dat", mode, info, &fh);``

Collective I/O is the essential technique for achieving high throughput on parallel file systems — by recognizing that parallel file systems like Lustre are optimized for large sequential accesses rather than many small random ones, collective I/O through MPI-IO transforms the access pattern from process-centric to file-system-friendly, delivering the 100+ GB/s aggregate bandwidth that HPC simulations and AI training data pipelines require for checkpointing and data loading at scale.

Want to learn more?