Collective I/O (MPI-IO)

Keywords: collective io,mpi io,parallel file read,hdf5 parallel write,aggregated io

Collective I/O (MPI-IO) is the coordinated parallel file access technique where multiple MPI processes cooperate to read or write a shared file in large, contiguous I/O operations — transforming many small, non-contiguous accesses from individual processes into fewer large transfers through a two-phase I/O protocol with designated aggregator processes, which can improve parallel file system throughput by 10-100× compared to independent I/O by matching the file system's preference for large sequential operations.

The Parallel I/O Problem

``
Without collective I/O (independent I/O):
Rank 0: Read bytes [0:100], [400:500], [800:900] ← 3 small reads
Rank 1: Read bytes [100:200], [500:600], [900:1000] ← 3 small reads
Rank 2: Read bytes [200:300], [600:700], [1000:1100] ← 3 small reads
Rank 3: Read bytes [300:400], [700:800], [1100:1200] ← 3 small reads
→ 12 separate I/O requests → file system thrashes

With collective I/O (two-phase):
Aggregator 0: Read bytes [0:600] ← 1 large read
Aggregator 1: Read bytes [600:1200] ← 1 large read
→ 2 large I/O requests → communicate data to correct ranks
→ 10-100× faster on parallel file systems
`

Two-Phase I/O Protocol

`
Phase 1 (I/O): Aggregator processes perform large contiguous reads/writes
to the parallel file system (Lustre, GPFS)

Phase 2 (Communication): Aggregators redistribute data to/from all processes
via MPI communication (AlltoAll)

Result: File system sees large, sequential I/O (efficient)
Processes get their non-contiguous data (correct)
`

MPI-IO API

`c
// Open file collectively
MPI_File fh;
MPI_File_open(MPI_COMM_WORLD, "output.dat",
MPI_MODE_WRONLY | MPI_MODE_CREATE, MPI_INFO_NULL, &fh);

// Set per-rank view (each rank writes different portion)
MPI_File_set_view(fh, rank chunk_size sizeof(double),
MPI_DOUBLE, MPI_DOUBLE, "native", MPI_INFO_NULL);

// Collective write (all ranks participate)
MPI_File_write_all(fh, local_data, chunk_size, MPI_DOUBLE, &status);
// ^^^ _all suffix = collective

MPI_File_close(&fh);
`

Independent vs. Collective I/O

| Aspect | Independent (File_write) | Collective (File_write_all) |
|--------|------------------------|---------------------------|
| Coordination | None | All ranks in communicator |
| I/O pattern | Each rank issues own I/O | Aggregators combine requests |
| Small accesses | Many small I/Os (slow) | Merged into large I/Os (fast) |
| Network traffic | None (direct file access) | MPI communication phase |
| Throughput on Lustre | 1-10 GB/s | 50-200 GB/s |

HDF5 Parallel I/O

`c
// HDF5 collective I/O (built on MPI-IO)
hid_t plist = H5Pcreate(H5P_FILE_ACCESS);
H5Pset_fapl_mpio(plist, MPI_COMM_WORLD, MPI_INFO_NULL);
hid_t file = H5Fopen("data.h5", H5F_ACC_RDONLY, plist);

// Collective transfer property
hid_t xfer = H5Pcreate(H5P_DATASET_XFER);
H5Pset_dxpl_mpio(xfer, H5FD_MPIO_COLLECTIVE);

H5Dread(dataset, H5T_NATIVE_DOUBLE, memspace, filespace, xfer, data);
`

Tuning Collective I/O

| Parameter | What | Impact |
|-----------|------|--------|
| cb_nodes | Number of aggregator processes | More aggregators → more parallel I/O |
| cb_buffer_size | Buffer size per aggregator | Larger → fewer I/O calls |
| striping_factor | Lustre stripe count | Match cb_nodes to stripe count |
| romio_ds_write | Data sieving for writes | Helps non-contiguous patterns |

`c
MPI_Info info;
MPI_Info_create(&info);
MPI_Info_set(info, "cb_nodes", "64"); // 64 aggregators
MPI_Info_set(info, "cb_buffer_size", "67108864"); // 64 MB buffer
MPI_File_open(comm, "output.dat", mode, info, &fh);
``

Collective I/O is the essential technique for achieving high throughput on parallel file systems — by recognizing that parallel file systems like Lustre are optimized for large sequential accesses rather than many small random ones, collective I/O through MPI-IO transforms the access pattern from process-centric to file-system-friendly, delivering the 100+ GB/s aggregate bandwidth that HPC simulations and AI training data pipelines require for checkpointing and data loading at scale.

Want to learn more?

Search 13,225+ semiconductor and AI topics or chat with our AI assistant.

Search Topics Chat with CFSGPT