Home Knowledge Base Collective I/O (MPI-IO)

Collective I/O (MPI-IO) is the coordinated parallel file access technique where multiple MPI processes cooperate to read or write a shared file in large, contiguous I/O operations — transforming many small, non-contiguous accesses from individual processes into fewer large transfers through a two-phase I/O protocol with designated aggregator processes, which can improve parallel file system throughput by 10-100× compared to independent I/O by matching the file system's preference for large sequential operations.

The Parallel I/O Problem

 Without collective I/O (independent I/O):
 Rank 0: Read bytes [0:100], [400:500], [800:900]    ← 3 small reads
 Rank 1: Read bytes [100:200], [500:600], [900:1000]  ← 3 small reads
 Rank 2: Read bytes [200:300], [600:700], [1000:1100]  ← 3 small reads
 Rank 3: Read bytes [300:400], [700:800], [1100:1200]  ← 3 small reads
 → 12 separate I/O requests → file system thrashes

 With collective I/O (two-phase):
 Aggregator 0: Read bytes [0:600]     ← 1 large read
 Aggregator 1: Read bytes [600:1200]  ← 1 large read
 → 2 large I/O requests → communicate data to correct ranks
 → 10-100× faster on parallel file systems

Two-Phase I/O Protocol

 Phase 1 (I/O): Aggregator processes perform large contiguous reads/writes
                to the parallel file system (Lustre, GPFS)

 Phase 2 (Communication): Aggregators redistribute data to/from all processes
                          via MPI communication (AlltoAll)

 Result: File system sees large, sequential I/O (efficient)
         Processes get their non-contiguous data (correct)

MPI-IO API

// Open file collectively
MPI_File fh;
MPI_File_open(MPI_COMM_WORLD, "output.dat",
              MPI_MODE_WRONLY | MPI_MODE_CREATE, MPI_INFO_NULL, &fh);

// Set per-rank view (each rank writes different portion)
MPI_File_set_view(fh, rank * chunk_size * sizeof(double),
                  MPI_DOUBLE, MPI_DOUBLE, "native", MPI_INFO_NULL);

// Collective write (all ranks participate)
MPI_File_write_all(fh, local_data, chunk_size, MPI_DOUBLE, &status);
//          ^^^ _all suffix = collective

MPI_File_close(&fh);

Independent vs. Collective I/O

AspectIndependent (File_write)Collective (File_write_all)
CoordinationNoneAll ranks in communicator
I/O patternEach rank issues own I/OAggregators combine requests
Small accessesMany small I/Os (slow)Merged into large I/Os (fast)
Network trafficNone (direct file access)MPI communication phase
Throughput on Lustre1-10 GB/s50-200 GB/s

HDF5 Parallel I/O

// HDF5 collective I/O (built on MPI-IO)
hid_t plist = H5Pcreate(H5P_FILE_ACCESS);
H5Pset_fapl_mpio(plist, MPI_COMM_WORLD, MPI_INFO_NULL);
hid_t file = H5Fopen("data.h5", H5F_ACC_RDONLY, plist);

// Collective transfer property
hid_t xfer = H5Pcreate(H5P_DATASET_XFER);
H5Pset_dxpl_mpio(xfer, H5FD_MPIO_COLLECTIVE);

H5Dread(dataset, H5T_NATIVE_DOUBLE, memspace, filespace, xfer, data);

Tuning Collective I/O

ParameterWhatImpact
cb_nodesNumber of aggregator processesMore aggregators → more parallel I/O
cb_buffer_sizeBuffer size per aggregatorLarger → fewer I/O calls
striping_factorLustre stripe countMatch cb_nodes to stripe count
romio_ds_writeData sieving for writesHelps non-contiguous patterns
MPI_Info info;
MPI_Info_create(&info);
MPI_Info_set(info, "cb_nodes", "64");           // 64 aggregators
MPI_Info_set(info, "cb_buffer_size", "67108864"); // 64 MB buffer
MPI_File_open(comm, "output.dat", mode, info, &fh);

Collective I/O is the essential technique for achieving high throughput on parallel file systems — by recognizing that parallel file systems like Lustre are optimized for large sequential accesses rather than many small random ones, collective I/O through MPI-IO transforms the access pattern from process-centric to file-system-friendly, delivering the 100+ GB/s aggregate bandwidth that HPC simulations and AI training data pipelines require for checkpointing and data loading at scale.

collective iompi ioparallel file readhdf5 parallel writeaggregated io

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.