Known Issues with MPI¶
This pages holds known issues observed with MPI and concrete MPI implementations.
Performance Loss with MPI-IO-Module OMPIO¶
Open MPI v4.1.x introduced a couple of major enhancements, e.g., the
OMPIO module is now the
default module for MPI-IO on all filesystems incl. Lustre (cf.
NEWS file in Open MPI source code).
Prior to this,
ROMIO was the default MPI-IO module for Lustre.
Colleagues of ZIH have found that some MPI-IO access patterns suffer a significant performance loss
OMPIO as MPI-IO module with
OpenMPI/4.1.x modules on ZIH systems. At the moment, the root
cause is unclear and needs further investigation.
A workaround for this performance loss is to use the "old", i.e.,
ROMIO MPI-IO-module. This
is achieved by setting the environment variable
OMPI_MCA_io before executing the application as
marie@login$ export OMPI_MCA_io=^ompio marie@login$ srun [...]
or setting the option as argument, in case you invoke
marie@login$ mpirun --mca io ^ompio [...]
Mpirun on Partitions
mpirun on partitions
ml leads to wrong resource distribution when more than
one node is involved. This yields a strange distribution like e.g.
--tasks-per-node=8 was specified. Unless you really know what you're doing (e.g.
use rank pinning via perl script), avoid using mpirun.
Another issue arises when using the Intel toolchain: mpirun calls a different MPI and caused a 8-9x slowdown in the PALM app in comparison to using srun or the GCC-compiled version of the app (which uses the correct MPI).
R Parallel Library on Multiple Nodes¶
Using the R parallel library on MPI clusters has shown problems when using more than a few compute nodes. The error messages indicate that there are buggy interactions of R/Rmpi/Open MPI and UCX. Disabling UCX has solved these problems in our experiments.
We invoked the R script successfully with the following command:
marie@login$ mpirun -mca btl_openib_allow_ib true --mca pml ^ucx --mca osc ^ucx -np 1 Rscript --vanilla the-script.R
where the arguments
-mca btl_openib_allow_ib true --mca pml ^ucx --mca osc ^ucx disable usage of
MPI_Win_allocate is a one-sided MPI call that allocates memory and returns a window
object for RDMA operations (ref. man page).
Using MPI_Win_allocate rather than separate MPI_Alloc_mem + MPI_Win_create may allow the MPI implementation to optimize the memory allocation. (Using advanced MPI)
It was observed for at least for the
OpenMPI/4.0.5 module that using
MPI_Win_Allocate instead of
MPI_Alloc_mem in conjunction with
MPI_Win_create leads to segmentation faults in the calling
application. To be precise, the segfaults occurred at partition
romeo when about 200 GB per node
where allocated. In contrast, the segmentation faults vanished when the implementation was
refactored to call the