-
Notifications
You must be signed in to change notification settings - Fork 920
Description
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
I can reproduce this issue with version 4.1.1 but not with versions 3.1.4, 3.1.1 and 2.1.2. So this looks to me like a regression in OpenMPI 4(see #10433 (comment)).
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
All mentioned versions were built using EasyBuild.
Please describe the system on which you are running
- Operating system/version: Red Hat Enterprise Linux release 8.3 (Ootpa)
- Computer hardware: Cray CS500, nodes contain two Intel(R) Xeon(R) Gold 6148(F) CPUs (system description)
- Network type: 100Gbps Omni-Path
Details of the problem
I use one-sided communication to distribute parts of a 2000x2000 character array to the different MPI processes. For that, I use MPI_Get
on each process to fetch its data block from rank 0. MPI_Get
is surrounded by calls to MPI_Win_fence
as follows:
MPI_Win_fence(0, win);
MPI_Get(local, local_height*local_width, MPI_CHAR, 0, global_offset, 1, mydata_in_global, win);
MPI_Win_fence(0, win);
This works correctly on most systems that I work on. On one specific compute cluster that uses the PSM2 MTL by default, MPI_Get
will undeterministically provide wrong data.
A workaround is to make sure that only one MPI_Get
call is performed within each communication epoch:
for (int i = 0; i < nranks; i++) {
MPI_Win_fence(0, win);
if (rank == i) MPI_Get(local, local_height*local_width, MPI_CHAR, 0, global_offset, 1, mydata_in_global, win);
}
MPI_Win_fence(0, win);
The problem also goes away when the use of PSM2 is avoided by using either of the following environment variables:
OMPI_MCA_mtl=^psm2
OMPI_MCA_pml=ob1
The problem only shows up for a significant number of processes (at least 10, better 20) and only at some of many repeated program runs. So far, I have only tested on a single node, i.e. communication probably does not go over the network.
Semantics and correctness of one-sided communication is described in section 12.7 of the current MPI standard. It is described that parallel MPI_Put
s on the same windows within the same epoch are undefined behavior. However, for MPI_Get
I could not find such a statement and it seems counter-intuitive that multiple read accesses influence each other.
Apart from OpenMPI v2, v3 and v4, I also tested Intel MPI and it runs the program without issues.
You can find a code that reproduces the problem in the following gist: https://gist.github.com/michaellass/85486202494a149c9f24f48ad1786497
I run it via the following script which creates 50 output files that should all be identical:
for i in $(seq 50); do
mpirun -np 20 ./reproducer $i.out
done
echo
echo "All runs should create the same checksum:"
sha1sum *.out