Skip to content

Parallel MPI_Get on same window provide wrong data when using mtl/psm2 and osc/ucx #10433

@michaellass

Description

@michaellass

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

I can reproduce this issue with version 4.1.1 but not with versions 3.1.4, 3.1.1 and 2.1.2. So this looks to me like a regression in OpenMPI 4(see #10433 (comment)).

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

All mentioned versions were built using EasyBuild.

Please describe the system on which you are running

  • Operating system/version: Red Hat Enterprise Linux release 8.3 (Ootpa)
  • Computer hardware: Cray CS500, nodes contain two Intel(R) Xeon(R) Gold 6148(F) CPUs (system description)
  • Network type: 100Gbps Omni-Path

Details of the problem

I use one-sided communication to distribute parts of a 2000x2000 character array to the different MPI processes. For that, I use MPI_Get on each process to fetch its data block from rank 0. MPI_Get is surrounded by calls to MPI_Win_fence as follows:

MPI_Win_fence(0, win);
MPI_Get(local, local_height*local_width, MPI_CHAR, 0, global_offset, 1, mydata_in_global, win);
MPI_Win_fence(0, win);

This works correctly on most systems that I work on. On one specific compute cluster that uses the PSM2 MTL by default, MPI_Get will undeterministically provide wrong data.

A workaround is to make sure that only one MPI_Get call is performed within each communication epoch:

for (int i = 0; i < nranks; i++) {
  MPI_Win_fence(0, win);
  if (rank == i) MPI_Get(local, local_height*local_width, MPI_CHAR, 0, global_offset, 1, mydata_in_global, win);
}
MPI_Win_fence(0, win);

The problem also goes away when the use of PSM2 is avoided by using either of the following environment variables:

  • OMPI_MCA_mtl=^psm2
  • OMPI_MCA_pml=ob1

The problem only shows up for a significant number of processes (at least 10, better 20) and only at some of many repeated program runs. So far, I have only tested on a single node, i.e. communication probably does not go over the network.

Semantics and correctness of one-sided communication is described in section 12.7 of the current MPI standard. It is described that parallel MPI_Puts on the same windows within the same epoch are undefined behavior. However, for MPI_Get I could not find such a statement and it seems counter-intuitive that multiple read accesses influence each other.

Apart from OpenMPI v2, v3 and v4, I also tested Intel MPI and it runs the program without issues.

You can find a code that reproduces the problem in the following gist: https://gist.github.com/michaellass/85486202494a149c9f24f48ad1786497
I run it via the following script which creates 50 output files that should all be identical:

for i in $(seq 50); do
  mpirun -np 20 ./reproducer $i.out
done

echo
echo "All runs should create the same checksum:"

sha1sum *.out

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions