Skip to content

incorrect MPI_TAG_UB, throws "'boost::wrapexcept<boost::mpi::exception>' what(): MPI_Recv: MPI_ERR_TAG: invalid tag" #6940

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
opoplawski opened this issue Aug 29, 2019 · 5 comments

Comments

@opoplawski
Copy link
Contributor

Background information

This is copied from https://bugzilla.redhat.com/show_bug.cgi?id=1746564

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

4.0.1

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Fedora packages

Please describe the system on which you are running

  • Operating system/version: Fedora 31
  • Computer hardware: x86_4
  • Network type:

Details of the problem

The value of MPI_TAG_UB is 8388608 (2^23) instead of 2147483647 (2^31-1). C++
code containing a call to boost::mpi::reduce() and compiled with the openmpi
and boost-openmpi libraries will throw an "invalid tag" exception if 2 or more
MPI threads are used.

Steps to Reproduce:

  1. install openmpi-devel boost-devel boost-openmpi-devel
  2. compile the attached minimum working example (sample.cpp)
  3. run the output binary with two threads

Actual results:
The program crashes with error message "terminate called after throwing an
instance of 'boost::wrapexceptboost::mpi::exception' what(): MPI_Recv:
MPI_ERR_TAG: invalid tag"

Expected results:
The program should print the string "The result is zero one"

Additional info:
When compiling openmpi 4.0.1 and boost 1.69 from sources, the MPI_TAG_UB has
the correct value and the code sample does not throw an exception. The correct
behavior is also observed in Fedora 30, which has the same boost version but
openmpi 3.1.4. This bug was investigated using the fedora:31 Docker image. The
Dockerfiles, code sample, bash commands and error output are attached. Our
analysis of this bug
(espressomd/espresso#2985 (comment))
shows it is the root cause for Bug 1728057.
attachment.tar.gz

--- Comment #1 from Philip Kovacs [email protected] ---
I took some time to look at this -- the problem is somewhere in the ucx layer.
Now I don't have f31,
but I do have f32 rawhide and I did observe the max tag = 8388608 problem using
the get_tag.cc sample
provided.

I reconfigured openmpi 4.0.1 without ucx and that resolves the problem:

mpirun -np 4 ./get_tag

MPI_TAG_UB = 2147483647
MPI_TAG_UB = 2147483647
MPI_TAG_UB = 2147483647
MPI_TAG_UB = 2147483647
@ggouaillardet
Copy link
Contributor

MPI_TAG_UB can be influenced by the hardware that is being used. For example if a given interconnect can do message tag matching at the hardware level, it might restrict how many bits are available for the tag.

MPI_TAG_UB has the correct value

sounds like a misunderstanding of what can be expected from MPI_TAG_UB.

@hjelmn
Copy link
Member

hjelmn commented Aug 29, 2019

Indeed. We can set the tag ub to whatever. Closing.

@hjelmn hjelmn closed this as completed Aug 29, 2019
@hjelmn
Copy link
Member

hjelmn commented Aug 29, 2019

Definitely a boost bug. They have their own reduce implementation (seriously) and use their own "collectives tag" which is obviously out of range.

@yosefe
Copy link
Contributor

yosefe commented Aug 29, 2019

@opoplawski Even though boost should not make any assumption about MPI_TAG_UB, other than it's at leas 32768, this issue might be fixed by #6792

@hjelmn
Copy link
Member

hjelmn commented Aug 29, 2019

So your pml had an off by one error? Not sure how that passed MTT. Thought we had a test for sending max tag. If not then please add one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants