-
Notifications
You must be signed in to change notification settings - Fork 901
incorrect MPI_TAG_UB, throws "'boost::wrapexcept<boost::mpi::exception>' what(): MPI_Recv: MPI_ERR_TAG: invalid tag" #6940
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
sounds like a misunderstanding of what can be expected from |
Indeed. We can set the tag ub to whatever. Closing. |
Definitely a boost bug. They have their own reduce implementation (seriously) and use their own "collectives tag" which is obviously out of range. |
@opoplawski Even though boost should not make any assumption about MPI_TAG_UB, other than it's at leas 32768, this issue might be fixed by #6792 |
So your pml had an off by one error? Not sure how that passed MTT. Thought we had a test for sending max tag. If not then please add one. |
Background information
This is copied from https://bugzilla.redhat.com/show_bug.cgi?id=1746564
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
4.0.1
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Fedora packages
Please describe the system on which you are running
Details of the problem
The value of MPI_TAG_UB is 8388608 (2^23) instead of 2147483647 (2^31-1). C++
code containing a call to boost::mpi::reduce() and compiled with the openmpi
and boost-openmpi libraries will throw an "invalid tag" exception if 2 or more
MPI threads are used.
Steps to Reproduce:
Actual results:
The program crashes with error message "terminate called after throwing an
instance of 'boost::wrapexceptboost::mpi::exception' what(): MPI_Recv:
MPI_ERR_TAG: invalid tag"
Expected results:
The program should print the string "The result is zero one"
Additional info:
When compiling openmpi 4.0.1 and boost 1.69 from sources, the MPI_TAG_UB has
the correct value and the code sample does not throw an exception. The correct
behavior is also observed in Fedora 30, which has the same boost version but
openmpi 3.1.4. This bug was investigated using the fedora:31 Docker image. The
Dockerfiles, code sample, bash commands and error output are attached. Our
analysis of this bug
(espressomd/espresso#2985 (comment))
shows it is the root cause for Bug 1728057.
attachment.tar.gz
--- Comment #1 from Philip Kovacs [email protected] ---
I took some time to look at this -- the problem is somewhere in the ucx layer.
Now I don't have f31,
but I do have f32 rawhide and I did observe the max tag = 8388608 problem using
the get_tag.cc sample
provided.
I reconfigured openmpi 4.0.1 without ucx and that resolves the problem:
The text was updated successfully, but these errors were encountered: