-
Notifications
You must be signed in to change notification settings - Fork 900
coll/base: only retain datatypes/op if the request has not yet completed #6889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
coll/base: only retain datatypes/op if the request has not yet completed #6889
Conversation
I don't have much to show, only the trace: *** Process received signal ***
Signal: Segmentation fault (11)
Signal code: Address not mapped (1)
Failing at address: 0x4008aa3d
[ 0] /usr/lib64/libpthread.so.0(+0xf5e0)[0x7f2607b6c5e0]
[ 1] /ompi/__install/lib/libmpi.so.40(+0x9f7f4)[0x7f2607e187f4]
[ 2] /ompi/__install/lib/libmpi.so.40(+0x9f8fa)[0x7f2607e188fa]
[ 3] /ompi/__install/lib/openmpi/mca_coll_libnbc.so(ompi_coll_libnbc_progress+0x75)[0x7f25fa78c635]
[ 4] /ompi/__install/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7f260720becc]
[ 5] /ompi/__install/lib/libmpi.so.40(ompi_request_default_wait+0x125)[0x7f2607dc9105]
[ 6] /ompi/__install/lib/libmpi.so.40(PMPI_Wait+0x4e)[0x7f2607e030ae]
[ 7] /atest/src/atest[0x4144bf]
[ 8] /atest/src/atest[0x40dd88]
[ 9] /atest/src/atest[0x402acf]
[10] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f26077bbc05]
[11] /atest/src/atest[0x402d24]
*** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------------------------- |
@amaslenn can you please share your test (or a pointer) so I can investigate this ? |
I uploaded test binary with my command line to this gist: https://gist.github.com/amaslenn/628d3e7378452fa26e32e8ec70e46788 Can't provide source code. |
base ompi_coll_libnbc_request_t on top of ompi_coll_base_nbc_request_t to correctly support the retention of datatypes/operators This fixes a regression introduced in open-mpi/ompi@0fe756d Signed-off-by: Gilles Gouaillardet <[email protected]> (cherry picked from commit open-mpi/ompi@f8eef0f)
Since ompi_coll_base_nbc_request_t is to be used in an opal_free_list_t, it must be returned into a "clean" state. So cleanup some data in the callback completion subroutines. This fixes a regression introduced in open-mpi/ompi@0fe756d Signed-off-by: Gilles Gouaillardet <[email protected]> (cherry picked from commit open-mpi/ompi@0862c40)
a non blocking collective might return ompi_request_null, so we should not retain anything in that case. Signed-off-by: Gilles Gouaillardet <[email protected]> (cherry picked from commit open-mpi/ompi@63d3ccd)
ed5167e
to
39ec580
Compare
@amaslenn my bad, I screwed up when issuing this PR (I only cherry picked one out of three commits). The PR should be good now (and it passe the test you sent me). |
bot:ompi:retest |
Checked on my side, test passed. |
This should be OK to merge. |
a non blocking collective might return ompi_request_null, so we should not
retain anything in that case.
Signed-off-by: Gilles Gouaillardet [email protected]
(cherry picked from commit 63d3ccd)