Skip to content

coll/base: only retain datatypes/op if the request has not yet completed #6889

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Aug 19, 2019

Conversation

ggouaillardet
Copy link
Contributor

a non blocking collective might return ompi_request_null, so we should not
retain anything in that case.

Signed-off-by: Gilles Gouaillardet [email protected]

(cherry picked from commit 63d3ccd)

@amaslenn
Copy link

mpich's test passed, but I've got an error with our internal test. Is this PR final?

I don't have much to show, only the trace:

*** Process received signal ***
Signal: Segmentation fault (11)
Signal code: Address not mapped (1)
Failing at address: 0x4008aa3d
[ 0] /usr/lib64/libpthread.so.0(+0xf5e0)[0x7f2607b6c5e0]
[ 1] /ompi/__install/lib/libmpi.so.40(+0x9f7f4)[0x7f2607e187f4]
[ 2] /ompi/__install/lib/libmpi.so.40(+0x9f8fa)[0x7f2607e188fa]
[ 3] /ompi/__install/lib/openmpi/mca_coll_libnbc.so(ompi_coll_libnbc_progress+0x75)[0x7f25fa78c635]
[ 4] /ompi/__install/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7f260720becc]
[ 5] /ompi/__install/lib/libmpi.so.40(ompi_request_default_wait+0x125)[0x7f2607dc9105]
[ 6] /ompi/__install/lib/libmpi.so.40(PMPI_Wait+0x4e)[0x7f2607e030ae]
[ 7] /atest/src/atest[0x4144bf]
[ 8] /atest/src/atest[0x40dd88]
[ 9] /atest/src/atest[0x402acf]
[10] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f26077bbc05]
[11] /atest/src/atest[0x402d24]
*** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------

@ggouaillardet
Copy link
Contributor Author

@amaslenn can you please share your test (or a pointer) so I can investigate this ?

@amaslenn
Copy link

@amaslenn can you please share your test (or a pointer) so I can investigate this ?

I uploaded test binary with my command line to this gist: https://gist.github.com/amaslenn/628d3e7378452fa26e32e8ec70e46788

Can't provide source code.

base ompi_coll_libnbc_request_t on top of ompi_coll_base_nbc_request_t
to correctly support the retention of datatypes/operators

This fixes a regression introduced in open-mpi/ompi@0fe756d

Signed-off-by: Gilles Gouaillardet <[email protected]>

(cherry picked from commit open-mpi/ompi@f8eef0f)
Since ompi_coll_base_nbc_request_t is to be used in an
opal_free_list_t, it must be returned into a "clean" state.
So cleanup some data in the callback completion subroutines.

This fixes a regression introduced in open-mpi/ompi@0fe756d

Signed-off-by: Gilles Gouaillardet <[email protected]>

(cherry picked from commit open-mpi/ompi@0862c40)
a non blocking collective might return ompi_request_null, so we should not
retain anything in that case.

Signed-off-by: Gilles Gouaillardet <[email protected]>

(cherry picked from commit open-mpi/ompi@63d3ccd)
@ggouaillardet ggouaillardet force-pushed the topic/v4.0.x/nbc_fixes branch from ed5167e to 39ec580 Compare August 12, 2019 16:17
@ggouaillardet
Copy link
Contributor Author

@amaslenn my bad, I screwed up when issuing this PR (I only cherry picked one out of three commits). The PR should be good now (and it passe the test you sent me).

@hppritcha
Copy link
Member

bot:ompi:retest

@amaslenn
Copy link

Checked on my side, test passed.

@amaslenn
Copy link

This should be OK to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants