-
Notifications
You must be signed in to change notification settings - Fork 900
coll/{base,libnbc}: fix datatypes/operator retention #6880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
base ompi_coll_libnbc_request_t on top of ompi_coll_base_nbc_request_t to correctly support the retention of datatypes/operators This fixes a regression introduced in open-mpi/ompi@0fe756d Signed-off-by: Gilles Gouaillardet <[email protected]>
Since ompi_coll_base_nbc_request_t is to be used in an opal_free_list_t, it must be returned into a "clean" state. So cleanup some data in the callback completion subroutines. This fixes a regression introduced in open-mpi/ompi@0fe756d Signed-off-by: Gilles Gouaillardet <[email protected]>
@AboorvaDevarajan @amaslenn could you please give this PR a try ? @bosilca can you please review this? |
With the patch But the below tests still seems to fail for np > 2,
|
@jladd-mlnx Coverity fails on MLNX's Jenkins (licence issue)
Could you please have it fixed ? I have some PR that would clearly need it before I merge into master. |
Looks like the issue is still here: *** Process received signal ***
Signal: Segmentation fault (11)
Signal code: (128)
Failing at address: (nil)
[ 0] /usr/lib64/libpthread.so.0(+0xf5e0)[0x7fb7ed51e5e0]
[ 1] /ompi/__install/lib/libmpi.so.0(ompi_coll_base_retain_datatypes_w+0x7c)[0x7fb7ed7cb0cc]
[ 2] /ompi/__install/lib/libmpi.so.0(PMPI_Ialltoallw+0x2c1)[0x7fb7ed78fa21]
[ 3] /mpich-3.3.1/test/mpi/coll/nbicalltoallw[0x401c28]
[ 4] /mpich-3.3.1/test/mpi/coll/nbicalltoallw[0x401f81]
[ 5] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7fb7ed16dc05]
[ 6] /mpich-3.3.1/test/mpi/coll/nbicalltoallw[0x401af9]
*** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------------------------- |
a non blocking collective might return ompi_request_null, so we should not retain anything in that case. Signed-off-by: Gilles Gouaillardet <[email protected]>
@amaslenn I pushed a new commit that should fix this issue @AboorvaDevarajan I cannot reproduce any crash, could you please upload a stack trace ? |
:bot:aws:retest |
@AboorvaDevarajan , @amaslenn, How does this PR look now that it's been updated? |
Will want this on v4.0.x after #6863 goes in. |
My tests have passed 👍 |
It's all passing now with the updated fixes, Thanks 👍 |
Thanks ! I issued #6889 to backported the fix to the |
Refs. #6870
Refs. #6876