Skip to content

coll/{base,libnbc}: fix datatypes/operator retention #6880

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Aug 12, 2019

Conversation

ggouaillardet
Copy link
Contributor

Refs. #6870
Refs. #6876

base ompi_coll_libnbc_request_t on top of ompi_coll_base_nbc_request_t
to correctly support the retention of datatypes/operators

This fixes a regression introduced in open-mpi/ompi@0fe756d

Signed-off-by: Gilles Gouaillardet <[email protected]>
Since ompi_coll_base_nbc_request_t is to be used in an
opal_free_list_t, it must be returned into a "clean" state.
So cleanup some data in the callback completion subroutines.

This fixes a regression introduced in open-mpi/ompi@0fe756d

Signed-off-by: Gilles Gouaillardet <[email protected]>
@ggouaillardet
Copy link
Contributor Author

@AboorvaDevarajan @amaslenn could you please give this PR a try ?

@bosilca can you please review this?
Then I will merge and backport into v4.0.x

@AboorvaDevarajan
Copy link
Member

With the patch igather_gap and iscatter_gap tests are passing, I ran the tests on 2 nodes (20 procs on each node) for about 10+ iterations, it all passed.

But the below tests still seems to fail for np > 2,

ireduce_nocommute_inter 
ireduce_nocommute_stride_inter
ireduce_nocommute_gap_inter

@ggouaillardet
Copy link
Contributor Author

@jladd-mlnx Coverity fails on MLNX's Jenkins (licence issue)

+ cov-analyze --dir /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/cov_build/all
Coverity Static Analysis version 8.5.0 on Linux 3.10.0-327.el7.x86_64 x86_64
Internal version numbers: c8d197a567 p-kent-push-26368.915

[FATAL] No license file (license.dat) or license configuration file (license.config) found in the default location: /hpc/local/commercial/coverity/cov-sa/bin
++ cov-format-errors --dir /var/lib/jenkins/jobs/gh-ompi-master-pr/workspace/cov_build/all
++ awk '/Processing [0-9]+ errors?/ { print $2 }'
[FATAL] No license file (license.dat) or license configuration file (license.config) found in the default location: /hpc/local/commercial/coverity/cov-sa/bin

Could you please have it fixed ?

I have some PR that would clearly need it before I merge into master.

@amaslenn
Copy link

amaslenn commented Aug 8, 2019

Looks like the issue is still here:

*** Process received signal ***
Signal: Segmentation fault (11)
Signal code:  (128)
Failing at address: (nil)
[ 0] /usr/lib64/libpthread.so.0(+0xf5e0)[0x7fb7ed51e5e0]
[ 1] /ompi/__install/lib/libmpi.so.0(ompi_coll_base_retain_datatypes_w+0x7c)[0x7fb7ed7cb0cc]
[ 2] /ompi/__install/lib/libmpi.so.0(PMPI_Ialltoallw+0x2c1)[0x7fb7ed78fa21]
[ 3] /mpich-3.3.1/test/mpi/coll/nbicalltoallw[0x401c28]
[ 4] /mpich-3.3.1/test/mpi/coll/nbicalltoallw[0x401f81]
[ 5] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7fb7ed16dc05]
[ 6] /mpich-3.3.1/test/mpi/coll/nbicalltoallw[0x401af9]
*** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------

a non blocking collective might return ompi_request_null, so we should not
retain anything in that case.

Signed-off-by: Gilles Gouaillardet <[email protected]>
@ggouaillardet
Copy link
Contributor Author

@amaslenn I pushed a new commit that should fix this issue

@AboorvaDevarajan I cannot reproduce any crash, could you please upload a stack trace ?

@ggouaillardet ggouaillardet requested a review from bosilca August 9, 2019 01:07
@ggouaillardet
Copy link
Contributor Author

:bot:aws:retest

@gpaulsen
Copy link
Member

gpaulsen commented Aug 9, 2019

@AboorvaDevarajan , @amaslenn, How does this PR look now that it's been updated?

@gpaulsen
Copy link
Member

gpaulsen commented Aug 9, 2019

Will want this on v4.0.x after #6863 goes in.

@amaslenn
Copy link

@AboorvaDevarajan , @amaslenn, How does this PR look now that it's been updated?

My tests have passed 👍

@AboorvaDevarajan
Copy link
Member

It's all passing now with the updated fixes, Thanks 👍

@ggouaillardet ggouaillardet merged commit cf4910b into open-mpi:master Aug 12, 2019
@ggouaillardet
Copy link
Contributor Author

Thanks ! I issued #6889 to backported the fix to the v4.0.x branch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants