Skip to content

coll/nbc: non blocking collective fails with ompi master #6870

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
AboorvaDevarajan opened this issue Aug 7, 2019 · 3 comments
Closed

coll/nbc: non blocking collective fails with ompi master #6870

AboorvaDevarajan opened this issue Aug 7, 2019 · 3 comments

Comments

@AboorvaDevarajan
Copy link
Member

AboorvaDevarajan commented Aug 7, 2019

Background Information

Few of the non blocking collective test cases in ibm suite fails with recent ompi master branch

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

OMPI version : master

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

git clone

Please describe the system on which you are running

  • Operating system/version: RHEL 7.3
  • Computer hardware: ppc64le
  • Network type: IB

Details of the problem

Here is the list of non blocking collective test cases that fails:

collective/igather_gap 
collective/iscatter_gap 
collective/intercomm/ireduce_nocommute_inter 
collective/intercomm/ireduce_nocommute_stride_inter 
collective/intercomm/ireduce_nocommute_gap_inter 
[smpici@c712f6n06 test]$ mpirun -mca pml ob1 -host c712f6n06:1,c712f6n07:1 -np 2  -x LD_LIBRARY_PATH --prefix /nfs_smpi_ci/abd/os/ompi-install/ ./test2
[c712f6n06:14673:0:14673] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xe86300a87d2903a6)
==== backtrace ====
    0  /nfs_smpi_ci/abd/os/ucx-install/lib/libucs.so.0(+0x25650) [0x100003915650]
    1  /nfs_smpi_ci/abd/os/ucx-install/lib/libucs.so.0(+0x259a4) [0x1000039159a4]
    2  [0x100000050478]
    3  /nfs_smpi_ci/abd/os/ompi-install/lib/libmpi.so.0(ompi_request_finalize+0x54) [0x1000000e5f24]
    4  /nfs_smpi_ci/abd/os/ompi-install/lib/libmpi.so.0(ompi_mpi_finalize+0x990) [0x1000000e8e30]
    5  /nfs_smpi_ci/abd/os/ompi-install/lib/libmpi.so.0(PMPI_Finalize+0x44) [0x1000001155c4]
    6  ./test2() [0x10000ff8]
    7  /lib64/libc.so.6(+0x25200) [0x100000255200]
    8  /lib64/libc.so.6(__libc_start_main+0xc4) [0x1000002553f4]
===================

==== backtrace ====
/lib64/libc.so.6(+0x25200)[0x100000255200]
[c712f6n06:14673] [ 6] /lib64/libc.so.6(__libc_start_main+0xc4)[0x1000002553f4]
[c712f6n06:14673] *** End of error message ***
    0  /nfs_smpi_ci/abd/os/ucx-install/lib/libucs.so.0(+0x25650) [0x100003915650]
    1  /nfs_smpi_ci/abd/os/ucx-install/lib/libucs.so.0(+0x259a4) [0x1000039159a4]
    2  [0x100000050478]
    3  /nfs_smpi_ci/abd/os/ompi-install/lib/libmpi.so.0(ompi_request_finalize+0x54) [0x1000000e5f24]
    4  /nfs_smpi_ci/abd/os/ompi-install/lib/libmpi.so.0(ompi_mpi_finalize+0x990) [0x1000000e8e30]
    5  /nfs_smpi_ci/abd/os/ompi-install/lib/libmpi.so.0(PMPI_Finalize+0x44) [0x1000001155c4]
    6  ./test2() [0x10000ff8]
    7  /lib64/libc.so.6(+0x25200) [0x100000255200]
    8  /lib64/libc.so.6(__libc_start_main+0xc4) [0x1000002553f4]
===================

--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node c712f6n06 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Some additional info

When I revert this patch the issue is not seen,
0fe756d

At least one issue I guess I'm seeing in the patch is that nbc_req_cons will never be called since the ompi_coll_base_nbc_request_t object is not dynamically allocated, so it seems like req->data.objs.objs is propagating some garbage values.

@AboorvaDevarajan AboorvaDevarajan changed the title coll/nbc: non blocking collective failswith ompi master coll/nbc: non blocking collective fails with ompi master Aug 7, 2019
@ggouaillardet
Copy link
Contributor

I ll have a look.
The requests are pulled from a free list, so the constructor should be invoked

@ggouaillardet
Copy link
Contributor

Oh I see

ompi_coll_libnbc_request_t should have ompi_coll_base_nbc_request_t as a parent instead of ompi_request_t

@AboorvaDevarajan
Copy link
Member Author

I retested the tests with recent branch and its passing with this fix, #6880 #6889, hence closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants