Skip to content

coll/nbc: non blocking collective fails with ompi master #6870

Closed
@AboorvaDevarajan

Description

@AboorvaDevarajan

Background Information

Few of the non blocking collective test cases in ibm suite fails with recent ompi master branch

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

OMPI version : master

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

git clone

Please describe the system on which you are running

  • Operating system/version: RHEL 7.3
  • Computer hardware: ppc64le
  • Network type: IB

Details of the problem

Here is the list of non blocking collective test cases that fails:

collective/igather_gap 
collective/iscatter_gap 
collective/intercomm/ireduce_nocommute_inter 
collective/intercomm/ireduce_nocommute_stride_inter 
collective/intercomm/ireduce_nocommute_gap_inter 
[smpici@c712f6n06 test]$ mpirun -mca pml ob1 -host c712f6n06:1,c712f6n07:1 -np 2  -x LD_LIBRARY_PATH --prefix /nfs_smpi_ci/abd/os/ompi-install/ ./test2
[c712f6n06:14673:0:14673] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xe86300a87d2903a6)
==== backtrace ====
    0  /nfs_smpi_ci/abd/os/ucx-install/lib/libucs.so.0(+0x25650) [0x100003915650]
    1  /nfs_smpi_ci/abd/os/ucx-install/lib/libucs.so.0(+0x259a4) [0x1000039159a4]
    2  [0x100000050478]
    3  /nfs_smpi_ci/abd/os/ompi-install/lib/libmpi.so.0(ompi_request_finalize+0x54) [0x1000000e5f24]
    4  /nfs_smpi_ci/abd/os/ompi-install/lib/libmpi.so.0(ompi_mpi_finalize+0x990) [0x1000000e8e30]
    5  /nfs_smpi_ci/abd/os/ompi-install/lib/libmpi.so.0(PMPI_Finalize+0x44) [0x1000001155c4]
    6  ./test2() [0x10000ff8]
    7  /lib64/libc.so.6(+0x25200) [0x100000255200]
    8  /lib64/libc.so.6(__libc_start_main+0xc4) [0x1000002553f4]
===================

==== backtrace ====
/lib64/libc.so.6(+0x25200)[0x100000255200]
[c712f6n06:14673] [ 6] /lib64/libc.so.6(__libc_start_main+0xc4)[0x1000002553f4]
[c712f6n06:14673] *** End of error message ***
    0  /nfs_smpi_ci/abd/os/ucx-install/lib/libucs.so.0(+0x25650) [0x100003915650]
    1  /nfs_smpi_ci/abd/os/ucx-install/lib/libucs.so.0(+0x259a4) [0x1000039159a4]
    2  [0x100000050478]
    3  /nfs_smpi_ci/abd/os/ompi-install/lib/libmpi.so.0(ompi_request_finalize+0x54) [0x1000000e5f24]
    4  /nfs_smpi_ci/abd/os/ompi-install/lib/libmpi.so.0(ompi_mpi_finalize+0x990) [0x1000000e8e30]
    5  /nfs_smpi_ci/abd/os/ompi-install/lib/libmpi.so.0(PMPI_Finalize+0x44) [0x1000001155c4]
    6  ./test2() [0x10000ff8]
    7  /lib64/libc.so.6(+0x25200) [0x100000255200]
    8  /lib64/libc.so.6(__libc_start_main+0xc4) [0x1000002553f4]
===================

--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node c712f6n06 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Some additional info

When I revert this patch the issue is not seen,
0fe756d

At least one issue I guess I'm seeing in the patch is that nbc_req_cons will never be called since the ompi_coll_base_nbc_request_t object is not dynamically allocated, so it seems like req->data.objs.objs is propagating some garbage values.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions