Closed
Description
Background Information
Few of the non blocking collective test cases in ibm suite fails with recent ompi master branch
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
OMPI version : master
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
git clone
Please describe the system on which you are running
- Operating system/version: RHEL 7.3
- Computer hardware: ppc64le
- Network type: IB
Details of the problem
Here is the list of non blocking collective test cases that fails:
collective/igather_gap
collective/iscatter_gap
collective/intercomm/ireduce_nocommute_inter
collective/intercomm/ireduce_nocommute_stride_inter
collective/intercomm/ireduce_nocommute_gap_inter
[smpici@c712f6n06 test]$ mpirun -mca pml ob1 -host c712f6n06:1,c712f6n07:1 -np 2 -x LD_LIBRARY_PATH --prefix /nfs_smpi_ci/abd/os/ompi-install/ ./test2
[c712f6n06:14673:0:14673] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xe86300a87d2903a6)
==== backtrace ====
0 /nfs_smpi_ci/abd/os/ucx-install/lib/libucs.so.0(+0x25650) [0x100003915650]
1 /nfs_smpi_ci/abd/os/ucx-install/lib/libucs.so.0(+0x259a4) [0x1000039159a4]
2 [0x100000050478]
3 /nfs_smpi_ci/abd/os/ompi-install/lib/libmpi.so.0(ompi_request_finalize+0x54) [0x1000000e5f24]
4 /nfs_smpi_ci/abd/os/ompi-install/lib/libmpi.so.0(ompi_mpi_finalize+0x990) [0x1000000e8e30]
5 /nfs_smpi_ci/abd/os/ompi-install/lib/libmpi.so.0(PMPI_Finalize+0x44) [0x1000001155c4]
6 ./test2() [0x10000ff8]
7 /lib64/libc.so.6(+0x25200) [0x100000255200]
8 /lib64/libc.so.6(__libc_start_main+0xc4) [0x1000002553f4]
===================
==== backtrace ====
/lib64/libc.so.6(+0x25200)[0x100000255200]
[c712f6n06:14673] [ 6] /lib64/libc.so.6(__libc_start_main+0xc4)[0x1000002553f4]
[c712f6n06:14673] *** End of error message ***
0 /nfs_smpi_ci/abd/os/ucx-install/lib/libucs.so.0(+0x25650) [0x100003915650]
1 /nfs_smpi_ci/abd/os/ucx-install/lib/libucs.so.0(+0x259a4) [0x1000039159a4]
2 [0x100000050478]
3 /nfs_smpi_ci/abd/os/ompi-install/lib/libmpi.so.0(ompi_request_finalize+0x54) [0x1000000e5f24]
4 /nfs_smpi_ci/abd/os/ompi-install/lib/libmpi.so.0(ompi_mpi_finalize+0x990) [0x1000000e8e30]
5 /nfs_smpi_ci/abd/os/ompi-install/lib/libmpi.so.0(PMPI_Finalize+0x44) [0x1000001155c4]
6 ./test2() [0x10000ff8]
7 /lib64/libc.so.6(+0x25200) [0x100000255200]
8 /lib64/libc.so.6(__libc_start_main+0xc4) [0x1000002553f4]
===================
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node c712f6n06 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Some additional info
When I revert this patch the issue is not seen,
0fe756d
At least one issue I guess I'm seeing in the patch is that nbc_req_cons
will never be called since the ompi_coll_base_nbc_request_t
object is not dynamically allocated, so it seems like req->data.objs.objs
is propagating some garbage values.
Metadata
Metadata
Assignees
Labels
No labels