-
Notifications
You must be signed in to change notification settings - Fork 936
Closed
Description
ARMCI-MPI is experiencing incorrect results with shared-memory accumulate with Open-MPI 4.0. Version 3.x is fine.
Background information
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
4.0.0
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Homebrew or built from source.
Please describe the system on which you are running
- Operating system/version: Mac or Linux
- Computer hardware: Intel CPUs.
- Network type: Shared-memory communication.
Details of the problem
https://travis-ci.org/pmodels/armci-mpi/jobs/477111650 is but one example.
Local testing
MacOS Mojave
Open-MPI from Homebrew:
jrhammon-mac02:prk-repo jrhammon$ mpicc -show
clang -I/usr/local/Cellar/open-mpi/4.0.0/include -L/usr/local/opt/libevent/lib -L/usr/local/Cellar/open-mpi/4.0.0/lib -lmpi
jrhammon-mac02:prk-repo jrhammon$ brew info open-mpi
open-mpi: stable 4.0.0 (bottled), HEAD
High performance message passing library
https://www.open-mpi.org/
Conflicts with:
mpich (because both install MPI compiler wrappers)
/usr/local/Cellar/open-mpi/4.0.0 (753 files, 10.7MB) *
Poured from bottle on 2019-01-14 at 11:27:52
From: https://github.com/Homebrew/homebrew-core/blob/master/Formula/open-mpi.rb
==> Dependencies
Required: gcc ✔, libevent ✔
==> Options
--HEAD
Install HEAD version
==> Analytics
install: 8,380 (30 days), 32,457 (90 days), 95,919 (365 days)
install_on_request: 2,965 (30 days), 11,847 (90 days), 37,813 (365 days)
build_error: 0 (30 days)Not using datatypes works fine:
ARMCI_STRIDED_METHOD=IOV ARMCI_IOV_METHOD=BATCHED ARMCI_USE_WIN_ALLOCATE=1 mpirun -oversubscribe -n 4 ./tests/contrib/armci-test
ARMCI_STRIDED_METHOD=IOV ARMCI_IOV_METHOD=BATCHED ARMCI_USE_WIN_ALLOCATE=0 mpirun -oversubscribe -n 4 ./tests/contrib/armci-testUsing datatypes fails differently depending on whether windows are allocated or created:
ARMCI_STRIDED_METHOD=DIRECT ARMCI_IOV_METHOD=DIRECT ARMCI_USE_WIN_ALLOCATE=1 mpirun -oversubscribe -n 4 ./tests/contrib/armci-testfails here:
Testing non-blocking vector gets and puts
Now veryfying the vector put data for correctness
Puts OK
1:while verifying data of a op from proc=0 giov index=0 ptr_arr_index=0
:element index=0 elem was supposed to be 3.890000 but is 0.000000
2:while verifying data of a op from proc=0 giov index=0 ptr_arr_index=0
:element index=0 elem was supposed to be 4.890000 but is 0.000000
3:while verifying data of a op from proc=0 giov index=0 ptr_arr_index=0
:element index=0 elem was supposed to be 5.890000 but is 0.000000 Now veryfying the vector get data for correctness
0:while verifying data of a op from proc=1 giov index=0 ptr_arr_index=0
:element index=0 elem was supposed to be 3.890000 but is 1010105040100.000000[3] ARMCI Error: vector non-blocking failed
[0] ARMCI Error: vector non-blocking failed
[1] ARMCI Error: vector non-blocking failed
[2] ARMCI Error: vector non-blocking failed
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 2 in communicator MPI COMMUNICATOR 3 DUP FROM 0
with errorcode 0.ARMCI_STRIDED_METHOD=DIRECT ARMCI_IOV_METHOD=DIRECT ARMCI_USE_WIN_ALLOCATE=0 mpirun -oversubscribe -n 4 ./tests/contrib/armci-test
[0] ARMCI Warning: MPI Datatypes are broken in RMA in older versions of Open-MPI!
ARMCI test program (4 processes)
[0] ARMCI Warning: MPI Datatypes are broken in RMA in older versions of Open-MPI!
[0] ARMCI Warning: MPI Datatypes are broken in RMA in older versions of Open-MPI!
Testing strided gets and puts
(Only std output for process 0 is printed)
[0] ARMCI Warning: MPI Datatypes are broken in RMA in older versions of Open-MPI!
--------array[5]--------
local[2:4] -> remote[2:4] -> local[0:2]
ERROR: a [2] (proc=2):2.000000 b [0] 0.000000
A = 2.000000 B = 0.000000
ERROR: a [2] (proc=1):2.000000 b [0] 0.000000
A = 2.000000 B = 0.000000
ERROR: a [2] (proc=0):2.000000[jrhammon-mac02:19084] *** Process received signal ***
[jrhammon-mac02:19084] Signal: Abort trap: 6 (6)
[jrhammon-mac02:19084] Signal code: (0)
[jrhammon-mac02:19084] [ 0] 0 libsystem_platform.dylib 0x00007fff63f33b3d _sigtramp + 29
[jrhammon-mac02:19084] [ 1] 0 libsystem_c.dylib 0x00007fff63d95000 __dso_handle + 0
[jrhammon-mac02:19084] [ 2] 0 libsystem_c.dylib 0x00007fff63df11c9 abort + 127
[jrhammon-mac02:19084] [ 3] 0 libsystem_c.dylib 0x00007fff63df133c _UTF2_init + 0
[jrhammon-mac02:19084] [ 4] 0 libsystem_c.dylib 0x00007fff63e15c8e __chk_fail_overlap + 0
[jrhammon-mac02:19084] [ 5] 0 libsystem_c.dylib 0x00007fff63e15c5e __chk_fail + 0
[jrhammon-mac02:19084] [ 6] 0 libsystem_c.dylib 0x00007fff63e1615b __memcpy_chk + 0
[jrhammon-mac02:19084] [ 7] 0 armci-test 0x0000000106d59f68 compare_patches + 1656
[jrhammon-mac02:19084] [ 8] 0 armci-test 0x0000000106d5aa52 test_dim + 1682
[jrhammon-mac02:19084] [ 9] 0 armci-test 0x0000000106d61fa8 main + 344
[jrhammon-mac02:19084] [10] 0 libdyld.dylib 0x00007fff63d48ed9 start + 1
[jrhammon-mac02:19084] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 2 in communicator MPI COMMUNICATOR 3 DUP FROM 0
with errorcode 0.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[1] ARMCI Error: Bailing out
[2] ARMCI Error: Bailing out
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node jrhammon-mac02 exited on signal 6 (Abort trap: 6).
--------------------------------------------------------------------------
[jrhammon-mac02.ra.intel.com:19083] 1 more process has sent help message help-mpi-api.txt / mpi-abort
[jrhammon-mac02.ra.intel.com:19083] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages