Skip to content

RMA accumulate is busted in 4.0 #6275

@jeffhammond

Description

@jeffhammond

ARMCI-MPI is experiencing incorrect results with shared-memory accumulate with Open-MPI 4.0. Version 3.x is fine.

Background information

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

4.0.0

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Homebrew or built from source.

Please describe the system on which you are running

  • Operating system/version: Mac or Linux
  • Computer hardware: Intel CPUs.
  • Network type: Shared-memory communication.

Details of the problem

https://travis-ci.org/pmodels/armci-mpi/jobs/477111650 is but one example.

Local testing

MacOS Mojave

Open-MPI from Homebrew:

jrhammon-mac02:prk-repo jrhammon$ mpicc -show
clang -I/usr/local/Cellar/open-mpi/4.0.0/include -L/usr/local/opt/libevent/lib -L/usr/local/Cellar/open-mpi/4.0.0/lib -lmpi
jrhammon-mac02:prk-repo jrhammon$ brew info open-mpi
open-mpi: stable 4.0.0 (bottled), HEAD
High performance message passing library
https://www.open-mpi.org/
Conflicts with:
  mpich (because both install MPI compiler wrappers)
/usr/local/Cellar/open-mpi/4.0.0 (753 files, 10.7MB) *
  Poured from bottle on 2019-01-14 at 11:27:52
From: https://github.com/Homebrew/homebrew-core/blob/master/Formula/open-mpi.rb
==> Dependencies
Required: gcc ✔, libevent ✔
==> Options
--HEAD
	Install HEAD version
==> Analytics
install: 8,380 (30 days), 32,457 (90 days), 95,919 (365 days)
install_on_request: 2,965 (30 days), 11,847 (90 days), 37,813 (365 days)
build_error: 0 (30 days)

Not using datatypes works fine:

ARMCI_STRIDED_METHOD=IOV ARMCI_IOV_METHOD=BATCHED ARMCI_USE_WIN_ALLOCATE=1 mpirun -oversubscribe -n 4 ./tests/contrib/armci-test
ARMCI_STRIDED_METHOD=IOV ARMCI_IOV_METHOD=BATCHED ARMCI_USE_WIN_ALLOCATE=0 mpirun -oversubscribe -n 4 ./tests/contrib/armci-test

Using datatypes fails differently depending on whether windows are allocated or created:

ARMCI_STRIDED_METHOD=DIRECT ARMCI_IOV_METHOD=DIRECT ARMCI_USE_WIN_ALLOCATE=1 mpirun -oversubscribe -n 4 ./tests/contrib/armci-test

fails here:

Testing non-blocking vector gets and puts

	Now veryfying the vector put data for correctness
	Puts OK


1:while verifying data of a op from proc=0 giov index=0 ptr_arr_index=0 
 :element index=0 elem was supposed to be 3.890000 but is 0.000000
2:while verifying data of a op from proc=0 giov index=0 ptr_arr_index=0 
 :element index=0 elem was supposed to be 4.890000 but is 0.000000
3:while verifying data of a op from proc=0 giov index=0 ptr_arr_index=0 
 :element index=0 elem was supposed to be 5.890000 but is 0.000000	Now veryfying the vector get data for correctness
0:while verifying data of a op from proc=1 giov index=0 ptr_arr_index=0 
 :element index=0 elem was supposed to be 3.890000 but is 1010105040100.000000[3] ARMCI Error: vector non-blocking failed
[0] ARMCI Error: vector non-blocking failed
[1] ARMCI Error: vector non-blocking failed
[2] ARMCI Error: vector non-blocking failed
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 2 in communicator MPI COMMUNICATOR 3 DUP FROM 0
with errorcode 0.
ARMCI_STRIDED_METHOD=DIRECT ARMCI_IOV_METHOD=DIRECT ARMCI_USE_WIN_ALLOCATE=0 mpirun -oversubscribe -n 4 ./tests/contrib/armci-test 
[0] ARMCI Warning: MPI Datatypes are broken in RMA in older versions of Open-MPI!
ARMCI test program (4 processes)
[0] ARMCI Warning: MPI Datatypes are broken in RMA in older versions of Open-MPI!
[0] ARMCI Warning: MPI Datatypes are broken in RMA in older versions of Open-MPI!

Testing strided gets and puts
(Only std output for process 0 is printed)

[0] ARMCI Warning: MPI Datatypes are broken in RMA in older versions of Open-MPI!
--------array[5]--------
local[2:4] -> remote[2:4] -> local[0:2] 
ERROR: a [2] (proc=2):2.000000 b [0] 0.000000

A = 2.000000 B = 0.000000
ERROR: a [2] (proc=1):2.000000 b [0] 0.000000

A = 2.000000 B = 0.000000
ERROR: a [2] (proc=0):2.000000[jrhammon-mac02:19084] *** Process received signal ***
[jrhammon-mac02:19084] Signal: Abort trap: 6 (6)
[jrhammon-mac02:19084] Signal code:  (0)
[jrhammon-mac02:19084] [ 0] 0   libsystem_platform.dylib            0x00007fff63f33b3d _sigtramp + 29
[jrhammon-mac02:19084] [ 1] 0   libsystem_c.dylib                   0x00007fff63d95000 __dso_handle + 0
[jrhammon-mac02:19084] [ 2] 0   libsystem_c.dylib                   0x00007fff63df11c9 abort + 127
[jrhammon-mac02:19084] [ 3] 0   libsystem_c.dylib                   0x00007fff63df133c _UTF2_init + 0
[jrhammon-mac02:19084] [ 4] 0   libsystem_c.dylib                   0x00007fff63e15c8e __chk_fail_overlap + 0
[jrhammon-mac02:19084] [ 5] 0   libsystem_c.dylib                   0x00007fff63e15c5e __chk_fail + 0
[jrhammon-mac02:19084] [ 6] 0   libsystem_c.dylib                   0x00007fff63e1615b __memcpy_chk + 0
[jrhammon-mac02:19084] [ 7] 0   armci-test                          0x0000000106d59f68 compare_patches + 1656
[jrhammon-mac02:19084] [ 8] 0   armci-test                          0x0000000106d5aa52 test_dim + 1682
[jrhammon-mac02:19084] [ 9] 0   armci-test                          0x0000000106d61fa8 main + 344
[jrhammon-mac02:19084] [10] 0   libdyld.dylib                       0x00007fff63d48ed9 start + 1
[jrhammon-mac02:19084] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 2 in communicator MPI COMMUNICATOR 3 DUP FROM 0
with errorcode 0.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[1] ARMCI Error: Bailing out
[2] ARMCI Error: Bailing out
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node jrhammon-mac02 exited on signal 6 (Abort trap: 6).
--------------------------------------------------------------------------
[jrhammon-mac02.ra.intel.com:19083] 1 more process has sent help message help-mpi-api.txt / mpi-abort
[jrhammon-mac02.ra.intel.com:19083] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions