-
Notifications
You must be signed in to change notification settings - Fork 937
Description
I've found and now finally generated a small reproducer for a bug in handling of mildly complex datatypes introduced in commit 639f4b1 by @bosilca
In 2.0.3 the problem can simply be fixed by copying over the prior version of opal/datatype/opal_datatype_add.c from 2.0.2 but for more recent versions the fix is not so straight-forward obviously
The issue affects all versions of OpenMPI from 2.0.3 onwards including 4.0.1.
I've tested 2.0.2, 2.0.3, 2.0.4 and 4.0.1 and a colleague tested 2.1. In all cases except 2.0.2 for which I had an operating system package (but also verified from source tarball) we built from released source tarballs.
- Operating system/version:
I've tested this on multiple installations of OpenMPI on Linux (Debian 9 and RedHat 6)
- Computer hardware:
x86_64
- Network type:
Can be reproduced with IB, tcp, MPI_COMM_SELF and shared memory.
Details of the problem
The two attached example programs both build with a simple mpicc invocation. In case the C compiler is a little older, -std=gnu99 might be needed for the second example. Both programs are expected to silently report success but will fail with diagnostic output on affected OpenMPI versions.
$ mpicc -o dt openmpi_datatype.c
$ mpirun -np 1 ./dt
pack_buffer[1] = 1 != src_data[0] = 0
pack_buffer[2] = 2 != src_data[0] = 0
dst_data[1] = 1 != ref_dst_data[1] = 0
dst_data[2] = 2 != ref_dst_data[2] = 0
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[11460,1],0]
Exit code: 1
--------------------------------------------------------------------------
$ mpicc -std=gnu99 -o dt2 openmpi_datatype2.c
$ mpirun -np 1 ./dt2
data mismatch at j=1, i=0
b = {
9, 14, 13, 12, 11, 10, 9, 14,
15, 9, 10, 11, 12, 13, 16, 9,
22, 17, 18, 19, 20, 21, 22, 17,
30, 25, 26, 27, 28, 29, 30, 25 }
b expected = {
9, 14, 13, 12, 11, 10, 9, 14,
14, 9, 10, 11, 12, 13, 14, 9,
22, 17, 18, 19, 20, 21, 22, 17,
30, 25, 26, 27, 28, 29, 30, 25 }
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[11281,1],0]
Exit code: 1
--------------------------------------------------------------------------