Skip to content

bug in datatype handling #7019

@tjahns

Description

@tjahns

I've found and now finally generated a small reproducer for a bug in handling of mildly complex datatypes introduced in commit 639f4b1 by @bosilca

In 2.0.3 the problem can simply be fixed by copying over the prior version of opal/datatype/opal_datatype_add.c from 2.0.2 but for more recent versions the fix is not so straight-forward obviously

The issue affects all versions of OpenMPI from 2.0.3 onwards including 4.0.1.

I've tested 2.0.2, 2.0.3, 2.0.4 and 4.0.1 and a colleague tested 2.1. In all cases except 2.0.2 for which I had an operating system package (but also verified from source tarball) we built from released source tarballs.

  • Operating system/version:

I've tested this on multiple installations of OpenMPI on Linux (Debian 9 and RedHat 6)

  • Computer hardware:

x86_64

  • Network type:

Can be reproduced with IB, tcp, MPI_COMM_SELF and shared memory.


Details of the problem

The two attached example programs both build with a simple mpicc invocation. In case the C compiler is a little older, -std=gnu99 might be needed for the second example. Both programs are expected to silently report success but will fail with diagnostic output on affected OpenMPI versions.

$ mpicc -o dt openmpi_datatype.c
$ mpirun -np 1 ./dt
pack_buffer[1] = 1 != src_data[0] = 0
pack_buffer[2] = 2 != src_data[0] = 0
dst_data[1] = 1 != ref_dst_data[1] = 0
dst_data[2] = 2 != ref_dst_data[2] = 0
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[11460,1],0]
  Exit code:    1
--------------------------------------------------------------------------
$ mpicc -std=gnu99 -o dt2 openmpi_datatype2.c
$ mpirun -np 1 ./dt2
data mismatch at j=1, i=0
b = {
     9, 14, 13, 12, 11, 10,  9, 14,
    15,  9, 10, 11, 12, 13, 16,  9,
    22, 17, 18, 19, 20, 21, 22, 17,
    30, 25, 26, 27, 28, 29, 30, 25 }
b expected = {
     9, 14, 13, 12, 11, 10,  9, 14,
    14,  9, 10, 11, 12, 13, 14,  9,
    22, 17, 18, 19, 20, 21, 22, 17,
    30, 25, 26, 27, 28, 29, 30, 25 }
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[11281,1],0]
  Exit code:    1
--------------------------------------------------------------------------

openmpi_datatype.zip

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions