Skip to content

MPI-4: Broken P{send|recv}_init routines #10390

Closed
@dalcinl

Description

@dalcinl

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

git branch v5.0.x

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

mkdir BUILD
cd BUILD
../configure --prefix=/home/devel/mpi/openmpi/5.0.0 --without-ofi --without-ucx --with-pmix=internal --enable-debug --enable-mem-debug --disable-man-pages --disable-sphinx && make -j 32 install

NOTE: I'm configuring without ofi and ucx, pmix is internal, hwloc is from system Fedora 36 version 2.5.0

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

$ git submodule status
 8c39d8e6a95d6fa78e765b5e86c324bd8a4ecd56 3rd-party/openpmix (v4.1.2-58-g8c39d8e6)
 f75647a0518b5a476011f543200fca1cf8600cb8 3rd-party/prrte (v2.0.2-99-gf75647a051)

Please describe the system on which you are running

  • Operating system/version: Linux 5.17.6 (Fedora 35)
  • Computer hardware: AMD Ryzen Threadripper PRO 3995WX 64-Cores
  • Network type: isolated

Details of the problem

I could not find any Open MPI-specific test for MPI-4 partitioned communication. Therefore, I wrote my own trivial reproducer:

#include <mpi.h>

int main(int argc, char *argv[])
{
    int buf[1] = {0};
    MPI_Request sreq,rreq;

    MPI_Init(&argc, &argv);

    MPI_Psend_init(buf, 1, 1, MPI_INT, 0, 0, MPI_COMM_SELF, MPI_INFO_NULL, &sreq);
    MPI_Precv_init(buf, 1, 1, MPI_INT, 0, 0, MPI_COMM_SELF, MPI_INFO_NULL, &rreq);

    MPI_Request_free(&sreq);
    MPI_Request_free(&rreq);
    MPI_Finalize();
    return 0;
}
$ mpicc --version
gcc (GCC) 11.3.1 20220421 (Red Hat 11.3.1-2)
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ mpicc -g3 tmp.c

$ ./a.out 
[localhost:2807591] *** Process received signal ***
[localhost:2807591] Signal: Segmentation fault (11)
[localhost:2807591] Signal code: Address not mapped (1)
[localhost:2807591] Failing at address: (nil)
[localhost:2807591] [ 0] /lib64/libc.so.6(+0x55e30)[0x7f67508efe30]
[localhost:2807591] *** End of error message ***
Segmentation fault (core dumped)

Valgrind is not helpful:

$ valgrind -q ./a.out 
==2807606== Jump to the invalid address stated on the next line
==2807606==    at 0x0: ???
==2807606==    by 0x4011C8: main (tmp.c:10)
==2807606==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==2807606== 
[localhost:2807606] *** Process received signal ***
[localhost:2807606] Signal: Segmentation fault (11)
[localhost:2807606] Signal code: Invalid permissions (2)
[localhost:2807606] Failing at address: (nil)
[localhost:2807606] [ 0] /lib64/libc.so.6(+0x55e30)[0x4d71e30]
[localhost:2807606] *** End of error message ***
Segmentation fault (core dumped)

I cannot get why the jump address is 0x0. The symbol is definitely in the library:

$ nm /home/devel/mpi/openmpi/5.0.0/lib/libmpi.so | grep Psend_init
0000000000133ac5 W MPI_Psend_init
0000000000133ac5 T PMPI_Psend_init

Perhaps a compiler bug?

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions