Closed
Description
Looks like singleton MPI init and spawn is broken in the main branch.
Look at this reproducer:
#include <mpi.h>
#include <stdio.h>
int main(int argc, char *argv[])
{
MPI_Comm parent, intercomm;
MPI_Init(NULL, NULL);
MPI_Comm_get_parent(&parent);
if (MPI_COMM_NULL != parent)
MPI_Comm_disconnect(&parent);
if (argc > 1) {
printf("Spawning '%s' ... ", argv[1]);
MPI_Comm_spawn(argv[1], MPI_ARGV_NULL,
1, MPI_INFO_NULL, 0, MPI_COMM_SELF,
&intercomm, MPI_ERRCODES_IGNORE);
MPI_Comm_disconnect(&intercomm);
printf("OK\n");
}
MPI_Finalize();
}
Now I run that code using Open MPI v4.1.2 (system package from Fedora 36) the following two ways:
$ mpiexec -n 1 ./a.out ./a.out
Spawning './a.out' ... OK
$ ./a.out ./a.out
Spawning './a.out' ... OK
Note that the second way does not use mpiexec
(that is, what the MPI standard calls singleton MPI initialization).
Next I run the code with ompi/main. I've configured with:
./configure \
--without-ofi \
--without-ucx \
--with-pmix=internal \
--with-prrte=internal \
--with-libevent=internal \
--with-hwloc=internal \
--enable-debug \
--enable-mem-debug \
--disable-man-pages \
--disable-sphinx
The first way (using mpiexec
) seems to works just fine. The second way (singleton MPI init) fails:
$ mpiexec -n 1 ./a.out ./a.out
Spawning './a.out' ... OK
$ ./a.out ./a.out
[kw61149:1105609] OPAL ERROR: Error in file ../../ompi/dpm/dpm.c at line 2122
[kw61149:00000] *** An error occurred in MPI_Comm_spawn
[kw61149:00000] *** reported by process [440139776,0]
[kw61149:00000] *** on communicator MPI_COMM_SELF
[kw61149:00000] *** MPI_ERR_UNKNOWN: unknown error
[kw61149:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[kw61149:00000] *** and MPI will try to terminate your MPI job as well)
PS: Lack of singleton MPI initialization complicate some Python users wanting to dynamically spawn MPI processes as needed via mpi4py without requiring the parent process to be launched through mpiexec
.