Skip to content

MPI comm_spawn is hanging on disconnect #4542

@rhc54

Description

@rhc54

Seeing this on the current master (using orte/test/mpi/simple_spawn.c):

$ mpirun -H rhc001:24 -n 3 ./simple_spawn
[17760257:0 pid 176346] starting up on node rhc001!
[17760257:1 pid 176347] starting up on node rhc001!
[17760257:2 pid 176348] starting up on node rhc001!
0 completed MPI_Init
2 completed MPI_Init
Parent [pid 176348] about to spawn!
Parent [pid 176346] about to spawn!
1 completed MPI_Init
Parent [pid 176347] about to spawn!
[17760258:0 pid 176355] starting up on node rhc001!
[17760258:1 pid 176356] starting up on node rhc001!
[17760258:2 pid 176357] starting up on node rhc001!
Parent done with spawn
Parent sending message to child
Parent done with spawn
Parent done with spawn
2 completed MPI_Init
Hello from the child 2 of 3 on host rhc001 pid 176357
0 completed MPI_Init
Hello from the child 0 of 3 on host rhc001 pid 176355
1 completed MPI_Init
Hello from the child 1 of 3 on host rhc001 pid 176356
Parent disconnected
Child 2 disconnected
Parent disconnected
Child 1 disconnected
Child 0 received msg: 38
<hang forever>

I'm not sure when this started. @ggouaillardet Would you have a chance to take a peek? If it is PMIx related, please let me know and I'll dive into it.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions