Description
On the trunk and v1.5 branches (r22536), the IBM test loop_spawn is hanging. The exact iteration on which it hangs is nondeterministic; it hangs for me somewhere around iteration 200.
I'm running on 2 Linux 4-core nodes thusly:
$ mpirun -np 3 -bynode loop_spawn
parent: MPI_Comm_spawn #0 return : 0
parent: MPI_Comm_spawn #20 return : 0
parent: MPI_Comm_spawn #40 return : 0
parent: MPI_Comm_spawn #60 return : 0
parent: MPI_Comm_spawn #80 return : 0
parent: MPI_Comm_spawn #100 return : 0
parent: MPI_Comm_spawn #120 return : 0
parent: MPI_Comm_spawn #140 return : 0
parent: MPI_Comm_spawn #160 return : 0
[...hang...]
Note that this does ''not'' happen on the v1.4 branch; the test seems to work fine there. This suggests that something has changed on the trunk/v1.5 that caused the problem.
'''SIDENOTE:''' It is worth noting that if using the openib BTL with this test on the v1.4 branch, the test fails much later (i.e., around iteration 1300 for me) because of what looks like a problem in the openib BTL; see https://svn.open-mpi.org/trac/ompi/ticket/1928.
I was unable to determine ''why'' it was hanging. The BT from 2 of the 3 parent children appears to be nearly the same:
(gdb) bt
#0 0x0000002a9625590c in epoll_wait () from /lib64/tls/libc.so.6
#1 0x0000002a95a7016c in epoll_dispatch (base=0x519f20, arg=0x519db0,
tv=0x7fbfffd110) at epoll.c:210
#2 0x0000002a95a6d83b in opal_event_base_loop (base=0x519f20, flags=2)
at event.c:823
#3 0x0000002a95a6d568 in opal_event_loop (flags=2) at event.c:746
#4 0x0000002a95a49cb2 in opal_progress () at runtime/opal_progress.c:189
#5 0x0000002a958d458f in orte_grpcomm_base_allgather_list (
names=0x7fbfffd410, sbuf=0x7fbfffd2e0, rbuf=0x7fbfffd280)
at base/grpcomm_base_allgather.c:155
#6 0x0000002a958d5535 in orte_grpcomm_base_full_modex (procs=0x7fbfffd410,
modex_db=true) at base/grpcomm_base_modex.c:115
#7 0x0000002a969fb470 in modex (procs=0x7fbfffd410)
at grpcomm_bad_module.c:607
#8 0x0000002a9d418f67 in connect_accept (comm=0x5012e0, root=0,
port_string=0x7fbfffd590 "", send_first=false, newcomm=0x7fbfffd990)
at dpm_orte.c:375
#9 0x0000002a956c909a in PMPI_Comm_spawn (
command=0x7fbfffda00 "./loop_child", argv=0x0, maxprocs=1, info=0x5016e0,
root=0, comm=0x5012e0, intercomm=0x7fbfffdc20,
array_of_errcodes=0x7fbfffdc38) at pcomm_spawn.c:126
#10 0x0000000000400c86 in main (argc=1, argv=0x7fbfffdd28) at loop_spawn.c:34
(gdb)
Here's a bt from one of the two children:
(gdb) bt
#0 0x0000002a9623df89 in sched_yield () from /lib64/tls/libc.so.6
#1 0x0000002a95a49d0d in opal_progress () at runtime/opal_progress.c:220
#2 0x0000002a958d458f in orte_grpcomm_base_allgather_list (
names=0x7fbfffd7a0, sbuf=0x7fbfffd670, rbuf=0x7fbfffd610)
at base/grpcomm_base_allgather.c:155
#3 0x0000002a958d5535 in orte_grpcomm_base_full_modex (procs=0x7fbfffd7a0,
modex_db=true) at base/grpcomm_base_modex.c:115
#4 0x0000002a969fb470 in modex (procs=0x7fbfffd7a0)
at grpcomm_bad_module.c:607
#5 0x0000002a97673f67 in connect_accept (comm=0x5016d0, root=0,
port_string=0x674a60 "4103012352.0;tcp://172.29.218.140:55452;tcp://10.10.2\
0.250:55452;tcp://10.10.30.250:55452+4103012353.0;tcp://172.29.218.202:54128;tc\
p://10.10.20.202:54128;tcp://10.10.30.202:54128:562", send_first=true,
newcomm=0x7fbfffd940) at dpm_orte.c:375
#6 0x0000002a97675f27 in dyn_init () at dpm_orte.c:946
#7 0x0000002a956ae7f0 in ompi_mpi_init (argc=1, argv=0x7fbfffdc78,
requested=0, provided=0x7fbfffdb48) at runtime/ompi_mpi_init.c:846
#8 0x0000002a956d5fd1 in PMPI_Init (argc=0x7fbfffdb9c, argv=0x7fbfffdb90)
at pinit.c:84
#9 0x0000000000400b14 in main (argc=1, argv=0x7fbfffdc78) at loop_child.c:17
(gdb)
So they all appear to be in a modex. Beyond that, I am unfamiliar with this portion of the code base...