-
Notifications
You must be signed in to change notification settings - Fork 918
Description
Hi,
After 10-20 hours, MPI_Barrier hangs with the following stack.
It looks like some error happens during mca_btl_tcp_proc_create, and then mca_btl_tcp_proc_destruct cannot proceed because mca_btl_tcp_component.tcp_lock is not recursive.
The stack below has been captured using OMPI 2.0.2a1, and the OMPI 2.0.2 release does not seem to fix this issue.
I would be greateful for any ideas about the root-cause of this error -- wrong host config, etc.
Unfortunately (as usual), the application is too large and I do not have a small reproducer.
Here is some information that seems relevant.
MPI is used to distribute jobs to CUDA cards inside one host.
There are no inter-host communications.
Each MPI process is multithreaded.
MPI is called from different threads but the calls are serialized at the application level.
MPI is built with --enable-thread-multiple, the app calls MPI_Init_thread(MPI_THREAD_MULTIPLE).
The application uses 50-100 communicators which are created at the application launch in the master thread and they all stay alive while the application is running.
Each communicator contains all MPI processes.
All processes get stuck with identical stack.
I used gdb to attach to the processes after they got stuck and verified that the processes passed the correct communicator to MPI_Barrier.
#0 __lll_lock_wait ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1 0x00007f0e250c7649 in _L_lock_909 ()
from /lib/x86_64-linux-gnu/libpthread.so.0
#2 0x00007f0e250c7470 in __GI___pthread_mutex_lock (
mutex=0x7f0e1ad547f8 <mca_btl_tcp_component+440>)
at ../nptl/pthread_mutex_lock.c:79
#3 0x00007f0e1ab4f76d in mca_btl_tcp_proc_destruct ()
from /home/espetrov/lib/openmpi/mca_btl_tcp.so
#4 0x00007f0e1ab4fce1 in mca_btl_tcp_proc_create ()
from /home/espetrov/lib/openmpi/mca_btl_tcp.so
#5 0x00007f0e1ab48e2c in mca_btl_tcp_add_procs ()
from /home/espetrov/lib/openmpi/mca_btl_tcp.so
#6 0x00007f0e1ab50270 in mca_btl_tcp_proc_lookup ()
from /home/espetrov/lib/openmpi/mca_btl_tcp.so
#7 0x00007f0e1ab4ac15 in mca_btl_tcp_component_recv_handler ()
from /home/espetrov/lib/openmpi/mca_btl_tcp.so
#8 0x00007f0e238c4108 in event_process_active_single_queue ()
from /home/espetrov/lib/libopen-pal.so.20
#9 0x00007f0e238c437c in event_process_active ()
from /home/espetrov/lib/libopen-pal.so.20
#10 0x00007f0e238c49cb in opal_libevent2022_event_base_loop ()
from /home/espetrov/lib/libopen-pal.so.20
#11 0x00007f0e23881894 in opal_progress ()
from /home/espetrov/lib/libopen-pal.so.20
#12 0x00007f0e23886c2d in sync_wait_mt ()
from /home/espetrov/lib/libopen-pal.so.20
#13 0x00007f0e24a042b9 in ompi_request_default_wait ()
from /home/espetrov/lib/libmpi.so.20
#14 0x00007f0e24a5b97d in ompi_coll_base_barrier_intra_recursivedoubling ()
from /home/espetrov/lib/libmpi.so.20
#15 0x00007f0e24a18284 in PMPI_Barrier () from /home/espetrov/lib/libmpi.so.20
bash$ netstat -i
Kernel Interface table
Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg
eth1 8950 0 25787150 0 1 0 7448504 0 0 0 BMRU
lo 65536 0 1265039 0 0 0 1265039 0 0 0 LRU
vlan762 8950 0 1154868 0 0 0 91581 0 0 0 BMRU
EDIT: Added verbatim blocks