-
Notifications
You must be signed in to change notification settings - Fork 918
Closed
Description
I have a problem with mpi4py 2.0.0, with OpenMPI 2.0.1 (on Debian).
This is a regression compared to openmpi 1.10.3
mpi4py fails on its test suite (and build) due to the following code:
print("DEBUG: main", file=sys.stderr)
name, version = MPI.get_vendor()
print("DEBUG: vendor,version = %s, %s" % (name, version), file=sys.stderr)
cartcomm = MPI.COMM_SELF.Create_cart([1], periods=[0])
try:
try:
print("DEBUG: befor allgather", file=sys.stderr)
print('DEBUG: size = %s, rank = %s' % (cartcomm.Get_size(), cartcomm.Get_rank()), file=sys.stderr)
cartcomm.neighbor_allgather(None, None)
print("DEBUG: After allgather", file=sys.stderr)
except:
print("DEBUG: in except", file=sys.stderr)
(please excuse the print statements). It hangs on the neighbor_allgather(), when the test code is run outside mpiexec (single process).
Debug output:
nose.importer: DEBUG: find module part test_cco_ngh_buf (test_cco_ngh_buf) in ['/srv/build/mpi4py/mpi4py-2.0.0/test']
DEBUG: main
DEBUG: vendor,version = Open MPI, (2, 0, 1)
DEBUG: befor allgather
DEBUG: size = 1, rank = 0
I've a debug version of OpenMPI (--enable-debug), so I get the trace:
0x00007f1ab185d90c in ompi_request_default_wait_all (count=0, requests=<optimized out>, statuses=<optimized out>) at request/req_wait.c:346
346request/req_wait.c: No such file or directory.
(gdb) where
#0 0x00007f1ab185d90c in ompi_request_default_wait_all (count=0, requests=<optimized out>, statuses=<optimized out>) at request/req_wait.c:346
#1 0x00007f1a9e8c9166 in mca_coll_basic_neighbor_allgather_cart (module=0x1f3d0c0, comm=0x1fda650, rdtype=0x7f1ab1af0ca0 <ompi_mpi_int>, rcount=1,
rbuf=0x16d37d8, sdtype=0x7f1ab1af0ca0 <ompi_mpi_int>, scount=<optimized out>, sbuf=0x7ffc21cad868) at coll_basic_neighbor_allgather.c:109
#2 mca_coll_basic_neighbor_allgather (sbuf=0x7ffc21cad868, scount=<optimized out>, sdtype=0x7f1ab1af0ca0 <ompi_mpi_int>, rbuf=<optimized out>, rco\
unt=1,
rdtype=0x7f1ab1af0ca0 <ompi_mpi_int>, comm=0x1fda650, module=0x1f3d0c0) at coll_basic_neighbor_allgather.c:238
#3 0x00007f1ab1883b67 in PMPI_Neighbor_allgather (sendbuf=sendbuf@entry=0x7ffc21cad868, sendcount=sendcount@entry=1, sendtype=0x7f1ab1af0ca0 <ompi\
_mpi_int>,
recvbuf=recvbuf@entry=0x16d37d0, recvcount=recvcount@entry=1, recvtype=<optimized out>, comm=0x1fda650) at pneighbor_allgather.c:118
#4 0x00007f1ab1bf801d in __pyx_f_6mpi4py_3MPI_PyMPI_neighbor_allgather (__pyx_v_comm=0x1fda650, __pyx_v_sendobj=<optimized out>) at src/mpi4py.MPI\
.c:48950
#5 __pyx_pf_6mpi4py_3MPI_8Topocomm_22neighbor_allgather (__pyx_v_self=<optimized out>, __pyx_v_sendobj=<optimized out>) at src/mpi4py.MPI.c:50162
#6 __pyx_pw_6mpi4py_3MPI_8Topocomm_23neighbor_allgather (__pyx_v_self=<optimized out>, __pyx_args=<optimized out>, __pyx_kwds=<optimized out>)
at src/mpi4py.MPI.c:50129
#7 0x00000000004c398a in call_function (oparg=<optimized out>, pp_stack=0x7ffc21cad990) at ../Python/ceval.c:4350
So its locked in WAIT_SYNC_RELEASE(&sync), and the sync struct has:
$5 = {count = 0, status = 0, condition = {__data = {__lock = 0, __futex = 0, __total_seq = 0, __wakeup_seq = 0, __woken_seq = 0, __mutex = 0x0,
__nwaiters = 0, __broadcast_seq = 0}, __size = '\000' <repeats 47 times>, __align = 0}, lock = {__data = {__lock = 0, __count = 0,
__owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}},
__size = '\000' <repeats 39 times>, __align = 0}, next = 0x0, prev = 0x0, signaling = true}
The configuration is:
/configure --build=x86_64-linux-gnu --prefix=/usr --includedir=${prefix}/include --mandir=${prefix}/share/man --infodir=${prefix}/share/info --sysconfdir=/etc --localstatedir=/var --disable-silent-rules --libdir=${prefix}/lib/x86_64-linux-gnu --libexecdir=${prefix}/lib/x86_64-linux-gnu --disable-maintainer-mode --disable-dependency-tracking --with-libfabric --with-jdk-dir=/usr/lib/jvm/default-java --enable-mpi-java --enable-debug --enable-mpi-thread-multiple --disable-silent-rules --enable-mpi-cxx --with-hwloc=/usr/ --with-libltdl=/usr/ --with-devel-headers --with-slurm --with-sge --without-tm --enable-heterogeneous --disable-vt --sysconfdir=/etc/openmpi --libdir=${prefix}/lib/openmpi/lib --includedir=${prefix}/lib/openmpi/include
This is in a VM on my laptop (but also our build machines), so no interesting hardware.
How to proceed and debug this?
Metadata
Metadata
Assignees
Labels
No labels