Skip to content

Hang in PMPI_Neighbor_allgather() #2071

@amckinstry

Description

@amckinstry

I have a problem with mpi4py 2.0.0, with OpenMPI 2.0.1 (on Debian).
This is a regression compared to openmpi 1.10.3

mpi4py fails on its test suite (and build) due to the following code:


print("DEBUG: main", file=sys.stderr)
name, version = MPI.get_vendor()
print("DEBUG: vendor,version = %s, %s" % (name, version), file=sys.stderr)
cartcomm = MPI.COMM_SELF.Create_cart([1], periods=[0])
try:
    try:
        print("DEBUG: befor allgather", file=sys.stderr)
        print('DEBUG: size = %s, rank = %s' % (cartcomm.Get_size(), cartcomm.Get_rank()), file=sys.stderr)
        cartcomm.neighbor_allgather(None, None)
        print("DEBUG: After allgather", file=sys.stderr)
    except:
        print("DEBUG: in except", file=sys.stderr)

(please excuse the print statements). It hangs on the neighbor_allgather(), when the test code is run outside mpiexec (single process).

Debug output:

nose.importer: DEBUG: find module part test_cco_ngh_buf (test_cco_ngh_buf) in ['/srv/build/mpi4py/mpi4py-2.0.0/test']
DEBUG: main
DEBUG: vendor,version = Open MPI, (2, 0, 1)
DEBUG: befor allgather
DEBUG: size = 1, rank = 0

I've a debug version of OpenMPI (--enable-debug), so I get the trace:

0x00007f1ab185d90c in ompi_request_default_wait_all (count=0, requests=<optimized out>, statuses=<optimized out>) at request/req_wait.c:346
346request/req_wait.c: No such file or directory.
(gdb) where
#0  0x00007f1ab185d90c in ompi_request_default_wait_all (count=0, requests=<optimized out>, statuses=<optimized out>) at request/req_wait.c:346
#1  0x00007f1a9e8c9166 in mca_coll_basic_neighbor_allgather_cart (module=0x1f3d0c0, comm=0x1fda650, rdtype=0x7f1ab1af0ca0 <ompi_mpi_int>, rcount=1,
    rbuf=0x16d37d8, sdtype=0x7f1ab1af0ca0 <ompi_mpi_int>, scount=<optimized out>, sbuf=0x7ffc21cad868) at coll_basic_neighbor_allgather.c:109
#2  mca_coll_basic_neighbor_allgather (sbuf=0x7ffc21cad868, scount=<optimized out>, sdtype=0x7f1ab1af0ca0 <ompi_mpi_int>, rbuf=<optimized out>, rco\
unt=1,
    rdtype=0x7f1ab1af0ca0 <ompi_mpi_int>, comm=0x1fda650, module=0x1f3d0c0) at coll_basic_neighbor_allgather.c:238
#3  0x00007f1ab1883b67 in PMPI_Neighbor_allgather (sendbuf=sendbuf@entry=0x7ffc21cad868, sendcount=sendcount@entry=1, sendtype=0x7f1ab1af0ca0 <ompi\
_mpi_int>,
    recvbuf=recvbuf@entry=0x16d37d0, recvcount=recvcount@entry=1, recvtype=<optimized out>, comm=0x1fda650) at pneighbor_allgather.c:118
#4  0x00007f1ab1bf801d in __pyx_f_6mpi4py_3MPI_PyMPI_neighbor_allgather (__pyx_v_comm=0x1fda650, __pyx_v_sendobj=<optimized out>) at src/mpi4py.MPI\
.c:48950
#5  __pyx_pf_6mpi4py_3MPI_8Topocomm_22neighbor_allgather (__pyx_v_self=<optimized out>, __pyx_v_sendobj=<optimized out>) at src/mpi4py.MPI.c:50162
#6  __pyx_pw_6mpi4py_3MPI_8Topocomm_23neighbor_allgather (__pyx_v_self=<optimized out>, __pyx_args=<optimized out>, __pyx_kwds=<optimized out>)
    at src/mpi4py.MPI.c:50129
#7  0x00000000004c398a in call_function (oparg=<optimized out>, pp_stack=0x7ffc21cad990) at ../Python/ceval.c:4350

So its locked in WAIT_SYNC_RELEASE(&sync), and the sync struct has:

$5 = {count = 0, status = 0, condition = {__data = {__lock = 0, __futex = 0, __total_seq = 0, __wakeup_seq = 0, __woken_seq = 0, __mutex = 0x0,
      __nwaiters = 0, __broadcast_seq = 0}, __size = '\000' <repeats 47 times>, __align = 0}, lock = {__data = {__lock = 0, __count = 0,
      __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}},
    __size = '\000' <repeats 39 times>, __align = 0}, next = 0x0, prev = 0x0, signaling = true}

The configuration is:

/configure --build=x86_64-linux-gnu --prefix=/usr --includedir=${prefix}/include --mandir=${prefix}/share/man --infodir=${prefix}/share/info --sysconfdir=/etc --localstatedir=/var --disable-silent-rules --libdir=${prefix}/lib/x86_64-linux-gnu --libexecdir=${prefix}/lib/x86_64-linux-gnu --disable-maintainer-mode --disable-dependency-tracking --with-libfabric --with-jdk-dir=/usr/lib/jvm/default-java --enable-mpi-java --enable-debug --enable-mpi-thread-multiple --disable-silent-rules --enable-mpi-cxx --with-hwloc=/usr/ --with-libltdl=/usr/ --with-devel-headers --with-slurm --with-sge --without-tm --enable-heterogeneous --disable-vt --sysconfdir=/etc/openmpi --libdir=${prefix}/lib/openmpi/lib --includedir=${prefix}/lib/openmpi/include

This is in a VM on my laptop (but also our build machines), so no interesting hardware.
How to proceed and debug this?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions