CUDA GPUDirect blocks on message size greater than GPUDirect RDMA limit with openib btl

## Background information
When using CUDA GPUDirect to Send and Recv directly from GPU buffers on message sizes greater than the RDMA limit, rather than expected behavior of staging buffers through host memory, Open MPI simply hangs. 

### What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
v3.0.x


### Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Git clone.
Commands used:
```
./autogen.pl

./configure --prefix=/labhome/ompi/install --with-knem=/path/to/knem --with-mxm=/path/to/mxm --with-slurm --with-pmi --with-platform=contrib/platform/mellanox/optimized --with-hcoll=/path/to/hcoll --with-ucx=/path/to/ucx --with-cuda

make all install
```

### Please describe the system on which you are running

* Operating system/version: Red Hat Enterprise Linux Server 7.2 (Maipo) (3.10.0-327.el7.x86_64)
CUDA distribution: 8.0
CUDA GPUDirect Driver: Mellanox OFED GPUDirect RDMA
* Computer hardware: 2 nodes, 48 cores/node, 2 GPUs per node
CPUs: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
GPUs: NVIDIA Tesla P100-PCIE-16GB

* Network type: 
Infiniband
-----------------------------

## Details of the problem
GPUDirect is intended to improve latency and bandwidth by allowing RDMA transfers from GPU to GPU and bypassing the CPU. However, there is a message size limit on RDMA transfers. Above this limit, GPUDirect is expected to stage buffers through host memory with cudaMemcpy. 
The limit can be changed with the parameter "-mca btl_openib_cuda_rdma_limit x", where x is the message size in bytes.
Unfortunately, rather than behave as expected, the program simply hangs on MPI_Recv.  
According to [this](https://www.research-collection.ethz.ch/bitstream/handle/20.500.11850/155349/eth-48403-01.pdf), this may occur due to OpenMPI using blocking copies. The option "-mca mpi_common_cuda_cumemcpy_async 1" is meant to instruct OpenMPI to use non-blocking asynchronous copies when staging buffers through host.  
However, enabling this option does nothing, and according to [this](http://users.open-mpi.narkive.com/0iFnON6R/ompi-users-cuda-buffers-enforce-asynchronous-memcpy-s), the option was enabled by default in OpenMPI v1.10 and onwards.
I attempted the same test in v1.10 and it was successful. 

Below is gpudirect_bug.c, to replicate the behavior. argv[1] is taken as the number of floating point elements in the send and receive buffers, so the actual array size is that multiplied by sizeof(float). 
The btl_openib_cuda_rdma_limit parameter is a limit on total message size; in my case an extra 56 bytes seemed to be used as a header, so a limit of 1000 is actually a limit of 944. This small discrepancy is a minor inconvenience but should possibly be corrected to match the user message length.

In my command line I inputted 1000, so the array was 4000 bytes, and the RDMA limit was 1000 bytes, so it was above the limit and hung. Message sizes below the limit act as expected.

gpudirect_bug.c:

```
#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <cuda_runtime_api.h>
#include <mpi.h>
int main(int argc, char** argv)
{
        MPI_Init(NULL,NULL);
        int rank;
        MPI_Comm_rank(MPI_COMM_WORLD, &rank);
        MPI_Status st;
        float *dev_buf;
        size_t elements = atoi(argv[1]);
        size_t arraysize = elements * sizeof(float);
        if (rank == 0) {
                printf("Array size: %llu\n", arraysize);
        }

        cudaMalloc((void**)&dev_buf, arraysize);

        MPI_Barrier(MPI_COMM_WORLD);
        if (rank == 0) {
                MPI_Send(dev_buf, elements, MPI_FLOAT, 1, 0, MPI_COMM_WORLD);
                MPI_Recv(dev_buf, elements, MPI_FLOAT, 1, 0, MPI_COMM_WORLD, &st);
        } else if (rank == 1) {
                MPI_Recv(dev_buf, elements, MPI_FLOAT, 0, 0, MPI_COMM_WORLD, &st);
                MPI_Send(dev_buf, elements, MPI_FLOAT, 0, 0, MPI_COMM_WORLD);
        }
        MPI_Barrier(MPI_COMM_WORLD);

        cudaFree(dev_buf);
        MPI_Finalize();
        return 0;
}
```

Compilation:
```
mpicc -o gpu_test gpudirect_bug.c -lcudart -g
```
Command line:
```
mpirun -np 2 -bind-to core -mca pml ob1 -mca btl openib,self --map-by node -H vulcan03,vulcan02 --display-map -mca coll '^hcoll' -mca mtl '^mxm' -mca btl_openib_want_cuda_gdr 1 -mca btl_openib_cuda_rdma_limit 1000 gpu_test 1000
```

Stacktrace:
```
== vulcan02 == 24222
Thread 5 (Thread 0x7ffff42ff700 (LWP 24223)):
#0  0x00007ffff70b3d13 in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00007ffff6670b53 in epoll_dispatch (base=0x663840, tv=<optimized out>) at epoll.c:407
#2  0x00007ffff66745a0 in opal_libevent2022_event_base_loop (base=0x663840, flags=flags@entry=1) at event.c:1630
#3  0x00007ffff663445e in progress_engine (obj=<optimized out>) at runtime/opal_progress_threads.c:105
#4  0x00007ffff7384dc5 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007ffff70b373d in clone () from /usr/lib64/libc.so.6
Thread 4 (Thread 0x7ffff2c23700 (LWP 24224)):
#0  0x00007ffff70b3d13 in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00007ffff6670b53 in epoll_dispatch (base=0x68e5d0, tv=<optimized out>) at epoll.c:407
#2  0x00007ffff66745a0 in opal_libevent2022_event_base_loop (base=0x68e5d0, flags=flags@entry=1) at event.c:1630
#3  0x00007ffff32a321e in progress_engine (obj=<optimized out>) at runtime/pmix_progress_threads.c:109
#4  0x00007ffff7384dc5 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007ffff70b373d in clone () from /usr/lib64/libc.so.6
Thread 3 (Thread 0x7fffda73d700 (LWP 24240)):
#0  0x00007ffff70b4bcf in accept4 () from /usr/lib64/libc.so.6
#1  0x00007fffe8a28676 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#2  0x00007fffe8a1be9d in ?? () from /usr/lib64/nvidia/libcuda.so.1
#3  0x00007fffe8a29068 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#4  0x00007ffff7384dc5 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007ffff70b373d in clone () from /usr/lib64/libc.so.6
Thread 2 (Thread 0x7fffd9f3c700 (LWP 24241)):
#0  0x00007ffff70a8dfd in poll () from /usr/lib64/libc.so.6
#1  0x00007fffe8a277b3 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#2  0x00007fffe8a8af56 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#3  0x00007fffe8a29068 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#4  0x00007ffff7384dc5 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007ffff70b373d in clone () from /usr/lib64/libc.so.6
Thread 1 (Thread 0x7ffff7fad740 (LWP 24222)):
#0  0x00007fffe9491abe in progress_one_device (device=0x7ae600) at btl_openib_component.c:3689
#1  btl_openib_component_progress () at btl_openib_component.c:3765
#2  0x00007ffff662e574 in opal_progress () at runtime/opal_progress.c:222
#3  0x00007fffe81fc425 in ompi_request_wait_completion (req=<optimized out>) at ../../../../ompi/request/request.h:392
#4  mca_pml_ob1_recv (addr=0x10316000000, count=<optimized out>, datatype=0x6022e0 <ompi_mpi_float>, src=<optimized out>, tag=<optimized out>, comm=<optimized out>, status=0x7fffffffd580) at pml_ob1_irecv.c:135
#5  0x00007ffff75ff41c in PMPI_Recv (buf=<optimized out>, count=<optimized out>, type=<optimized out>, source=<optimized out>, tag=<optimized out>, comm=0x6020e0 <ompi_mpi_comm_world>, status=0x7fffffffd580) at precv.c:79
#6  0x00000000004011ac in main (argc=2, argv=0x7fffffffd6e8) at /labhome/avivkh/with_hpl/a.c:122
== vulcan03 == 9658
Thread 5 (Thread 0x7ffff42ff700 (LWP 9661)):
#0  0x00007ffff70b3d13 in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00007ffff6670b53 in epoll_dispatch (base=0x663860, tv=<optimized out>) at epoll.c:407
#2  0x00007ffff66745a0 in opal_libevent2022_event_base_loop (base=0x663860, flags=flags@entry=1) at event.c:1630
#3  0x00007ffff663445e in progress_engine (obj=<optimized out>) at runtime/opal_progress_threads.c:105
#4  0x00007ffff7384dc5 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007ffff70b373d in clone () from /usr/lib64/libc.so.6
Thread 4 (Thread 0x7ffff2c23700 (LWP 9662)):
#0  0x00007ffff70b3d13 in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00007ffff6670b53 in epoll_dispatch (base=0x68e5f0, tv=<optimized out>) at epoll.c:407
#2  0x00007ffff66745a0 in opal_libevent2022_event_base_loop (base=0x68e5f0, flags=flags@entry=1) at event.c:1630
#3  0x00007ffff32a321e in progress_engine (obj=<optimized out>) at runtime/pmix_progress_threads.c:109
#4  0x00007ffff7384dc5 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007ffff70b373d in clone () from /usr/lib64/libc.so.6
Thread 3 (Thread 0x7fffda73d700 (LWP 9675)):
#0  0x00007ffff70b4bcf in accept4 () from /usr/lib64/libc.so.6
#1  0x00007fffe8a28676 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#2  0x00007fffe8a1be9d in ?? () from /usr/lib64/nvidia/libcuda.so.1
#3  0x00007fffe8a29068 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#4  0x00007ffff7384dc5 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007ffff70b373d in clone () from /usr/lib64/libc.so.6
Thread 2 (Thread 0x7fffd9f3c700 (LWP 9676)):
#0  0x00007ffff70a8dfd in poll () from /usr/lib64/libc.so.6
#1  0x00007fffe8a277b3 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#2  0x00007fffe8a8af56 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#3  0x00007fffe8a29068 in ?? () from /usr/lib64/nvidia/libcuda.so.1
#4  0x00007ffff7384dc5 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007ffff70b373d in clone () from /usr/lib64/libc.so.6
Thread 1 (Thread 0x7ffff7fad740 (LWP 9658)):
#0  0x00007ffff7389210 in pthread_spin_lock () from /usr/lib64/libpthread.so.0
#1  0x00007fffeaf01b24 in mlx5_poll_cq_1 () from /usr/lib64/libmlx5-rdmav2.so
#2  0x00007fffe9490e83 in ibv_poll_cq (wc=0x7fffffffa7e0, num_entries=<optimized out>, cq=<optimized out>) at /usr/include/infiniband/verbs.h:1458
#3  poll_device (device=device@entry=0x7a2f00, count=count@entry=0) at btl_openib_component.c:3608
#4  0x00007fffe9491c22 in progress_one_device (device=0x7a2f00) at btl_openib_component.c:3741
#5  btl_openib_component_progress () at btl_openib_component.c:3765
#6  0x00007ffff662e574 in opal_progress () at runtime/opal_progress.c:222
#7  0x00007fffe81fc425 in ompi_request_wait_completion (req=<optimized out>) at ../../../../ompi/request/request.h:392
#8  mca_pml_ob1_recv (addr=0x10316000000, count=<optimized out>, datatype=0x6022e0 <ompi_mpi_float>, src=<optimized out>, tag=<optimized out>, comm=<optimized out>, status=0x7fffffffdcd0) at pml_ob1_irecv.c:135
#9  0x00007ffff75ff41c in PMPI_Recv (buf=<optimized out>, count=<optimized out>, type=<optimized out>, source=<optimized out>, tag=<optimized out>, comm=0x6020e0 <ompi_mpi_comm_world>, status=0x7fffffffdcd0) at precv.c:79
#10 0x00000000004012e6 in main (argc=2, argv=0x7fffffffde38) at /labhome/avivkh/with_hpl/a.c:138
```



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA GPUDirect blocks on message size greater than GPUDirect RDMA limit with openib btl #3972

Background information

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Please describe the system on which you are running

Details of the problem

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CUDA GPUDirect blocks on message size greater than GPUDirect RDMA limit with openib btl #3972

Description

Background information

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Please describe the system on which you are running

Details of the problem

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions