Skip to content

osu_latency: btl_openib.c:1441: mca_btl_openib_alloc: Assertion `qp != 255' failed. #3573

Closed
@alsrgv

Description

@alsrgv

Thank you for taking the time to submit an issue!

Background information

I'm running osu_latency from http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-5.3.2.tar.gz in CUDA mode with RoCE RDMA-CM and I'm getting an error in title. The error seems to happen when message size is above 32768.

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

OpenMPI v2.1.1

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

./configure --with-cuda --enable-debug --prefix=/home/asergeev/openmpi

Please describe the system on which you are running

  • Operating system/version: Debian GNU/Linux 8 (jessie), Linux opusgpu25-wbu2 4.4.66 BTL checkpoint friendly #1 SMP Wed May 3 23:47:24 UTC 2017 x86_64 GNU/Linux
  • Computer hardware: 128GB RAM, 4xTitan P40
  • Network type: 06:00.0 Ethernet controller: Mellanox Technologies MT27630 Family

Details of the problem

I'm running osu_latency from http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-5.3.2.tar.gz in CUDA mode with RoCE RDMA-CM and I'm getting an error in title. The error seems to happen when message size is above 32768.

(env)asergeev@opusgpu25-wbu2:~$ mpirun -mca btl self,openib -H opusgpu25-wbu2,opusgpu34-wbu2 -mca btl_openib_want_cuda_gdr 1 -mca btl_openib_receive_queues P,128,256,192,128:S,2048,256,128,32:S,12288,256,128,32:S,65536,256,128,32 --prefix /home/asergeev/openmpi -n 2 /home/asergeev/omb/pt2pt/osu_latency -d cuda D D
--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:

  Local host:            opusgpu34-wbu2
  Device name:           mlx5_0
  Device vendor ID:      0x02c9
  Device vendor part ID: 4117

Default device parameters will be used, which may result in lower
performance.  You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
# OSU MPI-CUDA Latency Test v5.3.2
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size          Latency (us)
0                       3.02
1                      10.47
2                      10.55
4                      11.39
8                      10.74
16                     10.68
32                     10.39
64                     10.66
128                    10.61
256                    11.53
512                    10.33
1024                   10.93
2048                   11.40
4096                   12.05
8192                   14.10
16384                  18.47
32768                  50.95
osu_latency: btl_openib.c:1441: mca_btl_openib_alloc: Assertion `qp != 255' failed.
[opusgpu25-wbu2:15898] *** Process received signal ***
[opusgpu25-wbu2:15898] Signal: Aborted (6)
[opusgpu25-wbu2:15898] Signal code:  (-6)
[opusgpu25-wbu2:15898] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xf8f0)[0x7f31d8fab8f0]
[opusgpu25-wbu2:15898] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37)[0x7f31d8c26077]
[opusgpu25-wbu2:15898] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7f31d8c27458]
[opusgpu25-wbu2:15898] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x2e266)[0x7f31d8c1f266]
[opusgpu25-wbu2:15898] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2e312)[0x7f31d8c1f312]
[opusgpu25-wbu2:15898] [ 5] /home/asergeev/openmpi/lib/openmpi/mca_btl_openib.so(mca_btl_openib_alloc+0x76)[0x7f31c2f8d01a]
[opusgpu25-wbu2:15898] [ 6] /home/asergeev/openmpi/lib/openmpi/mca_btl_openib.so(mca_btl_openib_prepare_src+0xac)[0x7f31c2f8d576]
[opusgpu25-wbu2:15898] [ 7] /home/asergeev/openmpi/lib/openmpi/mca_pml_ob1.so(+0x1ac89)[0x7f31c2b56c89]
[opusgpu25-wbu2:15898] [ 8] /home/asergeev/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_schedule_once+0x2d1)[0x7f31c2b59637]
[opusgpu25-wbu2:15898] [ 9] /home/asergeev/openmpi/lib/openmpi/mca_pml_ob1.so(+0x13659)[0x7f31c2b4f659]
[opusgpu25-wbu2:15898] [10] /home/asergeev/openmpi/lib/openmpi/mca_pml_ob1.so(+0x136bb)[0x7f31c2b4f6bb]
[opusgpu25-wbu2:15898] [11] /home/asergeev/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_frag_callback_ack+0x25b)[0x7f31c2b5008d]
[opusgpu25-wbu2:15898] [12] /home/asergeev/openmpi/lib/openmpi/mca_btl_openib.so(+0x19cee)[0x7f31c2f98cee]
[opusgpu25-wbu2:15898] [13] /home/asergeev/openmpi/lib/openmpi/mca_btl_openib.so(+0x1ad79)[0x7f31c2f99d79]
[opusgpu25-wbu2:15898] [14] /home/asergeev/openmpi/lib/openmpi/mca_btl_openib.so(+0x1b425)[0x7f31c2f9a425]
[opusgpu25-wbu2:15898] [15] /home/asergeev/openmpi/lib/openmpi/mca_btl_openib.so(+0x1b7a9)[0x7f31c2f9a7a9]
[opusgpu25-wbu2:15898] [16] /home/asergeev/openmpi/lib/openmpi/mca_btl_openib.so(+0x1b865)[0x7f31c2f9a865]
[opusgpu25-wbu2:15898] [17] /home/asergeev/openmpi/lib/libopen-pal.so.20(opal_progress+0xa9)[0x7f31d861c054]
[opusgpu25-wbu2:15898] [18] /home/asergeev/openmpi/lib/openmpi/mca_pml_ob1.so(+0xdd70)[0x7f31c2b49d70]
[opusgpu25-wbu2:15898] [19] /home/asergeev/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x525)[0x7f31c2b4be2f]
[opusgpu25-wbu2:15898] [20] /home/asergeev/openmpi/lib/libmpi.so(PMPI_Send+0x2a7)[0x7f31da1a496f]
[opusgpu25-wbu2:15898] [21] /home/asergeev/omb/pt2pt/osu_latency[0x401225]
[opusgpu25-wbu2:15898] [22] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f31d8c12b45]
[opusgpu25-wbu2:15898] [23] /home/asergeev/omb/pt2pt/osu_latency[0x4014be]
[opusgpu25-wbu2:15898] *** End of error message ***
[opusgpu25-wbu2:15892] 1 more process has sent help message help-mpi-btl-openib.txt / no device params found
[opusgpu25-wbu2:15892] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node opusgpu25-wbu2 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

I am able to make it pass if I specify large RDMA limit, like this:

(env)asergeev@opusgpu25-wbu2:~$ mpirun -mca btl self,openib -mca btl_openib_cuda_rdma_limit 10000000 -H opusgpu25-wbu2,opusgpu34-wbu2 -mca btl_openib_want_cuda_gdr 1 -mca btl_openib_receive_queues P,128,256,192,128:S,2048,256,128,32:S,12288,256,128,32:S,65536,256,128,32 --prefix /home/asergeev/openmpi -n 2 /home/asergeev/omb/pt2pt/osu_latency -d cuda D D
--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:

  Local host:            opusgpu34-wbu2
  Device name:           mlx5_0
  Device vendor ID:      0x02c9
  Device vendor part ID: 4117

Default device parameters will be used, which may result in lower
performance.  You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
# OSU MPI-CUDA Latency Test v5.3.2
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size          Latency (us)
0                       3.10
1                      10.41
2                      10.20
4                      10.97
8                      10.86
16                     10.57
32                     11.03
64                     10.81
128                    10.28
256                    10.47
512                    10.61
1024                   10.54
2048                   11.12
4096                   12.69
8192                   13.90
16384                  17.58
32768                  23.90
65536                  37.57
131072                 69.27
262144                140.70
524288                248.32
[opusgpu25-wbu2:16870] 1 more process has sent help message help-mpi-btl-openib.txt / no device params found
[opusgpu25-wbu2:16870] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
1048576               481.38
2097152              1023.72
4194304              2071.60

But then it still fails if I disable GPU direct altogether.

(env)asergeev@opusgpu25-wbu2:~$ mpirun -mca btl self,openib -mca btl_openib_cuda_rdma_limit 10000000 -H opusgpu25-wbu2,opusgpu34-wbu2 -mca btl_openib_want_cuda_gdr 0 -mca btl_openib_receive_queues P,128,256,192,128:S,2048,256,128,32:S,12288,256,128,32:S,65536,256,128,32 --prefix /home/asergeev/openmpi -n 2 /home/asergeev/omb/pt2pt/osu_latency -d cuda D D
--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:

  Local host:            opusgpu34-wbu2
  Device name:           mlx5_0
  Device vendor ID:      0x02c9
  Device vendor part ID: 4117

Default device parameters will be used, which may result in lower
performance.  You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
# OSU MPI-CUDA Latency Test v5.3.2
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size          Latency (us)
0                       3.19
1                      27.48
2                      26.84
4                      27.11
8                      26.85
16                     26.91
32                     26.94
64                     27.58
128                    26.96
[opusgpu25-wbu2:17545] 1 more process has sent help message help-mpi-btl-openib.txt / no device params found
[opusgpu25-wbu2:17545] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
256                    28.59
512                    27.02
1024                   28.46
2048                   28.17
4096                   30.12
8192                   32.22
16384                  40.80
osu_latency: btl_openib.c:1441: mca_btl_openib_alloc: Assertion `qp != 255' failed.
32768                  52.48
[opusgpu25-wbu2:17551] *** Process received signal ***
[opusgpu25-wbu2:17551] Signal: Aborted (6)
[opusgpu25-wbu2:17551] Signal code:  (-6)
[opusgpu25-wbu2:17551] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xf8f0)[0x7fc8569948f0]
[opusgpu25-wbu2:17551] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37)[0x7fc85660f077]
[opusgpu25-wbu2:17551] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7fc856610458]
[opusgpu25-wbu2:17551] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x2e266)[0x7fc856608266]
[opusgpu25-wbu2:17551] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2e312)[0x7fc856608312]
[opusgpu25-wbu2:17551] [ 5] /home/asergeev/openmpi/lib/openmpi/mca_btl_openib.so(mca_btl_openib_alloc+0x76)[0x7fc83c82301a]
[opusgpu25-wbu2:17551] [ 6] /home/asergeev/openmpi/lib/openmpi/mca_btl_openib.so(mca_btl_openib_prepare_src+0xac)[0x7fc83c823576]
[opusgpu25-wbu2:17551] [ 7] /home/asergeev/openmpi/lib/openmpi/mca_pml_ob1.so(+0x1ac89)[0x7fc83c3ecc89]
[opusgpu25-wbu2:17551] [ 8] /home/asergeev/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_schedule_once+0x2d1)[0x7fc83c3ef637]
[opusgpu25-wbu2:17551] [ 9] /home/asergeev/openmpi/lib/openmpi/mca_pml_ob1.so(+0x13659)[0x7fc83c3e5659]
[opusgpu25-wbu2:17551] [10] /home/asergeev/openmpi/lib/openmpi/mca_pml_ob1.so(+0x136bb)[0x7fc83c3e56bb]
[opusgpu25-wbu2:17551] [11] /home/asergeev/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_frag_callback_ack+0x25b)[0x7fc83c3e608d]
[opusgpu25-wbu2:17551] [12] /home/asergeev/openmpi/lib/openmpi/mca_btl_openib.so(+0x19cee)[0x7fc83c82ecee]
[opusgpu25-wbu2:17551] [13] /home/asergeev/openmpi/lib/openmpi/mca_btl_openib.so(+0x1ad79)[0x7fc83c82fd79]
[opusgpu25-wbu2:17551] [14] /home/asergeev/openmpi/lib/openmpi/mca_btl_openib.so(+0x1b425)[0x7fc83c830425]
[opusgpu25-wbu2:17551] [15] /home/asergeev/openmpi/lib/openmpi/mca_btl_openib.so(+0x1b7a9)[0x7fc83c8307a9]
[opusgpu25-wbu2:17551] [16] /home/asergeev/openmpi/lib/openmpi/mca_btl_openib.so(+0x1b865)[0x7fc83c830865]
[opusgpu25-wbu2:17551] [17] /home/asergeev/openmpi/lib/libopen-pal.so.20(opal_progress+0xa9)[0x7fc856005054]
[opusgpu25-wbu2:17551] [18] /home/asergeev/openmpi/lib/openmpi/mca_pml_ob1.so(+0xdd70)[0x7fc83c3dfd70]
[opusgpu25-wbu2:17551] [19] /home/asergeev/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x525)[0x7fc83c3e1e2f]
[opusgpu25-wbu2:17551] [20] /home/asergeev/openmpi/lib/libmpi.so(PMPI_Send+0x2a7)[0x7fc857b8d96f]
[opusgpu25-wbu2:17551] [21] /home/asergeev/omb/pt2pt/osu_latency[0x401225]
[opusgpu25-wbu2:17551] [22] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7fc8565fbb45]
[opusgpu25-wbu2:17551] [23] /home/asergeev/omb/pt2pt/osu_latency[0x4014be]
[opusgpu25-wbu2:17551] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node opusgpu25-wbu2 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions