Closed
Description
Thank you for taking the time to submit an issue!
Background information
I'm running osu_latency from http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-5.3.2.tar.gz in CUDA mode with RoCE RDMA-CM and I'm getting an error in title. The error seems to happen when message size is above 32768.
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
OpenMPI v2.1.1
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
./configure --with-cuda --enable-debug --prefix=/home/asergeev/openmpi
Please describe the system on which you are running
- Operating system/version: Debian GNU/Linux 8 (jessie), Linux opusgpu25-wbu2 4.4.66 BTL checkpoint friendly #1 SMP Wed May 3 23:47:24 UTC 2017 x86_64 GNU/Linux
- Computer hardware: 128GB RAM, 4xTitan P40
- Network type: 06:00.0 Ethernet controller: Mellanox Technologies MT27630 Family
Details of the problem
I'm running osu_latency from http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-5.3.2.tar.gz in CUDA mode with RoCE RDMA-CM and I'm getting an error in title. The error seems to happen when message size is above 32768.
(env)asergeev@opusgpu25-wbu2:~$ mpirun -mca btl self,openib -H opusgpu25-wbu2,opusgpu34-wbu2 -mca btl_openib_want_cuda_gdr 1 -mca btl_openib_receive_queues P,128,256,192,128:S,2048,256,128,32:S,12288,256,128,32:S,65536,256,128,32 --prefix /home/asergeev/openmpi -n 2 /home/asergeev/omb/pt2pt/osu_latency -d cuda D D
--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:
Local host: opusgpu34-wbu2
Device name: mlx5_0
Device vendor ID: 0x02c9
Device vendor part ID: 4117
Default device parameters will be used, which may result in lower
performance. You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.
NOTE: You can turn off this warning by setting the MCA parameter
btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
# OSU MPI-CUDA Latency Test v5.3.2
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size Latency (us)
0 3.02
1 10.47
2 10.55
4 11.39
8 10.74
16 10.68
32 10.39
64 10.66
128 10.61
256 11.53
512 10.33
1024 10.93
2048 11.40
4096 12.05
8192 14.10
16384 18.47
32768 50.95
osu_latency: btl_openib.c:1441: mca_btl_openib_alloc: Assertion `qp != 255' failed.
[opusgpu25-wbu2:15898] *** Process received signal ***
[opusgpu25-wbu2:15898] Signal: Aborted (6)
[opusgpu25-wbu2:15898] Signal code: (-6)
[opusgpu25-wbu2:15898] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xf8f0)[0x7f31d8fab8f0]
[opusgpu25-wbu2:15898] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37)[0x7f31d8c26077]
[opusgpu25-wbu2:15898] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7f31d8c27458]
[opusgpu25-wbu2:15898] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x2e266)[0x7f31d8c1f266]
[opusgpu25-wbu2:15898] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2e312)[0x7f31d8c1f312]
[opusgpu25-wbu2:15898] [ 5] /home/asergeev/openmpi/lib/openmpi/mca_btl_openib.so(mca_btl_openib_alloc+0x76)[0x7f31c2f8d01a]
[opusgpu25-wbu2:15898] [ 6] /home/asergeev/openmpi/lib/openmpi/mca_btl_openib.so(mca_btl_openib_prepare_src+0xac)[0x7f31c2f8d576]
[opusgpu25-wbu2:15898] [ 7] /home/asergeev/openmpi/lib/openmpi/mca_pml_ob1.so(+0x1ac89)[0x7f31c2b56c89]
[opusgpu25-wbu2:15898] [ 8] /home/asergeev/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_schedule_once+0x2d1)[0x7f31c2b59637]
[opusgpu25-wbu2:15898] [ 9] /home/asergeev/openmpi/lib/openmpi/mca_pml_ob1.so(+0x13659)[0x7f31c2b4f659]
[opusgpu25-wbu2:15898] [10] /home/asergeev/openmpi/lib/openmpi/mca_pml_ob1.so(+0x136bb)[0x7f31c2b4f6bb]
[opusgpu25-wbu2:15898] [11] /home/asergeev/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_frag_callback_ack+0x25b)[0x7f31c2b5008d]
[opusgpu25-wbu2:15898] [12] /home/asergeev/openmpi/lib/openmpi/mca_btl_openib.so(+0x19cee)[0x7f31c2f98cee]
[opusgpu25-wbu2:15898] [13] /home/asergeev/openmpi/lib/openmpi/mca_btl_openib.so(+0x1ad79)[0x7f31c2f99d79]
[opusgpu25-wbu2:15898] [14] /home/asergeev/openmpi/lib/openmpi/mca_btl_openib.so(+0x1b425)[0x7f31c2f9a425]
[opusgpu25-wbu2:15898] [15] /home/asergeev/openmpi/lib/openmpi/mca_btl_openib.so(+0x1b7a9)[0x7f31c2f9a7a9]
[opusgpu25-wbu2:15898] [16] /home/asergeev/openmpi/lib/openmpi/mca_btl_openib.so(+0x1b865)[0x7f31c2f9a865]
[opusgpu25-wbu2:15898] [17] /home/asergeev/openmpi/lib/libopen-pal.so.20(opal_progress+0xa9)[0x7f31d861c054]
[opusgpu25-wbu2:15898] [18] /home/asergeev/openmpi/lib/openmpi/mca_pml_ob1.so(+0xdd70)[0x7f31c2b49d70]
[opusgpu25-wbu2:15898] [19] /home/asergeev/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x525)[0x7f31c2b4be2f]
[opusgpu25-wbu2:15898] [20] /home/asergeev/openmpi/lib/libmpi.so(PMPI_Send+0x2a7)[0x7f31da1a496f]
[opusgpu25-wbu2:15898] [21] /home/asergeev/omb/pt2pt/osu_latency[0x401225]
[opusgpu25-wbu2:15898] [22] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f31d8c12b45]
[opusgpu25-wbu2:15898] [23] /home/asergeev/omb/pt2pt/osu_latency[0x4014be]
[opusgpu25-wbu2:15898] *** End of error message ***
[opusgpu25-wbu2:15892] 1 more process has sent help message help-mpi-btl-openib.txt / no device params found
[opusgpu25-wbu2:15892] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node opusgpu25-wbu2 exited on signal 6 (Aborted).
--------------------------------------------------------------------------
I am able to make it pass if I specify large RDMA limit, like this:
(env)asergeev@opusgpu25-wbu2:~$ mpirun -mca btl self,openib -mca btl_openib_cuda_rdma_limit 10000000 -H opusgpu25-wbu2,opusgpu34-wbu2 -mca btl_openib_want_cuda_gdr 1 -mca btl_openib_receive_queues P,128,256,192,128:S,2048,256,128,32:S,12288,256,128,32:S,65536,256,128,32 --prefix /home/asergeev/openmpi -n 2 /home/asergeev/omb/pt2pt/osu_latency -d cuda D D
--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:
Local host: opusgpu34-wbu2
Device name: mlx5_0
Device vendor ID: 0x02c9
Device vendor part ID: 4117
Default device parameters will be used, which may result in lower
performance. You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.
NOTE: You can turn off this warning by setting the MCA parameter
btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
# OSU MPI-CUDA Latency Test v5.3.2
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size Latency (us)
0 3.10
1 10.41
2 10.20
4 10.97
8 10.86
16 10.57
32 11.03
64 10.81
128 10.28
256 10.47
512 10.61
1024 10.54
2048 11.12
4096 12.69
8192 13.90
16384 17.58
32768 23.90
65536 37.57
131072 69.27
262144 140.70
524288 248.32
[opusgpu25-wbu2:16870] 1 more process has sent help message help-mpi-btl-openib.txt / no device params found
[opusgpu25-wbu2:16870] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
1048576 481.38
2097152 1023.72
4194304 2071.60
But then it still fails if I disable GPU direct altogether.
(env)asergeev@opusgpu25-wbu2:~$ mpirun -mca btl self,openib -mca btl_openib_cuda_rdma_limit 10000000 -H opusgpu25-wbu2,opusgpu34-wbu2 -mca btl_openib_want_cuda_gdr 0 -mca btl_openib_receive_queues P,128,256,192,128:S,2048,256,128,32:S,12288,256,128,32:S,65536,256,128,32 --prefix /home/asergeev/openmpi -n 2 /home/asergeev/omb/pt2pt/osu_latency -d cuda D D
--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:
Local host: opusgpu34-wbu2
Device name: mlx5_0
Device vendor ID: 0x02c9
Device vendor part ID: 4117
Default device parameters will be used, which may result in lower
performance. You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.
NOTE: You can turn off this warning by setting the MCA parameter
btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
# OSU MPI-CUDA Latency Test v5.3.2
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size Latency (us)
0 3.19
1 27.48
2 26.84
4 27.11
8 26.85
16 26.91
32 26.94
64 27.58
128 26.96
[opusgpu25-wbu2:17545] 1 more process has sent help message help-mpi-btl-openib.txt / no device params found
[opusgpu25-wbu2:17545] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
256 28.59
512 27.02
1024 28.46
2048 28.17
4096 30.12
8192 32.22
16384 40.80
osu_latency: btl_openib.c:1441: mca_btl_openib_alloc: Assertion `qp != 255' failed.
32768 52.48
[opusgpu25-wbu2:17551] *** Process received signal ***
[opusgpu25-wbu2:17551] Signal: Aborted (6)
[opusgpu25-wbu2:17551] Signal code: (-6)
[opusgpu25-wbu2:17551] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xf8f0)[0x7fc8569948f0]
[opusgpu25-wbu2:17551] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37)[0x7fc85660f077]
[opusgpu25-wbu2:17551] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7fc856610458]
[opusgpu25-wbu2:17551] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x2e266)[0x7fc856608266]
[opusgpu25-wbu2:17551] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2e312)[0x7fc856608312]
[opusgpu25-wbu2:17551] [ 5] /home/asergeev/openmpi/lib/openmpi/mca_btl_openib.so(mca_btl_openib_alloc+0x76)[0x7fc83c82301a]
[opusgpu25-wbu2:17551] [ 6] /home/asergeev/openmpi/lib/openmpi/mca_btl_openib.so(mca_btl_openib_prepare_src+0xac)[0x7fc83c823576]
[opusgpu25-wbu2:17551] [ 7] /home/asergeev/openmpi/lib/openmpi/mca_pml_ob1.so(+0x1ac89)[0x7fc83c3ecc89]
[opusgpu25-wbu2:17551] [ 8] /home/asergeev/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_schedule_once+0x2d1)[0x7fc83c3ef637]
[opusgpu25-wbu2:17551] [ 9] /home/asergeev/openmpi/lib/openmpi/mca_pml_ob1.so(+0x13659)[0x7fc83c3e5659]
[opusgpu25-wbu2:17551] [10] /home/asergeev/openmpi/lib/openmpi/mca_pml_ob1.so(+0x136bb)[0x7fc83c3e56bb]
[opusgpu25-wbu2:17551] [11] /home/asergeev/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_frag_callback_ack+0x25b)[0x7fc83c3e608d]
[opusgpu25-wbu2:17551] [12] /home/asergeev/openmpi/lib/openmpi/mca_btl_openib.so(+0x19cee)[0x7fc83c82ecee]
[opusgpu25-wbu2:17551] [13] /home/asergeev/openmpi/lib/openmpi/mca_btl_openib.so(+0x1ad79)[0x7fc83c82fd79]
[opusgpu25-wbu2:17551] [14] /home/asergeev/openmpi/lib/openmpi/mca_btl_openib.so(+0x1b425)[0x7fc83c830425]
[opusgpu25-wbu2:17551] [15] /home/asergeev/openmpi/lib/openmpi/mca_btl_openib.so(+0x1b7a9)[0x7fc83c8307a9]
[opusgpu25-wbu2:17551] [16] /home/asergeev/openmpi/lib/openmpi/mca_btl_openib.so(+0x1b865)[0x7fc83c830865]
[opusgpu25-wbu2:17551] [17] /home/asergeev/openmpi/lib/libopen-pal.so.20(opal_progress+0xa9)[0x7fc856005054]
[opusgpu25-wbu2:17551] [18] /home/asergeev/openmpi/lib/openmpi/mca_pml_ob1.so(+0xdd70)[0x7fc83c3dfd70]
[opusgpu25-wbu2:17551] [19] /home/asergeev/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x525)[0x7fc83c3e1e2f]
[opusgpu25-wbu2:17551] [20] /home/asergeev/openmpi/lib/libmpi.so(PMPI_Send+0x2a7)[0x7fc857b8d96f]
[opusgpu25-wbu2:17551] [21] /home/asergeev/omb/pt2pt/osu_latency[0x401225]
[opusgpu25-wbu2:17551] [22] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7fc8565fbb45]
[opusgpu25-wbu2:17551] [23] /home/asergeev/omb/pt2pt/osu_latency[0x4014be]
[opusgpu25-wbu2:17551] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node opusgpu25-wbu2 exited on signal 6 (Aborted).
--------------------------------------------------------------------------