Skip to content

osu_latency: btl_openib.c:1441: mca_btl_openib_alloc: Assertion `qp != 255' failed. #3573

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
alsrgv opened this issue May 24, 2017 · 10 comments
Labels

Comments

@alsrgv
Copy link

alsrgv commented May 24, 2017

Thank you for taking the time to submit an issue!

Background information

I'm running osu_latency from http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-5.3.2.tar.gz in CUDA mode with RoCE RDMA-CM and I'm getting an error in title. The error seems to happen when message size is above 32768.

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

OpenMPI v2.1.1

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

./configure --with-cuda --enable-debug --prefix=/home/asergeev/openmpi

Please describe the system on which you are running

  • Operating system/version: Debian GNU/Linux 8 (jessie), Linux opusgpu25-wbu2 4.4.66 BTL checkpoint friendly #1 SMP Wed May 3 23:47:24 UTC 2017 x86_64 GNU/Linux
  • Computer hardware: 128GB RAM, 4xTitan P40
  • Network type: 06:00.0 Ethernet controller: Mellanox Technologies MT27630 Family

Details of the problem

I'm running osu_latency from http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-5.3.2.tar.gz in CUDA mode with RoCE RDMA-CM and I'm getting an error in title. The error seems to happen when message size is above 32768.

(env)asergeev@opusgpu25-wbu2:~$ mpirun -mca btl self,openib -H opusgpu25-wbu2,opusgpu34-wbu2 -mca btl_openib_want_cuda_gdr 1 -mca btl_openib_receive_queues P,128,256,192,128:S,2048,256,128,32:S,12288,256,128,32:S,65536,256,128,32 --prefix /home/asergeev/openmpi -n 2 /home/asergeev/omb/pt2pt/osu_latency -d cuda D D
--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:

  Local host:            opusgpu34-wbu2
  Device name:           mlx5_0
  Device vendor ID:      0x02c9
  Device vendor part ID: 4117

Default device parameters will be used, which may result in lower
performance.  You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
# OSU MPI-CUDA Latency Test v5.3.2
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size          Latency (us)
0                       3.02
1                      10.47
2                      10.55
4                      11.39
8                      10.74
16                     10.68
32                     10.39
64                     10.66
128                    10.61
256                    11.53
512                    10.33
1024                   10.93
2048                   11.40
4096                   12.05
8192                   14.10
16384                  18.47
32768                  50.95
osu_latency: btl_openib.c:1441: mca_btl_openib_alloc: Assertion `qp != 255' failed.
[opusgpu25-wbu2:15898] *** Process received signal ***
[opusgpu25-wbu2:15898] Signal: Aborted (6)
[opusgpu25-wbu2:15898] Signal code:  (-6)
[opusgpu25-wbu2:15898] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xf8f0)[0x7f31d8fab8f0]
[opusgpu25-wbu2:15898] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37)[0x7f31d8c26077]
[opusgpu25-wbu2:15898] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7f31d8c27458]
[opusgpu25-wbu2:15898] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x2e266)[0x7f31d8c1f266]
[opusgpu25-wbu2:15898] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2e312)[0x7f31d8c1f312]
[opusgpu25-wbu2:15898] [ 5] /home/asergeev/openmpi/lib/openmpi/mca_btl_openib.so(mca_btl_openib_alloc+0x76)[0x7f31c2f8d01a]
[opusgpu25-wbu2:15898] [ 6] /home/asergeev/openmpi/lib/openmpi/mca_btl_openib.so(mca_btl_openib_prepare_src+0xac)[0x7f31c2f8d576]
[opusgpu25-wbu2:15898] [ 7] /home/asergeev/openmpi/lib/openmpi/mca_pml_ob1.so(+0x1ac89)[0x7f31c2b56c89]
[opusgpu25-wbu2:15898] [ 8] /home/asergeev/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_schedule_once+0x2d1)[0x7f31c2b59637]
[opusgpu25-wbu2:15898] [ 9] /home/asergeev/openmpi/lib/openmpi/mca_pml_ob1.so(+0x13659)[0x7f31c2b4f659]
[opusgpu25-wbu2:15898] [10] /home/asergeev/openmpi/lib/openmpi/mca_pml_ob1.so(+0x136bb)[0x7f31c2b4f6bb]
[opusgpu25-wbu2:15898] [11] /home/asergeev/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_frag_callback_ack+0x25b)[0x7f31c2b5008d]
[opusgpu25-wbu2:15898] [12] /home/asergeev/openmpi/lib/openmpi/mca_btl_openib.so(+0x19cee)[0x7f31c2f98cee]
[opusgpu25-wbu2:15898] [13] /home/asergeev/openmpi/lib/openmpi/mca_btl_openib.so(+0x1ad79)[0x7f31c2f99d79]
[opusgpu25-wbu2:15898] [14] /home/asergeev/openmpi/lib/openmpi/mca_btl_openib.so(+0x1b425)[0x7f31c2f9a425]
[opusgpu25-wbu2:15898] [15] /home/asergeev/openmpi/lib/openmpi/mca_btl_openib.so(+0x1b7a9)[0x7f31c2f9a7a9]
[opusgpu25-wbu2:15898] [16] /home/asergeev/openmpi/lib/openmpi/mca_btl_openib.so(+0x1b865)[0x7f31c2f9a865]
[opusgpu25-wbu2:15898] [17] /home/asergeev/openmpi/lib/libopen-pal.so.20(opal_progress+0xa9)[0x7f31d861c054]
[opusgpu25-wbu2:15898] [18] /home/asergeev/openmpi/lib/openmpi/mca_pml_ob1.so(+0xdd70)[0x7f31c2b49d70]
[opusgpu25-wbu2:15898] [19] /home/asergeev/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x525)[0x7f31c2b4be2f]
[opusgpu25-wbu2:15898] [20] /home/asergeev/openmpi/lib/libmpi.so(PMPI_Send+0x2a7)[0x7f31da1a496f]
[opusgpu25-wbu2:15898] [21] /home/asergeev/omb/pt2pt/osu_latency[0x401225]
[opusgpu25-wbu2:15898] [22] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f31d8c12b45]
[opusgpu25-wbu2:15898] [23] /home/asergeev/omb/pt2pt/osu_latency[0x4014be]
[opusgpu25-wbu2:15898] *** End of error message ***
[opusgpu25-wbu2:15892] 1 more process has sent help message help-mpi-btl-openib.txt / no device params found
[opusgpu25-wbu2:15892] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node opusgpu25-wbu2 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

I am able to make it pass if I specify large RDMA limit, like this:

(env)asergeev@opusgpu25-wbu2:~$ mpirun -mca btl self,openib -mca btl_openib_cuda_rdma_limit 10000000 -H opusgpu25-wbu2,opusgpu34-wbu2 -mca btl_openib_want_cuda_gdr 1 -mca btl_openib_receive_queues P,128,256,192,128:S,2048,256,128,32:S,12288,256,128,32:S,65536,256,128,32 --prefix /home/asergeev/openmpi -n 2 /home/asergeev/omb/pt2pt/osu_latency -d cuda D D
--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:

  Local host:            opusgpu34-wbu2
  Device name:           mlx5_0
  Device vendor ID:      0x02c9
  Device vendor part ID: 4117

Default device parameters will be used, which may result in lower
performance.  You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
# OSU MPI-CUDA Latency Test v5.3.2
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size          Latency (us)
0                       3.10
1                      10.41
2                      10.20
4                      10.97
8                      10.86
16                     10.57
32                     11.03
64                     10.81
128                    10.28
256                    10.47
512                    10.61
1024                   10.54
2048                   11.12
4096                   12.69
8192                   13.90
16384                  17.58
32768                  23.90
65536                  37.57
131072                 69.27
262144                140.70
524288                248.32
[opusgpu25-wbu2:16870] 1 more process has sent help message help-mpi-btl-openib.txt / no device params found
[opusgpu25-wbu2:16870] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
1048576               481.38
2097152              1023.72
4194304              2071.60

But then it still fails if I disable GPU direct altogether.

(env)asergeev@opusgpu25-wbu2:~$ mpirun -mca btl self,openib -mca btl_openib_cuda_rdma_limit 10000000 -H opusgpu25-wbu2,opusgpu34-wbu2 -mca btl_openib_want_cuda_gdr 0 -mca btl_openib_receive_queues P,128,256,192,128:S,2048,256,128,32:S,12288,256,128,32:S,65536,256,128,32 --prefix /home/asergeev/openmpi -n 2 /home/asergeev/omb/pt2pt/osu_latency -d cuda D D
--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:

  Local host:            opusgpu34-wbu2
  Device name:           mlx5_0
  Device vendor ID:      0x02c9
  Device vendor part ID: 4117

Default device parameters will be used, which may result in lower
performance.  You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
# OSU MPI-CUDA Latency Test v5.3.2
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size          Latency (us)
0                       3.19
1                      27.48
2                      26.84
4                      27.11
8                      26.85
16                     26.91
32                     26.94
64                     27.58
128                    26.96
[opusgpu25-wbu2:17545] 1 more process has sent help message help-mpi-btl-openib.txt / no device params found
[opusgpu25-wbu2:17545] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
256                    28.59
512                    27.02
1024                   28.46
2048                   28.17
4096                   30.12
8192                   32.22
16384                  40.80
osu_latency: btl_openib.c:1441: mca_btl_openib_alloc: Assertion `qp != 255' failed.
32768                  52.48
[opusgpu25-wbu2:17551] *** Process received signal ***
[opusgpu25-wbu2:17551] Signal: Aborted (6)
[opusgpu25-wbu2:17551] Signal code:  (-6)
[opusgpu25-wbu2:17551] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xf8f0)[0x7fc8569948f0]
[opusgpu25-wbu2:17551] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37)[0x7fc85660f077]
[opusgpu25-wbu2:17551] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7fc856610458]
[opusgpu25-wbu2:17551] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x2e266)[0x7fc856608266]
[opusgpu25-wbu2:17551] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2e312)[0x7fc856608312]
[opusgpu25-wbu2:17551] [ 5] /home/asergeev/openmpi/lib/openmpi/mca_btl_openib.so(mca_btl_openib_alloc+0x76)[0x7fc83c82301a]
[opusgpu25-wbu2:17551] [ 6] /home/asergeev/openmpi/lib/openmpi/mca_btl_openib.so(mca_btl_openib_prepare_src+0xac)[0x7fc83c823576]
[opusgpu25-wbu2:17551] [ 7] /home/asergeev/openmpi/lib/openmpi/mca_pml_ob1.so(+0x1ac89)[0x7fc83c3ecc89]
[opusgpu25-wbu2:17551] [ 8] /home/asergeev/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_schedule_once+0x2d1)[0x7fc83c3ef637]
[opusgpu25-wbu2:17551] [ 9] /home/asergeev/openmpi/lib/openmpi/mca_pml_ob1.so(+0x13659)[0x7fc83c3e5659]
[opusgpu25-wbu2:17551] [10] /home/asergeev/openmpi/lib/openmpi/mca_pml_ob1.so(+0x136bb)[0x7fc83c3e56bb]
[opusgpu25-wbu2:17551] [11] /home/asergeev/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_frag_callback_ack+0x25b)[0x7fc83c3e608d]
[opusgpu25-wbu2:17551] [12] /home/asergeev/openmpi/lib/openmpi/mca_btl_openib.so(+0x19cee)[0x7fc83c82ecee]
[opusgpu25-wbu2:17551] [13] /home/asergeev/openmpi/lib/openmpi/mca_btl_openib.so(+0x1ad79)[0x7fc83c82fd79]
[opusgpu25-wbu2:17551] [14] /home/asergeev/openmpi/lib/openmpi/mca_btl_openib.so(+0x1b425)[0x7fc83c830425]
[opusgpu25-wbu2:17551] [15] /home/asergeev/openmpi/lib/openmpi/mca_btl_openib.so(+0x1b7a9)[0x7fc83c8307a9]
[opusgpu25-wbu2:17551] [16] /home/asergeev/openmpi/lib/openmpi/mca_btl_openib.so(+0x1b865)[0x7fc83c830865]
[opusgpu25-wbu2:17551] [17] /home/asergeev/openmpi/lib/libopen-pal.so.20(opal_progress+0xa9)[0x7fc856005054]
[opusgpu25-wbu2:17551] [18] /home/asergeev/openmpi/lib/openmpi/mca_pml_ob1.so(+0xdd70)[0x7fc83c3dfd70]
[opusgpu25-wbu2:17551] [19] /home/asergeev/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x525)[0x7fc83c3e1e2f]
[opusgpu25-wbu2:17551] [20] /home/asergeev/openmpi/lib/libmpi.so(PMPI_Send+0x2a7)[0x7fc857b8d96f]
[opusgpu25-wbu2:17551] [21] /home/asergeev/omb/pt2pt/osu_latency[0x401225]
[opusgpu25-wbu2:17551] [22] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7fc8565fbb45]
[opusgpu25-wbu2:17551] [23] /home/asergeev/omb/pt2pt/osu_latency[0x4014be]
[opusgpu25-wbu2:17551] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node opusgpu25-wbu2 exited on signal 6 (Aborted).
--------------------------------------------------------------------------
@jsquyres
Copy link
Member

@jladd-mlnx @artpol84 I know you guys don't typically care about openib, but this part from the above bug report caught my eye:

  Device name:           mlx5_0
  Device vendor ID:      0x02c9
  Device vendor part ID: 4117

I mention this because on March 7, Josh commited b286478, which added some part numbers to the openib INI file. Are you missing some more part numbers?

@jladd-mlnx
Copy link
Member

Hmmm...will take a look.

@alsrgv
Copy link
Author

alsrgv commented May 24, 2017

BTW, if I run actual workload (TensorFlow-MPI) with these parameters, I get kernel BUG.

(env)asergeev@opusgpu25-wbu2:~/benchmarks/scripts/tf_cnn_benchmarks$ mpirun -mca btl self,openib -mca btl_openib_cuda_rdma_limit 10000000  -mca btl_openib_want_cuda_gdr 1 -mca btl_openib_receive_queues P,128,256,192,128:S,2048,256,128,32:S,12288,256,128,32:S,65536,256,128,32 --prefix /home/asergeev/openmpi -n 4 python tf_cnn_benchmarks.py --model inception3 --batch_size 64

TensorFlow:  1.2
Model:       inception3
Mode:        training
Batch size:  64 globalBatch size:  64 globalBatch size:  64 globalBatch size:  64 global
             64 per device
Device:      gpu:0
Data format: NCHW
Optimizer:   sgd
==========
Generating model
TensorFlow:  1.2
Model:       inception3
Mode:        training
Batch size:  64 globalBatch size:  64 globalBatch size:  64 globalBatch size:  64 global
             64 per device
Device:      gpu:0
Data format: NCHW
Optimizer:   sgd
==========
Generating model
TensorFlow:  1.2
Model:       inception3
Mode:        training
Batch size:  64 globalBatch size:  64 globalBatch size:  64 globalBatch size:  64 global
             64 per device
Device:      gpu:0
Data format: NCHW
Optimizer:   sgd
==========
Generating model
TensorFlow:  1.2
Model:       inception3
Mode:        training
Batch size:  64 globalBatch size:  64 globalBatch size:  64 globalBatch size:  64 global
             64 per device
Device:      gpu:0
Data format: NCHW
Optimizer:   sgd
==========
Generating model
2017-05-24 19:06:45.159314: I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] Found device 0 with properties:
name: Tesla P40
major: 6 minor: 1 memoryClockRate (GHz) 1.531
pciBusID 0000:84:00.0
Total memory: 22.38GiB
Free memory: 22.22GiB
2017-05-24 19:06:45.159354: I tensorflow/core/common_runtime/gpu/gpu_device.cc:927] DMA: 0
2017-05-24 19:06:45.159363: I tensorflow/core/common_runtime/gpu/gpu_device.cc:937] 0:   Y
2017-05-24 19:06:45.159386: I tensorflow/core/common_runtime/gpu/gpu_device.cc:996] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P40, pci bus id: 0000:84:00.0)
2017-05-24 19:06:45.201809: I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] Found device 0 with properties:
name: Tesla P40
major: 6 minor: 1 memoryClockRate (GHz) 1.531
pciBusID 0000:85:00.0
Total memory: 22.38GiB
Free memory: 22.22GiB
2017-05-24 19:06:45.201864: I tensorflow/core/common_runtime/gpu/gpu_device.cc:927] DMA: 0
2017-05-24 19:06:45.201886: I tensorflow/core/common_runtime/gpu/gpu_device.cc:937] 0:   Y
2017-05-24 19:06:45.201904: I tensorflow/core/common_runtime/gpu/gpu_device.cc:996] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P40, pci bus id: 0000:85:00.0)
2017-05-24 19:06:45.203449: I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] Found device 0 with properties:
name: Tesla P40
major: 6 minor: 1 memoryClockRate (GHz) 1.531
pciBusID 0000:05:00.0
Total memory: 22.38GiB
Free memory: 22.22GiB
2017-05-24 19:06:45.203488: I tensorflow/core/common_runtime/gpu/gpu_device.cc:927] DMA: 0
2017-05-24 19:06:45.203497: I tensorflow/core/common_runtime/gpu/gpu_device.cc:937] 0:   Y
2017-05-24 19:06:45.203531: I tensorflow/core/common_runtime/gpu/gpu_device.cc:996] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P40, pci bus id: 0000:05:00.0)
2017-05-24 19:06:45.227068: I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] Found device 0 with properties:
name: Tesla P40
major: 6 minor: 1 memoryClockRate (GHz) 1.531
pciBusID 0000:04:00.0
Total memory: 22.38GiB
Free memory: 22.22GiB
2017-05-24 19:06:45.227124: I tensorflow/core/common_runtime/gpu/gpu_device.cc:927] DMA: 0
2017-05-24 19:06:45.227134: I tensorflow/core/common_runtime/gpu/gpu_device.cc:937] 0:   Y
2017-05-24 19:06:45.227151: I tensorflow/core/common_runtime/gpu/gpu_device.cc:996] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P40, pci bus id: 0000:04:00.0)
2017-05-24 19:06:45.530388: I tensorflow/core/common_runtime/gpu/gpu_device.cc:996] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P40, pci bus id: 0000:84:00.0)
...
2017-05-24 19:06:48.069435: I tensorflow/core/common_runtime/gpu/gpu_device.cc:996] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P40, pci bus id: 0000:04:00.0)
2017-05-24 19:06:48.070738: I tensorflow/core/common_runtime/gpu/gpu_device.cc:996] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P40, pci bus id: 0000:04:00.0)
WARNING:tensorflow:Error encountered when serializing LAYER_NAME_UIDS.
Type is unsupported, or the types of the items don't match field type in CollectionDef.
'dict' object has no attribute 'name'
2017-05-24 19:06:51.840443: I tensorflow/core/common_runtime/gpu/gpu_device.cc:996] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P40, pci bus id: 0000:05:00.0)
WARNING:tensorflow:Error encountered when serializing LAYER_NAME_UIDS.
Type is unsupported, or the types of the items don't match field type in CollectionDef.
'dict' object has no attribute 'name'
2017-05-24 19:06:51.963929: I tensorflow/core/common_runtime/gpu/gpu_device.cc:996] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P40, pci bus id: 0000:85:00.0)
WARNING:tensorflow:Error encountered when serializing LAYER_NAME_UIDS.
Type is unsupported, or the types of the items don't match field type in CollectionDef.
'dict' object has no attribute 'name'
WARNING:tensorflow:Error encountered when serializing LAYER_NAME_UIDS.
Type is unsupported, or the types of the items don't match field type in CollectionDef.
'dict' object has no attribute 'name'
2017-05-24 19:06:52.346766: I tensorflow/core/common_runtime/gpu/gpu_device.cc:996] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P40, pci bus id: 0000:04:00.0)
2017-05-24 19:06:52.363217: I tensorflow/core/common_runtime/gpu/gpu_device.cc:996] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P40, pci bus id: 0000:84:00.0)
--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:

  Local host:            opusgpu25-wbu2
  Device name:           mlx5_1
  Device vendor ID:      0x02c9
  Device vendor part ID: 4117

Default device parameters will be used, which may result in lower
performance.  You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
Running warm up
Running warm up
Running warm up
Running warm up
[opusgpu25-wbu2:76413] 7 more processes have sent help message help-mpi-btl-openib.txt / no device params found
[opusgpu25-wbu2:76413] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

Dmesg:

[ 4247.160209] kernel BUG at mm/memory.c:3202!
[ 4247.165533] invalid opcode: 0000 [#4] SMP
[ 4247.170798] Modules linked in: tcp_diag(E) inet_diag(E) ipt_MASQUERADE(E) nf_nat_masquerade_ipv4(E) xfrm_user(E) xfrm_algo(E) iptable_nat(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) nf_nat_ipv4(E) xt_addrtype(E) iptable_filter(E) ip_tables(E) xt_conntrack(E) x_tables(E) nf_nat(E) nf_conntrack(E) br_netfilter(E) bridge(E) overlay(E) binfmt_misc(E) nv_peer_mem(OE) nvidia_uvm(POE) 8021q(E) garp(E) mrp(E) stp(E) llc(E) rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) configfs(E) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx4_en(OE) mlx4_ib(OE) ib_sa(OE) ib_mad(OE) ib_core(OE) ib_addr(OE) ib_netlink(OE) mlx4_core(OE) nvidia_drm(POE) nvidia_modeset(POE) nvidia(POE) intel_rapl(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E)
[ 4247.255426]  ghash_clmulni_intel(E) hmac(E) nouveau(E) drbg(E) mgag200(E) video(E) ansi_cprng(E) ttm(E) snd_pcm(E) drm_kms_helper(E) knem(OE) aesni_intel(E) snd_timer(E) aes_x86_64(E) mxm_wmi(E) evdev(E) dcdbas(E) psmouse(E) lrw(E) sb_edac(E) gf128mul(E) snd(E) iTCO_wdt(E) glue_helper(E) soundcore(E) ablk_helper(E) drm(E) iTCO_vendor_support(E) cryptd(E) i2c_algo_bit(E) pcspkr(E) edac_core(E) mei_me(E) lpc_ich(E) usbhid(E) shpchp(E) mei(E) mfd_core(E) 8250_fintek(E) wmi(E) hid(E) uhci_hcd(E) ohci_hcd(E) tpm_tis(E) ipmi_watchdog(E) tpm(E) processor(E) acpi_power_meter(E) button(E) ipmi_si(E) ipmi_poweroff(E) ipmi_devintf(E) ipmi_msghandler(E) autofs4(E) ext4(E) crc16(E) mbcache(E) jbd2(E) dm_mod(E) md_mod(E) sg(E) sd_mod(E) ahci(E) libahci(E) mlx5_core(OE) libata(E) mlx_compat(OE) ehci_pci(E) inet_lro(E)
[ 4247.338609]  vxlan(E) ehci_hcd(E) ip6_udp_tunnel(E) udp_tunnel(E) usbcore(E) ptp(E) crc32c_intel(E) scsi_mod(E) usb_common(E) pps_core(E) fjes(E)
[ 4247.353322] CPU: 5 PID: 78626 Comm: python Tainted: P      D    OE   4.4.66 #1
[ 4247.362166] Hardware name: Dell Inc. PowerEdge C4130/0VCHW8, BIOS 2.4.2 01/06/2017
[ 4247.371380] task: ffff88203341a7c0 ti: ffff881f202bc000 task.ti: ffff881f202bc000
[ 4247.380543] RIP: 0010:[<ffffffff8119aac1>]  [<ffffffff8119aac1>] handle_mm_fault+0x11a1/0x1840
[ 4247.391040] RSP: 0018:ffff881f202bf8b8  EFLAGS: 00010246
[ 4247.397844] RAX: 0000000000000000 RBX: 000001020e000000 RCX: ffff880000000000
[ 4247.406676] RDX: ffff881ed5179000 RSI: 00003fffffe00000 RDI: 0000000000000000
[ 4247.415502] RBP: 0000000000000000 R08: 0000001fe0b6a120 R09: 0000000000000000
[ 4247.424246] R10: 0000000000000000 R11: 0000000000000120 R12: ffff881023581870
[ 4247.432989] R13: ffff881fc7ae6380 R14: 00003ffffffff000 R15: ffff88203249c400
[ 4247.441807] FS:  00007f39ef68c700(0000) GS:ffff88203ea40000(0000) knlGS:0000000000000000
[ 4247.451615] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4247.458806] CR2: 00007f3b35ef5a43 CR3: 0000002027c77000 CR4: 00000000003406e0
[ 4247.467600] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 4247.476364] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 4247.485218] Stack:
[ 4247.488270]  ffffffff811733fd 0000001f00000000 ffff880000000000 0000160000000000
[ 4247.497415]  0000001fe0b6a120 0000000000000000 00000000024000c0 0000000000000000
[ 4247.506585]  0000000000000000 ffffffff81824a3b ffff88103ec19400 ffffffff81194409
[ 4247.515770] Call Trace:
[ 4247.519393]  [<ffffffff811733fd>] ? __alloc_pages_nodemask+0x1cd/0xbf0
[ 4247.527577]  [<ffffffff81194409>] ? follow_page_pte+0x209/0x3d0
[ 4247.535225]  [<ffffffff81194a5f>] ? __get_user_pages+0x13f/0x620
[ 4247.542825]  [<ffffffff811c3f6a>] ? __kmalloc+0x12a/0x1a0
[ 4247.549741]  [<ffffffff813091b0>] ? sg_kfree+0x20/0x20
[ 4247.556357]  [<ffffffffc04626bc>] ? ib_umem_get_ex+0x4ec/0x7d0 [ib_core]
[ 4247.564707]  [<ffffffffc056bd22>] ? mlx5_ib_reg_user_mr+0x82/0x830 [mlx5_ib]
[ 4247.573478]  [<ffffffff812fa3f2>] ? idr_mark_full+0x52/0x60
[ 4247.580529]  [<ffffffff811c2a2a>] ? cmpxchg_double_slab.isra.60+0x2a/0xe0
[ 4247.588957]  [<ffffffffc03e4279>] ? ib_uverbs_exp_reg_mr_ex+0x2c9/0x440 [ib_uverbs]
[ 4247.598409]  [<ffffffff811c40aa>] ? __slab_free+0xca/0x250
[ 4247.605396]  [<ffffffff811c416b>] ? __slab_free+0x18b/0x250
[ 4247.612436]  [<ffffffff811c40aa>] ? __slab_free+0xca/0x250
[ 4247.619407]  [<ffffffffc03df8f2>] ? ib_uverbs_write+0x612/0x690 [ib_uverbs]
[ 4247.627953]  [<ffffffff811dda63>] ? __vfs_write+0x33/0x120
[ 4247.634868]  [<ffffffff812cdf96>] ? blk_finish_plug+0x26/0x40
[ 4247.642051]  [<ffffffff811aaf9c>] ? SyS_madvise+0x22c/0x710
[ 4247.649021]  [<ffffffff811de154>] ? vfs_write+0xa4/0x190
[ 4247.655742]  [<ffffffff811deed2>] ? SyS_write+0x52/0xc0
[ 4247.662298]  [<ffffffff815a66b6>] ? entry_SYSCALL_64_fastpath+0x16/0x75
[ 4247.670409] Code: 00 83 60 04 fe 81 4c 24 10 00 04 00 00 e9 b9 f8 ff ff 48 8b 7c 24 50 89 44 24 08 e8 7a eb fd ff 8b 44 24 08 89 c5 e9 f7 f0 ff ff <0f> 0b 8d 50 e2 83 fa 01 0f 86 ee 04 00 00 83 f8 1d c7 44 24 10
[ 4247.693606] RIP  [<ffffffff8119aac1>] handle_mm_fault+0x11a1/0x1840
[ 4247.701347]  RSP <ffff881f202bf8b8>
[ 4247.706009] ---[ end trace a7262b7d381a4c62 ]---

@alsrgv
Copy link
Author

alsrgv commented May 24, 2017

@jsquyres @jladd-mlnx @artpol84 looks like my device got added in commit 2779765, but it was never backported into 2.x.

If I manually add it to openmpi/share/openmpi/mca-btl-openib-device-params.ini in Mellanox ConnectX4 section, I still get:

(env)asergeev@opusgpu25-wbu2:~$ mpirun -mca btl self,openib -H opusgpu25-wbu2,opusgpu34-wbu2 -mca btl_openib_want_cuda_gdr 1 -mca btl_openib_receive_queues P,128,256,192,128:S,2048,256,128,32:S,12288,256,128,32:S,65536,256,128,32 --prefix /home/asergeev/openmpi -n 2 /home/asergeev/omb/pt2pt/osu_latency -d cuda D D
# OSU MPI-CUDA Latency Test v5.3.2
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size          Latency (us)
0                       2.77
1                      10.80
2                      10.40
4                      10.40
8                      10.34
16                     10.40
32                     10.51
64                     10.34
128                    10.85
256                    10.83
512                    10.44
1024                   11.05
2048                   11.51
4096                   12.59
8192                   14.89
16384                  17.27
32768                  51.52
osu_latency: btl_openib.c:1441: mca_btl_openib_alloc: Assertion `qp != 255' failed.
[opusgpu25-wbu2:17594] *** Process received signal ***
[opusgpu25-wbu2:17594] Signal: Aborted (6)
[opusgpu25-wbu2:17594] Signal code:  (-6)
[opusgpu25-wbu2:17594] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xf8f0)[0x7fcbe92f28f0]
[opusgpu25-wbu2:17594] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37)[0x7fcbe8f6d077]
[opusgpu25-wbu2:17594] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7fcbe8f6e458]
[opusgpu25-wbu2:17594] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x2e266)[0x7fcbe8f66266]
[opusgpu25-wbu2:17594] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2e312)[0x7fcbe8f66312]
[opusgpu25-wbu2:17594] [ 5] /home/asergeev/openmpi/lib/openmpi/mca_btl_openib.so(mca_btl_openib_alloc+0x76)[0x7fcbd2f8d01a]
[opusgpu25-wbu2:17594] [ 6] /home/asergeev/openmpi/lib/openmpi/mca_btl_openib.so(mca_btl_openib_prepare_src+0xac)[0x7fcbd2f8d576]
[opusgpu25-wbu2:17594] [ 7] /home/asergeev/openmpi/lib/openmpi/mca_pml_ob1.so(+0x1ac89)[0x7fcbd2b56c89]
[opusgpu25-wbu2:17594] [ 8] /home/asergeev/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_schedule_once+0x2d1)[0x7fcbd2b59637]
[opusgpu25-wbu2:17594] [ 9] /home/asergeev/openmpi/lib/openmpi/mca_pml_ob1.so(+0x13659)[0x7fcbd2b4f659]
[opusgpu25-wbu2:17594] [10] /home/asergeev/openmpi/lib/openmpi/mca_pml_ob1.so(+0x136bb)[0x7fcbd2b4f6bb]
[opusgpu25-wbu2:17594] [11] /home/asergeev/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_frag_callback_ack+0x25b)[0x7fcbd2b5008d]
[opusgpu25-wbu2:17594] [12] /home/asergeev/openmpi/lib/openmpi/mca_btl_openib.so(+0x19cee)[0x7fcbd2f98cee]
[opusgpu25-wbu2:17594] [13] /home/asergeev/openmpi/lib/openmpi/mca_btl_openib.so(+0x1b71f)[0x7fcbd2f9a71f]
[opusgpu25-wbu2:17594] [14] /home/asergeev/openmpi/lib/openmpi/mca_btl_openib.so(+0x1b865)[0x7fcbd2f9a865]
[opusgpu25-wbu2:17594] [15] /home/asergeev/openmpi/lib/libopen-pal.so.20(opal_progress+0xa9)[0x7fcbe8963054]
[opusgpu25-wbu2:17594] [16] /home/asergeev/openmpi/lib/openmpi/mca_pml_ob1.so(+0xdd70)[0x7fcbd2b49d70]
[opusgpu25-wbu2:17594] [17] /home/asergeev/openmpi/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x525)[0x7fcbd2b4be2f]
[opusgpu25-wbu2:17594] [18] /home/asergeev/openmpi/lib/libmpi.so(PMPI_Send+0x2a7)[0x7fcbea4eb96f]
[opusgpu25-wbu2:17594] [19] /home/asergeev/omb/pt2pt/osu_latency[0x401225]
[opusgpu25-wbu2:17594] [20] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7fcbe8f59b45]
[opusgpu25-wbu2:17594] [21] /home/asergeev/omb/pt2pt/osu_latency[0x4014be]
[opusgpu25-wbu2:17594] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node opusgpu25-wbu2 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

If I remove -mca btl_openib_receive_queues P,128,256,192,128:S,2048,256,128,32:S,12288,256,128,32:S,65536,256,128,32, I get:

(env)asergeev@opusgpu25-wbu2:~$ mpirun -mca btl_base_verbose 100 -mca btl self,openib -H opusgpu25-wbu2,opusgpu34-wbu2 -mca btl_openib_want_cuda_gdr 1 --prefix /home/asergeev/openmpi -n 2 /home/asergeev/omb/pt2pt/osu_latency -d cuda D D
[opusgpu34-wbu2:20880] mca: base: components_register: registering framework btl components
[opusgpu34-wbu2:20880] mca: base: components_register: found loaded component openib
[opusgpu34-wbu2:20880] mca: base: components_register: component openib register function successful
[opusgpu34-wbu2:20880] mca: base: components_register: found loaded component self
[opusgpu34-wbu2:20880] mca: base: components_register: component self register function successful
[opusgpu34-wbu2:20880] mca: base: components_open: opening btl components
[opusgpu34-wbu2:20880] mca: base: components_open: found loaded component openib
[opusgpu34-wbu2:20880] mca: base: components_open: component openib open function successful
[opusgpu34-wbu2:20880] mca: base: components_open: found loaded component self
[opusgpu34-wbu2:20880] mca: base: components_open: component self open function successful
[opusgpu34-wbu2:20880] select: initializing btl component openib
[opusgpu34-wbu2:20880] Checking distance from this process to device=mlx5_0
[opusgpu34-wbu2:20880] hwloc_distances->nbobjs=2
[opusgpu34-wbu2:20880] hwloc_distances->latency[0]=1.000000
[opusgpu34-wbu2:20880] hwloc_distances->latency[1]=2.100000
[opusgpu34-wbu2:20880] hwloc_distances->latency[2]=2.100000
[opusgpu34-wbu2:20880] hwloc_distances->latency[3]=1.000000
[opusgpu34-wbu2:20880] ibv_obj->logical_index=0
[opusgpu34-wbu2:20880] my_obj->logical_index=0
[opusgpu34-wbu2:20880] Process is bound: distance to device is 1.000000
[opusgpu34-wbu2][[18551,1],1][btl_openib_component.c:637:init_one_port] looking for mlx5_0:1 GID index 0
[opusgpu34-wbu2][[18551,1],1][btl_openib_component.c:668:init_one_port] my IB subnet_id for HCA mlx5_0 port 1 is 0000000000000000
[opusgpu34-wbu2][[18551,1],1][btl_openib_ip.c:366:add_rdma_addr] Adding addr 10.191.32.76 (0x4c20bf0a) subnet 0xabf2000 as mlx5_0:1
[opusgpu34-wbu2][[18551,1],1][btl_openib_component.c:1351:setup_qps] srq: rd_num is 256 rd_low is 192 sd_max is 128 rd_max is 64 srq_limit is 12
[opusgpu34-wbu2][[18551,1],1][btl_openib_component.c:1351:setup_qps] srq: rd_num is 1024 rd_low is 1008 sd_max is 64 rd_max is 256 srq_limit is 48
[opusgpu34-wbu2:20880] openib BTL: rdmacm CPC unavailable for use on mlx5_0:1; skipped
[opusgpu34-wbu2][[18551,1],1][btl_openib_component.c:1351:setup_qps] srq: rd_num is 1024 rd_low is 1008 sd_max is 64 rd_max is 256 srq_limit is 48
[opusgpu34-wbu2][[18551,1],1][btl_openib_component.c:1351:setup_qps] srq: rd_num is 1024 rd_low is 1008 sd_max is 64 rd_max is 256 srq_limit is 48
[opusgpu34-wbu2][[18551,1],1][connect/btl_openib_connect_rdmacm.c:2024:rdmacm_component_query] rdmacm CPC only supported when the first QP is a PP QP; skipped
[opusgpu34-wbu2][[18551,1],1][connect/btl_openib_connect_udcm.c:451:udcm_component_query] UD CPC only supported on InfiniBand; skipped on mlx5_0:1
[opusgpu34-wbu2][[18551,1],1][connect/btl_openib_connect_udcm.c:500:udcm_component_query] unavailable for use on mlx5_0:1; skipped
[opusgpu34-wbu2:20880] select: init of component openib returned failure
[opusgpu34-wbu2][[18551,1],1][connect/btl_openib_connect_rdmacm.c:2201:rdmacm_component_finalize] rdmacm_component_finalize
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           opusgpu34-wbu2
  Local device:         mlx5_0
  Local port:           1
  CPCs attempted:       rdmacm, udcm
--------------------------------------------------------------------------
[opusgpu34-wbu2:20880] mca: base: close: component openib closed
[opusgpu34-wbu2:20880] mca: base: close: unloading component openib
[opusgpu34-wbu2:20880] select: initializing btl component self
[opusgpu34-wbu2:20880] select: init of component self returned success
[opusgpu25-wbu2:23686] mca: base: components_register: registering framework btl components
[opusgpu25-wbu2:23686] mca: base: components_register: found loaded component self
[opusgpu25-wbu2:23686] mca: base: components_register: component self register function successful
[opusgpu25-wbu2:23686] mca: base: components_register: found loaded component openib
[opusgpu25-wbu2:23686] mca: base: components_register: component openib register function successful
[opusgpu25-wbu2:23686] mca: base: components_open: opening btl components
[opusgpu25-wbu2:23686] mca: base: components_open: found loaded component self
[opusgpu25-wbu2:23686] mca: base: components_open: component self open function successful
[opusgpu25-wbu2:23686] mca: base: components_open: found loaded component openib
[opusgpu25-wbu2:23686] mca: base: components_open: component openib open function successful
[opusgpu25-wbu2:23686] select: initializing btl component self
[opusgpu25-wbu2:23686] select: init of component self returned success
[opusgpu25-wbu2:23686] select: initializing btl component openib
[opusgpu25-wbu2:23686] Checking distance from this process to device=mlx5_0
[opusgpu25-wbu2:23686] hwloc_distances->nbobjs=2
[opusgpu25-wbu2:23686] hwloc_distances->latency[0]=1.000000
[opusgpu25-wbu2:23686] hwloc_distances->latency[1]=2.100000
[opusgpu25-wbu2:23686] hwloc_distances->latency[2]=2.100000
[opusgpu25-wbu2:23686] hwloc_distances->latency[3]=1.000000
[opusgpu25-wbu2:23686] ibv_obj->logical_index=0
[opusgpu25-wbu2:23686] my_obj->logical_index=0
[opusgpu25-wbu2:23686] Process is bound: distance to device is 1.000000
[opusgpu25-wbu2][[18551,1],0][btl_openib_component.c:637:init_one_port] looking for mlx5_0:1 GID index 0
[opusgpu25-wbu2][[18551,1],0][btl_openib_component.c:668:init_one_port] my IB subnet_id for HCA mlx5_0 port 1 is 0000000000000000
[opusgpu25-wbu2][[18551,1],0][btl_openib_ip.c:366:add_rdma_addr] Adding addr 10.191.32.74 (0x4a20bf0a) subnet 0xabf2000 as mlx5_0:1
[opusgpu25-wbu2][[18551,1],0][btl_openib_component.c:1351:setup_qps] srq: rd_num is 256 rd_low is 192 sd_max is 128 rd_max is 64 srq_limit is 12
[opusgpu25-wbu2][[18551,1],0][btl_openib_component.c:1351:setup_qps] srq: rd_num is 1024 rd_low is 1008 sd_max is 64 rd_max is 256 srq_limit is 48
[opusgpu25-wbu2][[18551,1],0][btl_openib_component.c:1351:setup_qps] srq: rd_num is 1024 rd_low is 1008 sd_max is 64 rd_max is 256 srq_limit is 48
[opusgpu25-wbu2][[18551,1],0][btl_openib_component.c:1351:setup_qps] srq: rd_num is 1024 rd_low is 1008 sd_max is 64 rd_max is 256 srq_limit is 48
[opusgpu25-wbu2:23686] openib BTL: rdmacm CPC unavailable for use on mlx5_0:1; skipped
[opusgpu25-wbu2][[18551,1],0][connect/btl_openib_connect_rdmacm.c:2024:rdmacm_component_query] rdmacm CPC only supported when the first QP is a PP QP; skipped
[opusgpu25-wbu2][[18551,1],0][connect/btl_openib_connect_udcm.c:451:udcm_component_query] UD CPC only supported on InfiniBand; skipped on mlx5_0:1
[opusgpu25-wbu2][[18551,1],0][connect/btl_openib_connect_udcm.c:500:udcm_component_query] unavailable for use on mlx5_0:1; skipped
[opusgpu25-wbu2:23686] select: init of component openib returned failure
[opusgpu25-wbu2][[18551,1],0][connect/btl_openib_connect_rdmacm.c:2201:rdmacm_component_finalize] rdmacm_component_finalize
[opusgpu25-wbu2:23686] mca: base: close: component openib closed
[opusgpu25-wbu2:23686] mca: base: close: unloading component openib
[opusgpu25-wbu2:23686] mca: bml: Using self btl for send to [[18551,1],0] on node opusgpu25-wbu2
[opusgpu34-wbu2:20880] mca: bml: Using self btl for send to [[18551,1],1] on node opusgpu34-wbu2
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[18551,1],1]) is on host: opusgpu34-wbu2
  Process 2 ([[18551,1],0]) is on host: opusgpu25-wbu2
  BTLs attempted: self

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
[opusgpu34-wbu2:20880] *** An error occurred in MPI_Barrier
[opusgpu34-wbu2:20880] *** reported by process [1215758337,1]
[opusgpu34-wbu2:20880] *** on communicator MPI_COMM_WORLD
[opusgpu34-wbu2:20880] *** MPI_ERR_INTERN: internal error
[opusgpu34-wbu2:20880] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[opusgpu34-wbu2:20880] ***    and potentially your MPI job)
# OSU MPI-CUDA Latency Test v5.3.2
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size          Latency (us)
[opusgpu25-wbu2:23680] 1 more process has sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
[opusgpu25-wbu2:23680] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[opusgpu25-wbu2:23680] 1 more process has sent help message help-mca-bml-r2.txt / unreachable proc
[opusgpu25-wbu2:23680] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal

@jladd-mlnx
Copy link
Member

@alsrgv

In order to use RoCE, you need to keep-mca btl_openib_receive_queues P,128,256,192,128:S,2048,256,128,32:S,12288,256,128,32:S,65536,256,128,32.

As for the OSU failures, @bureddy is able to reproduce on one of our p100 setups and is investigating. Hopefully, we can help nail down the issue.

@jladd-mlnx
Copy link
Member

@alsrgv,

@bureddy figured the issue out. From his analysis:

It is triggering RDMA pipeline protocol where the default message size is 128K. You do not have a 128K QP in your choice of send queues. You can fix this by doing one of the following:

  1. Limit the max send size to 64K to match your max send queue -mca btl_openib_max_send_size 65536

  2. Change you max send queue length -mca btl_openib_receive_queues P,128,256,192,128:S,2048,256,128,32:S,12288,256,128,32:S,131072,256,128,32

@alsrgv
Copy link
Author

alsrgv commented May 25, 2017

Thanks @jladd-mlnx, OSU works perfectly now!

I still have an issue running real TensorFlow workload: mpirun -mca btl self,openib -mca btl_openib_want_cuda_gdr 1 -mca btl_openib_receive_queues P,128,256,192,128:S,2048,256,128,32:S,12288,256,128,32:S,131072,256,128,32 --prefix /home/asergeev/openmpi -n 4 python tf_cnn_benchmarks.py --model inception3 --batch_size 64

It hangs and in dmesg we see kernel BUG:

[  751.653125] kernel BUG at mm/memory.c:3202!
[  751.658386] invalid opcode: 0000 [#3] SMP
[  751.663647] Modules linked in: nv_peer_mem(OE) tcp_diag(E) inet_diag(E) ipt_MASQUERADE(E) nf_nat_masquerade_ipv4(E) xfrm_user(E) xfrm_algo(E) iptable_nat(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) nf_nat_ipv4(E) xt_addrtype(E) iptable_filter(E) ip_tables(E) xt_conntrack(E) x_tables(E) nf_nat(E) nf_conntrack(E) br_netfilter(E) bridge(E) overlay(E) binfmt_misc(E) nvidia_uvm(POE) 8021q(E) garp(E) mrp(E) stp(E) llc(E) rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) configfs(E) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx4_en(OE) vxlan(E) ip6_udp_tunnel(E) udp_tunnel(E) mlx4_ib(OE) ib_core(OE) mlx4_core(OE) nvidia_drm(POE) nvidia_modeset(POE) nvidia(POE) intel_rapl(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E)
[  751.747354]  nouveau(E) ghash_clmulni_intel(E) knem(OE) psmouse(E) hmac(E) mgag200(E) video(E) usbhid(E) drbg(E) ttm(E) hid(E) drm_kms_helper(E) snd_pcm(E) uhci_hcd(E) ansi_cprng(E) sb_edac(E) drm(E) snd_timer(E) iTCO_wdt(E) aesni_intel(E) iTCO_vendor_support(E) snd(E) aes_x86_64(E) soundcore(E) mei_me(E) lpc_ich(E) lrw(E) mxm_wmi(E) gf128mul(E) evdev(E) dcdbas(E) glue_helper(E) ablk_helper(E) cryptd(E) i2c_algo_bit(E) ohci_hcd(E) pcspkr(E) edac_core(E) mfd_core(E) ipmi_watchdog(E) mei(E) shpchp(E) 8250_fintek(E) wmi(E) tpm_tis(E) tpm(E) processor(E) acpi_power_meter(E) button(E) ipmi_si(E) ipmi_poweroff(E) ipmi_devintf(E) ipmi_msghandler(E) autofs4(E) ext4(E) crc16(E) mbcache(E) jbd2(E) dm_mod(E) md_mod(E) sg(E) sd_mod(E) ahci(E) libahci(E) ehci_pci(E) mlx5_core(OE) ehci_hcd(E) libata(E) mlx_compat(OE)
[  751.830377]  inet_lro(E) usbcore(E) ptp(E) crc32c_intel(E) scsi_mod(E) usb_common(E) pps_core(E) fjes(E)
[  751.841184] CPU: 21 PID: 24814 Comm: python Tainted: P      D    OE   4.4.66 #1
[  751.850098] Hardware name: Dell Inc. PowerEdge C4130/0VCHW8, BIOS 2.4.2 01/06/2017
[  751.859318] task: ffff8820256b27c0 ti: ffff881f10410000 task.ti: ffff881f10410000
[  751.868482] RIP: 0010:[<ffffffff8119aac1>]  [<ffffffff8119aac1>] handle_mm_fault+0x11a1/0x1840
[  751.878891] RSP: 0018:ffff881f10413a18  EFLAGS: 00010246
[  751.885660] RAX: 0000000000000000 RBX: 000001020e000000 RCX: ffff880000000000
[  751.894443] RDX: ffff881ec9a8b000 RSI: 00003fffffe00000 RDI: 0000000000000000
[  751.903225] RBP: 0000000000000000 R08: 0000001f1bf4b120 R09: 0000000000000000
[  751.912023] R10: 0000000000000000 R11: 0000000000000120 R12: ffff880fe099c730
[  751.920798] R13: ffff881fd1043380 R14: 00003ffffffff000 R15: ffff8810233d0c00
[  751.929562] FS:  00007f59a563f700(0000) GS:ffff88203eb40000(0000) knlGS:0000000000000000
[  751.939411] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  751.946691] CR2: 00007f5ab1abba43 CR3: 0000002012de9000 CR4: 00000000003406e0
[  751.955432] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  751.964228] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  751.972928] Stack:
[  751.975939]  ffffffff811733fd 0000001f00000000 ffff880000000000 0000160000000000
[  751.985038]  0000001f1bf4b120 ffff882000000000 00400000024000c0 ffff8820256b27c0
[  751.994171]  ffff8820329145a8 ffff881ec9ed8000 ffff88103ec19400 ffffffff81194409
[  752.003291] Call Trace:
[  752.006882]  [<ffffffff811733fd>] ? __alloc_pages_nodemask+0x1cd/0xbf0
[  752.015077]  [<ffffffff81194409>] ? follow_page_pte+0x209/0x3d0
[  752.022561]  [<ffffffff81194a5f>] ? __get_user_pages+0x13f/0x620
[  752.030143]  [<ffffffff811c3f6a>] ? __kmalloc+0x12a/0x1a0
[  752.037061]  [<ffffffff813091b0>] ? sg_kfree+0x20/0x20
[  752.043707]  [<ffffffffc04bf9c1>] ? ib_umem_get+0x261/0x7a0 [ib_core]
[  752.051779]  [<ffffffffc06a2419>] ? mr_umem_get.isra.19+0x39/0x170 [mlx5_ib]
[  752.060517]  [<ffffffffc06a599d>] ? mlx5_ib_reg_user_mr+0x7d/0x4c0 [mlx5_ib]
[  752.069242]  [<ffffffffc03ccbe1>] ? ib_uverbs_reg_mr+0x1b1/0x350 [ib_uverbs]
[  752.077967]  [<ffffffffc03c8258>] ? ib_uverbs_write+0x208/0x410 [ib_uverbs]
[  752.086593]  [<ffffffff811dda63>] ? __vfs_write+0x33/0x120
[  752.093589]  [<ffffffff812cdf96>] ? blk_finish_plug+0x26/0x40
[  752.100857]  [<ffffffff811aaf9c>] ? SyS_madvise+0x22c/0x710
[  752.107947]  [<ffffffff811de154>] ? vfs_write+0xa4/0x190
[  752.114686]  [<ffffffff811deed2>] ? SyS_write+0x52/0xc0
[  752.121317]  [<ffffffff815a66b6>] ? entry_SYSCALL_64_fastpath+0x16/0x75
[  752.129478] Code: 00 83 60 04 fe 81 4c 24 10 00 04 00 00 e9 b9 f8 ff ff 48 8b 7c 24 50 89 44 24 08 e8 7a eb fd ff 8b 44 24 08 89 c5 e9 f7 f0 ff ff <0f> 0b 8d 50 e2 83 fa 01 0f 86 ee 04 00 00 83 f8 1d c7 44 24 10
[  752.152791] RIP  [<ffffffff8119aac1>] handle_mm_fault+0x11a1/0x1840
[  752.160561]  RSP <ffff881f10413a18>
[  752.165237] ---[ end trace 690c3802e704fd45 ]---

Mellanox drivers are version 4.0-2.0.0, but we had same issue with 3.3 as well.

@bureddy
Copy link
Member

bureddy commented May 25, 2017

@alsrgv
is it work if you disable GDR?
can you also check with --mca mpi_leave_pinned 0?

@alsrgv
Copy link
Author

alsrgv commented May 26, 2017

Yes, the culprit turned out to be GDR - I was trying to go from one GPU to another GPU on the same box via GPUDirect, to avoid smcuda IPC. The reason I wanted to avoid latter is because TensorFlow forces me to set CUDA_VISIBLE_DEVICES which doesn't play well with IPC. I guess I will have to live with non-IPC version of smcuda.

I was able to make GDR working across nodes. Now I need to beef up my cabling to actually see the difference, there's no difference on 25Gbit port :-)

Thanks for all your help!

@jladd-mlnx
Copy link
Member

@alsrgv I'm closing this issue. Please feel free to reopen if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants