-
Notifications
You must be signed in to change notification settings - Fork 900
osu_latency: btl_openib.c:1441: mca_btl_openib_alloc: Assertion `qp != 255' failed. #3573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@jladd-mlnx @artpol84 I know you guys don't typically care about
I mention this because on March 7, Josh commited b286478, which added some part numbers to the |
Hmmm...will take a look. |
BTW, if I run actual workload (TensorFlow-MPI) with these parameters, I get kernel BUG.
Dmesg:
|
@jsquyres @jladd-mlnx @artpol84 looks like my device got added in commit 2779765, but it was never backported into 2.x. If I manually add it to
If I remove
|
@bureddy figured the issue out. From his analysis: It is triggering RDMA pipeline protocol where the default message size is 128K. You do not have a 128K QP in your choice of send queues. You can fix this by doing one of the following:
|
Thanks @jladd-mlnx, OSU works perfectly now! I still have an issue running real TensorFlow workload: It hangs and in
Mellanox drivers are version 4.0-2.0.0, but we had same issue with 3.3 as well. |
@alsrgv |
Yes, the culprit turned out to be GDR - I was trying to go from one GPU to another GPU on the same box via GPUDirect, to avoid I was able to make GDR working across nodes. Now I need to beef up my cabling to actually see the difference, there's no difference on 25Gbit port :-) Thanks for all your help! |
@alsrgv I'm closing this issue. Please feel free to reopen if needed. |
Thank you for taking the time to submit an issue!
Background information
I'm running osu_latency from http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-5.3.2.tar.gz in CUDA mode with RoCE RDMA-CM and I'm getting an error in title. The error seems to happen when message size is above 32768.
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
OpenMPI v2.1.1
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
./configure --with-cuda --enable-debug --prefix=/home/asergeev/openmpi
Please describe the system on which you are running
Details of the problem
I'm running osu_latency from http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-5.3.2.tar.gz in CUDA mode with RoCE RDMA-CM and I'm getting an error in title. The error seems to happen when message size is above 32768.
I am able to make it pass if I specify large RDMA limit, like this:
But then it still fails if I disable GPU direct altogether.
The text was updated successfully, but these errors were encountered: