-
Notifications
You must be signed in to change notification settings - Fork 900
Atomics get completion with error on mlx5 #1241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Do any of the ivb machines have the latest mofed? Looks like this might be a change in behavior from the tested version (2.x). |
Yes, they're both running MOFED 3.1. |
Narrowing the issue : it only appears with GPU Direct RDMA, and disappears if I remove ATOMIC_FOPS from btl_openib_flags. I guess you are doing IB atomics directly on the memory location, which won't work if it is GPU memory mapped directly to the HCA. Even if PCIe supports atomics (from the NIC to the GPU) the GPU won't support it through the PCI BAR mapping. Do you think it would be possible to disable FOPS when the target is GPU memory ? I guess not since the source has no knowledge of the destination buffer ... unless there is a RDV protocol ? Another option would be to disable FOPS when the window is on the GPU. If we can't find a good way to do that, I'll do a patch to disable ATOMIC_FOPS whenever GDR is activated. |
Re-testing, I has in fact nothing to do with GPU Direct. I thought so because the non-GPU Direct test was also running on the tcp BTL ... so, we're back on your first question about MOFED version. Do you have an idea where to look at and what to try ? |
On the IBM onesided/c_accumulate test, the crash seems to happens on the first fetch_and_add operation. There are many cswap operations before which work fine. |
Adding traces, the crash happens when we do the accumulate on rank 1. Accumulate on rank 0 works. Btw, it only crashes with multiple ranks per node, it works fine with 1 rank per node or with multiple ranks on a single node. It may therefore be related to #1209. The Atomics between the nodes seem to work, what doesn't work is :
|
Looking at this now. I probably overlooked this problem because it doesn't fail on ugni for some reason. Probably is a difference in how the registration keys work. |
If atomics are not globally visible (cpu and nic atomics do not mix) then a btl endpoint must be used to access local ranks. To avoid issues that are caused by having the same region registered with multiple handles osc/rdma was updated to always use the handle for rank 0. There was a bug in the update that caused osc/rdma to continue using the local endpoint for accessing the state even though the pointer/handle are not valid for that endpoint. This commit fixes the bug. Fixes open-mpi#1241. Signed-off-by: Nathan Hjelm <[email protected]>
If atomics are not globally visible (cpu and nic atomics do not mix) then a btl endpoint must be used to access local ranks. To avoid issues that are caused by having the same region registered with multiple handles osc/rdma was updated to always use the handle for rank 0. There was a bug in the update that caused osc/rdma to continue using the local endpoint for accessing the state even though the pointer/handle are not valid for that endpoint. This commit fixes the bug. Fixes open-mpi/ompi#1241. (cherry picked from open-mpi/ompi@49d2f44) Signed-off-by: Nathan Hjelm <[email protected]>
If atomics are not globally visible (cpu and nic atomics do not mix) then a btl endpoint must be used to access local ranks. To avoid issues that are caused by having the same region registered with multiple handles osc/rdma was updated to always use the handle for rank 0. There was a bug in the update that caused osc/rdma to continue using the local endpoint for accessing the state even though the pointer/handle are not valid for that endpoint. This commit fixes the bug. Fixes open-mpi#1241. Signed-off-by: Nathan Hjelm <[email protected]>
opal/asm/base/MIPS.asm: fix uClibc regdef.h include path
If atomics are not globally visible (cpu and nic atomics do not mix) then a btl endpoint must be used to access local ranks. To avoid issues that are caused by having the same region registered with multiple handles osc/rdma was updated to always use the handle for rank 0. There was a bug in the update that caused osc/rdma to continue using the local endpoint for accessing the state even though the pointer/handle are not valid for that endpoint. This commit fixes the bug. Fixes open-mpi#1241. Signed-off-by: Nathan Hjelm <[email protected]>
A dozen of MTT tests related to atomics still fail using mlx5. They all fail with the same error message :
mlx5: drossetti-ivy4.nvidia.com: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 00008813 110818b2 000001d0
[[53836,1],0][btl_openib_component.c:3611:handle_wc] from drossetti-ivy4 to: drossetti-ivy4 error
polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 25e14b8 opcode 3 vendor
error 136 qp_idx 0
The text was updated successfully, but these errors were encountered: