Skip to content

Atomics get completion with error on mlx5 #1241

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sjeaugey opened this issue Dec 18, 2015 · 7 comments · Fixed by #1319 or open-mpi/ompi-release#911
Closed

Atomics get completion with error on mlx5 #1241

sjeaugey opened this issue Dec 18, 2015 · 7 comments · Fixed by #1319 or open-mpi/ompi-release#911
Assignees
Labels
Milestone

Comments

@sjeaugey
Copy link
Member

A dozen of MTT tests related to atomics still fail using mlx5. They all fail with the same error message :

mlx5: drossetti-ivy4.nvidia.com: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 00008813 110818b2 000001d0
[[53836,1],0][btl_openib_component.c:3611:handle_wc] from drossetti-ivy4 to: drossetti-ivy4 error
polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 25e14b8 opcode 3 vendor
error 136 qp_idx 0

@sjeaugey sjeaugey added the bug label Dec 18, 2015
@sjeaugey sjeaugey added this to the v2.x milestone Dec 18, 2015
@hjelmn
Copy link
Member

hjelmn commented Dec 18, 2015

Do any of the ivb machines have the latest mofed? Looks like this might be a change in behavior from the tested version (2.x).

@sjeaugey
Copy link
Member Author

Yes, they're both running MOFED 3.1.

@sjeaugey
Copy link
Member Author

sjeaugey commented Jan 6, 2016

Narrowing the issue : it only appears with GPU Direct RDMA, and disappears if I remove ATOMIC_FOPS from btl_openib_flags.

I guess you are doing IB atomics directly on the memory location, which won't work if it is GPU memory mapped directly to the HCA. Even if PCIe supports atomics (from the NIC to the GPU) the GPU won't support it through the PCI BAR mapping.

Do you think it would be possible to disable FOPS when the target is GPU memory ? I guess not since the source has no knowledge of the destination buffer ... unless there is a RDV protocol ? Another option would be to disable FOPS when the window is on the GPU.

If we can't find a good way to do that, I'll do a patch to disable ATOMIC_FOPS whenever GDR is activated.

@sjeaugey sjeaugey changed the title Atomics get completion with error on mlx5 Atomics get completion with error on mlx5 when GDR is enabled Jan 7, 2016
@sjeaugey sjeaugey assigned sjeaugey and unassigned hjelmn Jan 12, 2016
@sjeaugey sjeaugey changed the title Atomics get completion with error on mlx5 when GDR is enabled Atomics get completion with error on mlx5 Jan 12, 2016
@sjeaugey
Copy link
Member Author

Re-testing, I has in fact nothing to do with GPU Direct. I thought so because the non-GPU Direct test was also running on the tcp BTL ... so, we're back on your first question about MOFED version. Do you have an idea where to look at and what to try ?

@sjeaugey sjeaugey assigned hjelmn and unassigned sjeaugey Jan 12, 2016
@sjeaugey
Copy link
Member Author

On the IBM onesided/c_accumulate test, the crash seems to happens on the first fetch_and_add operation. There are many cswap operations before which work fine.
Edit : fetch_and_add seems out of the problem here, it crashes the same way if we do a cswap.

@sjeaugey
Copy link
Member Author

Adding traces, the crash happens when we do the accumulate on rank 1. Accumulate on rank 0 works.

Btw, it only crashes with multiple ranks per node, it works fine with 1 rank per node or with multiple ranks on a single node. It may therefore be related to #1209.

The Atomics between the nodes seem to work, what doesn't work is :

  • rank 0 or rank 1 doing remote atomic to rank 1 memory (local but through IB)
  • rank 2 or rank 3 doing remote atomic to rank 3 memory (also local but through IB)

@hjelmn
Copy link
Member

hjelmn commented Jan 22, 2016

Looking at this now. I probably overlooked this problem because it doesn't fail on ugni for some reason. Probably is a difference in how the registration keys work.

hjelmn added a commit to hjelmn/ompi that referenced this issue Jan 22, 2016
If atomics are not globally visible (cpu and nic atomics do not mix)
then a btl endpoint must be used to access local ranks. To avoid
issues that are caused by having the same region registered with
multiple handles osc/rdma was updated to always use the handle for
rank 0. There was a bug in the update that caused osc/rdma to continue
using the local endpoint for accessing the state even though the
pointer/handle are not valid for that endpoint. This commit fixes the
bug.

Fixes open-mpi#1241.

Signed-off-by: Nathan Hjelm <[email protected]>
hjelmn added a commit to hjelmn/ompi-release that referenced this issue Jan 22, 2016
If atomics are not globally visible (cpu and nic atomics do not mix)
then a btl endpoint must be used to access local ranks. To avoid
issues that are caused by having the same region registered with
multiple handles osc/rdma was updated to always use the handle for
rank 0. There was a bug in the update that caused osc/rdma to continue
using the local endpoint for accessing the state even though the
pointer/handle are not valid for that endpoint. This commit fixes the
bug.

Fixes open-mpi/ompi#1241.

(cherry picked from open-mpi/ompi@49d2f44)

Signed-off-by: Nathan Hjelm <[email protected]>
artpol84 pushed a commit to artpol84/ompi that referenced this issue Jan 26, 2016
If atomics are not globally visible (cpu and nic atomics do not mix)
then a btl endpoint must be used to access local ranks. To avoid
issues that are caused by having the same region registered with
multiple handles osc/rdma was updated to always use the handle for
rank 0. There was a bug in the update that caused osc/rdma to continue
using the local endpoint for accessing the state even though the
pointer/handle are not valid for that endpoint. This commit fixes the
bug.

Fixes open-mpi#1241.

Signed-off-by: Nathan Hjelm <[email protected]>
jsquyres pushed a commit to jsquyres/ompi that referenced this issue Sep 19, 2016
opal/asm/base/MIPS.asm: fix uClibc regdef.h include path
bosilca pushed a commit to bosilca/ompi that referenced this issue Oct 3, 2016
If atomics are not globally visible (cpu and nic atomics do not mix)
then a btl endpoint must be used to access local ranks. To avoid
issues that are caused by having the same region registered with
multiple handles osc/rdma was updated to always use the handle for
rank 0. There was a bug in the update that caused osc/rdma to continue
using the local endpoint for accessing the state even though the
pointer/handle are not valid for that endpoint. This commit fixes the
bug.

Fixes open-mpi#1241.

Signed-off-by: Nathan Hjelm <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants