-
Notifications
You must be signed in to change notification settings - Fork 900
OMPI+UCX on GPUs : drop of performance compared with pt2pt #7965
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@cayrols currently, UCX/CUDA RMA does not have all the CUDA performance protocols and it supports only limited cases. It will be improved in the future releases |
Still, this can hardly explain why faking RMA over pt2pt over UCX gives significantly better results than RMA over UCX. |
pt2pt is using ucx tag api, which had better CUDA protocol selections including cuda-ipc. with PUT, it is just depending only on GPUDirectRDMA over IB. And also on summit GPUDirectRDMA performance is slightly drops for message sizes (> 4 MB) I quickly ran on Summit, I can downgrade pt2pt Isend perf to PUT performance If I disable cuda_ipc and use GPUDirectRDMA Put for UCX tag API (
|
@bureddy May I kindly ask is this behavior just for PUT or same for all RMA operations? |
I guess we have to wait for openucx/ucx#5473 to get merged before seeing better GPU RMA performance. |
@cayrols Recent versions of UCX have added support for CUDA RMA. They are not default yet, but can be enabled with |
That is great news! Thank you for the info. I will definitively try it! |
Hi all,
I am contacting you because of some OSC issues on GPUs.
First of all, I have noticed that since version 5.0 the pt2pt has been removed. Therefore, if I do not compile with UCX, which is optional, the OSC feature on GPUs at least does not work.
(Note that I have tried to compile ompi 5.0 with the latest UCX and a simple MPI_Put from GPU to GPU causes a deadlock.)
I found out that OMPI 4.0.4 with the latest UCX is compiling and running.
(Note that, for small sizes, i.e., say < 1e6 doubles, we need to set the variable UCX_ZCOPY_THRESH=1.)
The context is the following:
I have a bunch of GPUs that are exchanging data using MPI_Put routine and MPI_Fence, mainly.
When using pt2pt, I get correct bandwidth per rank. However, when using UCX, the same bandwidth drops.
I have done a reproducer so that you can have a try.
This reproducer gives me the following performance on Summit :
Now, when using UCX, I got :
From these output, you can see that on the last column, the BW per rank goes up to 3.5ish GB/s when using pt2pt, whereas, when using UCX, the BW does not exceed the 2.5 GB/s.
I have tried many flags, like
-x UCX_TLS=ib,cuda_copy,cuda_ipc
and others but none gave me something similar BWto as the pt2pt, or better.
So, if you have any idea, maybe @bosilca, that would be really great.
Many thanks.
Details of the installation
Here are the flags that I used to install it on Summit
Note that the output of ucx_info is:
Reproducer
Note that I had to change the extension .cu with .LOG to attach it.
bench_ucx.LOG
The text was updated successfully, but these errors were encountered: