Skip to content

OMPI+UCX on GPUs : drop of performance compared with pt2pt #7965

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
cayrols opened this issue Jul 25, 2020 · 8 comments
Open

OMPI+UCX on GPUs : drop of performance compared with pt2pt #7965

cayrols opened this issue Jul 25, 2020 · 8 comments

Comments

@cayrols
Copy link

cayrols commented Jul 25, 2020

Hi all,

I am contacting you because of some OSC issues on GPUs.
First of all, I have noticed that since version 5.0 the pt2pt has been removed. Therefore, if I do not compile with UCX, which is optional, the OSC feature on GPUs at least does not work.
(Note that I have tried to compile ompi 5.0 with the latest UCX and a simple MPI_Put from GPU to GPU causes a deadlock.)

I found out that OMPI 4.0.4 with the latest UCX is compiling and running.
(Note that, for small sizes, i.e., say < 1e6 doubles, we need to set the variable UCX_ZCOPY_THRESH=1.)

The context is the following:
I have a bunch of GPUs that are exchanging data using MPI_Put routine and MPI_Fence, mainly.
When using pt2pt, I get correct bandwidth per rank. However, when using UCX, the same bandwidth drops.

I have done a reproducer so that you can have a try.
This reproducer gives me the following performance on Summit :

$ mpirun -n 12 -H c05n05:6,c05n06:6 --mca osc ^ucx -x LD_LIBRARY_PATH -x UCX_ZCOPY_THRESH=1 -x UCX_LOG_LEVEL=WARN ./bench_ucx

Label|	     nval(x1e+00)|  size|	Volume(x1MB)|	PingPongX 20(ms)| avg_send(ms)|	GB/s
___

   MPI_Alltoall        1000	     8	         0	     10.49	      0.52	      0.17
   MPI_Alltoall       10000	     8	         0	     15.54	      0.78	      1.15
   MPI_Alltoall       50000	     8	         4	     33.96	      1.70	      2.63
   MPI_Alltoall      100000	     8	         9	     72.16	      3.61	      2.48
   MPI_Alltoall      250000	     8	        22	    142.01	      7.10	      3.15
   MPI_Alltoall      500000	     8	        45	    273.70	     13.69	      3.27
   MPI_Alltoall     1000000	     8	        91	    515.96	     25.80	      3.47
   MPI_Alltoall    10000000	     8	       915	   5057.64	    252.88	      3.54

 Coll:MPI_Isend        1000	     8	         0	     11.11	      0.56	      0.16
 Coll:MPI_Isend       10000	     8	         0	     21.60	      1.08	      0.83
 Coll:MPI_Isend       50000	     8	         4	     30.88	      1.54	      2.90
 Coll:MPI_Isend      100000	     8	         9	     66.84	      3.34	      2.68
 Coll:MPI_Isend      250000	     8	        22	    137.92	      6.90	      3.24
 Coll:MPI_Isend      500000	     8	        45	    253.12	     12.66	      3.53
 Coll:MPI_Isend     1000000	     8	        91	    496.56	     24.83	      3.60
 Coll:MPI_Isend    10000000	     8	       915	   4930.47	    246.52	      3.63

   Coll:MPI_Put        1000	     8	         0	      9.06	      0.45	      0.20
   Coll:MPI_Put       10000	     8	         0	      8.14	      0.41	      2.20
   Coll:MPI_Put       50000	     8	         4	     29.46	      1.47	      3.04
   Coll:MPI_Put      100000	     8	         9	     58.10	      2.90	      3.08
   Coll:MPI_Put      250000	     8	        22	    141.57	      7.08	      3.16
   Coll:MPI_Put      500000	     8	        45	    282.01	     14.10	      3.17
   Coll:MPI_Put     1000000	     8	        91	    560.79	     28.04	      3.19
   Coll:MPI_Put    10000000	     8	       915	   5589.39	    279.47	      3.20

Now, when using UCX, I got :

$ mpirun -n 12 -H c05n05:6,c05n06:6 --mca osc ucx -x LD_LIBRARY_PATH -x UCX_ZCOPY_THRESH=1 -x UCX_LOG_LEVEL=WARN ./bench_ucx

Label|	     nval(x1e+00)|  size|	Volume(x1MB)|	PingPongX 20(ms)| avg_send(ms)|	GB/s
___
   MPI_Alltoall        1000	     8	         0	      9.77	      0.49	      0.18
   MPI_Alltoall       10000	     8	         0	     15.59	      0.78	      1.15
   MPI_Alltoall       50000	     8	         4	     43.95	      2.20	      2.03
   MPI_Alltoall      100000	     8	         9	     79.04	      3.95	      2.26
   MPI_Alltoall      250000	     8	        22	    159.94	      8.00	      2.79
   MPI_Alltoall      500000	     8	        45	    276.29	     13.81	      3.24
   MPI_Alltoall     1000000	     8	        91	    524.07	     26.20	      3.41
   MPI_Alltoall    10000000	     8	       915	   5048.09	    252.40	      3.54

 Coll:MPI_Isend        1000	     8	         0	      8.10	      0.40	      0.22
 Coll:MPI_Isend       10000	     8	         0	     32.06	      1.60	      0.56
 Coll:MPI_Isend       50000	     8	         4	     52.25	      2.61	      1.71
 Coll:MPI_Isend      100000	     8	         9	     59.39	      2.97	      3.01
 Coll:MPI_Isend      250000	     8	        22	    126.40	      6.32	      3.54
 Coll:MPI_Isend      500000	     8	        45	    257.02	     12.85	      3.48
 Coll:MPI_Isend     1000000	     8	        91	    542.05	     27.10	      3.30
 Coll:MPI_Isend    10000000	     8	       915	   4891.06	    244.55	      3.66

   Coll:MPI_Put        1000	     8	         0	      2.26	      0.11	      0.79
   Coll:MPI_Put       10000	     8	         0	      7.09	      0.35	      2.52
   Coll:MPI_Put       50000	     8	         4	     37.25	      1.86	      2.40
   Coll:MPI_Put      100000	     8	         9	     74.29	      3.71	      2.41
   Coll:MPI_Put      250000	     8	        22	    186.02	      9.30	      2.40
   Coll:MPI_Put      500000	     8	        45	    372.22	     18.61	      2.40
   Coll:MPI_Put     1000000	     8	        91	    740.21	     37.01	      2.42
   Coll:MPI_Put    10000000	     8	       915	   7467.59	    373.38	      2.39

From these output, you can see that on the last column, the BW per rank goes up to 3.5ish GB/s when using pt2pt, whereas, when using UCX, the BW does not exceed the 2.5 GB/s.

I have tried many flags, like -x UCX_TLS=ib,cuda_copy,cuda_ipc and others but none gave me something similar BW
to as the pt2pt, or better.

So, if you have any idea, maybe @bosilca, that would be really great.

Many thanks.


Details of the installation

Here are the flags that I used to install it on Summit

$ wget https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-4.0.4.tar.gz && 
   tar -xzvf openmpi-4.0.4.tar.gz && 
   cd openmpi-4.0.4 && 
   ./configure \
    --prefix=<prefix_path> \
    --enable-picky \                                                            
    --enable-visibility \                                                       
    --enable-contrib-no-build=vt \                                              
    --enable-mpirun-prefix-by-default \                                         
    --enable-dlopen \                                                           
    --enable-mpi1-compatibility \                                               
    --enable-shared \                                                           
    --enable-mpirun-prefix-by-default \                                         
    --with-cma \                                                                
    --with-hwloc=${HWLOC_ROOT} \                                                
    --with-cuda=${CUDA_ROOT} \                                                  
    --with-zlib=/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-7.4.0/zlib-1.2.11-tdykbkiueylpgx2rshpms3k3ncw5g3f6 \
    --with-ucx=${UCX_ROOT} \                                                    
    --with-mxm=/opt/mellanox/mxm \                                              
    --with-pmix=internal \                                                      
    --with-wrapper-ldflags= \                                                   
    --without-lsf \                                                             
    --without-psm \                                                             
    --without-libfabric  \                                                      
    --without-verbs \                                                           
    --without-psm2 \                                                            
    --without-alps \                                                            
    --without-sge \                                                             
    --without-slurm \                                                           
    --without-tm \                                                              
    --without-loadleveler \                                                     
    --disable-debug \                                                           
    --disable-memchecker \                                                      
    --disable-oshmem \                                                          
    --disable-java \                                                            
    --disable-mpi-java \                                                        
    --disable-man-pages &&                                                      
  make -j 20 &&                                                            
  make install

Note that the output of ucx_info is:

$ ucx_info -v
# UCT version=1.10.0 revision bbf159e
# configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --prefix=/ccs/home/scayrols/Installs/ucx-git/gcc-7.4.0/gcc-/hwloc-1.11.11/cuda-/gdrcopy- --enable-compiler-opt=3 --enable-optimizations --disable-profiling --disable-frame-pointer --disable-memtrack --disable-debug --disable-debug-data --disable-params-check --disable-backtrace-detail --disable-logging --disable-mt --with-cuda=/sw/summit/cuda/10.1.243 --with-gdrcopy=/sw/summit/gdrcopy/2.0

Reproducer

Note that I had to change the extension .cu with .LOG to attach it.

bench_ucx.LOG

@bureddy
Copy link
Member

bureddy commented Jul 27, 2020

@cayrols currently, UCX/CUDA RMA does not have all the CUDA performance protocols and it supports only limited cases. It will be improved in the future releases

@bosilca
Copy link
Member

bosilca commented Jul 27, 2020

Still, this can hardly explain why faking RMA over pt2pt over UCX gives significantly better results than RMA over UCX.

@bureddy
Copy link
Member

bureddy commented Jul 27, 2020

pt2pt is using ucx tag api, which had better CUDA protocol selections including cuda-ipc. with PUT, it is just depending only on GPUDirectRDMA over IB. And also on summit GPUDirectRDMA performance is slightly drops for message sizes (> 4 MB)

I quickly ran on Summit, I can downgrade pt2pt Isend perf to PUT performance If I disable cuda_ipc and use GPUDirectRDMA Put for UCX tag API (-x UCX_TLS=ib,cuda_copy -x UCX_RNDV_SCHEME=put_zcopy),

$mpirun -n 12 -H a04n16:6,a04n17:6   --mca osc ucx -x LD_LIBRARY_PATH -x UCX_ZCOPY_THRESH=1 -x UCX_TLS=ib,cuda_copy -x UCX_RNDV_SCHEME=put_zcopy -x UCX_LOG_LEVEL=WARN ./bench_ucx

 Coll:MPI_Isend        1000          8           0           23.00            1.15            0.08
 Coll:MPI_Isend       10000          8           0           20.19            1.01            0.89
 Coll:MPI_Isend       50000          8           4           48.56            2.43            1.84
 Coll:MPI_Isend      100000          8           9           81.46            4.07            2.20
 Coll:MPI_Isend      250000          8          22          202.02           10.10            2.21
 Coll:MPI_Isend      500000          8          45          401.14           20.06            2.23
 Coll:MPI_Isend     1000000          8          91          754.47           37.72            2.37
 Coll:MPI_Isend    10000000          8         915         7265.20          363.26            2.46

   Coll:MPI_Put        1000          8           0            2.25            0.11            0.80
   Coll:MPI_Put       10000          8           0          229.63           11.48            0.08
   Coll:MPI_Put       50000          8           4          183.35            9.17            0.49
   Coll:MPI_Put      100000          8           9           99.24            4.96            1.80
   Coll:MPI_Put      250000          8          22          178.76            8.94            2.50
   Coll:MPI_Put      500000          8          45          384.26           19.21            2.33
   Coll:MPI_Put     1000000          8          91          926.76           46.34            1.93
   Coll:MPI_Put    10000000          8         915         7296.07          364.80            2.45

@jli111
Copy link

jli111 commented Jul 28, 2020

@bureddy May I kindly ask is this behavior just for PUT or same for all RMA operations?
Will the GET protocol has better performance for CUDA?

@bosilca
Copy link
Member

bosilca commented Jul 28, 2020

I guess we have to wait for openucx/ucx#5473 to get merged before seeing better GPU RMA performance.

@bureddy
Copy link
Member

bureddy commented Jul 28, 2020

@jijo733 same behavior with GET as well.
@bosilca PR:#5473 is not really change the RMA behavior. it's tuning for tag RNDV protocols.
I will have a look if we can improve the intra-node RMA case.

@janjust
Copy link
Contributor

janjust commented Oct 21, 2022

@cayrols Recent versions of UCX have added support for CUDA RMA. They are not default yet, but can be enabled with UCX_PROTO_ENABLE=y flag.

@cayrols
Copy link
Author

cayrols commented Oct 21, 2022

That is great news! Thank you for the info. I will definitively try it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants