Add CUDA support for the OFI MTL #8536

wckzhang · 2021-02-26T23:14:16Z

This patch series touches the CUDA, opal datatype, PML CM, MTL datatype, and OFI MTL code.

The important things added here are:
Public function to allocate CUDA memory for datapacking of heterogenous buffers and device-to-device copy.
Pointer attribute detection in the CM PML layer.
Selecting libfabric providers with FI_HMEM and FI_MR_HMEM capabilities.
Only compiling the OFI MTL if libfabric version >= 1.9 if CUDA support is requested.
Memory registration for CUDA buffers.

If cuda is requested, due to libfabric only supporting FI_HMEM hints in libfabric versions 1.9.0 and greater, we do not compile the ofi mtl if cuda support is requested. Signed-off-by: William Zhang <[email protected]>

Update MTL/OFI structs to store needed data. Signed-off-by: William Zhang <[email protected]> Co-authored-by: William Zhang <[email protected]> Co-authored-by: Palmer Stolly <[email protected]>

PML/CM recv paths already have CUDA detection, both recv and irecv make calls to opal_convertor_copy_and_prepare_for_recv, which does detection. This patch and a subsequent MTL datapack patch addresses an error using heterogenous device send buffers (The code will malloc a temporary send buffer and then due to CONVERTOR_CUDA never being set, the subsequent opal_cuda_memcpy will attempt a mempcy from device to host buffer, which will cause an error). Preparing the send buffers and setting CONVERTOR_CUDA will allow for using the correct cuMemcpy. Signed-off-by: William Zhang <[email protected]> Co-authored-by: William Zhang <[email protected]> Co-authored-by: Palmer Stolly <[email protected]>

Signed-off-by: William Zhang <[email protected]>

bosilca · 2021-02-28T01:13:33Z

opal/datatype/opal_datatype_cuda.c

@@ -111,6 +111,59 @@ bool  opal_cuda_check_one_buf(char *buf, opal_convertor_t *convertor )
    return ( ftable.gpu_is_gpu_buffer(buf, convertor));
 }

+/*


The OPAL datatype packing and unpacking is too low level to provide support for memory allocation. I understand this seems to be a popular place to put CUDA functions, but I would suggest these functions move to opal/mca/common/cuda.

bosilca · 2021-02-28T01:14:54Z

opal/datatype/opal_datatype_cuda.h

@@ -18,12 +18,16 @@ struct opal_common_cuda_function_table {
    int (*gpu_cu_memcpy_async)(void*, const void*, size_t, opal_convertor_t*);
    int (*gpu_cu_memcpy)(void*, const void*, size_t);
    int (*gpu_memmove)(void*, void*, size_t);
+    int (*gpu_malloc)(void*, size_t);
+    int (*gpu_free)(void*);


Technically I think it makes much more sense to move the entire CUDA structure outside of the datatype (especially now that it start having additional capabilities).

Hmm, thanks for the feedback, I'll see what I can do about it.

…LOCAL support Add a check to see if Libfabric has at least one provider with FI_HMEM support, use this info to set whether or not Libfabric has CUDA support. Add provider hints for FI_MR_LOCAL, and if Libfabric has CUDA support, also add hints for FI_HMEM and FI_MR_HMEM. In the case where Open MPI is built with CUDA support but Libfabric is not, the MTL/OFI is not picked. Signed-off-by: William Zhang <[email protected]>

Change the unsigned int to use a size_t for cuMemAlloc Signed-off-by: William Zhang <[email protected]>

gpaulsen · 2021-03-04T15:21:04Z

@rajachan, you mentioned on Tuesday that AWS is doing some additional evaluation on this PR, is that correct?

hppritcha

I approve but agree with George's comments.

Signed-off-by: William Zhang <[email protected]>

…o Libfabric Adds functions for memory registration & deregistration. Modifies MTL/OFI to use those functions to register memory, send the registered buffer memory descriptor as a parameter to the Libfabric API, and deregisters the memory after the request is finished. This patch alongside an earlier commit fixes an issue with using heterogenous device send buffers for the MTL's. Signed-off-by: William Zhang <[email protected]> Co-authored-by: William Zhang <[email protected]> Co-authored-by: Palmer Stolly <[email protected]>

Signed-off-by: William Zhang <[email protected]>

wckzhang · 2021-03-05T19:12:17Z

Updated, there was a minor bug in the malloc code which I fixed (Didn't use a double pointer), and just basically moved all the opal_datatype_cuda.c and opal_datatype_cuda.h files into the corresponding common_cuda files.

ibm-ompi · 2021-03-05T19:24:31Z

The IBM CI (PGI) build failed! Please review the log, linked below.

Gist: https://gist.github.com/8423dc40cc266d16a37d32334aa7c6b9

wckzhang · 2021-03-05T20:41:09Z

Looks like there's something I'm missing with compilation. Didn't hit this error during personal testing, not sure what's different in the CI setup compared to the manual build.

make[2]: *** No rule to make target 'mca/common/cuda/libmca_common_cuda.la', needed by 'libopen-pal.la'.  Stop.

gpaulsen · 2021-03-05T21:14:00Z

@wckzhang did you figure it out? I was going to try a local ppc64le build if you're stuck.

ibm-ompi · 2021-03-05T21:26:19Z

The IBM CI (GNU/Scale) build failed! Please review the log, linked below.

Gist: https://gist.github.com/26475e9bedc1d8ff0754ee6cbe4ac127

wckzhang · 2021-03-06T01:23:51Z

I fixed the first error have another bug with cuda.h now it looks like

wckzhang · 2021-03-06T01:27:38Z

oh yeah I'm pretty sure I know what's up, I only tested with cuda compiled, will add checks only if compiled with cuda.

wckzhang · 2021-03-06T04:41:24Z

Bit of a dumb solution but it compiled properly both with and without cuda.

Signed-off-by: William Zhang <[email protected]>

ibm-ompi · 2021-03-08T03:12:57Z

The IBM CI (XL) build failed! Please review the log, linked below.

Gist: https://gist.github.com/ff4dbb87c599b309a70414807dccd5c0

wckzhang · 2021-03-08T04:28:30Z

Hmm, I don't see what the error is this time in the output, are the files truncated?

jjhursey · 2021-03-08T14:40:40Z

It looks like it hit a 20-minute timeout on the make stage. Let me adjust that then I'll ask for a retest

jjhursey · 2021-03-08T14:43:10Z

bot:ibm:retest

hppritcha · 2021-04-30T16:54:56Z

Unfortunately this PR breaks a bunch of IBM tests which involve derived, non-contiguous datatypes when using the OFI mtl.

hppritcha · 2021-04-30T17:16:10Z

ompi/mca/pml/cm/pml_cm_sendreq.h

+        MCA_PML_CM_SWITCH_CUDA_CONVERTOR_OFF(flags, datatype, count);   \
+        (req_send)->req_base.req_convertor.flags |= flags;              \
+        /* Sets CONVERTOR_CUDA flag if CUDA buffer */                   \
+        opal_convertor_prepare_for_send(                                \


this broke use of OFI MTL in when not supporting CUDA.

Can you describe what the exact test matrix was, ie. OMPI compiled w/ CUDA using a provider that supports/doesn't support FI_HMEM, using host buffers?

Is this the same issue as your previous comment? This segment seems to only affect contiguous memory it looks like, does removing this section of code fix your non-contiguous use case?

I was wrong about non-contiguous in the comment. It is for contiguous. looking at the names of the IBM tests which fail, it has to do with cases where buffer is contiguous but there's a "gap" at the start of the datatype.

Please read my comment on #8906 about this code.

PR open-mpi#8536 instroduced a regression in non-cuda environments when an application is using derived, but continguous datatypes. Related to open-mpi#8905. Signed-off-by: Howard Pritchard <[email protected]>

PR open-mpi#8536 introduced a regression in non-cuda environments when an application is using derived, but continguous datatypes. Related to open-mpi#8905. Signed-off-by: Howard Pritchard <[email protected]>

PR open-mpi#8536 introduced a regression in non-cuda environments when an application is using derived, but continguous datatypes. Related to open-mpi#8905. Signed-off-by: Howard Pritchard <[email protected]> (cherry picked from commit 9e99182)

wckzhang and others added 4 commits February 26, 2021 15:08

mtl/ofi: Only compile if ofi version >= 1.9 and cuda requested

249420f

If cuda is requested, due to libfabric only supporting FI_HMEM hints in libfabric versions 1.9.0 and greater, we do not compile the ofi mtl if cuda support is requested. Signed-off-by: William Zhang <[email protected]>

mtl/ofi: Update structs for FI_HMEM support

68a7310

Update MTL/OFI structs to store needed data. Signed-off-by: William Zhang <[email protected]> Co-authored-by: William Zhang <[email protected]> Co-authored-by: Palmer Stolly <[email protected]>

mtl/ofi: Remove unnecessary comment

9a6c0f4

Signed-off-by: William Zhang <[email protected]>

rajachan requested review from rajachan and a team February 26, 2021 23:18

rajachan added Target: main Target: v5.0.x labels Feb 26, 2021

bosilca approved these changes Feb 28, 2021

View reviewed changes

gpaulsen added this to the v5.0.0 milestone Mar 2, 2021

wckzhang added 2 commits March 4, 2021 00:47

common/cuda: Change cuMemAlloc to use size_t

906017c

Change the unsigned int to use a size_t for cuMemAlloc Signed-off-by: William Zhang <[email protected]>

hppritcha approved these changes Mar 4, 2021

View reviewed changes

wckzhang and others added 3 commits March 5, 2021 00:31

common/cuda: Add public functions for cuMemAlloc and cuFree

3a72df0

Signed-off-by: William Zhang <[email protected]>

mtl/ofi: Fix minor formatting issue

8e2c32e

Signed-off-by: William Zhang <[email protected]>

wckzhang force-pushed the gdr branch from 6a40068 to 7e1b60a Compare March 5, 2021 19:04

wckzhang force-pushed the gdr branch from 7e1b60a to b5db2ea Compare March 5, 2021 21:13

wckzhang force-pushed the gdr branch from b5db2ea to b6d2d69 Compare March 5, 2021 21:14

wckzhang force-pushed the gdr branch from b6d2d69 to eb3f85f Compare March 6, 2021 04:40

common/cuda: Move opal_datatype_cuda functionality into common cuda

deb37ac

Signed-off-by: William Zhang <[email protected]>

wckzhang force-pushed the gdr branch from eb3f85f to deb37ac Compare March 8, 2021 02:12

open-mpi deleted a comment from ibm-ompi Mar 8, 2021

wckzhang requested review from bosilca and hppritcha March 8, 2021 16:19

bosilca approved these changes Mar 9, 2021

View reviewed changes

hppritcha approved these changes Mar 9, 2021

View reviewed changes

rajachan merged commit 4da9d91 into open-mpi:master Mar 9, 2021

rajachan mentioned this pull request Apr 6, 2021

Performance impact when running psm2 through ofi mtl versus psm2 mtl #8762

Closed

hppritcha reviewed Apr 30, 2021

View reviewed changes

hppritcha mentioned this pull request Apr 30, 2021

Numerous IBM test failures using OFI MTL #8905

Closed

hppritcha mentioned this pull request Apr 30, 2021

pml/cm: fix a problem introduced with cuda support #8906

Merged

hppritcha mentioned this pull request May 4, 2021

pml/cm: fix a problem introduced with cuda support #8920

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CUDA support for the OFI MTL #8536

Add CUDA support for the OFI MTL #8536

wckzhang commented Feb 26, 2021 •

edited by gpaulsen

Loading

bosilca Feb 28, 2021

bosilca Feb 28, 2021

wckzhang Mar 1, 2021

gpaulsen commented Mar 4, 2021

hppritcha left a comment

wckzhang commented Mar 5, 2021

ibm-ompi commented Mar 5, 2021

wckzhang commented Mar 5, 2021

gpaulsen commented Mar 5, 2021

ibm-ompi commented Mar 5, 2021

wckzhang commented Mar 6, 2021

wckzhang commented Mar 6, 2021

wckzhang commented Mar 6, 2021

ibm-ompi commented Mar 8, 2021

wckzhang commented Mar 8, 2021

jjhursey commented Mar 8, 2021

jjhursey commented Mar 8, 2021

hppritcha commented Apr 30, 2021

hppritcha Apr 30, 2021

wckzhang Apr 30, 2021

wckzhang Apr 30, 2021

hppritcha Apr 30, 2021

bosilca Apr 30, 2021

Add CUDA support for the OFI MTL #8536

Add CUDA support for the OFI MTL #8536

Conversation

wckzhang commented Feb 26, 2021 • edited by gpaulsen Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gpaulsen commented Mar 4, 2021

hppritcha left a comment

Choose a reason for hiding this comment

wckzhang commented Mar 5, 2021

ibm-ompi commented Mar 5, 2021

wckzhang commented Mar 5, 2021

gpaulsen commented Mar 5, 2021

ibm-ompi commented Mar 5, 2021

wckzhang commented Mar 6, 2021

wckzhang commented Mar 6, 2021

wckzhang commented Mar 6, 2021

ibm-ompi commented Mar 8, 2021

wckzhang commented Mar 8, 2021

jjhursey commented Mar 8, 2021

jjhursey commented Mar 8, 2021

hppritcha commented Apr 30, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wckzhang commented Feb 26, 2021 •

edited by gpaulsen

Loading