-
Notifications
You must be signed in to change notification settings - Fork 900
Add CUDA support for the OFI MTL #8536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
If cuda is requested, due to libfabric only supporting FI_HMEM hints in libfabric versions 1.9.0 and greater, we do not compile the ofi mtl if cuda support is requested. Signed-off-by: William Zhang <[email protected]>
Update MTL/OFI structs to store needed data. Signed-off-by: William Zhang <[email protected]> Co-authored-by: William Zhang <[email protected]> Co-authored-by: Palmer Stolly <[email protected]>
PML/CM recv paths already have CUDA detection, both recv and irecv make calls to opal_convertor_copy_and_prepare_for_recv, which does detection. This patch and a subsequent MTL datapack patch addresses an error using heterogenous device send buffers (The code will malloc a temporary send buffer and then due to CONVERTOR_CUDA never being set, the subsequent opal_cuda_memcpy will attempt a mempcy from device to host buffer, which will cause an error). Preparing the send buffers and setting CONVERTOR_CUDA will allow for using the correct cuMemcpy. Signed-off-by: William Zhang <[email protected]> Co-authored-by: William Zhang <[email protected]> Co-authored-by: Palmer Stolly <[email protected]>
Signed-off-by: William Zhang <[email protected]>
opal/datatype/opal_datatype_cuda.c
Outdated
@@ -111,6 +111,59 @@ bool opal_cuda_check_one_buf(char *buf, opal_convertor_t *convertor ) | |||
return ( ftable.gpu_is_gpu_buffer(buf, convertor)); | |||
} | |||
|
|||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The OPAL datatype packing and unpacking is too low level to provide support for memory allocation. I understand this seems to be a popular place to put CUDA functions, but I would suggest these functions move to opal/mca/common/cuda.
opal/datatype/opal_datatype_cuda.h
Outdated
@@ -18,12 +18,16 @@ struct opal_common_cuda_function_table { | |||
int (*gpu_cu_memcpy_async)(void*, const void*, size_t, opal_convertor_t*); | |||
int (*gpu_cu_memcpy)(void*, const void*, size_t); | |||
int (*gpu_memmove)(void*, void*, size_t); | |||
int (*gpu_malloc)(void*, size_t); | |||
int (*gpu_free)(void*); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically I think it makes much more sense to move the entire CUDA structure outside of the datatype (especially now that it start having additional capabilities).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, thanks for the feedback, I'll see what I can do about it.
…LOCAL support Add a check to see if Libfabric has at least one provider with FI_HMEM support, use this info to set whether or not Libfabric has CUDA support. Add provider hints for FI_MR_LOCAL, and if Libfabric has CUDA support, also add hints for FI_HMEM and FI_MR_HMEM. In the case where Open MPI is built with CUDA support but Libfabric is not, the MTL/OFI is not picked. Signed-off-by: William Zhang <[email protected]>
Change the unsigned int to use a size_t for cuMemAlloc Signed-off-by: William Zhang <[email protected]>
@rajachan, you mentioned on Tuesday that AWS is doing some additional evaluation on this PR, is that correct? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I approve but agree with George's comments.
Signed-off-by: William Zhang <[email protected]>
…o Libfabric Adds functions for memory registration & deregistration. Modifies MTL/OFI to use those functions to register memory, send the registered buffer memory descriptor as a parameter to the Libfabric API, and deregisters the memory after the request is finished. This patch alongside an earlier commit fixes an issue with using heterogenous device send buffers for the MTL's. Signed-off-by: William Zhang <[email protected]> Co-authored-by: William Zhang <[email protected]> Co-authored-by: Palmer Stolly <[email protected]>
Signed-off-by: William Zhang <[email protected]>
Updated, there was a minor bug in the malloc code which I fixed (Didn't use a double pointer), and just basically moved all the opal_datatype_cuda.c and opal_datatype_cuda.h files into the corresponding common_cuda files. |
The IBM CI (PGI) build failed! Please review the log, linked below. Gist: https://gist.github.com/8423dc40cc266d16a37d32334aa7c6b9 |
Looks like there's something I'm missing with compilation. Didn't hit this error during personal testing, not sure what's different in the CI setup compared to the manual build.
|
@wckzhang did you figure it out? I was going to try a local ppc64le build if you're stuck. |
The IBM CI (GNU/Scale) build failed! Please review the log, linked below. Gist: https://gist.github.com/26475e9bedc1d8ff0754ee6cbe4ac127 |
I fixed the first error have another bug with cuda.h now it looks like |
oh yeah I'm pretty sure I know what's up, I only tested with cuda compiled, will add checks only if compiled with cuda. |
Bit of a dumb solution but it compiled properly both with and without cuda. |
Signed-off-by: William Zhang <[email protected]>
The IBM CI (XL) build failed! Please review the log, linked below. Gist: https://gist.github.com/ff4dbb87c599b309a70414807dccd5c0 |
Hmm, I don't see what the error is this time in the output, are the files truncated? |
It looks like it hit a 20-minute timeout on the make stage. Let me adjust that then I'll ask for a retest |
bot:ibm:retest |
Unfortunately this PR breaks a bunch of IBM tests which involve derived, non-contiguous datatypes when using the OFI mtl. |
MCA_PML_CM_SWITCH_CUDA_CONVERTOR_OFF(flags, datatype, count); \ | ||
(req_send)->req_base.req_convertor.flags |= flags; \ | ||
/* Sets CONVERTOR_CUDA flag if CUDA buffer */ \ | ||
opal_convertor_prepare_for_send( \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this broke use of OFI MTL in when not supporting CUDA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you describe what the exact test matrix was, ie. OMPI compiled w/ CUDA using a provider that supports/doesn't support FI_HMEM, using host buffers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the same issue as your previous comment? This segment seems to only affect contiguous memory it looks like, does removing this section of code fix your non-contiguous use case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wrong about non-contiguous in the comment. It is for contiguous. looking at the names of the IBM tests which fail, it has to do with cases where buffer is contiguous but there's a "gap" at the start of the datatype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please read my comment on #8906 about this code.
PR open-mpi#8536 instroduced a regression in non-cuda environments when an application is using derived, but continguous datatypes. Related to open-mpi#8905. Signed-off-by: Howard Pritchard <[email protected]>
PR open-mpi#8536 introduced a regression in non-cuda environments when an application is using derived, but continguous datatypes. Related to open-mpi#8905. Signed-off-by: Howard Pritchard <[email protected]>
PR open-mpi#8536 introduced a regression in non-cuda environments when an application is using derived, but continguous datatypes. Related to open-mpi#8905. Signed-off-by: Howard Pritchard <[email protected]>
PR open-mpi#8536 introduced a regression in non-cuda environments when an application is using derived, but continguous datatypes. Related to open-mpi#8905. Signed-off-by: Howard Pritchard <[email protected]> (cherry picked from commit 9e99182)
This patch series touches the CUDA, opal datatype, PML CM, MTL datatype, and OFI MTL code.
The important things added here are:
Public function to allocate CUDA memory for datapacking of heterogenous buffers and device-to-device copy.
Pointer attribute detection in the CM PML layer.
Selecting libfabric providers with FI_HMEM and FI_MR_HMEM capabilities.
Only compiling the OFI MTL if libfabric version >= 1.9 if CUDA support is requested.
Memory registration for CUDA buffers.