Skip to content

Add CUDA support for the OFI MTL #8536

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Mar 9, 2021
Merged

Add CUDA support for the OFI MTL #8536

merged 10 commits into from
Mar 9, 2021

Conversation

wckzhang
Copy link
Contributor

@wckzhang wckzhang commented Feb 26, 2021

This patch series touches the CUDA, opal datatype, PML CM, MTL datatype, and OFI MTL code.

The important things added here are:
Public function to allocate CUDA memory for datapacking of heterogenous buffers and device-to-device copy.
Pointer attribute detection in the CM PML layer.
Selecting libfabric providers with FI_HMEM and FI_MR_HMEM capabilities.
Only compiling the OFI MTL if libfabric version >= 1.9 if CUDA support is requested.
Memory registration for CUDA buffers.

wckzhang and others added 4 commits February 26, 2021 15:08
If cuda is requested, due to libfabric only supporting FI_HMEM hints
in libfabric versions 1.9.0 and greater, we do not compile the ofi
mtl if cuda support is requested.

Signed-off-by: William Zhang <[email protected]>
Update MTL/OFI structs to store needed data.

Signed-off-by: William Zhang <[email protected]>
Co-authored-by: William Zhang <[email protected]>
Co-authored-by: Palmer Stolly <[email protected]>
PML/CM recv paths already have CUDA detection, both recv and irecv make
calls to opal_convertor_copy_and_prepare_for_recv, which does detection.

This patch and a subsequent MTL datapack patch addresses an error using
heterogenous device send buffers (The code will malloc a temporary send
buffer and then due to CONVERTOR_CUDA never being set, the subsequent
opal_cuda_memcpy will attempt a mempcy from device to host buffer, which
will cause an error). Preparing the send buffers and setting
CONVERTOR_CUDA will allow for using the correct cuMemcpy.

Signed-off-by: William Zhang <[email protected]>
Co-authored-by: William Zhang <[email protected]>
Co-authored-by: Palmer Stolly <[email protected]>
@@ -111,6 +111,59 @@ bool opal_cuda_check_one_buf(char *buf, opal_convertor_t *convertor )
return ( ftable.gpu_is_gpu_buffer(buf, convertor));
}

/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OPAL datatype packing and unpacking is too low level to provide support for memory allocation. I understand this seems to be a popular place to put CUDA functions, but I would suggest these functions move to opal/mca/common/cuda.

@@ -18,12 +18,16 @@ struct opal_common_cuda_function_table {
int (*gpu_cu_memcpy_async)(void*, const void*, size_t, opal_convertor_t*);
int (*gpu_cu_memcpy)(void*, const void*, size_t);
int (*gpu_memmove)(void*, void*, size_t);
int (*gpu_malloc)(void*, size_t);
int (*gpu_free)(void*);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically I think it makes much more sense to move the entire CUDA structure outside of the datatype (especially now that it start having additional capabilities).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, thanks for the feedback, I'll see what I can do about it.

@gpaulsen gpaulsen added this to the v5.0.0 milestone Mar 2, 2021
wckzhang added 2 commits March 4, 2021 00:47
…LOCAL support

Add a check to see if Libfabric has at least one provider with FI_HMEM
support, use this info to set whether or not Libfabric has CUDA support.

Add provider hints for FI_MR_LOCAL, and if Libfabric has CUDA support,
also add hints for FI_HMEM and FI_MR_HMEM.

In the case where Open MPI is built with CUDA support but Libfabric is
not, the MTL/OFI is not picked.

Signed-off-by: William Zhang <[email protected]>
Change the unsigned int to use a size_t for cuMemAlloc

Signed-off-by: William Zhang <[email protected]>
@gpaulsen
Copy link
Member

gpaulsen commented Mar 4, 2021

@rajachan, you mentioned on Tuesday that AWS is doing some additional evaluation on this PR, is that correct?

Copy link
Member

@hppritcha hppritcha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve but agree with George's comments.

wckzhang and others added 3 commits March 5, 2021 00:31
…o Libfabric

Adds functions for memory registration & deregistration.

Modifies MTL/OFI to use those functions to register memory, send the
registered buffer memory descriptor as a parameter to the Libfabric API,
and deregisters the memory after the request is finished.

This patch alongside an earlier commit fixes an issue with using
heterogenous device send buffers for the MTL's.

Signed-off-by: William Zhang <[email protected]>
Co-authored-by: William Zhang <[email protected]>
Co-authored-by: Palmer Stolly <[email protected]>
@wckzhang
Copy link
Contributor Author

wckzhang commented Mar 5, 2021

Updated, there was a minor bug in the malloc code which I fixed (Didn't use a double pointer), and just basically moved all the opal_datatype_cuda.c and opal_datatype_cuda.h files into the corresponding common_cuda files.

@ibm-ompi
Copy link

ibm-ompi commented Mar 5, 2021

The IBM CI (PGI) build failed! Please review the log, linked below.

Gist: https://gist.github.com/8423dc40cc266d16a37d32334aa7c6b9

@wckzhang
Copy link
Contributor Author

wckzhang commented Mar 5, 2021

Looks like there's something I'm missing with compilation. Didn't hit this error during personal testing, not sure what's different in the CI setup compared to the manual build.

make[2]: *** No rule to make target 'mca/common/cuda/libmca_common_cuda.la', needed by 'libopen-pal.la'.  Stop.

@gpaulsen
Copy link
Member

gpaulsen commented Mar 5, 2021

@wckzhang did you figure it out? I was going to try a local ppc64le build if you're stuck.

@ibm-ompi
Copy link

ibm-ompi commented Mar 5, 2021

The IBM CI (GNU/Scale) build failed! Please review the log, linked below.

Gist: https://gist.github.com/26475e9bedc1d8ff0754ee6cbe4ac127

@wckzhang
Copy link
Contributor Author

wckzhang commented Mar 6, 2021

I fixed the first error have another bug with cuda.h now it looks like

@wckzhang
Copy link
Contributor Author

wckzhang commented Mar 6, 2021

oh yeah I'm pretty sure I know what's up, I only tested with cuda compiled, will add checks only if compiled with cuda.

@wckzhang
Copy link
Contributor Author

wckzhang commented Mar 6, 2021

Bit of a dumb solution but it compiled properly both with and without cuda.

@ibm-ompi
Copy link

ibm-ompi commented Mar 8, 2021

The IBM CI (XL) build failed! Please review the log, linked below.

Gist: https://gist.github.com/ff4dbb87c599b309a70414807dccd5c0

@wckzhang
Copy link
Contributor Author

wckzhang commented Mar 8, 2021

Hmm, I don't see what the error is this time in the output, are the files truncated?

@jjhursey
Copy link
Member

jjhursey commented Mar 8, 2021

It looks like it hit a 20-minute timeout on the make stage. Let me adjust that then I'll ask for a retest

@jjhursey
Copy link
Member

jjhursey commented Mar 8, 2021

bot:ibm:retest

@open-mpi open-mpi deleted a comment from ibm-ompi Mar 8, 2021
@open-mpi open-mpi deleted a comment from ibm-ompi Mar 8, 2021
@open-mpi open-mpi deleted a comment from ibm-ompi Mar 8, 2021
@open-mpi open-mpi deleted a comment from ibm-ompi Mar 8, 2021
@open-mpi open-mpi deleted a comment from ibm-ompi Mar 8, 2021
@open-mpi open-mpi deleted a comment from ibm-ompi Mar 8, 2021
@open-mpi open-mpi deleted a comment from ibm-ompi Mar 8, 2021
@open-mpi open-mpi deleted a comment from ibm-ompi Mar 8, 2021
@open-mpi open-mpi deleted a comment from ibm-ompi Mar 8, 2021
@wckzhang wckzhang requested review from bosilca and hppritcha March 8, 2021 16:19
@hppritcha
Copy link
Member

Unfortunately this PR breaks a bunch of IBM tests which involve derived, non-contiguous datatypes when using the OFI mtl.

MCA_PML_CM_SWITCH_CUDA_CONVERTOR_OFF(flags, datatype, count); \
(req_send)->req_base.req_convertor.flags |= flags; \
/* Sets CONVERTOR_CUDA flag if CUDA buffer */ \
opal_convertor_prepare_for_send( \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this broke use of OFI MTL in when not supporting CUDA.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you describe what the exact test matrix was, ie. OMPI compiled w/ CUDA using a provider that supports/doesn't support FI_HMEM, using host buffers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the same issue as your previous comment? This segment seems to only affect contiguous memory it looks like, does removing this section of code fix your non-contiguous use case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wrong about non-contiguous in the comment. It is for contiguous. looking at the names of the IBM tests which fail, it has to do with cases where buffer is contiguous but there's a "gap" at the start of the datatype.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please read my comment on #8906 about this code.

hppritcha added a commit to hppritcha/ompi that referenced this pull request Apr 30, 2021
PR open-mpi#8536 instroduced a regression in non-cuda environments
when an application is using derived, but continguous datatypes.

Related to open-mpi#8905.

Signed-off-by: Howard Pritchard <[email protected]>
hppritcha added a commit to hppritcha/ompi that referenced this pull request Apr 30, 2021
PR open-mpi#8536 introduced a regression in non-cuda environments
when an application is using derived, but continguous datatypes.

Related to open-mpi#8905.

Signed-off-by: Howard Pritchard <[email protected]>
hppritcha added a commit to hppritcha/ompi that referenced this pull request May 3, 2021
PR open-mpi#8536 introduced a regression in non-cuda environments
when an application is using derived, but continguous datatypes.

Related to open-mpi#8905.

Signed-off-by: Howard Pritchard <[email protected]>
hppritcha added a commit to hppritcha/ompi that referenced this pull request May 4, 2021
PR open-mpi#8536 introduced a regression in non-cuda environments
when an application is using derived, but continguous datatypes.

Related to open-mpi#8905.

Signed-off-by: Howard Pritchard <[email protected]>
(cherry picked from commit 9e99182)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants