-
Notifications
You must be signed in to change notification settings - Fork 919
UCX: initialize cuda from ucx pml component #7893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@Akshay-Venkatesh @yosefe @jladd-mlnx please review |
bot:ompi:retest |
@bureddy did a user report this issue? |
@hppritcha this issue reported in our internal application team. it is easily reproducible any osu cuda tests with options ( I just had a discussion with @yosefe . I think a better solution would be move this code to some common openmpi init code instead of doing it from different BTLS (tcp, smcuda, openib) and PMLs(ucx) any suggestions? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requesting changes so we don't merge until ready based on comments.
Is there a place independent of pmls and btls where cuda initialization can be moved to? Ultimately, the function pointer table in cuda initialization needs to be setup to 1. detect buffers passed to MPI as cuda buffers or not 2. issue copy operations for pack/unpack. Until now this logic was invoked from smcuda, openib, tcp btls (not sure why). Shouldn't this logic be triggered in datatype component whenever cuda support is requested in OpenMPI? |
mca_common_cuda_stage_one_init correctly handle the case where it is called multiple times, for as long as there are the same number of calls to mca_common_cuda_fini. You should be able to call it from the UCX PML init function. |
bot:ompi:retest |
1 similar comment
bot:ompi:retest |
bot:ompi:retest |
@yosefe can you please review? |
ompi/mca/pml/ucx/pml_ucx.c
Outdated
@@ -230,22 +233,37 @@ int mca_pml_ucx_open(void) | |||
|
|||
/* Query UCX attributes */ | |||
attr.field_mask = UCP_ATTR_FIELD_REQUEST_SIZE; | |||
#if HAVE_UCP_ATTR_MEMORY_TYPES && OPAL_CUDA_SUPPORT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need for OPAL_CUDA_SUPPORT
use #if HAVE_UCP_ATTR_MEMORY_TYPES
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
ompi/mca/pml/ucx/pml_ucx.h
Outdated
@@ -57,6 +57,7 @@ struct mca_pml_ucx_module { | |||
mca_pml_ucx_freelist_t convs; | |||
|
|||
int priority; | |||
int cuda_initialized; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is it set to 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yosefe initialized to "false" in mca_pml_ucx_open() (https://github.com/open-mpi/ompi/pull/7893/files#diff-80c6bd864dd92a9d6cdcfa297313cc9cR247)
Signed-off-by: Devendar Bureddy <[email protected]>
bot:ompi:retest |
This PR is ready to merge |
ompi/cuda is initialized only from tcp/smcuda/openib component open.
cuda memory segfaults with UCX PML if it used w/o any of these components (
-mca pml ucx -mca btl ^tcp,smcuda,openib
)