Skip to content

Conversation

bureddy
Copy link
Member

@bureddy bureddy commented Jun 29, 2020

ompi/cuda is initialized only from tcp/smcuda/openib component open.
cuda memory segfaults with UCX PML if it used w/o any of these components (-mca pml ucx -mca btl ^tcp,smcuda,openib )

@bureddy
Copy link
Member Author

bureddy commented Jun 29, 2020

@Akshay-Venkatesh @yosefe @jladd-mlnx please review

@awlauria
Copy link
Contributor

bot:ompi:retest

@hppritcha
Copy link
Member

@bureddy did a user report this issue?

@bureddy
Copy link
Member Author

bureddy commented Jul 1, 2020

@hppritcha this issue reported in our internal application team. it is easily reproducible any osu cuda tests with options (-mca pml ucx -mca btl ^tcp,smcuda,openib)

I just had a discussion with @yosefe . I think a better solution would be move this code to some common openmpi init code instead of doing it from different BTLS (tcp, smcuda, openib) and PMLs(ucx)

any suggestions?

Copy link
Contributor

@awlauria awlauria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes so we don't merge until ready based on comments.

@Akshay-Venkatesh
Copy link
Contributor

cc @jsquyres @bosilca

Is there a place independent of pmls and btls where cuda initialization can be moved to?

Ultimately, the function pointer table in cuda initialization needs to be setup to 1. detect buffers passed to MPI as cuda buffers or not 2. issue copy operations for pack/unpack. Until now this logic was invoked from smcuda, openib, tcp btls (not sure why). Shouldn't this logic be triggered in datatype component whenever cuda support is requested in OpenMPI?

@bosilca
Copy link
Member

bosilca commented Jul 7, 2020

mca_common_cuda_stage_one_init correctly handle the case where it is called multiple times, for as long as there are the same number of calls to mca_common_cuda_fini. You should be able to call it from the UCX PML init function.

@bureddy
Copy link
Member Author

bureddy commented Jul 8, 2020

bot:ompi:retest

1 similar comment
@awlauria
Copy link
Contributor

awlauria commented Jul 8, 2020

bot:ompi:retest

@bureddy
Copy link
Member Author

bureddy commented Jul 11, 2020

bot:ompi:retest

@bureddy
Copy link
Member Author

bureddy commented Jul 11, 2020

@yosefe can you please review?

@awlauria awlauria dismissed their stale review July 11, 2020 18:15

Updated.

@@ -230,22 +233,37 @@ int mca_pml_ucx_open(void)

/* Query UCX attributes */
attr.field_mask = UCP_ATTR_FIELD_REQUEST_SIZE;
#if HAVE_UCP_ATTR_MEMORY_TYPES && OPAL_CUDA_SUPPORT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for OPAL_CUDA_SUPPORT
use #if HAVE_UCP_ATTR_MEMORY_TYPES

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@@ -57,6 +57,7 @@ struct mca_pml_ucx_module {
mca_pml_ucx_freelist_t convs;

int priority;
int cuda_initialized;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is it set to 0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bureddy
Copy link
Member Author

bureddy commented Jul 13, 2020

bot:ompi:retest

@bureddy
Copy link
Member Author

bureddy commented Jul 13, 2020

This PR is ready to merge

@jladd-mlnx jladd-mlnx merged commit aa8f7f4 into open-mpi:master Jul 13, 2020
@bureddy bureddy deleted the cuda-ucx branch July 13, 2020 19:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants