UCX: initialize cuda from ucx pml component #7893

bureddy · 2020-06-29T21:43:06Z

ompi/cuda is initialized only from tcp/smcuda/openib component open.
cuda memory segfaults with UCX PML if it used w/o any of these components (-mca pml ucx -mca btl ^tcp,smcuda,openib )

bureddy · 2020-06-29T21:43:50Z

@Akshay-Venkatesh @yosefe @jladd-mlnx please review

awlauria · 2020-06-30T01:28:48Z

bot:ompi:retest

hppritcha · 2020-07-01T16:30:13Z

@bureddy did a user report this issue?

bureddy · 2020-07-01T17:22:38Z

@hppritcha this issue reported in our internal application team. it is easily reproducible any osu cuda tests with options (-mca pml ucx -mca btl ^tcp,smcuda,openib)

I just had a discussion with @yosefe . I think a better solution would be move this code to some common openmpi init code instead of doing it from different BTLS (tcp, smcuda, openib) and PMLs(ucx)

any suggestions?

awlauria

Requesting changes so we don't merge until ready based on comments.

Akshay-Venkatesh · 2020-07-07T20:28:36Z

cc @jsquyres @bosilca

Is there a place independent of pmls and btls where cuda initialization can be moved to?

Ultimately, the function pointer table in cuda initialization needs to be setup to 1. detect buffers passed to MPI as cuda buffers or not 2. issue copy operations for pack/unpack. Until now this logic was invoked from smcuda, openib, tcp btls (not sure why). Shouldn't this logic be triggered in datatype component whenever cuda support is requested in OpenMPI?

bosilca · 2020-07-07T22:23:30Z

mca_common_cuda_stage_one_init correctly handle the case where it is called multiple times, for as long as there are the same number of calls to mca_common_cuda_fini. You should be able to call it from the UCX PML init function.

bureddy · 2020-07-08T16:22:25Z

bot:ompi:retest

awlauria · 2020-07-08T20:38:22Z

bot:ompi:retest

bureddy · 2020-07-11T18:00:54Z

bot:ompi:retest

bureddy · 2020-07-11T18:01:23Z

@yosefe can you please review?

Updated.

yosefe · 2020-07-12T10:51:58Z

ompi/mca/pml/ucx/pml_ucx.c

@@ -230,22 +233,37 @@ int mca_pml_ucx_open(void)

    /* Query UCX attributes */
    attr.field_mask        = UCP_ATTR_FIELD_REQUEST_SIZE;
+#if HAVE_UCP_ATTR_MEMORY_TYPES && OPAL_CUDA_SUPPORT


no need for OPAL_CUDA_SUPPORT
use #if HAVE_UCP_ATTR_MEMORY_TYPES

yosefe · 2020-07-12T10:53:00Z

ompi/mca/pml/ucx/pml_ucx.h

@@ -57,6 +57,7 @@ struct mca_pml_ucx_module {
    mca_pml_ucx_freelist_t    convs;

    int                       priority;
+    int                       cuda_initialized;


where is it set to 0?

@yosefe initialized to "false" in mca_pml_ucx_open() (https://github.com/open-mpi/ompi/pull/7893/files#diff-80c6bd864dd92a9d6cdcfa297313cc9cR247)

Signed-off-by: Devendar Bureddy <[email protected]>

bureddy · 2020-07-13T14:52:49Z

bot:ompi:retest

bureddy · 2020-07-13T18:17:10Z

This PR is ready to merge

bureddy force-pushed the cuda-ucx branch from 6517431 to 8bb4346 Compare June 29, 2020 22:20

Akshay-Venkatesh approved these changes Jun 30, 2020

View reviewed changes

awlauria mentioned this pull request Jul 1, 2020

v4.1.x: UCX: initialize cuda from ucx pml component #7899

Merged

awlauria added the ⚠️ WIP-DNM! label Jul 1, 2020

awlauria previously requested changes Jul 1, 2020

View reviewed changes

bureddy force-pushed the cuda-ucx branch from 8bb4346 to 3a531b1 Compare July 8, 2020 02:54

bureddy mentioned this pull request Jul 8, 2020

UCP: Add memtype query in ucp_context_query() API openucx/ucx#5371

Merged

bureddy force-pushed the cuda-ucx branch from 3a531b1 to ccda356 Compare July 10, 2020 19:25

yosefe reviewed Jul 12, 2020

View reviewed changes

bureddy force-pushed the cuda-ucx branch from ccda356 to a19cdda Compare July 12, 2020 15:39

UCX: initialize cuda from ucx pml component

2547e24

Signed-off-by: Devendar Bureddy <[email protected]>

bureddy force-pushed the cuda-ucx branch from a19cdda to 2547e24 Compare July 12, 2020 15:41

yosefe approved these changes Jul 13, 2020

View reviewed changes

jladd-mlnx merged commit aa8f7f4 into open-mpi:master Jul 13, 2020

bureddy deleted the cuda-ucx branch July 13, 2020 19:19

bureddy mentioned this pull request Jul 13, 2020

v4.0.x: UCX: initialize cuda from ucx pml component #7898

Merged

UCX: initialize cuda from ucx pml component #7893

UCX: initialize cuda from ucx pml component #7893

Uh oh!

Conversation

bureddy commented Jun 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bureddy commented Jun 29, 2020

Uh oh!

awlauria commented Jun 30, 2020

Uh oh!

hppritcha commented Jul 1, 2020

Uh oh!

bureddy commented Jul 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

awlauria left a comment

Choose a reason for hiding this comment

Uh oh!

Akshay-Venkatesh commented Jul 7, 2020

Uh oh!

bosilca commented Jul 7, 2020

Uh oh!

bureddy commented Jul 8, 2020

Uh oh!

awlauria commented Jul 8, 2020

Uh oh!

bureddy commented Jul 11, 2020

Uh oh!

bureddy commented Jul 11, 2020

Uh oh!

yosefe Jul 12, 2020

Choose a reason for hiding this comment

Uh oh!

bureddy Jul 12, 2020

Choose a reason for hiding this comment

Uh oh!

yosefe Jul 12, 2020

Choose a reason for hiding this comment

Uh oh!

bureddy Jul 12, 2020

Choose a reason for hiding this comment

Uh oh!

bureddy commented Jul 13, 2020

Uh oh!

bureddy commented Jul 13, 2020

Uh oh!

Uh oh!

bureddy commented Jun 29, 2020 •

edited

Loading

bureddy commented Jul 1, 2020 •

edited

Loading