pml/cm: fix a problem introduced with cuda support #8906

hppritcha · 2021-04-30T18:11:23Z

PR #8536 introduced a regression in non-cuda environments
when an application is using derived, but continguous datatypes.

Related to #8905.

Signed-off-by: Howard Pritchard [email protected]

wckzhang · 2021-04-30T18:17:27Z

Patch should be fine, but shouldn't opal_convertor_prepare_for_send be able to handle the use case?

hppritcha · 2021-04-30T18:22:34Z

looks like ompi jenkins is having problems cloning from github at the moment
bot:ompi:retest

hppritcha · 2021-04-30T18:49:57Z

@jsquyres any idea why the call to opal_convertor_prepare_for_send breaks things here? looks like it adds a bunch of additional overhead we don't want in any case when not using CUDA.

jsquyres · 2021-04-30T19:00:53Z

@jsquyres any idea why the call to opal_convertor_prepare_for_send breaks things here? looks like it adds a bunch of additional overhead we don't want in any case when not using CUDA.

I am not knowledgeable about the innards of DDT -- I think this is a question for @bosilca...

bosilca · 2021-04-30T19:31:22Z

What is the call to convertor_prepare_for_send breaking ? What I see in the patch is that if the datatype is contiguous then the convertor will not be properly initialized is CUDA support is not active. This can be perceived as a shortcut (saving one function call), because the data to be sent is contiguous of size count*data.size. I assume whoever wrote that code took care to properly accounting for the potential use of a lower bound in the datatype.

hppritcha · 2021-04-30T19:37:37Z

Jithin did some optimization work in #602 that introduced this bypass of opal_convertor_prepare_for_send or related code.
As it was, master was failing lots of IBM tests without this patch post merge of #8536

See hpc#42

I was also seeing this in MTT but had not been looking into the issue at that point.

wckzhang

Might as well encapsulate the MCA_PML_SWITCH_CUDA_CONVERTOR_OFF with the if statement as well. Doesn't make sense to call it otherwise.

PR open-mpi#8536 introduced a regression in non-cuda environments when an application is using derived, but continguous datatypes. Related to open-mpi#8905. Signed-off-by: Howard Pritchard <[email protected]>

hppritcha · 2021-05-03T17:29:23Z

@wckzhang reworked per your suggestion

hppritcha requested a review from wckzhang April 30, 2021 18:11

bosilca mentioned this pull request Apr 30, 2021

Add CUDA support for the OFI MTL #8536

Merged

wckzhang suggested changes Apr 30, 2021

View reviewed changes

pml/cm: fix a problem introduced with cuda support

9e99182

PR open-mpi#8536 introduced a regression in non-cuda environments when an application is using derived, but continguous datatypes. Related to open-mpi#8905. Signed-off-by: Howard Pritchard <[email protected]>

hppritcha force-pushed the topic/swat_issue_8905 branch from d15c0e8 to 9e99182 Compare May 3, 2021 17:27

wckzhang approved these changes May 3, 2021

View reviewed changes

hppritcha merged commit 17b723b into open-mpi:master May 3, 2021

hppritcha mentioned this pull request May 17, 2021

Numerous IBM test failures using OFI MTL #8905

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

pml/cm: fix a problem introduced with cuda support #8906

pml/cm: fix a problem introduced with cuda support #8906

Uh oh!

hppritcha commented Apr 30, 2021

Uh oh!

wckzhang commented Apr 30, 2021

Uh oh!

hppritcha commented Apr 30, 2021

Uh oh!

hppritcha commented Apr 30, 2021

Uh oh!

jsquyres commented Apr 30, 2021

Uh oh!

bosilca commented Apr 30, 2021

Uh oh!

hppritcha commented Apr 30, 2021

Uh oh!

wckzhang left a comment

Uh oh!

hppritcha commented May 3, 2021

Uh oh!

Uh oh!

pml/cm: fix a problem introduced with cuda support #8906

pml/cm: fix a problem introduced with cuda support #8906

Uh oh!

Conversation

hppritcha commented Apr 30, 2021

Uh oh!

wckzhang commented Apr 30, 2021

Uh oh!

hppritcha commented Apr 30, 2021

Uh oh!

hppritcha commented Apr 30, 2021

Uh oh!

jsquyres commented Apr 30, 2021

Uh oh!

bosilca commented Apr 30, 2021

Uh oh!

hppritcha commented Apr 30, 2021

Uh oh!

wckzhang left a comment

Choose a reason for hiding this comment

Uh oh!

hppritcha commented May 3, 2021

Uh oh!

Uh oh!