Skip to content

pml/cm: fix a problem introduced with cuda support #8906

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 3, 2021

Conversation

hppritcha
Copy link
Member

PR #8536 introduced a regression in non-cuda environments
when an application is using derived, but continguous datatypes.

Related to #8905.

Signed-off-by: Howard Pritchard [email protected]

@hppritcha hppritcha requested a review from wckzhang April 30, 2021 18:11
@wckzhang
Copy link
Contributor

Patch should be fine, but shouldn't opal_convertor_prepare_for_send be able to handle the use case?

@hppritcha
Copy link
Member Author

looks like ompi jenkins is having problems cloning from github at the moment
bot:ompi:retest

@hppritcha
Copy link
Member Author

@jsquyres any idea why the call to opal_convertor_prepare_for_send breaks things here? looks like it adds a bunch of additional overhead we don't want in any case when not using CUDA.

@jsquyres
Copy link
Member

@jsquyres any idea why the call to opal_convertor_prepare_for_send breaks things here? looks like it adds a bunch of additional overhead we don't want in any case when not using CUDA.

I am not knowledgeable about the innards of DDT -- I think this is a question for @bosilca...

@bosilca
Copy link
Member

bosilca commented Apr 30, 2021

What is the call to convertor_prepare_for_send breaking ? What I see in the patch is that if the datatype is contiguous then the convertor will not be properly initialized is CUDA support is not active. This can be perceived as a shortcut (saving one function call), because the data to be sent is contiguous of size count*data.size. I assume whoever wrote that code took care to properly accounting for the potential use of a lower bound in the datatype.

@hppritcha
Copy link
Member Author

Jithin did some optimization work in #602 that introduced this bypass of opal_convertor_prepare_for_send or related code.
As it was, master was failing lots of IBM tests without this patch post merge of #8536

See hpc#42

I was also seeing this in MTT but had not been looking into the issue at that point.

Copy link
Contributor

@wckzhang wckzhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might as well encapsulate the MCA_PML_SWITCH_CUDA_CONVERTOR_OFF with the if statement as well. Doesn't make sense to call it otherwise.

PR open-mpi#8536 introduced a regression in non-cuda environments
when an application is using derived, but continguous datatypes.

Related to open-mpi#8905.

Signed-off-by: Howard Pritchard <[email protected]>
@hppritcha hppritcha force-pushed the topic/swat_issue_8905 branch from d15c0e8 to 9e99182 Compare May 3, 2021 17:27
@hppritcha
Copy link
Member Author

@wckzhang reworked per your suggestion

@hppritcha hppritcha merged commit 17b723b into open-mpi:master May 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants