Skip to content

common/ucx: Fix case where variable goes out of scope #10409

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 24, 2022

Conversation

jjhursey
Copy link
Member

If a program makes the following call sequence:

MPI_T_init_thread
MPI_T_finalize
MPI_Init

Then a segv will occur in MPI_Init because the memory backing the
*opal_common_ucx.tls will have been freed by the mca_base_var_finalize
in MPI_T_finalize. But the 'opal_common_ucx.tls` will be valid.

So add a check for this scenario so that we properly initialize the
default string even in this case.

bot:notacherrypick

@jjhursey
Copy link
Member Author

Related PR #10365

MTT found a few cases (environment/err, random/attr-error-code, environment/final) where this call sequence noted above resulted in segv's only for the v4.1 branch. main and v5.0.x have a slightly different setup that seems to avoid this issue.

The segv is caused by a NULL pointer being passed to a strcmp in opal_common_ucx_support_level.

@jjhursey
Copy link
Member Author

1 sec. I see another problem. Putting WIP on this.

@jjhursey
Copy link
Member Author

jjhursey commented May 18, 2022

PR #10365 picked up a lock (opal_common_ucx_mutex) in the cherry-pick, but missed the unlock somehow (the unlock is on the main and v5.0.x branches). I'm not sure how it was missed.
In any case, I pushed a new commit to this PR that adds the unlock in the correct location.

@janjust
Copy link
Contributor

janjust commented May 18, 2022

Should we maybe squash the commit since it's a critical commit for the initial commit?

If a program makes the following call sequence:
```
MPI_T_init_thread
MPI_T_finalize
MPI_Init
```
Then a segv will occur in MPI_Init because the memory backing the
`*opal_common_ucx.tls` will have been freed by the `mca_base_var_finalize`
in `MPI_T_finalize`. But the 'opal_common_ucx.tls` will be valid.

So add a check for this scenario so that we properly initialize the
default string even in this case.

 * Add missing unlock

Signed-off-by: Joshua Hursey <[email protected]>
@jjhursey
Copy link
Member Author

Good idea. Squashed.
@bwbarrett you will want this PR in before the release due to the unlock issue.

@bwbarrett bwbarrett merged commit 6e4f0a5 into open-mpi:v4.1.x May 24, 2022
@jjhursey jjhursey deleted the v4.1-fix-mpi-t branch May 24, 2022 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants