Skip to content

common/ucx: fix variable registration #10365

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 16, 2022

Conversation

jjhursey
Copy link
Member

@jjhursey jjhursey commented May 11, 2022

The registration code in common/ucx was overly complex. It assumed
that re-registration is not desireable when it essentially does the
same as the find before the registration. The only thing that needs
protection is resetting the char * variables before they are
registered.

The primary purpose of this commit to fix two memory leaks caused
by the extra allocations to hold the strings. These allocations
are unnecessary.


Fix the MCA registration modeling it after the common/ofi component.

bot:notacherrypick

The registration code in common/ucx was overly complex. It assumed
that re-registration is not desireable when it essentially does the
same as the find before the registration. The only thing that needs
protection is resetting the char * variables before they are
registered.

The primary purpose of this commit to fix two memory leaks caused
by the extra allocations to hold the strings. These allocations
are unnecessary.

Signed-off-by: Nathan Hjelm <[email protected]>
(cherry picked from commit 5ad772c)
@jjhursey jjhursey added this to the v4.1.4 milestone May 11, 2022
@jjhursey jjhursey requested a review from hppritcha May 11, 2022 02:09
@jjhursey
Copy link
Member Author

This was found by MTT testing on ppc64le (regardless of compiler) and configured with

./configure --prefix=$MPI_ROOT --without-hcoll --enable-debug \
    --enable-mpirun-prefix-by-default --disable-dlopen --enable-io-romio \
    --disable-io-ompio --enable-mpi1-compatibility --with-ucx=/opt/ucx/

I believe that the --disable-dlopen is necessary. Without it the test seemed to run fine. With it I would get a segv with the stack below 100% of the time in MPI_Init:

(gdb) bt
#0  0x00007fffb635c5fc in __strcmp_power8 () from /lib64/libc.so.6
#1  0x00007fffb5c3a014 in opal_common_ucx_support_level (context=0x1002bb0b3f0) at common_ucx.c:233
#2  0x00007fffb6866e50 in mca_pml_ucx_component_init (priority=0x7fffca1e4dc0, enable_progress_threads=false, enable_mpi_threads=false)
    at pml_ucx_component.c:131
#3  0x00007fffb6858e50 in mca_pml_base_select (enable_progress_threads=false, enable_mpi_threads=false) at base/pml_base_select.c:127
#4  0x00007fffb68e5344 in ompi_mpi_init (argc=0, argv=0x0, requested=0, provided=0x7fffca1e5020, reinit_ok=false) at runtime/ompi_mpi_init.c:647
#5  0x00007fffb664ffa4 in PMPI_Init (argc=0x0, argv=0x0) at pinit.c:69
#6  0x00000000100016b0 in main (argc=1, argv=0x7fffca1e54f8) at attr-error-code.c:28
(gdb) p *opal_common_ucx.tls
$3 = 0x0
(gdb) p *opal_common_ucx.devices
$4 = 0x0

The problem is the same one that Nathan fixed for main and v5.0.x.

@jjhursey jjhursey requested a review from gpaulsen May 11, 2022 02:13
@janjust
Copy link
Contributor

janjust commented May 11, 2022

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jjhursey
Copy link
Member Author

I'm able to reproduce the ompi_info segv. From the stack it has something to do with the tls registration. I'm looking into it now. Good on CI for catching that!

@jjhursey
Copy link
Member Author

jjhursey commented May 11, 2022

The problem seems to be specific to ompi_info since it is related to when the common_ucx library is unloaded in this section of code:
*

ompi_info_close_components();
OBJ_RELEASE(ompi_info_cmd_line);
OBJ_DESTRUCT(&mca_types);
for (i=0; i < component_map.size; i++) {
if (NULL != (map = (opal_info_component_map_t*)opal_pointer_array_get_item(&component_map, i))) {
OBJ_RELEASE(map);
}
}
OBJ_DESTRUCT(&component_map);
opal_info_finalize();
/* Put our own call to opal_finalize_util() here because we called
it up above (and it refcounts) */
opal_finalize_util();

The ompi_info_close_components() will close all frameworks including the common_ucx. This causes the symbol opal_common_ucx.tls to become an invalid storage location. The problem is that the MCA parameter opal_common_ucx_tls points to this memory for its storage. So when we get to opal_finalize_util() the MCA infrastructure tries to look at var->mbv_storage (which is opal_common_ucx.tls) and we have an invalid read and memory corruption leading to a segv.

I must be either missing some other commit from main/v5.0.x or something else is going on.

 * Model the MCA string variable handling after the common/ofi
   component that uses a second degree of indirection to keep
   the pointer valid even when the library closes (i.e., in ompi_info).

Signed-off-by: Joshua Hursey <[email protected]>
@jjhursey
Copy link
Member Author

I pushed a new commit to this PR that addresses the ompi_info segv that CI caught. The issue that caused the segv is described in the previous comment. The solution was to add one level of indirection to put the string on the heap - modeled after what common/ofi does with its MCA parameters.

This PR is ready to go.

@awlauria awlauria requested a review from janjust May 11, 2022 18:29
@bwbarrett bwbarrett merged commit 6685213 into open-mpi:v4.1.x May 16, 2022
@jjhursey jjhursey deleted the v4.1-fix-ucx-common branch May 16, 2022 20:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants