Skip to content

Updated the documentation about shared memory. #13218

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 3, 2025

Conversation

xbw22109
Copy link
Contributor

This PR fixes errors regarding the shared memory file path in the document, and also updates two other parts of the shared memory documentation:

  1. Corrected the error about the file path.

previous .. error:: notes:

shared-memory.rst:78 .. error:: TODO Is this correct?
shared-memory.rst:86 .. error:: TODO The filename above will certainly be wrong.
shared-memory.rst:91 .. error:: TODO The MCA param name above is definitely wrong.
  1. Moved the explanation about Shared Memory Mechanisms from docs/launching-apps/localhost.rst to docs/tuning-apps/networking/shared-memory.rst.

previous .. error:: notes:

localhost.rst:41 .. error:: TODO This should really be moved to the networking section.
  1. Fixed titles of the Shared Memory section.

previous .. error:: notes:

shared-memory.rst:4 .. error:: TODO This section needs to be converted from FAQ Q&A style
shared-memory.rst:5            to regular documentation style.

Additional Notes on Modification 1:

1. The default output path is /dev/shm with filename sm_segment.nodename.user_id.job_id.my_node_rank, not the session directory:

mca output:
image
code opal/mca/btl/sm/btl_sm_component.c:192:
image
code opal/mca/btl/sm/btl_sm_component.c:381:
image

2.About the session directory:

The "session directory" is a concept from OpenPMIx. It can be set manually via --pmixmca, or automatically by mpirun/prun or Open MPI (in singleton mode). While not central to this doc section, it's relevant because fallback paths may use the session dir. So, the MCA param is mentioned for completeness.

3.About single-copy methods and the backing file

After enabling single-copy methods, it might seem that a shared memory mapping file is no longer needed.
However, from reading the source code, I confirmed that it is still required.
This information is relevant for performance tuning, so I added it to the documentation.

Copy link

Hello! The Git Commit Checker CI bot found a few problems with this PR:

aaf5f27: Updated the documentation about shared memory.

  • check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

@xbw22109 xbw22109 force-pushed the update-shared-mem-document branch from aaf5f27 to 0861121 Compare April 29, 2025 05:56
@xbw22109
Copy link
Contributor Author

xbw22109 commented Apr 29, 2025

@bosilca Hello, this is my first pr and I need to be approved for workflow execution.

@rhc54
Copy link
Contributor

rhc54 commented Apr 29, 2025

Might be worth explaining a tad more about the session directory and this shmem backing file. The reason we created a session directory (way back in the beginning of OMPI - predates PMIx by more than 10 years) was to provide a location where we could collect all files OMPI creates, thus providing a simple way to clean them all up upon termination. The daemon just whack the entire session directory tree after the job completes (normally or abnormally) before it exits. We changed that for the shmem backing file (to use /dev, if it exists) due to size limitations (IIRC).

The problem with moving that file is that we now lose the ability to ensure cleanup. If the process itself abnormally terminates, the daemon that shepherds it has no idea that the shmem backing file was created somewhere outside the session directory - and therefore has no way to clean it up. This has generated some user complaints, but I'm not sure of the best solution. Perhaps we should use the PMIx_Job_ctrl API to register the backing file for cleanup upon termination when it isn't in the session directory? Or maybe that has already been done?

Regardless, might be worth noting that in the docs somewhere. If we don't register the cleanup location and your app abnormally terminates, then you may need to ensure (somehow) that you go back and remove those shmem backing files. Otherwise, they will just build up over time (as other jobs abnormally die), consuming disk space.

@bosilca
Copy link
Member

bosilca commented Apr 29, 2025

We can certainly do work to improve this documentation., But as a first contribution this is already awesome as it is.

bosilca
bosilca previously approved these changes Apr 29, 2025
@xbw22109
Copy link
Contributor Author

Might be worth explaining a tad more about the session directory and this shmem backing file. The reason we created a session directory (way back in the beginning of OMPI - predates PMIx by more than 10 years) was to provide a location where we could collect all files OMPI creates, thus providing a simple way to clean them all up upon termination. The daemon just whack the entire session directory tree after the job completes (normally or abnormally) before it exits. We changed that for the shmem backing file (to use /dev, if it exists) due to size limitations (IIRC).

The problem with moving that file is that we now lose the ability to ensure cleanup. If the process itself abnormally terminates, the daemon that shepherds it has no idea that the shmem backing file was created somewhere outside the session directory - and therefore has no way to clean it up. This has generated some user complaints, but I'm not sure of the best solution. Perhaps we should use the PMIx_Job_ctrl API to register the backing file for cleanup upon termination when it isn't in the session directory? Or maybe that has already been done?

Regardless, might be worth noting that in the docs somewhere. If we don't register the cleanup location and your app abnormally terminates, then you may need to ensure (somehow) that you go back and remove those shmem backing files. Otherwise, they will just build up over time (as other jobs abnormally die), consuming disk space.

I noticed the cleanup registration in opal/mca/btl/sm/btl_sm_component.c, which calls PMIx_Job_control_nb. That has already been done.

    // Note: Use the node_rank not the local_rank for the backing file.
    // This makes the file unique even when recovering from failures.
    rc = opal_asprintf(&sm_file, "%s" OPAL_PATH_SEP "sm_segment.%s.%u.%x.%d",
                       mca_btl_sm_component.backing_directory, opal_process_info.nodename,
                       geteuid(), OPAL_PROC_MY_NAME.jobid, opal_process_info.my_node_rank);
    if (0 > rc) {
        free(btls);
        return NULL;
    }
    opal_pmix_register_cleanup(sm_file, false, false, false);

the opal_pmix_register_cleanup function will also be called by mca_coll_xhc_shmem_create from ompi/mca/coll/xhc/coll_xhc.c.


    err = opal_asprintf(&shmem_file, "%s" OPAL_PATH_SEP
        "xhc_shmem_seg.%u@%s.%x.%d:%d_%s:%d:%d", mca_coll_xhc_component.shmem_backing,
        geteuid(), opal_process_info.nodename, OPAL_PROC_MY_NAME.jobid,
        ompi_comm_rank(MPI_COMM_WORLD), ompi_comm_get_local_cid(ompi_comm),
        name, id1, id2);

    if(err < 0) {
        return NULL;
    }

    // Not 100% sure what this does!, copied from btl/sm
    opal_pmix_register_cleanup(shmem_file, false, false, false);

However, there's a bug #13002 . My tests show that neither /dev/shm nor session dir can be cleaned up correctly when running as a singleton. When started using mpirun, both paths will be cleaned up. This means that the cleanup registration works fine when using mpirun.

Maybe I can try to fix the bug later and revise the documentation. Looks like we also need to add running as a singleton to the documentation somewhere.

@xbw22109
Copy link
Contributor Author

For the bug #13002 and the cleanup related issues : Can we make this pr focus on sm's documentation? Let’s make other changes later.

@rhc54
Copy link
Contributor

rhc54 commented Apr 30, 2025

For the bug #13002 and the cleanup related issues : Can we make this pr focus on sm's documentation? Let’s make other changes later.

Agreed - I wasn't trying to expand the scope here. Just trying to capture thoughts for the future.

Thanks for checking the registration! Part of the problem with singletons is that the registration doesn't work in that case as there is no daemon shepherding the process. We would need an alternate mechanism to ensure cleanup of the non-session-directory file for singletons. The issue I filed was that we don't cleanup the session directory when aborting a singleton- you are welcome to expand that to a more general "we don't cleanup" to include the non-session-directory file(s) if you like.

@xbw22109
Copy link
Contributor Author

xbw22109 commented May 1, 2025

I fixed the format error and mca parameter error, I think it’s ready for review.

Copy link
Member

@jsquyres jsquyres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xbw22109 Many thanks for these doc updates!

@jsquyres jsquyres merged commit ba4b68c into open-mpi:main May 3, 2025
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants