-
Notifications
You must be signed in to change notification settings - Fork 903
ompi/mpi_init: fix barrier #2089
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@jladd-mlnx @rhc54 @hppritcha @dsolt @jsquyres SLURM is processing new request in a separate thread. And in the case where node is overutilized this new thread is scheduled to late causing significant delays in collective progress. I guess this may be the case for other Resource Managers as well. |
I know we have to cycle opal_progress because some of the MTL's (and maybe BTL's as well) are actually doing things during this time. However, I don't believe they will be significantly impacted. We should ensure everyone checks, though, once this is committed. |
@artpol84 presented a detailed breakdown of
|
I assume you mean that @artpol84 presented this detailed breakdown to you internally? I don't see it here, nor do I see the references you seem to be citing. Is there some missing info? |
@rhc54 This was presented internally. |
Okay, understood. After thinking about this awhile, I think this is something better controlled by another parameter for several reasons:
We know we had to retain the call to opal_progress during the barrier else we "hang" in some environments. It is therefore possible that MTL/BTLs requiring progress may take more time to complete whatever they are doing at that point, thereby causing MPI_Init to take longer to complete even though the Slurm barrier algorithm would have completed faster. So I'd recommend adding a param to select between lazy and aggressive barriers, defaulting to our current behavior. Then let's give folks an opportunity to try this in different environments, using different fabrics and RMs, and see if we can get a sense of the best solution. |
@rhc54 I was about to suggest that as well. Will update the PR later today. |
@rhc54, these are very good points. We will update the PR accordingly based on your guidance. |
@rhc54 to be clear, overal behavior was under consideration. We are not interested in SLURM-only performance. |
@artpol84 out of curiosity, did you try to |
@ggouaillardet No, we just used the extant macro (i.e. we did not add any new functionality) that is utilized in
|
@ggouaillardet I don't think it makes much difference. Time to execute barrier was ~240ms and doing 100ns waits will not affect this noticeably. |
Jut to be clear: I wrote that macro. We can't use sched_yield as the nature of that call changed awhile ago, and so using it for this purpose has been discouraged. Thus, we throw a small amount of sleep into the code so the kernel scheduler swaps us out. We did a number of experiments to see how the amount of sleep time affects overall behavior, and it does indeed make some difference. Our primary concern was to keep cpu utilization down during finalize as users were complaining about the power consumption, while not sleeping so long that it caused us to inordinately lengthen the launch/terminate benchmarks. |
Relax CPU usage pressure from the application processes when doing modex and barrier in ompi_mpi_init. We see significant latencies in SLURM/pmix plugin barrier progress because app processes are aggressively call opal_progress pushing away daemon process doing collective progress.
7fc45ed
to
0861884
Compare
Jeff, we need this in 2.1. Is this still possible? |
@rhc54 please have a look |
@jladd-mlnx @jsquyres I understand I tighten with the update of this PR. But can we still take it to 2.1? |
bot:mellanox:retest |
Jenkins seems to be fine with this now. |
Parameter added, default behavior preserved. Merging. |
Relax CPU usage pressure from the application processes when doing
modex and barrier in ompi_mpi_init.
We see significant latencies in SLURM/pmix plugin barrier progress
because app processes are aggressively call opal_progress pushing
away daemon process doing collective progress.