Skip to content

ompi/mpi_init: fix barrier #2089

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 27, 2016
Merged

Conversation

artpol84
Copy link
Contributor

Relax CPU usage pressure from the application processes when doing
modex and barrier in ompi_mpi_init.

We see significant latencies in SLURM/pmix plugin barrier progress
because app processes are aggressively call opal_progress pushing
away daemon process doing collective progress.

@artpol84 artpol84 added this to the v2.1.0 milestone Sep 19, 2016
@artpol84
Copy link
Contributor Author

artpol84 commented Sep 19, 2016

@jladd-mlnx @rhc54 @hppritcha @dsolt @jsquyres
This fix substantially (3x) improves barrier performance for SLURM/pmix plugin. I haven't noticed any degradation in OMPI/pmix performance case.
Given that now libevent initialization may take up to 200ms doing lazy wait (with 100ns sleeps) shouldn't negatively affect the startup time. From the other side we need to give the daemon a chance to progress.

SLURM is processing new request in a separate thread. And in the case where node is overutilized this new thread is scheduled to late causing significant delays in collective progress. I guess this may be the case for other Resource Managers as well.

@rhc54
Copy link
Contributor

rhc54 commented Sep 19, 2016

I know we have to cycle opal_progress because some of the MTL's (and maybe BTL's as well) are actually doing things during this time. However, I don't believe they will be significantly impacted. We should ensure everyone checks, though, once this is committed.

@jladd-mlnx
Copy link
Member

jladd-mlnx commented Sep 19, 2016

@artpol84 presented a detailed breakdown of ompi_mpi_init during a direct launch with srun using the PMIx plugin for the back-end collective operations. In order to put the observations into context, a ping-pong type benchmark was used to measure unicast times between PMIx daemons. The findings are as follows:

  1. Unicast times are "fast" with 1ppn.
  2. The PMIx barrier in ompi_mpi_finalize is "fast" regardless of number of MPI processes launched on a host. In this context, fast means that the time to fan-in is equal to the number of steps in the fan-in tree multiplied by the unicast value measured in [1].
  3. The PMIx barrier executed during ompi_mpi_init was observed to be significantly slower than the PMIx barrier executed during ompi_mpi_finalize. Investigation of the individual unicast times comprising the fan-in stage executed during ompi_mpi_init reveal significant degradation when compared against the values measured in [1].
  4. The degradation observed in [3] is a function of the number of MPI processes started on the host. When PPN =1 - 14, the barrier in ompi_mpi_finalize and ompi_mpi_init have comparable completion times. Beyond 14 PPN, the performance of the PMIx barrier in ompi_mpi_init drops precipitously.
  5. We determined that aggressive polling of progress impedes the PMIx daemon from obtaining CPU resources necessary to progress the PMIx barrier operation.
  6. The solution is to complete the barrier in ompi_mpi_init with a lazy wait which enables better sharing of CPU resources (as it is done in ompi_mpi_finalize.)

@rhc54
Copy link
Contributor

rhc54 commented Sep 19, 2016

I assume you mean that @artpol84 presented this detailed breakdown to you internally? I don't see it here, nor do I see the references you seem to be citing. Is there some missing info?

@jladd-mlnx
Copy link
Member

@rhc54 This was presented internally.

@rhc54
Copy link
Contributor

rhc54 commented Sep 19, 2016

Okay, understood. After thinking about this awhile, I think this is something better controlled by another parameter for several reasons:

  • the PMIx community is advocating that RMs improve their collective algorithms, including moving to the use of network offload libraries. As that happens, this change may actually prove to be detrimental.
  • the barrier in mpi_init can be slower due to multiple factors, e.g., connection formation times as this is the first point where an inter-daemon collective operation is performed. This is typically not the case for algorithms that follow the same pattern used to transmit the launch message and for RMs that "pre-establish" the connections between daemons. Still, one needs to account for such factors, and every RM does it differently. Thus, no one size solution may best fit all.
  • making the barrier always be "lazy" may result in sub-optimal behavior. In other words, while your work may indicate that Slurm's barrier operation went faster, that doesn't mean that OMPI will necessarily start faster overall. Without seeing the report, it is difficult to tell - it sounds like the focus was more on how Slurm behaved as opposed to optimizing overall behavior.

We know we had to retain the call to opal_progress during the barrier else we "hang" in some environments. It is therefore possible that MTL/BTLs requiring progress may take more time to complete whatever they are doing at that point, thereby causing MPI_Init to take longer to complete even though the Slurm barrier algorithm would have completed faster.

So I'd recommend adding a param to select between lazy and aggressive barriers, defaulting to our current behavior. Then let's give folks an opportunity to try this in different environments, using different fabrics and RMs, and see if we can get a sense of the best solution.

@artpol84
Copy link
Contributor Author

@rhc54 I was about to suggest that as well. Will update the PR later today.

@jladd-mlnx
Copy link
Member

@rhc54, these are very good points. We will update the PR accordingly based on your guidance.

@artpol84
Copy link
Contributor Author

@rhc54 to be clear, overal behavior was under consideration. We are not interested in SLURM-only performance.
I was checking both mpirun and srun cases.
Mpirun startup time hasn't changed at all while SLURM time was significantly decreased as well as the overall runtime of ring_c.c

@ggouaillardet
Copy link
Contributor

@artpol84 out of curiosity, did you try to sched_yield() instead of usleep(100) ?

@jladd-mlnx
Copy link
Member

@ggouaillardet No, we just used the extant macro (i.e. we did not add any new functionality) that is utilized in ompi_mpi_finalize in similar fashion:

#define OMPI_LAZY_WAIT_FOR_COMPLETION(flg)                                  \
    do {                                                                    \
        opal_output_verbose(1, ompi_rte_base_framework.framework_output,    \
                            "%s lazy waiting on RTE event at %s:%d",        \
                            OMPI_NAME_PRINT(OMPI_PROC_MY_NAME),             \
                            __FILE__, __LINE__);                            \
        while ((flg)) {                                                     \
            opal_progress();                                                \
            usleep(100);                                                    \
        }                                                                   \
    }while(0);

@artpol84
Copy link
Contributor Author

@ggouaillardet I don't think it makes much difference. Time to execute barrier was ~240ms and doing 100ns waits will not affect this noticeably.
After the fix time was decreased to 90ms.

@rhc54
Copy link
Contributor

rhc54 commented Sep 20, 2016

Jut to be clear: I wrote that macro. We can't use sched_yield as the nature of that call changed awhile ago, and so using it for this purpose has been discouraged. Thus, we throw a small amount of sleep into the code so the kernel scheduler swaps us out.

We did a number of experiments to see how the amount of sleep time affects overall behavior, and it does indeed make some difference. Our primary concern was to keep cpu utilization down during finalize as users were complaining about the power consumption, while not sleeping so long that it caused us to inordinately lengthen the launch/terminate benchmarks.

@jsquyres jsquyres removed this from the v2.1.0 milestone Sep 26, 2016
Relax CPU usage pressure from the application processes when doing
modex and barrier in ompi_mpi_init.

We see significant latencies in SLURM/pmix plugin barrier progress
because app processes are aggressively call opal_progress pushing
away daemon process doing collective progress.
@artpol84
Copy link
Contributor Author

Jeff, we need this in 2.1. Is this still possible?

@artpol84
Copy link
Contributor Author

@rhc54 please have a look

@artpol84 artpol84 added this to the v2.1.0 milestone Sep 27, 2016
@artpol84
Copy link
Contributor Author

@jladd-mlnx @jsquyres I understand I tighten with the update of this PR. But can we still take it to 2.1?

@artpol84
Copy link
Contributor Author

bot:mellanox:retest

@artpol84
Copy link
Contributor Author

Jenkins seems to be fine with this now.

@jladd-mlnx
Copy link
Member

Parameter added, default behavior preserved. Merging.

@jladd-mlnx jladd-mlnx merged commit 4b0b7fd into open-mpi:master Sep 27, 2016
@artpol84 artpol84 mentioned this pull request Oct 17, 2016
clementFoyer pushed a commit to bosilca/ompi that referenced this pull request Nov 4, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants