ompi/mpi_init: fix barrier #2089

artpol84 · 2016-09-19T07:22:13Z

Relax CPU usage pressure from the application processes when doing
modex and barrier in ompi_mpi_init.

We see significant latencies in SLURM/pmix plugin barrier progress
because app processes are aggressively call opal_progress pushing
away daemon process doing collective progress.

artpol84 · 2016-09-19T07:26:33Z

@jladd-mlnx @rhc54 @hppritcha @dsolt @jsquyres
This fix substantially (3x) improves barrier performance for SLURM/pmix plugin. I haven't noticed any degradation in OMPI/pmix performance case.
Given that now libevent initialization may take up to 200ms doing lazy wait (with 100ns sleeps) shouldn't negatively affect the startup time. From the other side we need to give the daemon a chance to progress.

SLURM is processing new request in a separate thread. And in the case where node is overutilized this new thread is scheduled to late causing significant delays in collective progress. I guess this may be the case for other Resource Managers as well.

rhc54 · 2016-09-19T14:16:24Z

I know we have to cycle opal_progress because some of the MTL's (and maybe BTL's as well) are actually doing things during this time. However, I don't believe they will be significantly impacted. We should ensure everyone checks, though, once this is committed.

jladd-mlnx · 2016-09-19T18:39:45Z

@artpol84 presented a detailed breakdown of ompi_mpi_init during a direct launch with srun using the PMIx plugin for the back-end collective operations. In order to put the observations into context, a ping-pong type benchmark was used to measure unicast times between PMIx daemons. The findings are as follows:

Unicast times are "fast" with 1ppn.
The PMIx barrier in ompi_mpi_finalize is "fast" regardless of number of MPI processes launched on a host. In this context, fast means that the time to fan-in is equal to the number of steps in the fan-in tree multiplied by the unicast value measured in [1].
The PMIx barrier executed during ompi_mpi_init was observed to be significantly slower than the PMIx barrier executed during ompi_mpi_finalize. Investigation of the individual unicast times comprising the fan-in stage executed during ompi_mpi_init reveal significant degradation when compared against the values measured in [1].
The degradation observed in [3] is a function of the number of MPI processes started on the host. When PPN =1 - 14, the barrier in ompi_mpi_finalize and ompi_mpi_init have comparable completion times. Beyond 14 PPN, the performance of the PMIx barrier in ompi_mpi_init drops precipitously.
We determined that aggressive polling of progress impedes the PMIx daemon from obtaining CPU resources necessary to progress the PMIx barrier operation.
The solution is to complete the barrier in ompi_mpi_init with a lazy wait which enables better sharing of CPU resources (as it is done in ompi_mpi_finalize.)

rhc54 · 2016-09-19T20:54:55Z

I assume you mean that @artpol84 presented this detailed breakdown to you internally? I don't see it here, nor do I see the references you seem to be citing. Is there some missing info?

jladd-mlnx · 2016-09-19T21:44:55Z

@rhc54 This was presented internally.

rhc54 · 2016-09-19T22:33:40Z

Okay, understood. After thinking about this awhile, I think this is something better controlled by another parameter for several reasons:

the PMIx community is advocating that RMs improve their collective algorithms, including moving to the use of network offload libraries. As that happens, this change may actually prove to be detrimental.
the barrier in mpi_init can be slower due to multiple factors, e.g., connection formation times as this is the first point where an inter-daemon collective operation is performed. This is typically not the case for algorithms that follow the same pattern used to transmit the launch message and for RMs that "pre-establish" the connections between daemons. Still, one needs to account for such factors, and every RM does it differently. Thus, no one size solution may best fit all.
making the barrier always be "lazy" may result in sub-optimal behavior. In other words, while your work may indicate that Slurm's barrier operation went faster, that doesn't mean that OMPI will necessarily start faster overall. Without seeing the report, it is difficult to tell - it sounds like the focus was more on how Slurm behaved as opposed to optimizing overall behavior.

We know we had to retain the call to opal_progress during the barrier else we "hang" in some environments. It is therefore possible that MTL/BTLs requiring progress may take more time to complete whatever they are doing at that point, thereby causing MPI_Init to take longer to complete even though the Slurm barrier algorithm would have completed faster.

So I'd recommend adding a param to select between lazy and aggressive barriers, defaulting to our current behavior. Then let's give folks an opportunity to try this in different environments, using different fabrics and RMs, and see if we can get a sense of the best solution.

artpol84 · 2016-09-20T02:33:13Z

@rhc54 I was about to suggest that as well. Will update the PR later today.

jladd-mlnx · 2016-09-20T11:07:48Z

@rhc54, these are very good points. We will update the PR accordingly based on your guidance.

artpol84 · 2016-09-20T11:54:04Z

@rhc54 to be clear, overal behavior was under consideration. We are not interested in SLURM-only performance.
I was checking both mpirun and srun cases.
Mpirun startup time hasn't changed at all while SLURM time was significantly decreased as well as the overall runtime of ring_c.c

ggouaillardet · 2016-09-20T13:10:38Z

@artpol84 out of curiosity, did you try to sched_yield() instead of usleep(100) ?

jladd-mlnx · 2016-09-20T13:35:24Z

@ggouaillardet No, we just used the extant macro (i.e. we did not add any new functionality) that is utilized in ompi_mpi_finalize in similar fashion:

#define OMPI_LAZY_WAIT_FOR_COMPLETION(flg)                                  \
    do {                                                                    \
        opal_output_verbose(1, ompi_rte_base_framework.framework_output,    \
                            "%s lazy waiting on RTE event at %s:%d",        \
                            OMPI_NAME_PRINT(OMPI_PROC_MY_NAME),             \
                            __FILE__, __LINE__);                            \
        while ((flg)) {                                                     \
            opal_progress();                                                \
            usleep(100);                                                    \
        }                                                                   \
    }while(0);

artpol84 · 2016-09-20T14:54:19Z

@ggouaillardet I don't think it makes much difference. Time to execute barrier was ~240ms and doing 100ns waits will not affect this noticeably.
After the fix time was decreased to 90ms.

rhc54 · 2016-09-20T15:08:03Z

Jut to be clear: I wrote that macro. We can't use sched_yield as the nature of that call changed awhile ago, and so using it for this purpose has been discouraged. Thus, we throw a small amount of sleep into the code so the kernel scheduler swaps us out.

We did a number of experiments to see how the amount of sleep time affects overall behavior, and it does indeed make some difference. Our primary concern was to keep cpu utilization down during finalize as users were complaining about the power consumption, while not sleeping so long that it caused us to inordinately lengthen the launch/terminate benchmarks.

Relax CPU usage pressure from the application processes when doing modex and barrier in ompi_mpi_init. We see significant latencies in SLURM/pmix plugin barrier progress because app processes are aggressively call opal_progress pushing away daemon process doing collective progress.

artpol84 · 2016-09-27T04:35:30Z

Jeff, we need this in 2.1. Is this still possible?

artpol84 · 2016-09-27T04:35:44Z

@rhc54 please have a look

artpol84 · 2016-09-27T04:38:23Z

@jladd-mlnx @jsquyres I understand I tighten with the update of this PR. But can we still take it to 2.1?

artpol84 · 2016-09-27T10:04:17Z

bot:mellanox:retest

artpol84 · 2016-09-27T11:01:01Z

Jenkins seems to be fine with this now.

jladd-mlnx · 2016-09-27T13:42:41Z

Parameter added, default behavior preserved. Merging.

artpol84 added the enhancement label Sep 19, 2016

artpol84 added this to the v2.1.0 milestone Sep 19, 2016

jsquyres added the Target: v2.x label Sep 26, 2016

jsquyres removed this from the v2.1.0 milestone Sep 26, 2016

artpol84 force-pushed the fix_pmix_barrier branch from 7fc45ed to 0861884 Compare September 27, 2016 04:29

artpol84 added this to the v2.1.0 milestone Sep 27, 2016

jladd-mlnx merged commit 4b0b7fd into open-mpi:master Sep 27, 2016

artpol84 mentioned this pull request Oct 17, 2016

Lazy wait v2.x #2181

Closed

clementFoyer pushed a commit to bosilca/ompi that referenced this pull request Nov 4, 2016

Merge pull request open-mpi#2089 from artpol84/fix_pmix_barrier

c9e9cec

ompi/mpi_init: fix barrier #2089

ompi/mpi_init: fix barrier #2089

Uh oh!

Conversation

artpol84 commented Sep 19, 2016

Uh oh!

artpol84 commented Sep 19, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rhc54 commented Sep 19, 2016

Uh oh!

jladd-mlnx commented Sep 19, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rhc54 commented Sep 19, 2016

Uh oh!

jladd-mlnx commented Sep 19, 2016

Uh oh!

rhc54 commented Sep 19, 2016

Uh oh!

artpol84 commented Sep 20, 2016

Uh oh!

jladd-mlnx commented Sep 20, 2016

Uh oh!

artpol84 commented Sep 20, 2016

Uh oh!

ggouaillardet commented Sep 20, 2016

Uh oh!

jladd-mlnx commented Sep 20, 2016

Uh oh!

artpol84 commented Sep 20, 2016

Uh oh!

rhc54 commented Sep 20, 2016

Uh oh!

artpol84 commented Sep 27, 2016

Uh oh!

artpol84 commented Sep 27, 2016

Uh oh!

artpol84 commented Sep 27, 2016

Uh oh!

artpol84 commented Sep 27, 2016

Uh oh!

artpol84 commented Sep 27, 2016

Uh oh!

jladd-mlnx commented Sep 27, 2016

Uh oh!

Uh oh!

artpol84 commented Sep 19, 2016 •

edited

Loading

jladd-mlnx commented Sep 19, 2016 •

edited

Loading