ompi: Avoid unnecessary PMIx lookups when adding procs. #3011

artpol84 · 2017-02-22T00:25:59Z

On the medium KNL cluster a linear scaling was observed with this part of the code.
This PR reduces number of lookups from O(n) to O(ppn).
ompi_proc_complete_init() was taking 9 ms on 64 procs and 180 ms on 640 procs. After the fix the delay was fixed around 10ms.

@rhc54 I know you wanted to fix this in some other way but I believe we want this in v2.x for scalability. I'm going to open the PR there.

@hjelmn @jjhursey please consider this while doing your performance tests.

@jladd-mlnx, FYI.

rhc54 · 2017-02-22T00:29:47Z

ahem - you need to sign this off before committing it. I'm not sure it makes sense to bring this to 2.x given that we are not including the largest time eaters.

artpol84 · 2017-02-22T00:34:52Z

We see this problem on the 40-nodes and just 16 ppn. I guess this is a different situation.

artpol84 · 2017-02-22T00:35:09Z

Thanks for the "sign-off" note.

rhc54 · 2017-02-22T00:38:33Z

Let's get everyone together and talk about it at next week's OMPI telecon. We need to decide where we are going to draw the line on v2.1 scalability, or else we are going to be making these decisions one at a time, and get into trouble.

artpol84 · 2017-02-22T00:48:51Z

The difference of this PR is that it solves direct launch scalability issue as well.

artpol84 · 2017-02-22T00:51:18Z

2.x is not frozen yet and those are minor localized changes.
lets discuss on the telecon.

rhc54 · 2017-02-22T01:24:31Z

@hjelmn Is there any reason why we would ever run with ompi_add_procs_cutoff > 0? This doesn't impact whether or not we do an async modex. The only thing this param does is determine whether we add ompi_proc_t structures for every proc in the job prior to communicating with them.

I cannot see any reason why we would ever do that. If we remove that parameter, then this code can be further optimized. We would go ahead and setup structures for all the local procs, but we wouldn't need to do a modex recv on locality for any procs after that point as we would already know that they are non-local.

There are a few further optimizations we can do here, but getting a better understanding of the role of ompi_add_procs_cutoff is certainly something we need to do.

rhc54 · 2017-02-22T01:28:17Z

@artpol84 My concern is that you are chasing milliseconds when we know that there are 10s of minutes of time delays built into the v2.1 release. We had said earlier today that we weren't going to worry about scaling in this series, and so we didn't plan on bringing the larger optimizations across. If we want to change that decision, then we should probably put a priority on the bigger gains as well.

I'm not saying this change isn't worth doing - just want to ensure we are accurately setting expectations.

artpol84 · 2017-02-22T01:33:36Z

@rhc54
64 proc - 9 ms (0.1 ms)
640 proc - 180 ms (0.28 ms)
....
8192 x 64 x 0.1ms = 52 s
8192 x 64 x 0.1ms = 146 s

artpol84 · 2017-02-22T01:36:32Z

@hjelmn wasn't hitting this if I understand correctly, because he was runnin /bin/true (correct me if I am wrong)

hjelmn · 2017-02-22T04:17:06Z

Commented on the v2.x patch. This is a bug that existed before PMIx. I was seeing a launch slowdown with MPI apps but didn't have time to track it down. Will try this with scaling.pl once my knl system is back this weekend.

rhc54 · 2017-02-22T05:12:01Z

@hjelmn Can you please answer my question about ompi_add_procs_cutoff? I think this change isn't quite complete.

hjelmn · 2017-02-22T05:37:21Z

@rhc54 I would support making the add_procs cutoff a hidden parameter in 2.1 (like mpi_preconnect_mpi) as no user should set it unless they have an all-to-all type code. For 3.0 we should probably evaluate all the scaling variables and collapse them down into one.

As for this change, I think I see what you are talking about. ompi_proc_complete_init_single does the modex receive on locality. It is probably worth further optimizing to avoid that extra modex call. We should probably discuss what we can do in the context of v2.x vs v3.x.

artpol84 · 2017-02-22T05:46:09Z

For some reasons v2.x PR went well through our Jenkins, but this one has some problems. I'm looking into that.

rhc54 · 2017-02-22T05:47:21Z

@hjelmn Agreed, except that I'm not sure the cutoff has any impact on the all-to-all code either. You'll still simply create all the ompi proc_t's when needed. Worth taking a look at and pondering a bit before we go over to 2.1.

artpol84 · 2017-02-22T08:40:49Z

@rhc54 I noticed that ring_c test was hanging, it was mpirun at the cleanup stage with the following backtrace:

Thread 12 (Thread 0x7ffff09f3700 (LWP 12957)):
#0  0x00007ffff63687a3 in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00007ffff7820e93 in epoll_dispatch (base=0x742cd0, tv=<optimized out>) at epoll.c:407
#2  0x00007ffff78248e0 in opal_libevent2022_event_base_loop (base=0x742cd0, flags=1) at event.c:1630
#3  0x00007ffff77c9d48 in progress_engine (obj=0x718f88) at runtime/opal_progress_threads.c:105
#4  0x00007ffff663adc5 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007ffff63681cd in clone () from /usr/lib64/libc.so.6
Thread 11 (Thread 0x7ffff01f2700 (LWP 12958)):
#0  0x00007ffff63687a3 in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00007ffff7820e93 in epoll_dispatch (base=0x715980, tv=<optimized out>) at epoll.c:407
#2  0x00007ffff78248e0 in opal_libevent2022_event_base_loop (base=0x715980, flags=1) at event.c:1630
#3  0x00007ffff77c9d48 in progress_engine (obj=0x715918) at runtime/opal_progress_threads.c:105
#4  0x00007ffff663adc5 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007ffff63681cd in clone () from /usr/lib64/libc.so.6
Thread 10 (Thread 0x7fffef9f1700 (LWP 12959)):
#0  0x00007ffff63687a3 in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00007ffff7820e93 in epoll_dispatch (base=0x748010, tv=<optimized out>) at epoll.c:407
#2  0x00007ffff78248e0 in opal_libevent2022_event_base_loop (base=0x748010, flags=1) at event.c:1630
#3  0x00007ffff77c9d48 in progress_engine (obj=0x716378) at runtime/opal_progress_threads.c:105
#4  0x00007ffff663adc5 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007ffff63681cd in clone () from /usr/lib64/libc.so.6
Thread 9 (Thread 0x7fffef1f0700 (LWP 12960)):
#0  0x00007ffff63687a3 in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00007ffff7820e93 in epoll_dispatch (base=0x748860, tv=<optimized out>) at epoll.c:407
#2  0x00007ffff78248e0 in opal_libevent2022_event_base_loop (base=0x748860, flags=1) at event.c:1630
#3  0x00007ffff77c9d48 in progress_engine (obj=0x7487f8) at runtime/opal_progress_threads.c:105
#4  0x00007ffff663adc5 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007ffff63681cd in clone () from /usr/lib64/libc.so.6
Thread 8 (Thread 0x7fffee9ef700 (LWP 12961)):
#0  0x00007ffff63687a3 in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00007ffff7820e93 in epoll_dispatch (base=0x7492c0, tv=<optimized out>) at epoll.c:407
#2  0x00007ffff78248e0 in opal_libevent2022_event_base_loop (base=0x7492c0, flags=1) at event.c:1630
#3  0x00007ffff77c9d48 in progress_engine (obj=0x749258) at runtime/opal_progress_threads.c:105
#4  0x00007ffff663adc5 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007ffff63681cd in clone () from /usr/lib64/libc.so.6
Thread 7 (Thread 0x7fffee1ee700 (LWP 12962)):
#0  0x00007ffff63687a3 in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00007ffff7820e93 in epoll_dispatch (base=0x749e60, tv=<optimized out>) at epoll.c:407
#2  0x00007ffff78248e0 in opal_libevent2022_event_base_loop (base=0x749e60, flags=1) at event.c:1630
#3  0x00007ffff77c9d48 in progress_engine (obj=0x749df8) at runtime/opal_progress_threads.c:105
#4  0x00007ffff663adc5 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007ffff63681cd in clone () from /usr/lib64/libc.so.6
Thread 6 (Thread 0x7fffed9ed700 (LWP 12963)):
#0  0x00007ffff63687a3 in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00007ffff7820e93 in epoll_dispatch (base=0x74a990, tv=<optimized out>) at epoll.c:407
#2  0x00007ffff78248e0 in opal_libevent2022_event_base_loop (base=0x74a990, flags=1) at event.c:1630
#3  0x00007ffff77c9d48 in progress_engine (obj=0x74a928) at runtime/opal_progress_threads.c:105
#4  0x00007ffff663adc5 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007ffff63681cd in clone () from /usr/lib64/libc.so.6
Thread 5 (Thread 0x7fffed1ec700 (LWP 12964)):
#0  0x00007ffff63687a3 in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00007ffff7820e93 in epoll_dispatch (base=0x74b540, tv=<optimized out>) at epoll.c:407
#2  0x00007ffff78248e0 in opal_libevent2022_event_base_loop (base=0x74b540, flags=1) at event.c:1630
#3  0x00007ffff77c9d48 in progress_engine (obj=0x74b4d8) at runtime/opal_progress_threads.c:105
#4  0x00007ffff663adc5 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007ffff63681cd in clone () from /usr/lib64/libc.so.6
Thread 4 (Thread 0x7fffec9eb700 (LWP 12965)):
#0  0x00007ffff635f8f3 in select () from /usr/lib64/libc.so.6
#1  0x00007ffff126966e in listen_thread (obj=0x7ffff14787b0 <mca_oob_tcp_component+1008>) at oob_tcp_listener.c:701
#2  0x00007ffff663adc5 in start_thread () from /usr/lib64/libpthread.so.0
#3  0x00007ffff63681cd in clone () from /usr/lib64/libc.so.6
Thread 3 (Thread 0x7fffea876700 (LWP 12966)):
#0  0x00007ffff63687a3 in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00007ffff7820e93 in epoll_dispatch (base=0x7b17e0, tv=<optimized out>) at epoll.c:407
#2  0x00007ffff78248e0 in opal_libevent2022_event_base_loop (base=0x7b17e0, flags=1) at event.c:1630
#3  0x00007fffeaf48d69 in progress_engine (obj=0x7b1798) at runtime/pmix_progress_threads.c:151
#4  0x00007ffff663adc5 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007ffff63681cd in clone () from /usr/lib64/libc.so.6
Thread 2 (Thread 0x7fffea075700 (LWP 12967)):
#0  0x00007ffff635f8f3 in select () from /usr/lib64/libc.so.6
#1  0x00007fffeaf6378d in listen_thread (obj=0x0) at base/ptl_base_listener.c:215
#2  0x00007ffff663adc5 in start_thread () from /usr/lib64/libpthread.so.0
#3  0x00007ffff63681cd in clone () from /usr/lib64/libc.so.6
Thread 1 (Thread 0x7ffff7fb2740 (LWP 12956)):
#0  0x00007ffff663bef7 in pthread_join () from /usr/lib64/libpthread.so.0
#1  0x00007fffeaf48ba9 in pmix_thread_join (t=0x7b1798, thr_return=0x0) at runtime/pmix_progress_threads.c:66
#2  0x00007fffeaf48de2 in stop_progress_engine (trk=0x7b16a0) at runtime/pmix_progress_threads.c:166
#3  0x00007fffeaf49852 in pmix_progress_thread_pause (name=0x7fffeaf77ec0 "PMIX-wide async progress thread") at runtime/pmix_progress_threads.c:335
#4  0x00007fffeaef1652 in OPAL_MCA_PMIX2X_PMIx_server_finalize () at server/pmix_server.c:249
#5  0x00007fffeb1b1f06 in pmix2x_server_finalize () at pmix2x_server_south.c:176
#6  0x00007ffff7b334a6 in pmix_server_finalize () at orted/pmix/pmix_server.c:385
#7  0x00007ffff3d8ddfb in rte_finalize () at ess_hnp_module.c:811
#8  0x00007ffff7aea6e8 in orte_finalize () at runtime/orte_finalize.c:76
#9  0x0000000000401725 in orterun (argc=22, argv=0x7fffffffcc98) at orterun.c:219
#10 0x0000000000401070 in main (argc=22, argv=0x7fffffffcc98) at main.c:13

Out of curiosity - do we really mean all of those threads? I wasn't expecting to see so many.

artpol84 · 2017-02-22T08:42:41Z

This might be related to #2982
When I tried to reproduce manually all worked fine. So I guess it was a race condition.

Signed-off-by: Artem Polyakov <[email protected]>

rhc54 · 2017-02-22T15:03:22Z

Interesting - yeah, it probably is a race condition. @hjelmn and I added a bunch of oob progress threads to help with communication delays during launch, but we can probably dial those back if/when the new mapping scheme tests out.

artpol84 added bug Severity: critical enhancement Target: v2.0.x Target: v2.x labels Feb 22, 2017

artpol84 self-assigned this Feb 22, 2017

artpol84 force-pushed the add_proc_fix/master branch from 9f4464a to 0cb86e7 Compare February 22, 2017 00:33

artpol84 mentioned this pull request Feb 22, 2017

Add proc fix/v2.x #3012

Merged

artpol84 force-pushed the add_proc_fix/master branch from 0cb86e7 to 0f15785 Compare February 22, 2017 01:09

artpol84 force-pushed the add_proc_fix/master branch from 0f15785 to dcc0959 Compare February 22, 2017 07:28

ompi: Avoid unnecessary PMIx lookups when adding procs.

717f3fe

Signed-off-by: Artem Polyakov <[email protected]>

artpol84 force-pushed the add_proc_fix/master branch from dcc0959 to 717f3fe Compare February 22, 2017 09:16

artpol84 mentioned this pull request Feb 22, 2017

PMIx_Get logic refactoring in v1.2.2 openpmix/openpmix#316

Closed

rhc54 merged commit 735fbf8 into open-mpi:master Feb 28, 2017

artpol84 mentioned this pull request Mar 6, 2017

Fix some minor compatibility issues #3104

Merged

artpol84 mentioned this pull request Mar 15, 2017

ompi: Avoid unnecessary PMIx lookups when adding procs (step 2). #3181

Merged

ompi: Avoid unnecessary PMIx lookups when adding procs. #3011

ompi: Avoid unnecessary PMIx lookups when adding procs. #3011

Uh oh!

Conversation

artpol84 commented Feb 22, 2017

Uh oh!

rhc54 commented Feb 22, 2017

Uh oh!

artpol84 commented Feb 22, 2017

Uh oh!

artpol84 commented Feb 22, 2017

Uh oh!

rhc54 commented Feb 22, 2017

Uh oh!

artpol84 commented Feb 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

artpol84 commented Feb 22, 2017

Uh oh!

rhc54 commented Feb 22, 2017

Uh oh!

rhc54 commented Feb 22, 2017

Uh oh!

artpol84 commented Feb 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

artpol84 commented Feb 22, 2017

Uh oh!

hjelmn commented Feb 22, 2017

Uh oh!

rhc54 commented Feb 22, 2017

Uh oh!

hjelmn commented Feb 22, 2017

Uh oh!

artpol84 commented Feb 22, 2017

Uh oh!

rhc54 commented Feb 22, 2017

Uh oh!

artpol84 commented Feb 22, 2017

Uh oh!

artpol84 commented Feb 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rhc54 commented Feb 22, 2017

Uh oh!

Uh oh!

artpol84 commented Feb 22, 2017 •

edited

Loading

artpol84 commented Feb 22, 2017 •

edited

Loading

artpol84 commented Feb 22, 2017 •

edited

Loading