Skip to content

ompi: Avoid unnecessary PMIx lookups when adding procs. #3011

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 28, 2017

Conversation

artpol84
Copy link
Contributor

On the medium KNL cluster a linear scaling was observed with this part of the code.
This PR reduces number of lookups from O(n) to O(ppn).
ompi_proc_complete_init() was taking 9 ms on 64 procs and 180 ms on 640 procs. After the fix the delay was fixed around 10ms.

@rhc54 I know you wanted to fix this in some other way but I believe we want this in v2.x for scalability. I'm going to open the PR there.

@hjelmn @jjhursey please consider this while doing your performance tests.

@jladd-mlnx, FYI.

@rhc54
Copy link
Contributor

rhc54 commented Feb 22, 2017

ahem - you need to sign this off before committing it. I'm not sure it makes sense to bring this to 2.x given that we are not including the largest time eaters.

@artpol84 artpol84 force-pushed the add_proc_fix/master branch from 9f4464a to 0cb86e7 Compare February 22, 2017 00:33
@artpol84
Copy link
Contributor Author

We see this problem on the 40-nodes and just 16 ppn. I guess this is a different situation.

@artpol84
Copy link
Contributor Author

Thanks for the "sign-off" note.

@rhc54
Copy link
Contributor

rhc54 commented Feb 22, 2017

Let's get everyone together and talk about it at next week's OMPI telecon. We need to decide where we are going to draw the line on v2.1 scalability, or else we are going to be making these decisions one at a time, and get into trouble.

@artpol84 artpol84 mentioned this pull request Feb 22, 2017
@artpol84
Copy link
Contributor Author

artpol84 commented Feb 22, 2017

The difference of this PR is that it solves direct launch scalability issue as well.

@artpol84
Copy link
Contributor Author

2.x is not frozen yet and those are minor localized changes.
lets discuss on the telecon.

@artpol84 artpol84 force-pushed the add_proc_fix/master branch from 0cb86e7 to 0f15785 Compare February 22, 2017 01:09
@rhc54
Copy link
Contributor

rhc54 commented Feb 22, 2017

@hjelmn Is there any reason why we would ever run with ompi_add_procs_cutoff > 0? This doesn't impact whether or not we do an async modex. The only thing this param does is determine whether we add ompi_proc_t structures for every proc in the job prior to communicating with them.

I cannot see any reason why we would ever do that. If we remove that parameter, then this code can be further optimized. We would go ahead and setup structures for all the local procs, but we wouldn't need to do a modex recv on locality for any procs after that point as we would already know that they are non-local.

There are a few further optimizations we can do here, but getting a better understanding of the role of ompi_add_procs_cutoff is certainly something we need to do.

@rhc54
Copy link
Contributor

rhc54 commented Feb 22, 2017

@artpol84 My concern is that you are chasing milliseconds when we know that there are 10s of minutes of time delays built into the v2.1 release. We had said earlier today that we weren't going to worry about scaling in this series, and so we didn't plan on bringing the larger optimizations across. If we want to change that decision, then we should probably put a priority on the bigger gains as well.

I'm not saying this change isn't worth doing - just want to ensure we are accurately setting expectations.

@artpol84
Copy link
Contributor Author

artpol84 commented Feb 22, 2017

@rhc54
64 proc - 9 ms (0.1 ms)
640 proc - 180 ms (0.28 ms)
....
8192 x 64 x 0.1ms = 52 s
8192 x 64 x 0.1ms = 146 s

@artpol84
Copy link
Contributor Author

@hjelmn wasn't hitting this if I understand correctly, because he was runnin /bin/true (correct me if I am wrong)

@hjelmn
Copy link
Member

hjelmn commented Feb 22, 2017

Commented on the v2.x patch. This is a bug that existed before PMIx. I was seeing a launch slowdown with MPI apps but didn't have time to track it down. Will try this with scaling.pl once my knl system is back this weekend.

@rhc54
Copy link
Contributor

rhc54 commented Feb 22, 2017

@hjelmn Can you please answer my question about ompi_add_procs_cutoff? I think this change isn't quite complete.

@hjelmn
Copy link
Member

hjelmn commented Feb 22, 2017

@rhc54 I would support making the add_procs cutoff a hidden parameter in 2.1 (like mpi_preconnect_mpi) as no user should set it unless they have an all-to-all type code. For 3.0 we should probably evaluate all the scaling variables and collapse them down into one.

As for this change, I think I see what you are talking about. ompi_proc_complete_init_single does the modex receive on locality. It is probably worth further optimizing to avoid that extra modex call. We should probably discuss what we can do in the context of v2.x vs v3.x.

@artpol84
Copy link
Contributor Author

For some reasons v2.x PR went well through our Jenkins, but this one has some problems. I'm looking into that.

@rhc54
Copy link
Contributor

rhc54 commented Feb 22, 2017

@hjelmn Agreed, except that I'm not sure the cutoff has any impact on the all-to-all code either. You'll still simply create all the ompi proc_t's when needed. Worth taking a look at and pondering a bit before we go over to 2.1.

@artpol84 artpol84 force-pushed the add_proc_fix/master branch from 0f15785 to dcc0959 Compare February 22, 2017 07:28
@artpol84
Copy link
Contributor Author

@rhc54 I noticed that ring_c test was hanging, it was mpirun at the cleanup stage with the following backtrace:

Thread 12 (Thread 0x7ffff09f3700 (LWP 12957)):
#0  0x00007ffff63687a3 in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00007ffff7820e93 in epoll_dispatch (base=0x742cd0, tv=<optimized out>) at epoll.c:407
#2  0x00007ffff78248e0 in opal_libevent2022_event_base_loop (base=0x742cd0, flags=1) at event.c:1630
#3  0x00007ffff77c9d48 in progress_engine (obj=0x718f88) at runtime/opal_progress_threads.c:105
#4  0x00007ffff663adc5 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007ffff63681cd in clone () from /usr/lib64/libc.so.6
Thread 11 (Thread 0x7ffff01f2700 (LWP 12958)):
#0  0x00007ffff63687a3 in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00007ffff7820e93 in epoll_dispatch (base=0x715980, tv=<optimized out>) at epoll.c:407
#2  0x00007ffff78248e0 in opal_libevent2022_event_base_loop (base=0x715980, flags=1) at event.c:1630
#3  0x00007ffff77c9d48 in progress_engine (obj=0x715918) at runtime/opal_progress_threads.c:105
#4  0x00007ffff663adc5 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007ffff63681cd in clone () from /usr/lib64/libc.so.6
Thread 10 (Thread 0x7fffef9f1700 (LWP 12959)):
#0  0x00007ffff63687a3 in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00007ffff7820e93 in epoll_dispatch (base=0x748010, tv=<optimized out>) at epoll.c:407
#2  0x00007ffff78248e0 in opal_libevent2022_event_base_loop (base=0x748010, flags=1) at event.c:1630
#3  0x00007ffff77c9d48 in progress_engine (obj=0x716378) at runtime/opal_progress_threads.c:105
#4  0x00007ffff663adc5 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007ffff63681cd in clone () from /usr/lib64/libc.so.6
Thread 9 (Thread 0x7fffef1f0700 (LWP 12960)):
#0  0x00007ffff63687a3 in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00007ffff7820e93 in epoll_dispatch (base=0x748860, tv=<optimized out>) at epoll.c:407
#2  0x00007ffff78248e0 in opal_libevent2022_event_base_loop (base=0x748860, flags=1) at event.c:1630
#3  0x00007ffff77c9d48 in progress_engine (obj=0x7487f8) at runtime/opal_progress_threads.c:105
#4  0x00007ffff663adc5 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007ffff63681cd in clone () from /usr/lib64/libc.so.6
Thread 8 (Thread 0x7fffee9ef700 (LWP 12961)):
#0  0x00007ffff63687a3 in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00007ffff7820e93 in epoll_dispatch (base=0x7492c0, tv=<optimized out>) at epoll.c:407
#2  0x00007ffff78248e0 in opal_libevent2022_event_base_loop (base=0x7492c0, flags=1) at event.c:1630
#3  0x00007ffff77c9d48 in progress_engine (obj=0x749258) at runtime/opal_progress_threads.c:105
#4  0x00007ffff663adc5 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007ffff63681cd in clone () from /usr/lib64/libc.so.6
Thread 7 (Thread 0x7fffee1ee700 (LWP 12962)):
#0  0x00007ffff63687a3 in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00007ffff7820e93 in epoll_dispatch (base=0x749e60, tv=<optimized out>) at epoll.c:407
#2  0x00007ffff78248e0 in opal_libevent2022_event_base_loop (base=0x749e60, flags=1) at event.c:1630
#3  0x00007ffff77c9d48 in progress_engine (obj=0x749df8) at runtime/opal_progress_threads.c:105
#4  0x00007ffff663adc5 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007ffff63681cd in clone () from /usr/lib64/libc.so.6
Thread 6 (Thread 0x7fffed9ed700 (LWP 12963)):
#0  0x00007ffff63687a3 in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00007ffff7820e93 in epoll_dispatch (base=0x74a990, tv=<optimized out>) at epoll.c:407
#2  0x00007ffff78248e0 in opal_libevent2022_event_base_loop (base=0x74a990, flags=1) at event.c:1630
#3  0x00007ffff77c9d48 in progress_engine (obj=0x74a928) at runtime/opal_progress_threads.c:105
#4  0x00007ffff663adc5 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007ffff63681cd in clone () from /usr/lib64/libc.so.6
Thread 5 (Thread 0x7fffed1ec700 (LWP 12964)):
#0  0x00007ffff63687a3 in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00007ffff7820e93 in epoll_dispatch (base=0x74b540, tv=<optimized out>) at epoll.c:407
#2  0x00007ffff78248e0 in opal_libevent2022_event_base_loop (base=0x74b540, flags=1) at event.c:1630
#3  0x00007ffff77c9d48 in progress_engine (obj=0x74b4d8) at runtime/opal_progress_threads.c:105
#4  0x00007ffff663adc5 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007ffff63681cd in clone () from /usr/lib64/libc.so.6
Thread 4 (Thread 0x7fffec9eb700 (LWP 12965)):
#0  0x00007ffff635f8f3 in select () from /usr/lib64/libc.so.6
#1  0x00007ffff126966e in listen_thread (obj=0x7ffff14787b0 <mca_oob_tcp_component+1008>) at oob_tcp_listener.c:701
#2  0x00007ffff663adc5 in start_thread () from /usr/lib64/libpthread.so.0
#3  0x00007ffff63681cd in clone () from /usr/lib64/libc.so.6
Thread 3 (Thread 0x7fffea876700 (LWP 12966)):
#0  0x00007ffff63687a3 in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00007ffff7820e93 in epoll_dispatch (base=0x7b17e0, tv=<optimized out>) at epoll.c:407
#2  0x00007ffff78248e0 in opal_libevent2022_event_base_loop (base=0x7b17e0, flags=1) at event.c:1630
#3  0x00007fffeaf48d69 in progress_engine (obj=0x7b1798) at runtime/pmix_progress_threads.c:151
#4  0x00007ffff663adc5 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007ffff63681cd in clone () from /usr/lib64/libc.so.6
Thread 2 (Thread 0x7fffea075700 (LWP 12967)):
#0  0x00007ffff635f8f3 in select () from /usr/lib64/libc.so.6
#1  0x00007fffeaf6378d in listen_thread (obj=0x0) at base/ptl_base_listener.c:215
#2  0x00007ffff663adc5 in start_thread () from /usr/lib64/libpthread.so.0
#3  0x00007ffff63681cd in clone () from /usr/lib64/libc.so.6
Thread 1 (Thread 0x7ffff7fb2740 (LWP 12956)):
#0  0x00007ffff663bef7 in pthread_join () from /usr/lib64/libpthread.so.0
#1  0x00007fffeaf48ba9 in pmix_thread_join (t=0x7b1798, thr_return=0x0) at runtime/pmix_progress_threads.c:66
#2  0x00007fffeaf48de2 in stop_progress_engine (trk=0x7b16a0) at runtime/pmix_progress_threads.c:166
#3  0x00007fffeaf49852 in pmix_progress_thread_pause (name=0x7fffeaf77ec0 "PMIX-wide async progress thread") at runtime/pmix_progress_threads.c:335
#4  0x00007fffeaef1652 in OPAL_MCA_PMIX2X_PMIx_server_finalize () at server/pmix_server.c:249
#5  0x00007fffeb1b1f06 in pmix2x_server_finalize () at pmix2x_server_south.c:176
#6  0x00007ffff7b334a6 in pmix_server_finalize () at orted/pmix/pmix_server.c:385
#7  0x00007ffff3d8ddfb in rte_finalize () at ess_hnp_module.c:811
#8  0x00007ffff7aea6e8 in orte_finalize () at runtime/orte_finalize.c:76
#9  0x0000000000401725 in orterun (argc=22, argv=0x7fffffffcc98) at orterun.c:219
#10 0x0000000000401070 in main (argc=22, argv=0x7fffffffcc98) at main.c:13

Out of curiosity - do we really mean all of those threads? I wasn't expecting to see so many.

@artpol84
Copy link
Contributor Author

artpol84 commented Feb 22, 2017

This might be related to #2982
When I tried to reproduce manually all worked fine. So I guess it was a race condition.

@artpol84 artpol84 force-pushed the add_proc_fix/master branch from dcc0959 to 717f3fe Compare February 22, 2017 09:16
@rhc54
Copy link
Contributor

rhc54 commented Feb 22, 2017

Interesting - yeah, it probably is a race condition. @hjelmn and I added a bunch of oob progress threads to help with communication delays during launch, but we can probably dial those back if/when the new mapping scheme tests out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants