-
Notifications
You must be signed in to change notification settings - Fork 901
ompi: Avoid unnecessary PMIx lookups when adding procs. #3011
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ahem - you need to sign this off before committing it. I'm not sure it makes sense to bring this to 2.x given that we are not including the largest time eaters. |
9f4464a
to
0cb86e7
Compare
We see this problem on the 40-nodes and just 16 ppn. I guess this is a different situation. |
Thanks for the "sign-off" note. |
Let's get everyone together and talk about it at next week's OMPI telecon. We need to decide where we are going to draw the line on v2.1 scalability, or else we are going to be making these decisions one at a time, and get into trouble. |
The difference of this PR is that it solves direct launch scalability issue as well. |
2.x is not frozen yet and those are minor localized changes. |
0cb86e7
to
0f15785
Compare
@hjelmn Is there any reason why we would ever run with I cannot see any reason why we would ever do that. If we remove that parameter, then this code can be further optimized. We would go ahead and setup structures for all the local procs, but we wouldn't need to do a modex recv on locality for any procs after that point as we would already know that they are non-local. There are a few further optimizations we can do here, but getting a better understanding of the role of |
@artpol84 My concern is that you are chasing milliseconds when we know that there are 10s of minutes of time delays built into the v2.1 release. We had said earlier today that we weren't going to worry about scaling in this series, and so we didn't plan on bringing the larger optimizations across. If we want to change that decision, then we should probably put a priority on the bigger gains as well. I'm not saying this change isn't worth doing - just want to ensure we are accurately setting expectations. |
@rhc54 |
@hjelmn wasn't hitting this if I understand correctly, because he was runnin /bin/true (correct me if I am wrong) |
Commented on the v2.x patch. This is a bug that existed before PMIx. I was seeing a launch slowdown with MPI apps but didn't have time to track it down. Will try this with scaling.pl once my knl system is back this weekend. |
@hjelmn Can you please answer my question about |
@rhc54 I would support making the add_procs cutoff a hidden parameter in 2.1 (like mpi_preconnect_mpi) as no user should set it unless they have an all-to-all type code. For 3.0 we should probably evaluate all the scaling variables and collapse them down into one. As for this change, I think I see what you are talking about. |
For some reasons v2.x PR went well through our Jenkins, but this one has some problems. I'm looking into that. |
@hjelmn Agreed, except that I'm not sure the cutoff has any impact on the all-to-all code either. You'll still simply create all the ompi proc_t's when needed. Worth taking a look at and pondering a bit before we go over to 2.1. |
0f15785
to
dcc0959
Compare
@rhc54 I noticed that ring_c test was hanging, it was mpirun at the cleanup stage with the following backtrace:
Out of curiosity - do we really mean all of those threads? I wasn't expecting to see so many. |
This might be related to #2982 |
Signed-off-by: Artem Polyakov <[email protected]>
dcc0959
to
717f3fe
Compare
Interesting - yeah, it probably is a race condition. @hjelmn and I added a bunch of oob progress threads to help with communication delays during launch, but we can probably dial those back if/when the new mapping scheme tests out. |
On the medium KNL cluster a linear scaling was observed with this part of the code.
This PR reduces number of lookups from O(n) to O(ppn).
ompi_proc_complete_init()
was taking 9 ms on 64 procs and 180 ms on 640 procs. After the fix the delay was fixed around 10ms.@rhc54 I know you wanted to fix this in some other way but I believe we want this in v2.x for scalability. I'm going to open the PR there.
@hjelmn @jjhursey please consider this while doing your performance tests.
@jladd-mlnx, FYI.