-
Notifications
You must be signed in to change notification settings - Fork 900
ess/base: be sure to update the routing tree #4591
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ess/base: be sure to update the routing tree #4591
Conversation
so any tree spawn operation properly gets the number of children underneath us. This commit is a tiny subset of open-mpi/ompi@347ca41 that should have been back-ported into the v3.0.x branch Fixes open-mpi#4578 Thanks to Carlos Eduardo de Andrade for reporting. Signed-off-by: Gilles Gouaillardet <[email protected]>
@karasevb @jladd-mlnx can you give this a try and check whether this fixes #4465 too ? |
@hppritcha i will update |
Thanks @ggouaillardet ! |
I'm going to check this fix for #4465 |
The #4465 issue still remains. I have it reproduced with latest OMPI@master and v3.0.x |
@karasevb can you also test v3.1.x as well? |
@karasevb |
@artpol84 v3.1.x has the same hang. The hanging does not reproduce when try to add option
That fixes the hang with master/v3.0.x/v3.1x. And also v3.0.x without this PR works well. |
Only thing I can suggest is adding @ggouaillardet Do you have any luck reproducing it? I can't. |
i will give it a try. |
@ggouaillardet the node running
I have checked the default
That is my case why I get a hang when try to run more than 64 nodes. |
@karasevb what is the exact version (e.g. commit id) you are running ? i used to reproduce this only in can you please post the output of a hang with |
Here is the verbose output during the hang: |
@ggouaillardet I use this commit 0377959 |
@karasevb i observed the same symptom (No children + hang) before this PR was merged. can you please double check you built from the unmodified commit you mentionned and post your |
@ggouaillardet I had rechecked v3.0.x with the latest commit a69a84e Config:
|
@karasevb thanks, can you also please post your |
@ggouaillardet
|
@ggouaillardet |
This is making less and less sense ... When running on 100 nodes under slurm, are you using I am not sure I can spawn 100 vm to test that ... are you able to reproduce the hang with less nodes ? |
I use |
@karasevb Your log clearly shows the problems (there are two), but I'm not sure I understand the cause. First, as @ggouaillardet pointed out, when we read the hostfile we are not finding node-001 in your allocation. Remember, when you are in a managed system, the hostfile acts solely as a filter on the nodes allocated by SLURM. So if node-001 isn't included in your allocation, then we will ignore it. I suspect that is what is happening here. Second, it looks like the OMPI libraries on node-002 aren't correct. With the updated v3.0.1 code, there should have been a call to compute the routing tree when the daemon starts up. You can see the output from that call on your frontend when mpirun starts up:
This output should have been seen on node-002, but it isn't - which makes me believe that the OMPI libraries on that node were not updated. Another problem I see is:
Note that we assigned a daemon to node-003, which isn't in the mfile you provided. So now I am totally confused - where is node-003 coming from? And why is the newly launched orted not updating its routes per the path @ggouaillardet committed? |
I believe I see the problems. First, I believe you gave us the incorrect mfile - it actually includes node-002 and node-003, which is why we first launch to node-002. Second, your opal_prefix is incorrectly set. Look at the ssh line we are trying to execute:
See the
So the path to your installation is botched and you are picking up whatever default OMPI libraries exist on the backend. Someone might look at how that weird string got in there. Meantime, try adding |
@rhc54 Sorry for the confusion, I attached not appropriate hosts file. In the host file actually contains:
|
@rhc54 |
@karasevb No problem - just let us know when you edit the output (e.g., to remove the actual path) so we don't waste time wondering what it means. I can only assume that the library path is incorrect, or that the backend libraries have not been updated. Either way, the output from the backend nodes doesn't match the patch @ggouaillardet committed. |
Me and @karasevb did some more debugging now. I guess we hit the case that disables @ggouaillardet patch. I think that this codepath has not been tested. It starts here: Not having node regex doesn't trigger @ggouaillardet addition here: For debugging purposes it is enough to change |
Odd - I didn't think we had that switch any more since we can express it as a regex. I can take a look since @ggouaillardet is probably offline by now. |
Oh...the light begins to dawn. You have a No wonder nobody can reproduce your problem! |
Yes we do. |
Slurm regex works fine with it btw. |
That's fine - we don't. Will try to take a look at how to get around it, but not exactly at the top of the priority list. First time we've seen someone do that, so it is far from normal practice. Still, not saying it shouldn't be allowed. |
Now that we know the problem, I'm closing the associated issue and filed a more specific replacement here: |
@rhc54 thank you |
so any tree spawn operation properly gets the number of children underneath us.
This commit is a tiny subset of 347ca41
that should have been back-ported into the v3.0.x branch
Fixes #4578
Thanks to Carlos Eduardo de Andrade for reporting.
Signed-off-by: Gilles Gouaillardet [email protected]