Skip to content

ess/base: be sure to update the routing tree #4591

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

ggouaillardet
Copy link
Contributor

so any tree spawn operation properly gets the number of children underneath us.

This commit is a tiny subset of 347ca41
that should have been back-ported into the v3.0.x branch

Fixes #4578

Thanks to Carlos Eduardo de Andrade for reporting.

Signed-off-by: Gilles Gouaillardet [email protected]

so any tree spawn operation properly gets the number of children underneath us.

This commit is a tiny subset of open-mpi/ompi@347ca41
that should have been back-ported into the v3.0.x branch

Fixes open-mpi#4578

Thanks to Carlos Eduardo de Andrade for reporting.

Signed-off-by: Gilles Gouaillardet <[email protected]>
@ggouaillardet
Copy link
Contributor Author

@karasevb @jladd-mlnx can you give this a try and check whether this fixes #4465 too ?

@ggouaillardet
Copy link
Contributor Author

@hppritcha i will update NEWS once #4583 gets merged (otherwise there will be a conflict to be resolved in a very near future ...)

@rhc54
Copy link
Contributor

rhc54 commented Dec 8, 2017

Thanks @ggouaillardet !

@karasevb
Copy link
Member

I'm going to check this fix for #4465

@karasevb
Copy link
Member

The #4465 issue still remains. I have it reproduced with latest OMPI@master and v3.0.x

@artpol84
Copy link
Contributor

artpol84 commented Dec 12, 2017

@karasevb can you also test v3.1.x as well?

@artpol84
Copy link
Contributor

@karasevb
Also please try with oob/tcp. By default if IB fabric is present oob/ud supposed to be used so we want to try with oob/tcp to see if it's an OOB issue.

@karasevb
Copy link
Member

@artpol84 v3.1.x has the same hang.
I tried launch with -mca oob tcp but it not affected, the hang is still observed.

The hanging does not reproduce when try to add option --mca routed_radix 100. I tried to run on 100 nodes:

mpirun -npernode 1 --hostfile mfile --mca routed_radix 100 -mca plm rsh -x LD_LIBRARY_PATH hostname

That fixes the hang with master/v3.0.x/v3.1x. And also v3.0.x without this PR works well.

@rhc54
Copy link
Contributor

rhc54 commented Dec 13, 2017

Only thing I can suggest is adding -mca plm_base_verbose 5 -mca routed_base_verbose 5 and see if anything pops out. It bother me that none of us can seem to reproduce this behavior.

@ggouaillardet Do you have any luck reproducing it? I can't.

@ggouaillardet
Copy link
Contributor Author

i will give it a try.
@karasevb @artpol84 is the node running mpirun part of the mfile ?
can you always force --mca oob tcp in order to make sure the issue is independent of the oob component.
last but not least, can you try to reproduce the issue with a minimal number of nodes ? for example by forcing a smaller radix --mca routed_radix 2

@karasevb
Copy link
Member

@ggouaillardet the node running mpirun is not included to the mfile
I found out the routed_radix value should be greater or equal the number of nodes for the successful execution (mfile_2 has two nodes):

$mpirun -npernode 1 -hostfile mfile_2 -mca plm rsh -mca oob tcp  --mca routed_radix 2 hostname
node-002
node-003

$mpirun -npernode 1 -hostfile mfile_2 -mca plm rsh -mca oob tcp  --mca routed_radix 1 hostname
<hanging>

I have checked the default routed_radix value:

$./ompi_info -a
		...
        MCA routed radix: parameter "routed_radix" (current value: "64", data source: default, level: 9 dev/all, type: int)
                          Radix to be used for routed radix tree
		...

That is my case why I get a hang when try to run more than 64 nodes.
@ggouaillardet Is there expected behavior?

@ggouaillardet
Copy link
Contributor Author

@karasevb what is the exact version (e.g. commit id) you are running ? i used to reproduce this only in v3.0.x before this PR was merged.

can you please post the output of a hang with -mca plm_base_verbose 10 ?

@karasevb
Copy link
Member

Here is the verbose output during the hang:
https://github.com/karasevb/test/blob/master/mpirun_rsh.log

@karasevb
Copy link
Member

@ggouaillardet I use this commit 0377959

@ggouaillardet
Copy link
Contributor Author

@karasevb i observed the same symptom (No children + hang) before this PR was merged.
i rebuilt from the same commit, and am unable to reproduce the issue.

can you please double check you built from the unmodified commit you mentionned and post your configure command line ?

@karasevb
Copy link
Member

@ggouaillardet I had rechecked v3.0.x with the latest commit a69a84e
Here is updated output with -mca plm_base_verbose 10: https://github.com/karasevb/test/blob/master/mpirun_rsh_2.log

Config:

Configure command line: '--prefix=<path to work dir>/install/ompi_v3.0.x' '--enable-debug' '--without-ucx' '--enable-orterun-prefix-by-default'

@ggouaillardet
Copy link
Contributor Author

@karasevb thanks, can you also please post your mfile_2 ?

@karasevb
Copy link
Member

@ggouaillardet mfile_2 is pretty simple:

node-001
node-002

@karasevb
Copy link
Member

@ggouaillardet
I running mpirun into the Slurm allocation and I have an interesting fact: when I use the Slurm allocation of 100 nodes, then a hang is reproduced. But when I tried reduce the Slurm allocation till 4 nodes I see a hang only before this PR was merged (as you pointed above).
Perhaps this is an our system issue.
Could you try to reproduce my experience?

@ggouaillardet
Copy link
Contributor Author

This is making less and less sense ...
front-end ssh to node-002 but according to the hostfile, it should really ssh to node-001
Are you running this from salloc ?

When running on 100 nodes under slurm, are you using sbatch (and hence mpirun runs on node-xxx) or salloc (and hence mpirun is running on front-end) ?

I am not sure I can spawn 100 vm to test that ... are you able to reproduce the hang with less nodes ?

@karasevb
Copy link
Member

I use salloc, and mpirun is running on front-end.
I reproduced the hang on two different clusters. The threshold for the number of nodes for reproduce was different for each cluster. One of them is 59 nodes, and second is 72.

@rhc54
Copy link
Contributor

rhc54 commented Dec 13, 2017

@karasevb Your log clearly shows the problems (there are two), but I'm not sure I understand the cause.

First, as @ggouaillardet pointed out, when we read the hostfile we are not finding node-001 in your allocation. Remember, when you are in a managed system, the hostfile acts solely as a filter on the nodes allocated by SLURM. So if node-001 isn't included in your allocation, then we will ignore it. I suspect that is what is happening here.

Second, it looks like the OMPI libraries on node-002 aren't correct. With the updated v3.0.1 code, there should have been a call to compute the routing tree when the daemon starts up. You can see the output from that call on your frontend when mpirun starts up:

[frontend:03834] [[57145,0],0]: parent -1 num_children 1
[frontend:03834] [[57145,0],0]:    child 1
[frontend:03834] [[57145,0],0]:            relation 2
[frontend:03834] [[57145,0],0] routed:binomial rank 0 parent 0 me 0 num_procs 3
[frontend:03834] [[57145,0],0] routed:binomial 0 found child 1
[frontend:03834] [[57145,0],0] routed:binomial rank 0 parent 0 me 1 num_procs 3
[frontend:03834] [[57145,0],0] routed:binomial find children of rank 0
[frontend:03834] [[57145,0],0] routed:binomial find children checking peer 1
[frontend:03834] [[57145,0],0] routed:binomial find children computing tree
[frontend:03834] [[57145,0],0] routed:binomial rank 1 parent 0 me 1 num_procs 3
[frontend:03834] [[57145,0],0] routed:binomial find children returning found value 0
[frontend:03834] [[57145,0],0] routed:binomial 0 found child 2
[frontend:03834] [[57145,0],0] routed:binomial rank 0 parent 0 me 2 num_procs 3
[frontend:03834] [[57145,0],0] routed:binomial find children of rank 0
[frontend:03834] [[57145,0],0] routed:binomial find children checking peer 1
[frontend:03834] [[57145,0],0] routed:binomial find children computing tree
[frontend:03834] [[57145,0],0] routed:binomial rank 1 parent 0 me 2 num_procs 3
[frontend:03834] [[57145,0],0] routed:binomial find children of rank 1
[frontend:03834] [[57145,0],0] routed:binomial find children checking peer 3
[frontend:03834] [[57145,0],0] routed:binomial find children checking peer 2
[frontend:03834] [[57145,0],0] routed:binomial find children computing tree
[frontend:03834] [[57145,0],0] routed:binomial rank 2 parent 0 me 2 num_procs 3
[frontend:03834] [[57145,0],0] routed:binomial find children returning found value 0
[frontend:03834] [[57145,0],0]: parent 0 num_children 2
[frontend:03834] [[57145,0],0]:    child 1
[frontend:03834] [[57145,0],0]:    child 2
[frontend:03834] [[57145,0],0] routed:direct: update routing plan

This output should have been seen on node-002, but it isn't - which makes me believe that the OMPI libraries on that node were not updated.

Another problem I see is:

[frontend:03834] [[57145,0],0] plm:base:setup_vm assigning new daemon [[57145,0],1] to node node-002
[frontend:03834] [[57145,0],0] plm:base:setup_vm add new daemon [[57145,0],2]
[frontend:03834] [[57145,0],0] plm:base:setup_vm assigning new daemon [[57145,0],2] to node node-003

Note that we assigned a daemon to node-003, which isn't in the mfile you provided. So now I am totally confused - where is node-003 coming from? And why is the newly launched orted not updating its routes per the path @ggouaillardet committed?

@rhc54
Copy link
Contributor

rhc54 commented Dec 13, 2017

I believe I see the problems. First, I believe you gave us the incorrect mfile - it actually includes node-002 and node-003, which is why we first launch to node-002.

Second, your opal_prefix is incorrectly set. Look at the ssh line we are trying to execute:

[frontend:03834] [[57145,0],0] plm:rsh: final template argv:

        /bin/ssh <template>     PATH=<path to workdir>/install/ompi_v3.0.x/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=<path to workdir>/install/ompi_v3.0.x/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=<path to workdir>/install/ompi_v3.0.x/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ;   <path to workdir>/install/ompi_v3.0.x/bin/orted -mca ess "env" -mca ess_base_jobid "3745054720" -mca ess_base_vpid "<template>" -mca ess_base_num_procs "3" -mca orte_hnp_uri "3745054720.0;tcp://10.131.5.129,10.131.200.231,172.17.10.240:37506" -mca oob "tcp" --mca routed_radix "1" -mca plm_base_verbose "10" -mca routed_base_verbose "10" -mca plm "rsh" -mca rmaps_ppr_n_pernode "1" -mca pmix "^s1,s2,cray,isolated"

See the <path to workdir> argument? That should have been the actual path, not some odd string. You can see in the actual ssh command that nothing is substituted for it:

[frontend:03834] [[57145,0],0] plm:rsh: executing: (/bin/ssh) [/bin/ssh node-002     PATH=<path to workdir>/install/ompi_v3.0.x/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=<path to workdir>/install/ompi_v3.0.x/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=<path to workdir>/install/ompi_v3.0.x/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ;   <path to workdir>/install/ompi_v3.0.x/bin/orted -mca ess "env" -mca ess_base_jobid "3745054720" -mca ess_base_vpid 1 -mca ess_base_num_procs "3" -mca orte_hnp_uri "3745054720.0;tcp://10.131.5.129,10.131.200.231,172.17.10.240:37506" -mca oob "tcp" --mca routed_radix "1" -mca plm_base_verbose "10" -mca routed_base_verbose "10" -mca plm "rsh" -mca rmaps_ppr_n_pernode "1" -mca pmix "^s1,s2,cray,isolated"]

So the path to your installation is botched and you are picking up whatever default OMPI libraries exist on the backend.

Someone might look at how that weird string got in there. Meantime, try adding --enable-orterun-prefix-by-default to your configure line, drop the -x from your mpirun cmd line, and I expect things will run just fine.

@karasevb
Copy link
Member

@rhc54 Sorry for the confusion, I attached not appropriate hosts file. In the host file actually contains:

node-002
node-003

@artpol84
Copy link
Contributor

@rhc54
Regarding path to workdir - this is anonymization of the log, there is a real path there

@rhc54
Copy link
Contributor

rhc54 commented Dec 13, 2017

@karasevb No problem - just let us know when you edit the output (e.g., to remove the actual path) so we don't waste time wondering what it means.

I can only assume that the library path is incorrect, or that the backend libraries have not been updated. Either way, the output from the backend nodes doesn't match the patch @ggouaillardet committed.

@artpol84
Copy link
Contributor

artpol84 commented Dec 13, 2017

Me and @karasevb did some more debugging now. I guess we hit the case that disables @ggouaillardet patch. I think that this codepath has not been tested.

It starts here:
https://github.com/open-mpi/ompi/blob/v3.0.x/orte/mca/plm/base/plm_base_launch_support.c#L1559
In our case where we have large number of nodes, so our node list is not appended to the ssh cmdline (no -mca orte_node_regex <..> argument). It can be seen in the log that @karasevb posted above.
However if the node number is small this argument is appended and the list of nodes is available.

Not having node regex doesn't trigger @ggouaillardet addition here:
https://github.com/open-mpi/ompi/pull/4591/files#diff-90f3ebd36767e677e669d28e2977b492L530

For debugging purposes it is enough to change ORTE_MAX_REGEX_CMD_LENGTH in https://github.com/open-mpi/ompi/blob/v3.0.x/orte/mca/plm/base/plm_base_launch_support.c#L1559 to be a small value and you will be able to trigger the same code path.

@rhc54
Copy link
Contributor

rhc54 commented Dec 13, 2017

Odd - I didn't think we had that switch any more since we can express it as a regex. I can take a look since @ggouaillardet is probably offline by now.

@rhc54
Copy link
Contributor

rhc54 commented Dec 13, 2017

Oh...the light begins to dawn. You have a - in your node name! It is screwing up the regex processor as we think that stands for a range.

No wonder nobody can reproduce your problem!

@artpol84
Copy link
Contributor

Yes we do.

@artpol84
Copy link
Contributor

Slurm regex works fine with it btw.

@rhc54
Copy link
Contributor

rhc54 commented Dec 13, 2017

That's fine - we don't. Will try to take a look at how to get around it, but not exactly at the top of the priority list. First time we've seen someone do that, so it is far from normal practice. Still, not saying it shouldn't be allowed.

@rhc54
Copy link
Contributor

rhc54 commented Dec 13, 2017

Now that we know the problem, I'm closing the associated issue and filed a more specific replacement here:

#4621

@artpol84
Copy link
Contributor

@rhc54 thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants