ess/base: be sure to update the routing tree #4591

ggouaillardet · 2017-12-08T02:23:41Z

so any tree spawn operation properly gets the number of children underneath us.

This commit is a tiny subset of 347ca41
that should have been back-ported into the v3.0.x branch

Thanks to Carlos Eduardo de Andrade for reporting.

Signed-off-by: Gilles Gouaillardet [email protected]

so any tree spawn operation properly gets the number of children underneath us. This commit is a tiny subset of open-mpi/ompi@347ca41 that should have been back-ported into the v3.0.x branch Fixes open-mpi#4578 Thanks to Carlos Eduardo de Andrade for reporting. Signed-off-by: Gilles Gouaillardet <[email protected]>

ggouaillardet · 2017-12-08T02:24:54Z

@karasevb @jladd-mlnx can you give this a try and check whether this fixes #4465 too ?

ggouaillardet · 2017-12-08T02:26:37Z

@hppritcha i will update NEWS once #4583 gets merged (otherwise there will be a conflict to be resolved in a very near future ...)

rhc54 · 2017-12-08T02:34:14Z

Thanks @ggouaillardet !

karasevb · 2017-12-12T04:22:57Z

I'm going to check this fix for #4465

karasevb · 2017-12-12T07:36:08Z

The #4465 issue still remains. I have it reproduced with latest OMPI@master and v3.0.x

artpol84 · 2017-12-12T16:39:43Z

@karasevb can you also test v3.1.x as well?

artpol84 · 2017-12-12T16:44:47Z

@karasevb
Also please try with oob/tcp. By default if IB fabric is present oob/ud supposed to be used so we want to try with oob/tcp to see if it's an OOB issue.

karasevb · 2017-12-13T05:01:51Z

@artpol84 v3.1.x has the same hang.
I tried launch with -mca oob tcp but it not affected, the hang is still observed.

The hanging does not reproduce when try to add option --mca routed_radix 100. I tried to run on 100 nodes:

mpirun -npernode 1 --hostfile mfile --mca routed_radix 100 -mca plm rsh -x LD_LIBRARY_PATH hostname

That fixes the hang with master/v3.0.x/v3.1x. And also v3.0.x without this PR works well.

rhc54 · 2017-12-13T05:12:19Z

Only thing I can suggest is adding -mca plm_base_verbose 5 -mca routed_base_verbose 5 and see if anything pops out. It bother me that none of us can seem to reproduce this behavior.

@ggouaillardet Do you have any luck reproducing it? I can't.

ggouaillardet · 2017-12-13T06:04:59Z

i will give it a try.
@karasevb @artpol84 is the node running mpirun part of the mfile ?
can you always force --mca oob tcp in order to make sure the issue is independent of the oob component.
last but not least, can you try to reproduce the issue with a minimal number of nodes ? for example by forcing a smaller radix --mca routed_radix 2

karasevb · 2017-12-13T07:50:36Z

@ggouaillardet the node running mpirun is not included to the mfile
I found out the routed_radix value should be greater or equal the number of nodes for the successful execution (mfile_2 has two nodes):

$mpirun -npernode 1 -hostfile mfile_2 -mca plm rsh -mca oob tcp  --mca routed_radix 2 hostname
node-002
node-003

$mpirun -npernode 1 -hostfile mfile_2 -mca plm rsh -mca oob tcp  --mca routed_radix 1 hostname
<hanging>

I have checked the default routed_radix value:

$./ompi_info -a
		...
        MCA routed radix: parameter "routed_radix" (current value: "64", data source: default, level: 9 dev/all, type: int)
                          Radix to be used for routed radix tree
		...

That is my case why I get a hang when try to run more than 64 nodes.
@ggouaillardet Is there expected behavior?

ggouaillardet · 2017-12-13T07:59:18Z

@karasevb what is the exact version (e.g. commit id) you are running ? i used to reproduce this only in v3.0.x before this PR was merged.

can you please post the output of a hang with -mca plm_base_verbose 10 ?

karasevb · 2017-12-13T08:00:20Z

Here is the verbose output during the hang:
https://github.com/karasevb/test/blob/master/mpirun_rsh.log

karasevb · 2017-12-13T08:05:02Z

@ggouaillardet I use this commit 0377959

ggouaillardet · 2017-12-13T08:29:53Z

@karasevb i observed the same symptom (No children + hang) before this PR was merged.
i rebuilt from the same commit, and am unable to reproduce the issue.

can you please double check you built from the unmodified commit you mentionned and post your configure command line ?

karasevb · 2017-12-13T08:48:48Z

@ggouaillardet I had rechecked v3.0.x with the latest commit a69a84e
Here is updated output with -mca plm_base_verbose 10: https://github.com/karasevb/test/blob/master/mpirun_rsh_2.log

Config:

Configure command line: '--prefix=<path to work dir>/install/ompi_v3.0.x' '--enable-debug' '--without-ucx' '--enable-orterun-prefix-by-default'

ggouaillardet · 2017-12-13T11:33:36Z

@karasevb thanks, can you also please post your mfile_2 ?

karasevb · 2017-12-13T12:32:27Z

@ggouaillardet mfile_2 is pretty simple:

node-001
node-002

karasevb · 2017-12-13T12:56:30Z

@ggouaillardet
I running mpirun into the Slurm allocation and I have an interesting fact: when I use the Slurm allocation of 100 nodes, then a hang is reproduced. But when I tried reduce the Slurm allocation till 4 nodes I see a hang only before this PR was merged (as you pointed above).
Perhaps this is an our system issue.
Could you try to reproduce my experience?

ggouaillardet · 2017-12-13T13:35:26Z

This is making less and less sense ...
front-end ssh to node-002 but according to the hostfile, it should really ssh to node-001
Are you running this from salloc ?

When running on 100 nodes under slurm, are you using sbatch (and hence mpirun runs on node-xxx) or salloc (and hence mpirun is running on front-end) ?

I am not sure I can spawn 100 vm to test that ... are you able to reproduce the hang with less nodes ?

karasevb · 2017-12-13T14:00:24Z

I use salloc, and mpirun is running on front-end.
I reproduced the hang on two different clusters. The threshold for the number of nodes for reproduce was different for each cluster. One of them is 59 nodes, and second is 72.

rhc54 · 2017-12-13T15:05:24Z

@karasevb Your log clearly shows the problems (there are two), but I'm not sure I understand the cause.

First, as @ggouaillardet pointed out, when we read the hostfile we are not finding node-001 in your allocation. Remember, when you are in a managed system, the hostfile acts solely as a filter on the nodes allocated by SLURM. So if node-001 isn't included in your allocation, then we will ignore it. I suspect that is what is happening here.

Second, it looks like the OMPI libraries on node-002 aren't correct. With the updated v3.0.1 code, there should have been a call to compute the routing tree when the daemon starts up. You can see the output from that call on your frontend when mpirun starts up:

[frontend:03834] [[57145,0],0]: parent -1 num_children 1
[frontend:03834] [[57145,0],0]:    child 1
[frontend:03834] [[57145,0],0]:            relation 2
[frontend:03834] [[57145,0],0] routed:binomial rank 0 parent 0 me 0 num_procs 3
[frontend:03834] [[57145,0],0] routed:binomial 0 found child 1
[frontend:03834] [[57145,0],0] routed:binomial rank 0 parent 0 me 1 num_procs 3
[frontend:03834] [[57145,0],0] routed:binomial find children of rank 0
[frontend:03834] [[57145,0],0] routed:binomial find children checking peer 1
[frontend:03834] [[57145,0],0] routed:binomial find children computing tree
[frontend:03834] [[57145,0],0] routed:binomial rank 1 parent 0 me 1 num_procs 3
[frontend:03834] [[57145,0],0] routed:binomial find children returning found value 0
[frontend:03834] [[57145,0],0] routed:binomial 0 found child 2
[frontend:03834] [[57145,0],0] routed:binomial rank 0 parent 0 me 2 num_procs 3
[frontend:03834] [[57145,0],0] routed:binomial find children of rank 0
[frontend:03834] [[57145,0],0] routed:binomial find children checking peer 1
[frontend:03834] [[57145,0],0] routed:binomial find children computing tree
[frontend:03834] [[57145,0],0] routed:binomial rank 1 parent 0 me 2 num_procs 3
[frontend:03834] [[57145,0],0] routed:binomial find children of rank 1
[frontend:03834] [[57145,0],0] routed:binomial find children checking peer 3
[frontend:03834] [[57145,0],0] routed:binomial find children checking peer 2
[frontend:03834] [[57145,0],0] routed:binomial find children computing tree
[frontend:03834] [[57145,0],0] routed:binomial rank 2 parent 0 me 2 num_procs 3
[frontend:03834] [[57145,0],0] routed:binomial find children returning found value 0
[frontend:03834] [[57145,0],0]: parent 0 num_children 2
[frontend:03834] [[57145,0],0]:    child 1
[frontend:03834] [[57145,0],0]:    child 2
[frontend:03834] [[57145,0],0] routed:direct: update routing plan

This output should have been seen on node-002, but it isn't - which makes me believe that the OMPI libraries on that node were not updated.

Another problem I see is:

[frontend:03834] [[57145,0],0] plm:base:setup_vm assigning new daemon [[57145,0],1] to node node-002
[frontend:03834] [[57145,0],0] plm:base:setup_vm add new daemon [[57145,0],2]
[frontend:03834] [[57145,0],0] plm:base:setup_vm assigning new daemon [[57145,0],2] to node node-003

Note that we assigned a daemon to node-003, which isn't in the mfile you provided. So now I am totally confused - where is node-003 coming from? And why is the newly launched orted not updating its routes per the path @ggouaillardet committed?

rhc54 · 2017-12-13T15:25:33Z

I believe I see the problems. First, I believe you gave us the incorrect mfile - it actually includes node-002 and node-003, which is why we first launch to node-002.

Second, your opal_prefix is incorrectly set. Look at the ssh line we are trying to execute:

[frontend:03834] [[57145,0],0] plm:rsh: final template argv:

        /bin/ssh <template>     PATH=<path to workdir>/install/ompi_v3.0.x/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=<path to workdir>/install/ompi_v3.0.x/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=<path to workdir>/install/ompi_v3.0.x/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ;   <path to workdir>/install/ompi_v3.0.x/bin/orted -mca ess "env" -mca ess_base_jobid "3745054720" -mca ess_base_vpid "<template>" -mca ess_base_num_procs "3" -mca orte_hnp_uri "3745054720.0;tcp://10.131.5.129,10.131.200.231,172.17.10.240:37506" -mca oob "tcp" --mca routed_radix "1" -mca plm_base_verbose "10" -mca routed_base_verbose "10" -mca plm "rsh" -mca rmaps_ppr_n_pernode "1" -mca pmix "^s1,s2,cray,isolated"

See the <path to workdir> argument? That should have been the actual path, not some odd string. You can see in the actual ssh command that nothing is substituted for it:

[frontend:03834] [[57145,0],0] plm:rsh: executing: (/bin/ssh) [/bin/ssh node-002     PATH=<path to workdir>/install/ompi_v3.0.x/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=<path to workdir>/install/ompi_v3.0.x/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=<path to workdir>/install/ompi_v3.0.x/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ;   <path to workdir>/install/ompi_v3.0.x/bin/orted -mca ess "env" -mca ess_base_jobid "3745054720" -mca ess_base_vpid 1 -mca ess_base_num_procs "3" -mca orte_hnp_uri "3745054720.0;tcp://10.131.5.129,10.131.200.231,172.17.10.240:37506" -mca oob "tcp" --mca routed_radix "1" -mca plm_base_verbose "10" -mca routed_base_verbose "10" -mca plm "rsh" -mca rmaps_ppr_n_pernode "1" -mca pmix "^s1,s2,cray,isolated"]

So the path to your installation is botched and you are picking up whatever default OMPI libraries exist on the backend.

Someone might look at how that weird string got in there. Meantime, try adding --enable-orterun-prefix-by-default to your configure line, drop the -x from your mpirun cmd line, and I expect things will run just fine.

karasevb · 2017-12-13T15:30:53Z

@rhc54 Sorry for the confusion, I attached not appropriate hosts file. In the host file actually contains:

node-002
node-003

artpol84 · 2017-12-13T15:53:55Z

@rhc54
Regarding path to workdir - this is anonymization of the log, there is a real path there

rhc54 · 2017-12-13T16:03:50Z

@karasevb No problem - just let us know when you edit the output (e.g., to remove the actual path) so we don't waste time wondering what it means.

I can only assume that the library path is incorrect, or that the backend libraries have not been updated. Either way, the output from the backend nodes doesn't match the patch @ggouaillardet committed.

artpol84 · 2017-12-13T17:11:36Z

Me and @karasevb did some more debugging now. I guess we hit the case that disables @ggouaillardet patch. I think that this codepath has not been tested.

It starts here:
https://github.com/open-mpi/ompi/blob/v3.0.x/orte/mca/plm/base/plm_base_launch_support.c#L1559
In our case where we have large number of nodes, so our node list is not appended to the ssh cmdline (no -mca orte_node_regex <..> argument). It can be seen in the log that @karasevb posted above.
However if the node number is small this argument is appended and the list of nodes is available.

Not having node regex doesn't trigger @ggouaillardet addition here:
https://github.com/open-mpi/ompi/pull/4591/files#diff-90f3ebd36767e677e669d28e2977b492L530

For debugging purposes it is enough to change ORTE_MAX_REGEX_CMD_LENGTH in https://github.com/open-mpi/ompi/blob/v3.0.x/orte/mca/plm/base/plm_base_launch_support.c#L1559 to be a small value and you will be able to trigger the same code path.

rhc54 · 2017-12-13T17:16:50Z

Odd - I didn't think we had that switch any more since we can express it as a regex. I can take a look since @ggouaillardet is probably offline by now.

rhc54 · 2017-12-13T19:26:44Z

Oh...the light begins to dawn. You have a - in your node name! It is screwing up the regex processor as we think that stands for a range.

No wonder nobody can reproduce your problem!

artpol84 · 2017-12-13T19:28:34Z

Yes we do.

artpol84 · 2017-12-13T19:29:10Z

Slurm regex works fine with it btw.

rhc54 · 2017-12-13T19:31:06Z

That's fine - we don't. Will try to take a look at how to get around it, but not exactly at the top of the priority list. First time we've seen someone do that, so it is far from normal practice. Still, not saying it shouldn't be allowed.

rhc54 · 2017-12-13T19:43:56Z

Now that we know the problem, I'm closing the associated issue and filed a more specific replacement here:

#4621

artpol84 · 2017-12-13T19:46:24Z

@rhc54 thank you

ggouaillardet added bug Severity: critical Target: v3.0.x labels Dec 8, 2017

ggouaillardet added this to the v3.0.1 milestone Dec 8, 2017

ggouaillardet assigned rhc54 Dec 8, 2017

rhc54 approved these changes Dec 8, 2017

View reviewed changes

hppritcha added RM approved NEWS labels Dec 8, 2017

hppritcha merged commit e03cca2 into open-mpi:v3.0.x Dec 8, 2017

jjhursey mentioned this pull request Dec 9, 2017

RSH launcher hangs on more than 65 nodes #4465

Closed

ess/base: be sure to update the routing tree #4591

ess/base: be sure to update the routing tree #4591

Uh oh!

Conversation

ggouaillardet commented Dec 8, 2017

Uh oh!

ggouaillardet commented Dec 8, 2017

Uh oh!

ggouaillardet commented Dec 8, 2017

Uh oh!

rhc54 commented Dec 8, 2017

Uh oh!

karasevb commented Dec 12, 2017

Uh oh!

karasevb commented Dec 12, 2017

Uh oh!

artpol84 commented Dec 12, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

artpol84 commented Dec 12, 2017

Uh oh!

karasevb commented Dec 13, 2017

Uh oh!

rhc54 commented Dec 13, 2017

Uh oh!

ggouaillardet commented Dec 13, 2017

Uh oh!

karasevb commented Dec 13, 2017

Uh oh!

ggouaillardet commented Dec 13, 2017

Uh oh!

karasevb commented Dec 13, 2017

Uh oh!

karasevb commented Dec 13, 2017

Uh oh!

ggouaillardet commented Dec 13, 2017

Uh oh!

karasevb commented Dec 13, 2017

Uh oh!

ggouaillardet commented Dec 13, 2017

Uh oh!

karasevb commented Dec 13, 2017

Uh oh!

karasevb commented Dec 13, 2017

Uh oh!

ggouaillardet commented Dec 13, 2017

Uh oh!

karasevb commented Dec 13, 2017

Uh oh!

rhc54 commented Dec 13, 2017

Uh oh!

rhc54 commented Dec 13, 2017

Uh oh!

karasevb commented Dec 13, 2017

Uh oh!

artpol84 commented Dec 13, 2017

Uh oh!

rhc54 commented Dec 13, 2017

Uh oh!

artpol84 commented Dec 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rhc54 commented Dec 13, 2017

Uh oh!

rhc54 commented Dec 13, 2017

Uh oh!

artpol84 commented Dec 13, 2017

Uh oh!

artpol84 commented Dec 13, 2017

Uh oh!

rhc54 commented Dec 13, 2017

Uh oh!

rhc54 commented Dec 13, 2017

Uh oh!

artpol84 commented Dec 13, 2017

Uh oh!

Uh oh!

artpol84 commented Dec 12, 2017 •

edited

Loading

artpol84 commented Dec 13, 2017 •

edited

Loading