Skip to content

Hanging with more than 64 hosts #4578

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ceandrade opened this issue Dec 6, 2017 · 12 comments
Closed

Hanging with more than 64 hosts #4578

ceandrade opened this issue Dec 6, 2017 · 12 comments

Comments

@ceandrade
Copy link

Problem

I can not run jobs in parallel in a cluster with more than 64 machines.


System

Using openmpi-3.0.0.tar.bz2, Sep 12, 2017, compiled from source/distribution tarball.

Linux version 2.6.32-573.3.1.el6.x86_64 ([email protected]) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-16) (GCC) ) #1 SMP Thu Aug 13 22:55:16 UTC 2015


Details of the problem

I'm trying to use OpenMPI in a cluster with 200 machines. Unfortunately, mpirun looks like be frozen when I submit the jobs:

$ mpirun -np 1 -hostfile hosts.txt -pernode hostname

where hosts.txt has 200 machines.

If the number of machines is at most 64, I can get the answer pretty fast:

$ mpirun -np 64 -hostfile hosts_64.txt -pernode hostname
machine1
machine2
....
machine64

This will be blocked forever:

$ mpirun -np 65 -hostfile hosts_65.txt -pernode hostname
<no response for 10min...>
@rhc54
Copy link
Contributor

rhc54 commented Dec 6, 2017

If you add "-mca routed_radix 300" to your command line, does it make a difference?

@ceandrade
Copy link
Author

It got better but didn't scale to more than 128 nodes, even "-mca routed_radix 600". I also tried "routed_debruijn" and "routed_binomial", but no success.

@ggouaillardet
Copy link
Contributor

@ceandrade can you please confirm the node on which you run mpirun is not part of your hosts_64.txt hostfile ?

@ceandrade
Copy link
Author

ceandrade commented Dec 7, 2017

@ggouaillardet I tried both actually. I launched from a machine outside the cluster, and from machines within the cluster. This is the line:

$ mpirun -v -display-devel-map -display-allocation -mca routed_radix 600 --bind-to none -np 1 -hostfile hosts_266.txt -pernode hostname

but no answer. Note that I have 266 identical machines working in my cluster now. Is there a way to create some more detailed logging info that I can share with you guys?

@ggouaillardet
Copy link
Contributor

I can reproduce the issue with 65 tasks and with the default radix, but only when mpirun is invoked from a node not in the hostfile. I will resume investigations tomorrow. Stay tuned !

@rhc54
Copy link
Contributor

rhc54 commented Dec 7, 2017

Another thing to try: add "-mca plm_rsh_num_concurrent 300" to your cmd line and see if that helps.

@rhc54
Copy link
Contributor

rhc54 commented Dec 7, 2017

Alternatively, you can add "-mca plm_rsh_no_tree_spawn 1" to the cmd line - affect in your case will be the same.

@ceandrade
Copy link
Author

@rhc54 It worked with the options "-mca plm_rsh_num_concurrent 300 -mca routed_radix 600" and the option "-mca plm_rsh_no_tree_spawn 1" alone. The latter is much faster than the former options.

I don't understand very well MPI (I just use it as another tool), but apparently, the tree-based launch has some problem with more than 128 nodes.

So, this solution looks be enough to me. Therefore, I think it is up to you guys to leave this ticket open for further investigation.

Thank @rhc54 and @ggouaillardet for your help and prompt answers!

@rhc54
Copy link
Contributor

rhc54 commented Dec 7, 2017

ok, thanks for the report, and sorry for the problem. Since @ggouaillardet can reproduce it, I will defer to him for the eventual solution now that we have isolated the problem to the tree spawn code.

ggouaillardet added a commit to ggouaillardet/ompi that referenced this issue Dec 8, 2017
so any tree spawn operation properly gets the number of children underneath us.

This commit is a tiny subset of open-mpi/ompi@347ca41
that should have been back-ported into the v3.0.x branch

Fixes open-mpi#4578

Thanks to Carlos Eduardo de Andrade for reporting.

Signed-off-by: Gilles Gouaillardet <[email protected]>
@ggouaillardet
Copy link
Contributor

@ceandrade this issue will be fixed in upcoming 3.0.1.
Meanwhile, you can manually download and apply the patch available at https://github.com/ggouaillardet/ompi/commit/e445d47c3f8ff908d19eb18cbd6d953e1272f7ca.patch

Thanks for the report !

@ceandrade
Copy link
Author

Thank you very much @ggouaillardet. I am amazed how fast you guys are!

@rhc54
Copy link
Contributor

rhc54 commented Dec 12, 2017

Marking this as complete

@rhc54 rhc54 closed this as completed Dec 12, 2017
sam6258 pushed a commit to sam6258/ompi that referenced this issue Jan 24, 2018
so any tree spawn operation properly gets the number of children underneath us.

This commit is a tiny subset of open-mpi/ompi@347ca41
that should have been back-ported into the v3.0.x branch

Fixes open-mpi#4578

Thanks to Carlos Eduardo de Andrade for reporting.

Signed-off-by: Gilles Gouaillardet <[email protected]>
Signed-off-by: Scott Miller <[email protected]>
sam6258 pushed a commit to sam6258/ompi that referenced this issue Jan 25, 2018
so any tree spawn operation properly gets the number of children underneath us.

This commit is a tiny subset of open-mpi/ompi@347ca41
that should have been back-ported into the v3.0.x branch

Fixes open-mpi#4578

Thanks to Carlos Eduardo de Andrade for reporting.

Signed-off-by: Gilles Gouaillardet <[email protected]>
(cherry picked from commit e445d47)
sam6258 pushed a commit to sam6258/ompi that referenced this issue Feb 1, 2018
so any tree spawn operation properly gets the number of children underneath us.

This commit is a tiny subset of open-mpi/ompi@347ca41
that should have been back-ported into the v3.0.x branch

Fixes open-mpi#4578

Thanks to Carlos Eduardo de Andrade for reporting.

Signed-off-by: Gilles Gouaillardet <[email protected]>
(cherry picked from commit e445d47)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants