-
Notifications
You must be signed in to change notification settings - Fork 900
Hanging with more than 64 hosts #4578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
If you add "-mca routed_radix 300" to your command line, does it make a difference? |
It got better but didn't scale to more than 128 nodes, even "-mca routed_radix 600". I also tried "routed_debruijn" and "routed_binomial", but no success. |
@ceandrade can you please confirm the node on which you run |
@ggouaillardet I tried both actually. I launched from a machine outside the cluster, and from machines within the cluster. This is the line: $ mpirun -v -display-devel-map -display-allocation -mca routed_radix 600 --bind-to none -np 1 -hostfile hosts_266.txt -pernode hostname but no answer. Note that I have 266 identical machines working in my cluster now. Is there a way to create some more detailed logging info that I can share with you guys? |
I can reproduce the issue with 65 tasks and with the default radix, but only when |
Another thing to try: add "-mca plm_rsh_num_concurrent 300" to your cmd line and see if that helps. |
Alternatively, you can add "-mca plm_rsh_no_tree_spawn 1" to the cmd line - affect in your case will be the same. |
@rhc54 It worked with the options "-mca plm_rsh_num_concurrent 300 -mca routed_radix 600" and the option "-mca plm_rsh_no_tree_spawn 1" alone. The latter is much faster than the former options. I don't understand very well MPI (I just use it as another tool), but apparently, the tree-based launch has some problem with more than 128 nodes. So, this solution looks be enough to me. Therefore, I think it is up to you guys to leave this ticket open for further investigation. Thank @rhc54 and @ggouaillardet for your help and prompt answers! |
ok, thanks for the report, and sorry for the problem. Since @ggouaillardet can reproduce it, I will defer to him for the eventual solution now that we have isolated the problem to the tree spawn code. |
so any tree spawn operation properly gets the number of children underneath us. This commit is a tiny subset of open-mpi/ompi@347ca41 that should have been back-ported into the v3.0.x branch Fixes open-mpi#4578 Thanks to Carlos Eduardo de Andrade for reporting. Signed-off-by: Gilles Gouaillardet <[email protected]>
@ceandrade this issue will be fixed in upcoming Thanks for the report ! |
Thank you very much @ggouaillardet. I am amazed how fast you guys are! |
Marking this as complete |
so any tree spawn operation properly gets the number of children underneath us. This commit is a tiny subset of open-mpi/ompi@347ca41 that should have been back-ported into the v3.0.x branch Fixes open-mpi#4578 Thanks to Carlos Eduardo de Andrade for reporting. Signed-off-by: Gilles Gouaillardet <[email protected]> Signed-off-by: Scott Miller <[email protected]>
so any tree spawn operation properly gets the number of children underneath us. This commit is a tiny subset of open-mpi/ompi@347ca41 that should have been back-ported into the v3.0.x branch Fixes open-mpi#4578 Thanks to Carlos Eduardo de Andrade for reporting. Signed-off-by: Gilles Gouaillardet <[email protected]> (cherry picked from commit e445d47)
so any tree spawn operation properly gets the number of children underneath us. This commit is a tiny subset of open-mpi/ompi@347ca41 that should have been back-ported into the v3.0.x branch Fixes open-mpi#4578 Thanks to Carlos Eduardo de Andrade for reporting. Signed-off-by: Gilles Gouaillardet <[email protected]> (cherry picked from commit e445d47)
Problem
I can not run jobs in parallel in a cluster with more than 64 machines.
System
Using openmpi-3.0.0.tar.bz2, Sep 12, 2017, compiled from source/distribution tarball.
Linux version 2.6.32-573.3.1.el6.x86_64 ([email protected]) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-16) (GCC) ) #1 SMP Thu Aug 13 22:55:16 UTC 2015
Details of the problem
I'm trying to use OpenMPI in a cluster with 200 machines. Unfortunately,
mpirun
looks like be frozen when I submit the jobs:where
hosts.txt
has 200 machines.If the number of machines is at most 64, I can get the answer pretty fast:
This will be blocked forever:
The text was updated successfully, but these errors were encountered: