Closed
Description
Problem
I can not run jobs in parallel in a cluster with more than 64 machines.
System
Using openmpi-3.0.0.tar.bz2, Sep 12, 2017, compiled from source/distribution tarball.
Linux version 2.6.32-573.3.1.el6.x86_64 ([email protected]) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-16) (GCC) ) #1 SMP Thu Aug 13 22:55:16 UTC 2015
Details of the problem
I'm trying to use OpenMPI in a cluster with 200 machines. Unfortunately, mpirun
looks like be frozen when I submit the jobs:
$ mpirun -np 1 -hostfile hosts.txt -pernode hostname
where hosts.txt
has 200 machines.
If the number of machines is at most 64, I can get the answer pretty fast:
$ mpirun -np 64 -hostfile hosts_64.txt -pernode hostname
machine1
machine2
....
machine64
This will be blocked forever:
$ mpirun -np 65 -hostfile hosts_65.txt -pernode hostname
<no response for 10min...>