Skip to content

Hanging with more than 64 hosts #4578

Closed
@ceandrade

Description

@ceandrade

Problem

I can not run jobs in parallel in a cluster with more than 64 machines.


System

Using openmpi-3.0.0.tar.bz2, Sep 12, 2017, compiled from source/distribution tarball.

Linux version 2.6.32-573.3.1.el6.x86_64 ([email protected]) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-16) (GCC) ) #1 SMP Thu Aug 13 22:55:16 UTC 2015


Details of the problem

I'm trying to use OpenMPI in a cluster with 200 machines. Unfortunately, mpirun looks like be frozen when I submit the jobs:

$ mpirun -np 1 -hostfile hosts.txt -pernode hostname

where hosts.txt has 200 machines.

If the number of machines is at most 64, I can get the answer pretty fast:

$ mpirun -np 64 -hostfile hosts_64.txt -pernode hostname
machine1
machine2
....
machine64

This will be blocked forever:

$ mpirun -np 65 -hostfile hosts_65.txt -pernode hostname
<no response for 10min...>

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions