Skip to content

RSH launcher hangs on more than 65 nodes #4465

@jladd-mlnx

Description

@jladd-mlnx

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

v3.0.x
master

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

git clone

Please describe the system on which you are running

  • Operating system/version:
    RedHat 7.3

  • Computer hardware:
    Intel dual socket Broadwell

  • Network type:
    IB


Details of the problem

Open MPI 3.0.x and master, the following command hangs when the number of nodes in the hostfile is 66 or more (NOTE: this was originally reported to us by a customer; on their system, the hangs happens on at 65 nodes or more):

$mpirun --npernode 1 --hostfile mfile -mca plm rsh --debug-daemons -x LD_LIBRARY_PATH hostname

Debug output looks as follows:

Daemon [[50860,0],46] checking in as pid 24964 on host clx-hercules-096
[clx-hercules-096:24964] [[50860,0],46] orted: up and running - waiting for commands!
[clx-hercules-096:24964] [[50860,0],46] orted_cmd: received tree_spawn
Daemon [[50860,0],6] checking in as pid 4077 on host clx-hercules-008
[clx-hercules-008:04077] [[50860,0],6] orted: up and running - waiting for commands!
[clx-hercules-008:04077] [[50860,0],6] orted_cmd: received tree_spawn
Daemon [[50860,0],41] checking in as pid 30550 on host clx-hercules-060
[clx-hercules-060:30550] [[50860,0],41] orted: up and running - waiting for commands!
[clx-hercules-060:30550] [[50860,0],41] orted_cmd: received tree_spawn
Daemon [[50860,0],22] checking in as pid 11025 on host clx-hercules-025
[clx-hercules-025:11025] [[50860,0],22] orted: up and running - waiting for commands!
[clx-hercules-025:11025] [[50860,0],22] orted_cmd: received tree_spawn
Daemon [[50860,0],30] checking in as pid 8031 on host clx-hercules-034
[clx-hercules-034:08031] [[50860,0],30] orted: up and running - waiting for commands!
[clx-hercules-034:08031] [[50860,0],30] orted_cmd: received tree_spawn

We found that disabling the tree based RSH launch works around the issue, i.e.

$mpirun --npernode 1 --hostfile mfile -mca plm rsh -mca plm_rsh_no_tree_spawn true --debug-daemons -x LD_LIBRARY_PATH hostname

does NOT hang and produces the expected result.

This appears to be a regression from the 2.x series. We do NOT observe this issue in any of the 2.x branches.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions