Skip to content
This repository was archived by the owner on Sep 30, 2022. It is now read-only.

orte: Expand the application of !orte_keep_fqdn_hostnames #1354

Merged
merged 3 commits into from
Sep 12, 2016
Merged

orte: Expand the application of !orte_keep_fqdn_hostnames #1354

merged 3 commits into from
Sep 12, 2016

Conversation

jjhursey
Copy link
Member

  • Expand the use of the orte_keep_fqdn_hostnames MCA parameter when
    it is set to false.
  • If that parameter is set to false (default) then short hostnames
    (e.g., node01) will match with the long hostnames (e.g.,
    node01.mycluster.org). This allows a user (or resource manager)
    to mix the use of short and long hostnames.
    • Note that this mechanism does not perform a DNS lookup, but
      instead strips off the FQDN by truncating the hostname string at
      the first . character (when not an IP address).
      • By default (false) the following is true:
        node01 == node01.mycluster.org == node01.bogus.com
        since we use node01 as the hostname.

(cherry picked from commit open-mpi/ompi@d26dd2c)

bot:assign: @rhc54
bot:label:enhancement
bot:milestone:v2.0.2

@mellanox-github
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/2144/ for details.

@rhc54
Copy link

rhc54 commented Sep 1, 2016

@jjhursey My apologies - I missed this on the master commit. There is a function for checking if a hostname is an IP address: opal_net_isaddr is in opal/util/net.h. Might be good to use it here instead of having these specific checks - see what you think.

@jjhursey
Copy link
Member Author

jjhursey commented Sep 1, 2016

So the suggestion is to replace this logic:

if (0 == inet_pton(AF_INET, node_name, &buf) && 0 == inet_pton(AF_INET6, node_name, &buf)) {
    if (NULL != (ptr = strchr(node_name, '.'))) {
        *ptr = '\0';
    }
}

With this logic:

if ( !opal_net_isaddr(node_name) ) {
    if (NULL != (ptr = strchr(node_name, '.'))) {
        *ptr = '\0';
    }
}

That seems cleaner to me. I didn't know about the opal_net_isaddr method - seems handy. I'll update the version on master then add that commit to this PR.

@jjhursey
Copy link
Member Author

jjhursey commented Sep 2, 2016

@rhc54 I have a PR for master here with these changes if you wanted to take a look:

I'll wait for CI, then work on bringing it into this PR.

@jsquyres
Copy link
Member

jsquyres commented Sep 6, 2016

@jjhursey Do you have any tests for this behavior, perchance?

jjhursey and others added 2 commits September 6, 2016 13:10
 * Expand the use of the `orte_keep_fqdn_hostnames` MCA parameter when
   it is set to false.
 * If that parameter is set to false (default) then short hostnames
   (e.g., `node01`) will match with the long hostnames (e.g.,
   `node01.mycluster.org`). This allows a user (or resource manager)
    to mix the use of short and long hostnames.
  - Note that this mechanism does _not_ perform a DNS lookup, but
    instead strips off the FQDN by truncating the hostname string at
    the first `.` character (when not an IP address).
     - By default (`false`) the following is true:
       `node01 == node01.mycluster.org == node01.bogus.com`
       since we use `node01` as the hostname.

(cherry picked from commit open-mpi/ompi@d26dd2c)
 * Switch to use opal_net_isaddr() for checking if a name is an IP
   address - as it is a bit cleaner, and uses common functionality.

(cherry picked from commit open-mpi/ompi@fe937d1)
@jjhursey
Copy link
Member Author

jjhursey commented Sep 6, 2016

I just applied the cherry-pick of open-mpi/ompi#2047 into this PR. We should probably keep these as 2 commits, since they are on the master branch.

@mellanox-github
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/2160/ for details.

@jjhursey
Copy link
Member Author

jjhursey commented Sep 6, 2016

I don't have any automated tests for this, but you can recreate it using rankfiles, hostfiles, and --host. Below are a couple examples from a build of this PR, the -mca orte_keep_fqdn_hostnames t example is more like what the current repo will do without this PR. This PR fixes the default case: -mca orte_keep_fqdn_hostnames f

Rankfile

shell$ cat rankfile-mixed
rank 0=mpi01.mycluster.org slot=0
rank 1=mpi01 slot=1
rank 2=mpi02.mycluster.org slot=0
rank 3=mpi02 slot=1
shell$ mpirun --display-allocation --do-not-launch -rf rankfile-mixed hostname

======================   ALLOCATED NODES   ======================
    mpi01: slots=2 max_slots=0 slots_inuse=0 state=UP
    mpi02: slots=2 max_slots=0 slots_inuse=0 state=UNKNOWN
=================================================================
shell$ mpirun --display-allocation --do-not-launch -rf rankfile-mixed -mca orte_keep_fqdn_hostnames t hostname

======================   ALLOCATED NODES   ======================
    mpi01: slots=1 max_slots=0 slots_inuse=0 state=UP
    mpi02.mycluster.org: slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
    mpi02: slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
=================================================================
  • Notice that for the rankfile -mca orte_keep_fqdn_hostnames t case it gets the slots wrong too, but that should probably be addressed in a different PR.

Dash Host (--host)

shell$ mpirun --display-allocation --do-not-launch -np 4 --host mpi01,mpi01.mycluster.org,mpi02,mpi02.mycluster.org hostname

======================   ALLOCATED NODES   ======================
    mpi01: slots=2 max_slots=0 slots_inuse=0 state=UP
    mpi02: slots=2 max_slots=0 slots_inuse=0 state=UP
=================================================================
shell$ mpirun --display-allocation --do-not-launch -np 4 --host mpi01,mpi01.mycluster.org,mpi02,mpi02.mycluster.org -mca orte_keep_fqdn_hostnames t  hostname

======================   ALLOCATED NODES   ======================
    mpi01: slots=2 max_slots=0 slots_inuse=0 state=UP
    mpi02: slots=1 max_slots=0 slots_inuse=0 state=UP
    mpi02.mycluster.org: slots=1 max_slots=0 slots_inuse=0 state=UP
=================================================================

hostfile

shell$ cat hostfile-mixed
mpi01.mycluster.org
mpi01
mpi02.mycluster.org
mpi02
shell$ mpirun --display-allocation --do-not-launch -np 4 --hostfile ./hostfile-mixed  hostname

======================   ALLOCATED NODES   ======================
    mpi01: slots=2 max_slots=0 slots_inuse=0 state=UP
    mpi02: slots=2 max_slots=0 slots_inuse=0 state=UNKNOWN
=================================================================
shell$ mpirun --display-allocation --do-not-launch -np 4 --hostfile ./hostfile-mixed -mca orte_keep_fqdn_hostnames t  hostname

======================   ALLOCATED NODES   ======================
    mpi01: slots=2 max_slots=0 slots_inuse=0 state=UP
    mpi02.mycluster.org: slots=16 max_slots=0 slots_inuse=0 state=UNKNOWN
    mpi02: slots=16 max_slots=0 slots_inuse=0 state=UNKNOWN
=================================================================

@jjhursey
Copy link
Member Author

jjhursey commented Sep 6, 2016

@rhc54 I think this is ready for a final review.

Signed-off-by: Jeff Squyres <[email protected]>
(cherry picked from commit open-mpi/ompi@722d5ee)
@jjhursey
Copy link
Member Author

jjhursey commented Sep 7, 2016

Added @jsquyres commit from PR open-mpi/ompi#2060

@mellanox-github
Copy link

Test FAILed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/2161/ for details.

@jjhursey
Copy link
Member Author

jjhursey commented Sep 7, 2016

bot:mellanox:retest

@mellanox-github
Copy link

Test FAILed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/2162/ for details.

@jjhursey
Copy link
Member Author

jjhursey commented Sep 7, 2016

The mellanox failures do not look related. Mellanox CI was passing before I added Jeff's removal of an unused variable. Now it is failing in different locations all after the build when running tests. I think this is a Mellanox CI machine issue.

@jjhursey
Copy link
Member Author

jjhursey commented Sep 8, 2016

bot:mellanox:retest

@mellanox-github
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/2164/ for details.

@rhc54
Copy link

rhc54 commented Sep 9, 2016

👍

@hppritcha
Copy link
Member

@jsquyres I think this is ready to merge in.

@jsquyres jsquyres merged commit 5aabc18 into open-mpi:v2.x Sep 12, 2016
@mellanox-github
Copy link

Test FAILed.
See http://bgate.mellanox.com/jenkins/job/gh-ompi-release-pr/2183/ for details.

@jjhursey jjhursey deleted the topic/mixed-hostname branch September 30, 2016 03:16
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants