Skip to content

v4.0.x: Backport: Ensure that --host / --hostfile nodes are always used in order provided #6508

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

rhc54
Copy link
Contributor

@rhc54 rhc54 commented Mar 20, 2019

Backport: Ensure that nodes are always used in order provided

Corresponds to following commits to OMPI master:

35a5971
2794ae4
aed06e6
5aa775c

Signed-off-by: Ralph Castain [email protected]

rhc54 added 2 commits March 19, 2019 21:29
Corresponds to following commits to OMPI master:

35a5971
2794ae4
aed06e6
5aa775c

Signed-off-by: Ralph Castain <[email protected]>
Signed-off-by: Ralph Castain <[email protected]>
@rhc54 rhc54 added the bug label Mar 20, 2019
@rhc54 rhc54 added this to the v4.0.1 milestone Mar 20, 2019
@rhc54 rhc54 self-assigned this Mar 20, 2019
@rhc54 rhc54 requested review from jsquyres and jjhursey March 20, 2019 04:33
@rhc54
Copy link
Contributor Author

rhc54 commented Mar 20, 2019

@jsquyres @jjhursey Please check this out. Once we confirm it is okay, I'll use it as the basis for backports to the v3.x.y series.

@jsquyres
Copy link
Member

This is a big change. I see:

  • Deletion of the compress/bzip and compress/gzip components
  • Addition of the compress/zlib component (which looks like it's at least kind of a rename of gzip->zlib)
  • Deletion of the regx framework

I think that these are technically backwards-incompatible changes, right? Realistically, the end-user impact will likely be zero (i.e., I find it hard to believe that any real world user would be selecting / passing MCA params to the bzip/gzip components), but still -- we should acknowledge that we're intentionally breaking backwards compatibility here.

Also, I see changes to DSS packed/unpacked messages. Does this PR change the wire protocol? If so, does this have implications in container environments, e.g., where mpirun is outside the container and is a different version of OMPI than is inside the container?

@rhc54
Copy link
Contributor Author

rhc54 commented Mar 20, 2019

I think that these are technically backwards-incompatible changes, right? Realistically, the end-user impact will likely be zero (i.e., I find it hard to believe that any real world user would be selecting / passing MCA params to the bzip/gzip components), but still -- we should acknowledge that we're intentionally breaking backwards compatibility here.

Nobody could use those components because they were only used for CR, which was removed from the release branches long ago.

Also, I see changes to DSS packed/unpacked messages. Does this PR change the wire protocol? If so, does this have implications in container environments, e.g., where mpirun is outside the container and is a different version of OMPI than is inside the container?

We have always required that the daemons and mpirun all be at the same release level. Within that constraint, this doesn't break anything. However, if mpirun is outside the container, and the daemons are from some other version inside the container, then yes - launch will be broken.

@hppritcha
Copy link
Member

@gpaulsen this is huge to merge in while we are in rc state. I'd say this waits till 4.0.2.

@gpaulsen
Copy link
Member

@hppritcha I agree. 4.0.2 would be sufficient.

@jsquyres
Copy link
Member

@rhc54 FWIW, I'm getting assertion failures with this PR:

$ mpirun --host mpi001,mpi002 hostname
mpirun: base/rml_base_msg_handlers.c:184: orte_rml_base_process_msg: Assertion `((void *)0) != orte_node_regex' failed.
[1]    14227 abort (core dumped)  mpirun --host mpi001,mpi002 hostname

@rhc54
Copy link
Contributor Author

rhc54 commented Mar 21, 2019

@jsquyres Should be okay now - I missed a spot in the backport.

@jsquyres
Copy link
Member

@rhc54 Something still seems to be missing. I'm not getting assertion failures any more, but I'm getting the old/bad behavior:

$ mpirun --host mpi001,mpi002 ~/foo.sh
mpi001: 0 rank
mpi002: 1 rank
$ mpirun --host mpi002,mpi001 ~/foo.sh
mpi001: 0 rank
mpi002: 1 rank

@rhc54
Copy link
Contributor Author

rhc54 commented Mar 25, 2019

I'm beginning to wonder if this is worth all the pain...hitting my limits on how much time I can devote to it.

@hppritcha hppritcha modified the milestones: v4.0.1, v4.0.2 Mar 25, 2019
@hppritcha hppritcha removed the request for review from jsquyres April 2, 2019 15:08
@jsquyres jsquyres changed the title Backport: Ensure that nodes are always used in order provided v4.0.x: Backport: Ensure that nodes are always used in order provided Apr 9, 2019
@jsquyres jsquyres changed the title v4.0.x: Backport: Ensure that nodes are always used in order provided v4.0.x: Backport: Ensure that --host / --hostfile nodes are always used in order provided Apr 9, 2019
@jjhursey
Copy link
Member

I'm going to take myself off as a review to prevent the Pull Request Reminder from bugging me. Add me back when this is ready for review.

@jjhursey jjhursey removed their request for review April 23, 2019 15:11
@gpaulsen
Copy link
Member

@ggouaillardet - Ralph said in today's web-ex that he realistically won't have time for this in the coming months. Is this something you could help with?

@gpaulsen gpaulsen removed this from the v4.0.2 milestone Jun 7, 2019
@gpaulsen
Copy link
Member

gpaulsen commented Jun 7, 2019

Removing milestone as we're currently unsure of the scope, schedule of getting this PR working.

@hppritcha hppritcha modified the milestone: v4.0.2 Jul 13, 2019
@rhc54 rhc54 closed this Nov 27, 2019
@rhc54 rhc54 deleted the cmr40x/order branch November 27, 2019 19:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants