Skip to content

Conversation

@jjhursey
Copy link
Member

@jjhursey jjhursey commented Feb 6, 2018

Signed-off-by: Ralph Castain <[email protected]>
(cherry picked from commit ce901ba)
@jjhursey jjhursey added the bug label Feb 6, 2018
@jjhursey jjhursey added this to the v3.0.1 milestone Feb 6, 2018
@jjhursey jjhursey requested a review from bosilca February 6, 2018 15:42
@hppritcha
Copy link
Member

@bwbarrett discussed at the webex today and think this should be pulled in before the release. We also need b643852

@hppritcha
Copy link
Member

@jjhursey could you add commit b643852 to this PR?

@jjhursey
Copy link
Member Author

jjhursey commented Feb 6, 2018

It didn't apply cleanly, so I'll have to investigate a bit. I'll try to get it done today.

@jjhursey
Copy link
Member Author

jjhursey commented Feb 7, 2018

@bosilca I could not reproduce the hang with this branch, but I reproduce a different issue related to failed executions.

From the trace below you can see that if a job fails to launch it still retains it's allocation. So eventually the DVM runs out of available slots and starts rejecting submissions. I did not see this in testing the v3.1.x branch.

[jjhursey@node03]  orte-dvm --host node03:2,node04:2,node05:2 &
[1] 108781
VMURI: 1511063552.0;tcp://x.x.x.x:39429;ud://31323.23.1
DVM ready

[jjhursey@node03] mpirun --hnp file:/tmp/ompi.`hostname`.${EUID}/dvm/contact.txt -npernode 1 -H node04,node05 hostname
[ORTE] Task: 0 is launched! (Job ID: [23057,1])
node05
node04
[ORTE] Task: 0 returned: 0 (Job ID: [23057,1])
[jjhursey@node03] mpirun --hnp file:/tmp/ompi.`hostname`.${EUID}/dvm/contact.txt -npernode 1 -H node04,node05 bogus
--------------------------------------------------------------------------
mpirun was unable to find the specified executable file, and therefore
did not launch the job.  This error was first reported for process
rank 1; it may have occurred for other processes as well.

NOTE: A common cause for this error is misspelling a mpirun command
      line parameter option (remember that mpirun interprets the first
      unrecognized command line token as the executable).

Node:       node05
Executable: bogus
--------------------------------------------------------------------------
[jjhursey@node03] mpirun --hnp file:/tmp/ompi.`hostname`.${EUID}/dvm/contact.txt -npernode 1 -H node04,node05 hostname
node05
[ORTE] Task: 0 is launched! (Job ID: [23057,3])
node04
[ORTE] Task: 0 returned: 0 (Job ID: [23057,3])
[jjhursey@node03] mpirun --hnp file:/tmp/ompi.`hostname`.${EUID}/dvm/contact.txt -npernode 1 -H node04,node05 bogus
--------------------------------------------------------------------------
mpirun was unable to find the specified executable file, and therefore
did not launch the job.  This error was first reported for process
rank 0; it may have occurred for other processes as well.

NOTE: A common cause for this error is misspelling a mpirun command
      line parameter option (remember that mpirun interprets the first
      unrecognized command line token as the executable).

Node:       node04
Executable: bogus
--------------------------------------------------------------------------
[jjhursey@node03] mpirun --hnp file:/tmp/ompi.`hostname`.${EUID}/dvm/contact.txt -npernode 1 -H node04,node05 hostname
--------------------------------------------------------------------------
All nodes which are allocated for this job are already filled.
--------------------------------------------------------------------------
[jjhursey@node03] 

@rhc54
Copy link
Contributor

rhc54 commented Feb 7, 2018

@jjhursey Please see #4760 - it really does need to be committed, but is waiting for your review

@jjhursey
Copy link
Member Author

jjhursey commented Feb 8, 2018

From testing, it doesn't seem this this patch is needed in the v3.0.x series. Closing this PR.

@jjhursey jjhursey closed this Feb 8, 2018
@jjhursey jjhursey deleted the fix/v3.0.x/badexe branch February 8, 2018 16:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants