Skip to content

Conversation

@rhc54
Copy link
Contributor

@rhc54 rhc54 commented Jan 27, 2018

Since we now support the dynamic addition of hosts to the orte_node_pool, there is no longer any reason to require advanced specification of all possible nodes. Instead, use a precedence method to initially allocate only those hosts that were specified in the cmd line:

  • rankfile, if given, as that will specify the nodes

  • -host, aggregated across all app_contexts

  • -hostfile, aggregated across all app_contexts

  • default hostfile

  • assign local node

Fix slots_inuse accounting so that the nodes are correctly reset upon error termination - e.g., when oversubscribed without permission.

Ensure we accurately track the user's specified desires for oversubscribe and no-use-local when dynamically spawning jobs.

Signed-off-by: Ralph Castain [email protected]
(cherry picked from commit c9b3e68)

@rhc54
Copy link
Contributor Author

rhc54 commented Jan 27, 2018

You may or may not want this backport from the PMIx reference server. While investigating a problem reported by @ggouaillardet regarding oversubscription, I found that the slot accounting system was generating erroneous results whenever a job failed, and that multiple job executions were resulting in allocation confusion as the hostfile was adding nodes on every invocation.

These changes should have no impact on one-shot executions (i.e., mpirun), but they definitely resolve some orte-dvm problems. However, I can't absolutely swear that they won't impact somebody doing something unusual. You all know how many times we have discovered that someone's corner case behavior changed when we touch the allocation code!

So look this over carefully - as I said, totally up to you.

@ggouaillardet
Copy link
Contributor

i will review this.

out of curiosity, what is the rationale for having pmix-reference-server in a dedicated repository ?
at first glance, this looks like a trimmed version of Open MPI, so could an autogen.pl option be used to achieve the same result ? ( and hence reduce the need to synchronize both repositories both ways ?)

@rhc54
Copy link
Contributor Author

rhc54 commented Jan 27, 2018

It is mostly politics - asking PMIx users and developers to clone the OMPI repository is unacceptable to some non-trivial portion of the community. It also allows for divergence, as this PR may well represent.

@rhc54
Copy link
Contributor Author

rhc54 commented Feb 4, 2018

@jjhursey The DVM will likely have problems without this change as the resource accounting is off.

@jsquyres
Copy link
Member

jsquyres commented Feb 6, 2018

@jjhursey @ggouaillardet Ping. Can you guys review this? Thanks!

@jjhursey
Copy link
Member

jjhursey commented Feb 7, 2018

(reviewing/testing now) Sorry for the delay

@jjhursey
Copy link
Member

jjhursey commented Feb 7, 2018

I reviewed the code and it looked good. However, I am trying to do some runtime testing to verify the fix and hit a snag that I think I've seen before on master.

[jjhursey@node03 ompi-rhc54] orte-dvm --host node03:2,node04:2 --system-server &
[1] 156171
[jjhursey@node03 ompi-rhc54] DVM ready

[jjhursey@node03 ompi-rhc54] prun -n 1 hostname
[node03:156211] *** Process received signal ***
[node03:156211] Signal: Segmentation fault (11)
[node03:156211] Signal code: Address not mapped (1)
[node03:156211] Failing at address: 0x50
[node03:156211] [ 0] [0x3fffa2010478]
[node03:156211] [ 1] /.../ompi/install/ompi-rhc54-dbg/lib/openmpi/mca_pmix_pmix3x.so(OPAL_MCA_PMIX3X_PMIx_tool_init+0x2214)[0x3fffa0d1bfd0]
[node03:156211] [ 2] /.../ompi/install/ompi-rhc54-dbg/lib/openmpi/mca_pmix_pmix3x.so(pmix3x_tool_init+0x368)[0x3fffa0c4e620]
[node03:156211] [ 3] /.../ompi/install/ompi-rhc54-dbg/lib/libopen-rte.so.0(orte_ess_base_tool_setup+0x2d0)[0x3fffa1f3d524]
[node03:156211] [ 4] /.../ompi/install/ompi-rhc54-dbg/lib/openmpi/mca_ess_tool.so(+0x20e0)[0x3fffa18020e0]
[node03:156211] [ 5] /.../ompi/install/ompi-rhc54-dbg/lib/libopen-rte.so.0(orte_init+0x4bc)[0x3fffa1ec1dd4]
[node03:156211] [ 6] prun[0x10003e04]
[node03:156211] [ 7] prun[0x10002330]
[node03:156211] [ 8] /lib64/libc.so.6(+0x24700)[0x3fffa1944700]
[node03:156211] [ 9] /lib64/libc.so.6(__libc_start_main+0xc4)[0x3fffa19448f4]
[node03:156211] *** End of error message ***
Segmentation fault (core dumped)
#0  0x00003fffa0d1bfec in OPAL_MCA_PMIX3X_PMIx_tool_init (proc=0x3fffa0dc0110 <mca_pmix_pmix3x_component+264>, info=0x0, ninfo=0)
    at tool/pmix_tool.c:389
389	    (void)strncpy(pmix_globals.mypeer->info->pname.nspace, proc->nspace, PMIX_MAX_NSLEN);

I have most of a fix, and will post when it's done. It looks like a NULL pointer in the PMIx_tool_init logic.

@rhc54
Copy link
Contributor Author

rhc54 commented Feb 7, 2018

Yeah, this is one of the fixes that failed to come across - should be fixed in a separate PR that got committed, I think. If not, it certainly is fixed upstream in PMIx.

@jjhursey
Copy link
Member

jjhursey commented Feb 7, 2018

The NULL pointer was in the info attribute. With this patch prun no longer crashes, but it only launches on one node.

diff --git a/opal/mca/pmix/pmix3x/pmix/src/tool/pmix_tool.c b/opal/mca/pmix/pmix3x/pmix/src/tool/pmix_tool.c
index 31020c56..675d724b 100644
--- a/opal/mca/pmix/pmix3x/pmix/src/tool/pmix_tool.c
+++ b/opal/mca/pmix/pmix3x/pmix/src/tool/pmix_tool.c
@@ -386,6 +386,10 @@ PMIX_EXPORT int PMIx_tool_init(pmix_proc_t *proc,
     if (NULL == pmix_globals.mypeer->nptr->nspace) {
         pmix_globals.mypeer->nptr->nspace = strdup(proc->nspace);
     }
+    if( NULL == pmix_globals.mypeer->info ) {
+        pmix_globals.mypeer->info = PMIX_NEW(pmix_rank_info_t);
+        pmix_globals.mypeer->info->pname.nspace = (char*)malloc(sizeof(char)*(PMIX_MAX_NSLEN+1));
+    }
     (void)strncpy(pmix_globals.mypeer->info->pname.nspace, proc->nspace, PMIX_MAX_NSLEN);
     pmix_globals.mypeer->info->pname.rank = proc->rank;
[jjhursey@node03 ~] orte-dvm --host node03:2,node04:2 --system-server &
[1] 93423
[jjhursey@node03 ~] DVM ready

[jjhursey@node03 ~] prun -n 2  hostname
node03
[jjhursey@node03 ~] prun -npernode 2  hostname
node04
node04

@rhc54
Copy link
Contributor Author

rhc54 commented Feb 7, 2018

Sigh - I fear things are getting out of sync as the PR's sit too long. I can try to resync this one.

Ralph Castain added 2 commits February 7, 2018 11:29
Since we now support the dynamic addition of hosts to the orte_node_pool, there is no longer any reason to require advanced specification of all possible nodes. Instead, use a precedence method to initially allocate only those hosts that were specified in the cmd line:

* rankfile, if given, as that will specify the nodes

* -host, aggregated across all app_contexts

* -hostfile, aggregated across all app_contexts

* default hostfile

* assign local node

Fix slots_inuse accounting so that the nodes are correctly reset upon error termination - e.g., when oversubscribed without permission.

Ensure we accurately track the user's specified desires for oversubscribe and no-use-local when dynamically spawning jobs.

Signed-off-by: Ralph Castain <[email protected]>
(cherry picked from commit c9b3e68)
Signed-off-by: Ralph Castain <[email protected]>
@rhc54
Copy link
Contributor Author

rhc54 commented Feb 7, 2018

okay, I have brought this up-to-date. Problem wasn't in the mapping code, but simply the IOF output was being lost.

@jjhursey
Copy link
Member

jjhursey commented Feb 8, 2018

I did a run with the updated branch, and it looks like it is working just fine now.

I do see this warning from the PMIx v3 component (might just be my build though):
mca_bfrops_v3.so: undefined symbol: pmix_bfrops_base_pack_iof_channel (ignored)

@rhc54 I think this is good to go. I don't know if you also want to wait for @ggouaillardet to sign off as well.

This is what I tested:

[jjhursey@node03 ~]  orte-dvm --host node03:2,node04:2,node05:2 &
[1] 155939
DVM ready

[jjhursey@node03 ~] prun -n 2  hostname
node03
node03
[jjhursey@node03 ~] prun -n 2 --nolocal  hostname
node04
node04
[jjhursey@node03 ~] prun -n 20  hostname
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 20 slots
that were requested by the application:
  hostname

Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
[node03:156135] Job failed to spawn: (null)
[jjhursey@node03 ~] prun -n 20 --oversubscribe  hostname
node04
node04
node04
node04
node04
node03
node03
node03
node03
node03
node03
node03
node05
node05
node05
node05
node05
node04
node05
node03
[jjhursey@node03 ~] prun -n 20 --oversubscribe --nolocal  hostname
node05
node05
node05
node05
node04
node05
node04
node05
node04
node05
node04
node05
node04
node05
node04
node04
node04
node05
node04
node04

@rhc54
Copy link
Contributor Author

rhc54 commented Feb 8, 2018

Yeah, that loading error is irrelevant - just forgot to remove a component that only applies to the PMIx IOF branch. I can do that separately.

Thx for checking this out!

@rhc54 rhc54 merged commit efd715e into open-mpi:master Feb 8, 2018
@rhc54 rhc54 deleted the topic/map branch February 8, 2018 18:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants