-
Notifications
You must be signed in to change notification settings - Fork 900
ORTE regex fails when nodenames include hyphens #4621
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Just to further clarify: this problem will occur regardless of the plm launcher being used (e.g., when using mpirun under slurm) as we always try to pass the regex to the backend daemons on launch, and there is no guarantee that the length restriction will save us like it did in the referenced case. |
@rhc54 i am a bit puzzled with this ... i renamed my virtual cluster there is a more obvious error when the regex is too long, and this does not involve that can be simply reproduced with the inline patch below, and i can reproduce the issue when running
from
at first glance, i did not see diff --git a/orte/mca/plm/base/plm_base_launch_support.c b/orte/mca/plm/base/plm_base_launch_support.c
index 118813e..54930db 100644
--- a/orte/mca/plm/base/plm_base_launch_support.c
+++ b/orte/mca/plm/base/plm_base_launch_support.c
@@ -1573,6 +1573,7 @@ int orte_plm_base_orted_append_basic_args(int *argc, char ***argv,
ORTE_ERROR_LOG(rc);
return rc;
}
+#if 0
/* if this is too long, then we'll have to do it with
* a phone home operation instead */
if (strlen(param) < ORTE_MAX_REGEX_CMD_LENGTH) {
@@ -1582,6 +1583,7 @@ int orte_plm_base_orted_append_basic_args(int *argc, char ***argv,
/* mark that the nidmap has been communicated */
orte_nidmap_communicated = true;
}
+#endif
free(param);
if (!orte_static_ports && !orte_fwd_mpirun_port) { |
If you only have two nodes in the hostfile, such as the Mellanox folks had, then the regex cannot be more than 1024 characters in length (which is the cmd_length limit). We run over 20k nodes with it and have never hit the limit. I would look at the regex generator and see what it is doing wrong in that scenario. |
One place to start might be to simply print out the regex and see why it is so long... |
ok, let me put it this way then, i created a cluster whose nodes are called the regex (that can be seen with should we fix the real issue (e.g. the phone home does not work) before optimizing the regex generation ? |
I honestly don't think that is the Once we understand and fix that, then sure - we can look into why the alternative code path isn't working. I honestly don't remember the code path for that case, so it would take a little looking to see how it is supposed to work. |
@rhc54 i am fine with that. @karasevb is Open MPI installed on NFS ? if not, can you please double check the very same Open MPI is deployed on all the nodes ? i have a real hard time trying to reproduce the issue, and all i can think of is (as @rhc54 previously suggested it) some node is running an older version of |
Another thing we could do is provide a patch for orte/util/regex.c that adds some debug and have @karasevb run it to see what is going on over there. |
@ggouaillardet Actually, I just realized that your output indeed shows the problem. The regex should have compressed node-001 and node-002 into a regex like node-[3:1-2]. The hyphen is confusing it into thinking every node name is unique and cannot be compressed. This is why Mellanox is failing. They have a lot of nodes (we don't filter them when making the original regex), and the regex generator thinks they are all unique names. Otherwise, it would have output something like "node-[3:1-999]". |
@rhc54 minor detail, the regex should be |
here is a patch that fixes the issue that occurs when the regex is too long. @rhc54 can you please have a look at it ? diff --git a/orte/mca/plm/base/plm_base_launch_support.c b/orte/mca/plm/base/plm_base_launch_support.c
index 118813e..8e941e7 100644
--- a/orte/mca/plm/base/plm_base_launch_support.c
+++ b/orte/mca/plm/base/plm_base_launch_support.c
@@ -1575,12 +1575,20 @@ int orte_plm_base_orted_append_basic_args(int *argc, char ***argv,
}
/* if this is too long, then we'll have to do it with
* a phone home operation instead */
- if (strlen(param) < ORTE_MAX_REGEX_CMD_LENGTH) {
+ if (false && strlen(param) < ORTE_MAX_REGEX_CMD_LENGTH) {
opal_argv_append(argc, argv, "-"OPAL_MCA_CMD_LINE_ID);
opal_argv_append(argc, argv, "orte_node_regex");
opal_argv_append(argc, argv, param);
/* mark that the nidmap has been communicated */
orte_nidmap_communicated = true;
+ } else if (NULL != orte_tree_launch_cmd) {
+ assert (!orte_nidmap_communicated);
+ if (ORTE_SUCCESS != (rc = opal_dss.pack(orte_tree_launch_cmd, ¶m, 1, OPAL_STRING))) {
+ ORTE_ERROR_LOG(rc);
+ return rc;
+ }
+ /* mark that the nidmap has been communicated */
+ orte_nidmap_communicated = true;
}
free(param);
diff --git a/orte/mca/plm/rsh/plm_rsh_module.c b/orte/mca/plm/rsh/plm_rsh_module.c
index 7b5ae93..3968de7 100644
--- a/orte/mca/plm/rsh/plm_rsh_module.c
+++ b/orte/mca/plm/rsh/plm_rsh_module.c
@@ -802,6 +802,25 @@ static int remote_spawn(opal_buffer_t *launch)
goto cleanup;
}
+ /* extract the node regex if needed, and update the routing tree */
+ if (NULL == orte_node_regex) {
+ n = 1;
+ if (ORTE_SUCCESS != (rc = opal_dss.unpack(launch, &orte_node_regex, &n, OPAL_STRING))) {
+ ORTE_ERROR_LOG(rc);
+ goto cleanup;
+ }
+
+ if (NULL != orte_node_regex) {
+ if (ORTE_SUCCESS != (rc = orte_util_nidmap_parse(orte_node_regex))) {
+ ORTE_ERROR_LOG(rc);
+ goto cleanup;
+ }
+ /* be sure to update the routing tree so any tree spawn operation
+ * properly gets the number of children underneath us */
+ orte_routed.update_routing_plan(NULL);
+ }
+ }
+
/* get the updated routing list */
rtmod = orte_rml.get_routed(orte_coll_conduit);
OBJ_CONSTRUCT(&coll, opal_list_t); |
@ggouaillardet I confirm, this patch fixes the hang. |
"about the "lot of nodes" thing, are you saying that even with a two line hostfile, and forcing the plm/rsh, the regex includes all the (100 iirc) nodes from the slurm allocation ?" I think it might, yes. Even though we are only launching daemons on the initial limited set of nodes, the orte_node_pool still includes all the nodes in the slurm allocation, and that is what is used to build the regex. We filter the allocation prior to defining the daemons for the initial launch. IIRC, the rationale was that a subsequent comm_spawn might utilize the rest of that allocation, and we need the node pool on the daemons to match that in mpirun for the distributed mapping to work. However, we could probably get that to update correctly since we now include a revised regex in the launch command. Might be worth a shot, if you have a few cycles to try it. Alternatively, it also might be useful to get the regex to handle the hyphen correctly. As I said, we have never seen someone do that (until now), but it isn't forbidden, and so it is always possible that someone else might someday do it. |
I will take a crack at it tomorrow. About the hyphen, a cheap trick might be to convert |
True - thanks! |
@rhc54 when running from is this the intended behavior ? |
Hmmm...it would be nice if n2 was given the port for n1 so it can callback to there. That certainly is true for static ports, but may be an optimization for the non-static case that we didn't write yet. |
thanks, i'll add this to my todo list ! |
'-' is not an alpha character nor a digit, but it is a valid hostname character and should be handled as an alpha character, otherwise, nodes such as node-001 do not get "compressed" in the regex. Refs open-mpi#4621 Signed-off-by: Gilles Gouaillardet <[email protected]>
Just to follow-up on the "phone home" question. There is an MCA param that is supposed to be set on the cmd line by the daemon on n1 (in your example) when tree-spawning n2. This directs the n2 daemon to send its "phone home" message to the daemon on n1 and provides the necessary contact info. What actually happens today is that n2 will send the message directly to the HNP on n0, but it will route the message thru n1. So the message is indeed received at n0 - it just went thru the routing tree. If you want the message to actually land on the daemon on n1, then you would need to modify the code in orte/orted/orted_main.c:
We should be putting the process name into ORTE_PROC_MY_PARENT, not a local "parent" location. It is possible that we update ORTE_PROC_MY_PARENT elsewhere, so you might check
This is actually okay - just need to add a call to update the route to ORTE_PROC_MY_PARENT
Include a check of ORTE_PROC_MY_PARENT - if it is set, then we send to the parent instead of the HNP. Note that the rollup code already exists in orted_main, so this should all flow correctly from there. It would probably be a good idea to use the rollup as it dramatically reduces the number of messages handled by the HNP. |
thanks for the lengthy explanation. iirc, i naively tried to have i will resume this from Monday. |
I fixed it (#4628) - found that treespawn wasn't actually doing what it should anyway. Every daemon had to phone directly home and then get a "remote spawn" message to launch its children. Root cause was the missing parent URI that prevented rollup. |
This was fixed by @ggouaillardet |
'-' is not an alpha character nor a digit, but it is a valid hostname character and should be handled as an alpha character, otherwise, nodes such as node-001 do not get "compressed" in the regex. Refs open-mpi#4621 Signed-off-by: Gilles Gouaillardet <[email protected]>
'-' is not an alpha character nor a digit, but it is a valid hostname character and should be handled as an alpha character, otherwise, nodes such as node-001 do not get "compressed" in the regex. Refs open-mpi#4621 Signed-off-by: Gilles Gouaillardet <[email protected]>
'-' is not an alpha character nor a digit, but it is a valid hostname character and should be handled as an alpha character, otherwise, nodes such as node-001 do not get "compressed" in the regex. Refs open-mpi#4621 Signed-off-by: Gilles Gouaillardet <[email protected]> Signed-off-by: Scott Miller <[email protected]>
'-' is not an alpha character nor a digit, but it is a valid hostname character and should be handled as an alpha character, otherwise, nodes such as node-001 do not get "compressed" in the regex. Refs open-mpi#4621 Signed-off-by: Gilles Gouaillardet <[email protected]> Signed-off-by: Scott Miller <[email protected]>
'-' is not an alpha character nor a digit, but it is a valid hostname character and should be handled as an alpha character, otherwise, nodes such as node-001 do not get "compressed" in the regex. Refs open-mpi#4621 Signed-off-by: Gilles Gouaillardet <[email protected]> (cherry picked from commit f3e2a31)
'-' is not an alpha character nor a digit, but it is a valid hostname character and should be handled as an alpha character, otherwise, nodes such as node-001 do not get "compressed" in the regex. Refs open-mpi#4621 Signed-off-by: Gilles Gouaillardet <[email protected]> (cherry picked from commit f3e2a31)
If a cluster is configured with
-
in the names of their nodes, then the ORTE regex parser gets confused and thinks the hyphen is part of a range designation. This leads to unusually large regex results and can cause odd behavior (e.g., unable to tree spawn).Fortunately, we haven't seen this in 14 years, but Mellanox now has a cluster that hits the problem. No rules against such configuration, so probably something that should be addressed in case someone else does it.
The text was updated successfully, but these errors were encountered: