-
Notifications
You must be signed in to change notification settings - Fork 914
ompi/proc: fix local proc discovery #3101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The wrong process name was used to lookup local processes. A wildcard was used when the local process name is needed. Signed-off-by: Nathan Hjelm <[email protected]>
It is already like this in v2.x: |
but master with build-in pmix was failing, so I changed it to a wildcard |
Let me investigate |
what is your configuration? |
@artpol84 Ok so not a 2.x blocker. |
Ok, I see the problem. |
This is only a problem for master, as @hjelmn noted. |
I will prepare the PR with the fix this weekend |
And we indeed may want to port this to v2.x as well. |
I don't understand some of these comments. The access to any modex information is handled by the opal/pmix components, which are supposed to correct for the change in behavior between the PMIx versions. Rather than creating a new key, it sounds like what is required is to fix the external pmix component "glue" so it correctly handles the difference between v1.2 and v2.0. |
@hjelmn Are you saying master didn't work when built against an external copy of PMIx v1.2? Or are you saying that master doesn't work with the internal version of pmix? |
@rhc54 @hjelmn as I expected jenkins failed at runtime with this PR, this is because internal pmix/master want's wildcard. |
This is what I'm proposing. each pmix component will create a local key using |
This new key will be a "glue". We don't need to modify pmix codebase, this will be ompi internal key that pmix will handle in the same way as other keys submitted by application. |
To complete my idea. I'm proposing to access and process and here: https://github.com/open-mpi/ompi/blob/v2.x/ompi/proc/proc.c#L349 we will use UPD: the name of the key ( |
No, please don't do that - we'll wind up doing this with every key somebody accesses. The correct fix is to simply update the alps, s1, and s2 components so they properly handle the wildcard rank. |
Sure, let's discuss first. |
sure it will - we just have to adjust the ext1 component. should have already been done as we know there is a change wrt rank wildcard data, so this is a general problem there. |
OK, great. |
To make sure - you mean something like this: |
not exactly - we know that ORTE registered the nspace and passed certain keys down with rank wildcard. You need to look thru the server_south code to see what (if anything) we do with wildcard rank, and then trace the data that was passed into the 1.x pmix server to see how it was actually stored. The client code needs to take in the wildcard rank and do whatever conversion is required to ensure we look it up from the right place. |
here is what I meant: |
basically for pmix v1 we should always replace wildcard with |
agreed - i'm pretty sure that will solve the problem. |
then we just have to update the alps, s1, and s2 components to likewise ignore the wildcard |
Looking at the ext1 component, I think all you have to do is change this line (number 451) in the client: p.rank = proc->vpid; with if (OPAL_VPID_WILDCARD == proc->vpid) {
p.rank = my_proc.rank;
} else {
p.rank = proc->vpid;
} |
…tored against wildcard rank in the cray, s1, and s2 components, and that the ext1 component translates all wildcard rank requests into the peer's rank since v1.x of PMIx doesn't understand wildcard ranks Closes open-mpi#3101 Signed-off-by: Ralph Castain <[email protected]>
…tored against wildcard rank in the cray, s1, and s2 components, and that the ext1 component translates all wildcard rank requests into the peer's rank since v1.x of PMIx doesn't understand wildcard ranks Closes open-mpi#3101 Signed-off-by: Ralph Castain <[email protected]>
…tored against wildcard rank in the cray, s1, and s2 components, and that the ext1 component translates all wildcard rank requests into the peer's rank since v1.x of PMIx doesn't understand wildcard ranks Closes open-mpi#3101 Signed-off-by: Ralph Castain <[email protected]>
The wrong process name was used to lookup local processes. A wildcard
was used when the local process name is needed.
Signed-off-by: Nathan Hjelm [email protected]