Skip to content

Conversation

hjelmn
Copy link
Member

@hjelmn hjelmn commented Mar 4, 2017

The wrong process name was used to lookup local processes. A wildcard
was used when the local process name is needed.

Signed-off-by: Nathan Hjelm [email protected]

The wrong process name was used to lookup local processes. A wildcard
was used when the local process name is needed.

Signed-off-by: Nathan Hjelm <[email protected]>
@hjelmn
Copy link
Member Author

hjelmn commented Mar 4, 2017

@artpol84 The local process discovery was failing with Cray PMI. This change fixes it. @jsquyres This is a blocker for 2.1.0.

@hjelmn hjelmn removed this from the v2.1.0 milestone Mar 4, 2017
@artpol84
Copy link
Contributor

artpol84 commented Mar 4, 2017

It is already like this in v2.x:
https://github.com/open-mpi/ompi/blob/v2.x/ompi/proc/proc.c#L349

@artpol84
Copy link
Contributor

artpol84 commented Mar 4, 2017

but master with build-in pmix was failing, so I changed it to a wildcard

@artpol84
Copy link
Contributor

artpol84 commented Mar 4, 2017

Let me investigate

@artpol84
Copy link
Contributor

artpol84 commented Mar 4, 2017

what is your configuration?
external or internal pmix v1.2?
It seems that this key is stored differently in pmix v1.2 and pmix/master.

@hjelmn
Copy link
Member Author

hjelmn commented Mar 4, 2017

@artpol84 Ok so not a 2.x blocker.

@artpol84
Copy link
Contributor

artpol84 commented Mar 4, 2017

Ok, I see the problem.
Originally access to OPAL_PMIX_LOCAL_PEERS was performed inside pmix components and it was working fine since API change between v1.2 and v2.x was handled transparently/correctly by each component. But now we access this key from the central location in OMPI code which exposes API portability issue.
The simplest solution is to create another key locally with the copy of the value but the same access pattern (PMIx_store_local).

@artpol84
Copy link
Contributor

artpol84 commented Mar 4, 2017

This is only a problem for master, as @hjelmn noted.

@artpol84
Copy link
Contributor

artpol84 commented Mar 4, 2017

I will prepare the PR with the fix this weekend

@artpol84
Copy link
Contributor

artpol84 commented Mar 4, 2017

And we indeed may want to port this to v2.x as well.

@rhc54
Copy link
Contributor

rhc54 commented Mar 4, 2017

I don't understand some of these comments. The access to any modex information is handled by the opal/pmix components, which are supposed to correct for the change in behavior between the PMIx versions. Rather than creating a new key, it sounds like what is required is to fix the external pmix component "glue" so it correctly handles the difference between v1.2 and v2.0.

@rhc54
Copy link
Contributor

rhc54 commented Mar 4, 2017

@hjelmn Are you saying master didn't work when built against an external copy of PMIx v1.2? Or are you saying that master doesn't work with the internal version of pmix?

@artpol84
Copy link
Contributor

artpol84 commented Mar 4, 2017

@rhc54
@hjelmn is saying that alps component doesn't work. I expect that pmix v1.2 won't work either.
The reason is that we are requesting OPAL_PMIX_LOCAL_PEERS directly. In pmix/v1.2 you need to provide yourself's proc as an argument of PMIx_Get, in pmix/master you have to provide WILDCARD.

@hjelmn as I expected jenkins failed at runtime with this PR, this is because internal pmix/master want's wildcard.

@artpol84
Copy link
Contributor

artpol84 commented Mar 5, 2017

Rather than creating a new key, it sounds like what is required is to fix the external pmix component "glue" so it correctly handles the difference between v1.2 and v2.0.

This is what I'm proposing. each pmix component will create a local key using PMIx_Store_internal with the set of local ranks.
OPAL_PMIX_LOCAL_PEERS access differs, but this new local key will be always available in the same way.

@artpol84
Copy link
Contributor

artpol84 commented Mar 5, 2017

This new key will be a "glue". We don't need to modify pmix codebase, this will be ompi internal key that pmix will handle in the same way as other keys submitted by application.

@artpol84
Copy link
Contributor

artpol84 commented Mar 5, 2017

To complete my idea. I'm proposing to access and process OPAL_PMIX_LOCAL_PEERS value internally in each opal/pmix component (and in PMI1/PMI2 components this will obviously be different keys). The result will be stored uniformly in let say OPAL_PMIX_LOCAL_PEERS_NEW and this key may already be an array of integers instead of a string.

and here: https://github.com/open-mpi/ompi/blob/v2.x/ompi/proc/proc.c#L349 we will use OPAL_PMIX_LOCAL_PEERS_NEW instead of OPAL_PMIX_LOCAL_PEERS making API differences transparent to the rest of the OMPI code.

UPD: the name of the key (OPAL_PMIX_LOCAL_PEERS_NEW ) should be something more meaningful of course.

@rhc54
Copy link
Contributor

rhc54 commented Mar 5, 2017

No, please don't do that - we'll wind up doing this with every key somebody accesses. The correct fix is to simply update the alps, s1, and s2 components so they properly handle the wildcard rank.

@artpol84
Copy link
Contributor

artpol84 commented Mar 5, 2017

Sure, let's discuss first.
The only problem with your proposal is that external pmix v1.2 won't work.

@rhc54
Copy link
Contributor

rhc54 commented Mar 5, 2017

sure it will - we just have to adjust the ext1 component. should have already been done as we know there is a change wrt rank wildcard data, so this is a general problem there.

@artpol84
Copy link
Contributor

artpol84 commented Mar 5, 2017

OK, great.

@artpol84
Copy link
Contributor

artpol84 commented Mar 5, 2017

@rhc54
Copy link
Contributor

rhc54 commented Mar 5, 2017

not exactly - we know that ORTE registered the nspace and passed certain keys down with rank wildcard. You need to look thru the server_south code to see what (if anything) we do with wildcard rank, and then trace the data that was passed into the 1.x pmix server to see how it was actually stored. The client code needs to take in the wildcard rank and do whatever conversion is required to ensure we look it up from the right place.

@artpol84
Copy link
Contributor

artpol84 commented Mar 5, 2017

here is what I meant:
here https://github.com/open-mpi/ompi/blob/master/opal/mca/pmix/ext1x/pmix1x_client.c#L455:L464 you return data directly without going into PMIx_Get.
For keys like local_peers you can call PMIx_Get but intercept proc from wildcard to be current process (as pmix v1 requires).
I think what you've said above is similar, right?

@artpol84
Copy link
Contributor

artpol84 commented Mar 5, 2017

basically for pmix v1 we should always replace wildcard with this_proc since PMIx v1 doesn't expose wildcard to the API level IIRC.

@rhc54
Copy link
Contributor

rhc54 commented Mar 5, 2017

agreed - i'm pretty sure that will solve the problem.

@rhc54
Copy link
Contributor

rhc54 commented Mar 5, 2017

then we just have to update the alps, s1, and s2 components to likewise ignore the wildcard

@rhc54
Copy link
Contributor

rhc54 commented Mar 5, 2017

Looking at the ext1 component, I think all you have to do is change this line (number 451) in the client:

        p.rank = proc->vpid;

with

    if (OPAL_VPID_WILDCARD == proc->vpid) {
        p.rank = my_proc.rank;
    } else {
        p.rank = proc->vpid;
    }

@rhc54 rhc54 closed this in #3104 Mar 6, 2017
bosilca pushed a commit to bosilca/ompi that referenced this pull request Mar 16, 2017
…tored against wildcard rank in the cray, s1, and s2 components, and that the ext1 component translates all wildcard rank requests into the peer's rank since v1.x of PMIx doesn't understand wildcard ranks

Closes open-mpi#3101

Signed-off-by: Ralph Castain <[email protected]>
thananon pushed a commit to thananon/ompi that referenced this pull request Mar 31, 2017
…tored against wildcard rank in the cray, s1, and s2 components, and that the ext1 component translates all wildcard rank requests into the peer's rank since v1.x of PMIx doesn't understand wildcard ranks

Closes open-mpi#3101

Signed-off-by: Ralph Castain <[email protected]>
markalle pushed a commit to markalle/ompi that referenced this pull request Apr 5, 2017
…tored against wildcard rank in the cray, s1, and s2 components, and that the ext1 component translates all wildcard rank requests into the peer's rank since v1.x of PMIx doesn't understand wildcard ranks

Closes open-mpi#3101

Signed-off-by: Ralph Castain <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants