Skip to content

WIP: config: default to DSO for plugins with external deps #8800

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

acolinisi
Copy link
Contributor

With the new default to link all plugins as static, the dependencies of all the plugins (e.g. transport plugins) become direct dependencies of top-level binaries (e.g. opal_wrapper aka mpicc).

These direct dependencies are simply bad, because the dependency is false, it is unnecessary for running most binaries.

But if that's not sufficient justification, the current default behavior is
different from old behavior in a user-unfriendly way:

mpicc --version
mpicc: error while loading shared libraries:
libugni.so.0: cannot open shared object file: No such file or directory

The workaround the user needs to figure out is to build with --enable-mca-dso set, either for all or for a specific set of components; or set LD_LIBRARY_PATH system-wide or for each usage of mpicc and all other binaries, neither of which is reasonable.

In openpmix/prrte#871 it was decided that a behavior that's more user friendly than this can be provided by implementing a smart default for which components should default
to DSO compile mode. This PR implements this default.

One more component that also should default to DSO is btl-sm but only when Cray xpmem is found (it is enabled by default). This would require some kind of conditional logic. So, this PR is incomplete.

See this issue for rationale: openpmix/prrte#871

This patch to OMPI is a twin to the patches that were already committed to
PRRTE and PMIX:
PRRTE e3914b0ad0df30939d329fe615a0ec37e64dcf2a
PMIX 0b8f747

Signed-off-by: Alexei Colin [email protected]

With the new default to link all plugins as static, the dependencies of all the
plugins (e.g. transport plugins) become direct dependencies of top-level
binaries (e.g. opal_wrapper aka mpicc). These direct dependencies are simply
bad, because the dependency is false, it is unnecessary for running most
binaries.

But if that's not sufficient justification, the current default behavior is
different from old behavior in a user-unfriendly way:

	mpicc --version
	mpicc: error while loading shared libraries:
	libugni.so.0: cannot open shared object file: No such file or directory

The workaround the user needs to figure out is to build with `--enable-mca-dso`
set, either for all or for a specific set of components. In PRRTE open-mpi#871 it
was decided that a behavior that's more user friendly than this can be
provided by implementing a smart default for which components should default
to DSO compile mode.

One more component that also should default to DSO is `btl-sm` but only when
Cray `xpmem` is found (it is enabled by default). This would require some
kind of conditional logic.

See this issue for rationale: openpmix/prrte#871

This patch to OMPI is a twin to the patches that were already committed to
PRRTE and PMIX:
PRRTE e3914b0ad0df30939d329fe615a0ec37e64dcf2a
PMIX 0b8f747

Signed-off-by: Alexei Colin <[email protected]>
@ompiteam-bot
Copy link

Can one of the admins verify this patch?

@rhc54
Copy link
Contributor

rhc54 commented Apr 11, 2021

ok to test

@acolinisi
Copy link
Contributor Author

@rhc54 what do you say about this part?:

One more component that also should default to DSO is btl-sm but only when Cray xpmem is found (it is enabled by default). This would require some kind of conditional logic. So, this PR is incomplete.

acolinisi added a commit to ISI-apex/casper-ebuilds that referenced this pull request Apr 11, 2021
The plugins that themselves depend on an external .so library,
must not be linked in statically, because that external dep
becomes a dep of all mpi binaries, including mpicc, etc.

In latest master, this is the default already, but we didn't
bump the snapshot yet.

See open-mpi/ompi#8800
acolinisi added a commit to ISI-apex/casper-ebuilds that referenced this pull request Apr 11, 2021
That patch is incomplete though (btl-sm needs to also be DSO if and only if
xpmem is used), so won't be able to rely on this default. Commit to the ebuild
to explicitly set --enable-mca-dso comming soon.

Also see: open-mpi/ompi#8800
@bwbarrett
Copy link
Member

bwbarrett commented Apr 12, 2021

This was an intentional behavior choice on the part of the Open MPI community. This patch should not be accepted.

We should think about how to minimize dependency leakage for the wrapper compiler, but this approach is not the way to do that.

@alexeicolin
Copy link

This was an intentional behavior choice on the part of the Open MPI community. This patch should not be accepted.

We should think about how to minimize dependency leakage for the wrapper compiler, but this approach is not the way to do that.

Ok. There seems to be ambiguity on what the community decided. These commits suggest the opposite decision:

PRRTE e3914b0ad0df30939d329fe615a0ec37e64dcf2a
PMIX 0b8f747

@rhc54
Copy link
Contributor

rhc54 commented Apr 12, 2021

Ok. There seems to be ambiguity on what the community decided

Two different communities with different needs, I'm afraid. The biggest difference being that no application ever links against PRRTE, while they do against OMPI.

@bwbarrett
Copy link
Member

FWIW, I'm not sure that Ralph's fix in PRRTE or PMIx was the right solution to that problem, either. If a customer configures OMPI with --enable-mca-static, we can't have a broken build, and that's what that patch suggests we'd have.

@rhc54
Copy link
Contributor

rhc54 commented Apr 12, 2021

Let's chat about it this week - I may have misunderstood how that option worked. In the case you cite, I would have expected the OMPI-PRRTE and OMPI-PMIx integration to pass appropriate configure options to "do the right thing". If that isn't going to happen, then yes, we might have a problem.

@alexeicolin
Copy link

alexeicolin commented Apr 12, 2021

In this comment openpmix/prrte#871 (comment) Josh essentially said that that users should not compile with --enable-mca-static on certain systems. So, the build would be broken on some systems, but it would be the user's fault for having passed --enable-mca-static on those systems.

The question of whether the default behavior should try to be helpful or not seems somewhat orthogonal to the above underlying issue that the build with all components as static won't work on some systems. If the user did not pass any flags, what is the gain of producing a broken build for them by default? Changing the default like Ralph did seems like a no-loss proposition: build still broken with --enable-mca-static but at least it's not broken by default.

Default A: unless instructed otherwise with --enable-mca*, link everything statically everywhere.
Default B: unless instructed otherwise with --enable-mca*, link statically components that can sensibly be linked statically.

I don't see any issues with Default B, and Default B seems superior to Default A: with B you'll get a prted and mpicc that work out-of-the-box by default on all systems. It's much more tolerable to produce broken binaries in response to the user choosing flags incompatible with their system, than it is to produce broken binaries by default and expect the user to figure out the right flags that yield working binaries.

P.S. I'm not trying to defend this PR. I'm happy passing the flag explicitly. Just thinking of what's the best approach would be.

@gpaulsen
Copy link
Member

@acolinisi @alexeicolin - can we close this PR?

@bwbarrett
Copy link
Member

Again, this was a decision the OPen MPI community made. A cray system is an example of a system where DSOs directly negatively impact performance. The right solution isn't to try to change defaults on dsos, but to fix the library dependency problem for mpicc. We always knew that we'd have to fix these things, but they're real problems for large systems, that need fixing.

I strongly disagree with the approach the PMIx team has taken here; for large systems, DSOs are a problem and we need to sort out dependencies to avoid relying on DSOs, rather than sticking our head in the sand.

@rhc54
Copy link
Contributor

rhc54 commented Jan 12, 2022

I strongly disagree with the approach the PMIx team has taken here; for large systems, DSOs are a problem and we need to sort out dependencies to avoid relying on DSOs, rather than sticking our head in the sand.

I don't have an issue with a different approach - I was simply trying to "stick a finger in the dyke" as it was causing problems and complaints from a number of places. I'd welcome a better alternative if someone can develop it.

@alexeicolin
Copy link

@acolinisi @alexeicolin - can we close this PR?

Please go ahead and close. All I could contribute is the opinion from user's perspective summarized in my comment #8800 (comment)

@bwbarrett
Copy link
Member

@alexeicolin I don't think anyone is arguing that we need to fix this issue. We need to fix it before launch. I think it's the how to fix it that we're arguing about. There's absolutely no reason that mpicc/mpirun/etc. NEED to depend on libugni.so. Given the DSO load scalability problems, we need to break that dependency by not linking mpicc against the same library set as the MPI application, not by breaking the dependency by creating more artifacts that don't scale.

For what it's worth, the PMIx developers agree with this sentiment. They just wanted to unbreak things faster for their users. Which means that they're only going to fix the core issue of scalability around DSO loading because of OMPI holding the line. I thought we had an issue on this, but clearly we don't. I've cut #9869 and made it a blocker for the 5.0 release.

@bwbarrett bwbarrett closed this Jan 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants