Skip to content

Summit fix for libstc++ #564

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Jul 28, 2021
Merged

Summit fix for libstc++ #564

merged 11 commits into from
Jul 28, 2021

Conversation

nksauter
Copy link
Contributor

Problem statement: On Summit, "import boost.python" fails.
This is likely due to the practice, in high-performance computing centers, of linking the Python extension modules against one stdc++ library, whereas the system Python is linked against a different library behind the scene.
Solution: do not use ldd to check the libstc++ version in import-ext
The referenced code is part of the original cctbx design, but is likely not applicable to modern systems.
Removing it will save an external call to "ldd" during every boost python import.

Copy link
Contributor

@rwgk rwgk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm... I think it's a good idea to remove this code and see what happens. Back when I added this code, we got a lot of problem reports related to libstdc++ version mixes. I assume version mixes are still a problem, but maybe it doesn't happen anymore in practice, or not as often?

The world has changed a lot. A few months ago I saw an article (also already old, but newer than this code) that recommends strongly against RTLD_GLOBAL (the 0x2 in sys.setdlopenflags(0x100|0x2) here). I don't know how those recommendations play with modern Boost.Python. pybind11 doesn't need RTLD_GLOBAL, and even hides all symbols by default. Following that modern role model is probably best, if feasible, and will probably allow you to have a mix of libstdc++ versions.

@luc-j-bourhis
Copy link
Contributor

luc-j-bourhis commented Nov 17, 2020

I am afraid I had no idea that code was there! No educated opinion on this, sorry.

@nksauter
Copy link
Contributor Author

nksauter commented Nov 17, 2020 via email

@rwgk
Copy link
Contributor

rwgk commented Nov 17, 2020

Thinking about it a bit more, I'm beginning to feel a bit uneasy: if you actually are mixing libstdc++ versions, while also using RTLD_GLOBAL, you may get lucky today, but run into weird, hard-to-debug situations later. Essentially, this PR is removing a warning light from the dashboard. Maybe. My understanding of how dlopen works is too incomplete to really know what could bite you how.

@nksauter nksauter requested a review from JBlaschke November 17, 2020 16:52
@gbunkoczi
Copy link
Contributor

gbunkoczi commented Nov 17, 2020 via email

@nksauter
Copy link
Contributor Author

OK, I think we may need a deeper analysis; perhaps Billy & Johannes can lend a hand. One very bad element of the current design is that it seemingly forks multiple (potentially hundreds) of subprocesses at run time, killing performance on MPI-based applications. Are there alternatives, including 1) prescreening all our shared objects with a separate command, 2) making the check optional with an environment flag, or 3) removing the RTLD_GLOBAL behavior altogether?

@rwgk
Copy link
Contributor

rwgk commented Nov 17, 2020

  1. Link against the same libstd++ (I don't know how feasible this is).

Your 2) is probably quickest [with or without 4)]. Assuming you're still using something like libtbx/configure.py, use a command-line flag or some hint from the environment to figure out when to define the environment flag. Then just keep in the back of your mind that in theory you could have binary incompatibilities (mix of libstdc++ versions) but hope for the best :-)

@bkpoon
Copy link
Member

bkpoon commented Nov 17, 2020

We are migrating to using conda for distribution (there are already cctbx-base and cctbx packages on the conda-forge channel). That means that everything will be compiled with the same compiler and linked to the same libstdc++.

One ABI issue (came up with building Rosetta for Phenix) was the ABI change in GCC 5.1, so we could not link code built using an earlier version of GCC with anything built with GCC >= 5.1.

I would also put myself in the camp of not having a sufficiently deep understanding of everything in the linked article, but we are testing development builds with different compilers on Azure Pipelines using dependencies from conda. The build is structured so that we should not be mixing libstdc++ (e.g. Boost is compiled from source instead of using the Boost conda package). At runtime, a newer libstdc++ may be used, but we have always done that with the binary installers (i.e. we were not shipping our own libstdc++). I think this would be more of a problem if we were trying to link clang compiled code with gcc compiled code where the ABI might be different.

I think the issue with Summit and other HPC systems is that the compilers are usually customized for that particular system and I think the expectation is that all the code is compiled with the same compiler. @JBlaschke would know more.

@JBlaschke
Copy link
Contributor

I needed to ruminate over all the points raised here -- the topic is complex I think the points raised here are all good. However, if we want a solution that scales (at all, not just well) then we need to address this. Here are my recommendations:

  1. Accept the code change in this PR
  2. Add an libstdc++ check at compile time -- this check already exists, so it just needs to be included in either the build, or the testing script.
  3. Enable a mode where mixing libstdc++ version is permitted

I might need @bkpoon's help to "wire up" the existing import_ext tests with the SConscirpt, but 2 and 3 should be pretty easy, right?

Here is my reasoning (all points referencing the list above):

  1. for pt (1): line 50 is really bad for scaling: ldd (and subprocess for that matter) is not meant to scale up to massive parallism -- we see this on Summit: at about 220 MPI ranks, it just quits on us.
  2. for pt (1): (this is more for idealism), libstdc++ won't start changing while we run our application -- so why check it every time we import a boost extension. This is something that should be checked at compile time (or when the user runs a check)
  3. for pt (2): I share everyone's concern about mixing libstdc++ versions. I don't fully understand why this was a problem in the past -- a naive reading could be "just use RPATHs and elf-fiddling to point the dynamic linker to the 'correct' library versions for each extension module and be done with it". But this is very tricky to get right, and will depend on which dynamic linker the target system is using. So I understand the need for a "non-expert" mode which enforces a strict standard which guarantees correct behavior.
  4. for pt (3): pt (2) won't cut it for some of these more niche applications -- think of modules compiled with clang, or pgi, needing to be loaded by python (which is frequently built using gcc), all while using libraries targeting specialized hardware. This is the reality for cutting edge HPC. But I think it's safe to say that those users know what they are doing (or at least, enjoy debugging these sort of problems).

@JBlaschke
Copy link
Contributor

Hm... I think it's a good idea to remove this code and see what happens. Back when I added this code, we got a lot of problem reports related to libstdc++ version mixes. I assume version mixes are still a problem, but maybe it doesn't happen anymore in practice, or not as often?

The world has changed a lot. A few months ago I saw an article (also already old, but newer than this code) that recommends strongly against RTLD_GLOBAL (the 0x2 in sys.setdlopenflags(0x100|0x2) here). I don't know how those recommendations play with modern Boost.Python. pybind11 doesn't need RTLD_GLOBAL, and even hides all symbols by default. Following that modern role model is probably best, if feasible, and will probably allow you to have a mix of libstdc++ versions.

Thinking about it a bit more, I'm beginning to feel a bit uneasy: if you actually are mixing libstdc++ versions, while also using RTLD_GLOBAL, you may get lucky today, but run into weird, hard-to-debug situations later. Essentially, this PR is removing a warning light from the dashboard. Maybe. My understanding of how dlopen works is too incomplete to really know what could bite you how.

@rwgk thanks for the article -- I share your concern, and had to think about this. But I think we need to give some degree of flexibility here to HPC users -- mainly because "swapping out" stl's is still standard practice for things like code injection, or how some compiler support for advanced features. I wish it wasn't.

Regardless of how we proceed, I think that this check should happen at compile time. Or are you concerned with the user changing LD_LIBRARY_PATH? (setting an RPATH should guard against that).

@rwgk
Copy link
Contributor

rwgk commented Dec 7, 2020

I think, as long as RTLD_GLOBAL is used, I'd definitely keep the ldd check in general(**), but I'd move it to boost/python.py where the boost_python_meta_ext is imported. If that passes the check, and assuming all extensions are built consistently, the originally desired protection is still practically complete.

(**) Only if the one ldd per process startup is still somehow causing difficulties on summit, I'd think about avoiding that as well, probably via inspection of environment variables.

@rwgk
Copy link
Contributor

rwgk commented Dec 7, 2020

I think, as long as RTLD_GLOBAL is used, I'd definitely keep the ldd check in general(**), but I'd move it to boost/python.py where the boost_python_meta_ext is imported. If that passes the check, and assuming all extensions are built consistently, the originally desired protection is still practically complete.

(**) Only if the one ldd per process startup is still somehow causing difficulties on summit, I'd think about avoiding that as well, probably via inspection of environment variables.

Ugh, sorry ... the check is in boost/python.py already, so implementing the run-only-once idea could be super simple, boiling down to:

...
def import_ext(name, optional=False, check_libstdcxx_so=False):
...
  if (check_libstdcxx_so and python_libstdcxx_so is not None):
...
ext = import_ext("boost_python_meta_ext", check_libstdcxx_so=True)

If that's still causing an issue, make the True in the last line above conditional on the environment (and also avoid the check for sys.executable).

@JBlaschke
Copy link
Contributor

Wouldn't that still run once per MPI rank, which will prevent scaling? Do we expect the libstdc++ version to change between runs? We seem to already have a test for this sort of thing (at least one that we can build on) here: libtbx.import_all_ext -- my suggestion would be run this test in libtbx.refresh

Also: do we need actually RTLD_GLOBAL?

@rwgk
Copy link
Contributor

rwgk commented Dec 8, 2020

Wouldn't that still run once per MPI rank, which will prevent scaling? Do we expect the libstdc++ version to change between runs? We seem to already have a test for this sort of thing (at least one that we can build on) here: libtbx.import_all_ext -- my suggestion would be run this test in libtbx.refresh

I'm afraid there are too many things I'm uncertain about, including what exactly MPI rank is, and how you guarantee that libtbx.refresh is run after the extensions are built, therefore my rather conservative suggestions.

Initially I understood the ldd command is run for each imported extension on summit, now I'm uncertain about that. Staying with my conservative approach for the second: I'd look for an environment variable that's typical for the MPI environment, then change one line, in addition to my suggestion from yesterday:

if (sys.platform.startswith("linux") and "MPI_SOMETHING" not in os.environ):

You don't even have to change the rest of the logic because python_libstdcxx_so will be None, which will guard the second ldd command inside import_ext. That way you have the behavior you want on summit, without throwing out the protection completely for all other environments.

If you are absolutely certain that nobody ever links Python with libstdcxx in any environment, you can delete the code completely.

If you decide to keep the protection, I think the suggestion from yesterday is safe. I'd definitely go for it. In retrospect I think I was too happy back then to have plugged a hole, not thinking enough about the runtime impact as the number of extension grows.

Also: do we need actually RTLD_GLOBAL?

At some point in the early 2000s, yes, I could figure out a different way back then. Now: IDK
Boost.Python has a central registry that needs to be globally visible. I know that' s possible even with RTLD_LOCAL in principle, because that's what pybind11 is doing, but I don't know if it will need tweaks in Boost.Python, or if someone implemented what's needed already.

@JBlaschke
Copy link
Contributor

@rwgk I hope you're having a good christmas vacation. I spend some time thinking over your suggestions, unfortunately I still think that they are not future-proof. My objective is to eliminate as many edge cases in the behavior of CCTBX -- for reference: the ldd problem on Summit took several days to track down (since it was intermittent, and an unexpected design). The objective of this PR is to avoid for the next generation of machine.

I'm afraid there are too many things I'm uncertain about, including what exactly MPI rank is, and how you guarantee that libtbx.refresh is run after the extensions are built, therefore my rather conservative suggestions.

Right, so we are creating an MPI rank per CPU (i.e. MPI-managed thread) on Summit. Therefore there are potentially tens-of-thousands (more on exascale machines). Hence if each thread calls ldd only once, that would still be too much.

My last conversation with @bkpoon suggested that libtbx.refresh is called every time the dispatchers are generated.

I'd look for an environment variable that's typical for the MPI environment, then change one line, in addition to my suggestion from yesterday

There is no guaranteed environment variable that is set by MPI -- since MPI is a standard, so different implementations will define different things. What you suggest, could be implemented by querying the MPI Comm size, but that's dicey as it would mean importing the mpi4py module (or at least trying to). It would also radically change the way import_ext works, depending on the MPI communicator size. This might not sound too bad at first, but many MPI programs are debugged by running in single-threaded mode. Personally I feel that this approach is going down the wrong direction, as this code is meant to guard against unusual behavior.

At some point in the early 2000s, yes, I could figure out a different way back then. Now: IDK Boost.Python has a central registry that needs to be globally visible. I know that' s possible even with RTLD_LOCAL in principle, because that's what pybind11 is doing, but I don't know if it will need tweaks in Boost.Python, or if someone implemented what's needed already.

A quick google has revealed anything definitive -- thanks for the explanation, as a pybind11 user I was always shielded from needing this.

@JBlaschke
Copy link
Contributor

This discussion has become long and a bit convoluted now, so I want to bring to back to the LCD: We have three solutions:

  1. Check the libstdc++ linkage during compile-time (@bkpoon 's suggestion is to include it in libtbx.refresh as that is always called when the dispatchers a build). Pros: straightforward run-time behavior (and possible faster import_ext times [1]). Cons: this won't catch a libstdc++ version change if the environment between login nodes (where code is built) and compute nodes changes (I think this should be super-rare)

  2. Check the libstdc++ only of mpi4py.MPI.COMM_WORLD.Get_size() > 1. Pros: this will catch environment changes (eg. in LD_LIBRARY_PATH) that cause the dynamic linker to find the "wrong" libraries during run time. Cons: This is leave a parallel run completely unprotected (as the we are not checking the linker during build time either); and this changes the fundamental behavior between parallel and serial programs.

  3. Check libstdc++ only once per MPI rank. Pros: not a big change -- essentially the same level and kind of protection. Cons: doesn't solve the problem at scale.

[1]: IDK if that's an actual issue. I have seen highly variable run times on Cori, and it might be the ldd calls, but I need to confirm this.

@rwgk
Copy link
Contributor

rwgk commented Dec 26, 2020

I'm sorry, I feel my contributions here backfired badly. This is something I'd spend no more than a couple hours on myself before doing something simple to close the case.

https://www.open-mpi.org/faq/?category=running#mpi-environmental-variables

"Open MPI guarantees that these variables will remain stable throughout future releases"

If that's not doing the trick, maybe removing the code for all environments is the right thing to do ... unless someone already knows for sure that there are environments that link the python binary with libstdc++.

@JBlaschke
Copy link
Contributor

I'm sorry, I feel my contributions here backfired badly. This is something I'd spend no more than a couple hours on myself before doing something simple to close the case.

No worries @rwgk -- I found this conversation enlightening though: when we started considering this PR, we didn't know why this code was here in the first place. I would therefore vote for pt. 1 (check library versions at compile time only).

https://www.open-mpi.org/faq/?category=running#mpi-environmental-variables

"Open MPI guarantees that these variables will remain stable throughout future releases"

If that's not doing the trick, maybe removing the code for all environments is the right thing to do ... unless someone already knows for sure that there are environments that link the python binary with libstdc++.

In case anyone else reads this and goes off on a wrong tangent: Your solution will work for OpenMPI -- that doesn't make the code portable though since different implementations (OpenMPI, MPICH, MVAPICH, Cray MPICH) will define different environment variables (to the best of my knowledge, the MPI standard does not define environment variable names -- missed opportunity?).

@graeme-winter
Copy link
Contributor

Slightly OT but ...

A quick google has revealed anything definitive -- thanks for the explanation, as a pybind11 user I was always shielded from needing this.

I'd be fascinated to hear your experiences about this sometime - I spend a lot of time banging my head against boost::python and this has been mooted as a much better solution today (appreciate that the goalposts have moved in last couple of decades!)

@rwgk
Copy link
Contributor

rwgk commented Jan 8, 2021

I'd be fascinated to hear your experiences about this sometime - I spend a lot of time banging my head against boost::python

What were the things you were struggling with? Just curious.

and this has been mooted as a much better solution today

It's a mixed bag. For anyone starting fresh I'd definitely recommend using pybind11, although I've discovered some serious issues with the quality of implementation***. For that specific aspect, I'd say Boost.Python still has the clear lead. But we're actively working on catching up!

*** pybind/pybind11#2672 (comment)

@JBlaschke
Copy link
Contributor

Hey all -- especially @nksauter, @bkpoon and @rwgk

I've just found some time to put what we discussed here in practice. libtbx.refresh no looks at the libtbx.env.lib_path folder and check the linkage of all the .so files therein. Since libtbx.refresh is called at the end of a build, we're effectively checking with ldd only once, and only at compile time -- or whenever the dispatchers are rebuilt => i.e. whenever linkage can change.

An environment variable: CCTBX_CHECK_LDD can overwrite this test (if it exists but is non- 1, on, or true). This is a necessary feature as Summit's runtime environment links python strangely on the management nodes (but not on the login or compute node -- long story, very strange, TL;DR: conda+summit=strange -- but I digress). So sometimes there are good reasons to ignore this.

I recommend that @bkpoon look over my changes as they might not be cctbx-y.

@JBlaschke
Copy link
Contributor

@bkpoon It seems that some CI is failing. I understand the python v2 errors -- is there an elegant way to make this python v2-compatible? I don't understand the other errors, maybe you can take a look, if you have time.

Comment on lines 3245 to 3252
def is_true(x):
yes = {"1", "on", "true"}
return (x.strip().lower() in yes)

if is_true(os.environ.get("CCTBX_CHECK_LDD", "True")):
print("Checking that extension modules and python executabel link to same version of libstdc++")
print("Set CCTBX_CHECK_LDD=false to disable this step")
check_libcpp()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably only need to check that CCTBX_CHECK_LDD is set, not the specific value. So maybe something like

if os.environ.get("CCTBX_CHECK_LDD", None) is not None:

instead of the is_true function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that will accept an empty string. Just use

if os.getenv("CCTBX_CHECK_LDD"):

Also: *executable

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and that (and your comment) inverses the logic, so make it CCTBX_SKIP_LDD_CHECK

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I prefer to be more strict so that things like CCTBX_CHECK_LDD="" or CCTBX_CHECK_LDD="off" won't accidentally trigger the check -- I know we can use unset, but it's easy to overlook a blank variable using simple echo $CCTBX_CHECK_LDD (i.e. there is no quick one-liner in bash to check for nullity).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, are we always running this check on linux or only when the environment variable is set? At the start of def check_libcpp, there is a check for linux that returns True.

Copy link
Member

@bkpoon bkpoon Jan 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it is run regardless of the environment variable being set. But if it is set, it only runs if it's "1", "on", or "true" (case insensitive)? If that's the case, then it's fine. The message explains how to turn it off. But the earlier check of linux seems to be inverted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes the earlier check was inverted -- thanks for catching that -- once that's fixed, the test is on by default on linux (unless it's turned off).

@bkpoon
Copy link
Member

bkpoon commented Jan 27, 2021

Oh, and Python 2 does not support f-strings. And for the CI branch pipeline to pass, you'll need to get the latest changes from master. That's why the Full builds work since that pipeline runs on the merged code.

@bkpoon
Copy link
Member

bkpoon commented Jan 27, 2021

How do I check this on Summit? What's the command to test on the compute nodes?

@JBlaschke
Copy link
Contributor

How do I check this on Summit? What's the command to test on the compute nodes?

Do you have an OLCF account?

@bkpoon
Copy link
Member

bkpoon commented Jan 28, 2021

Yep, I got around to renewing my account for the project.

@JBlaschke
Copy link
Contributor

Oh, and Python 2 does not support f-strings. And for the CI branch pipeline to pass, you'll need to get the latest changes from master. That's why the Full builds work since that pipeline runs on the merged code.

Thanks, I replaced the f-strings with their non-f-string counterparts. This should fix the python 2 compatibility issues. Just waiting to see what the CI tests show us.

@JBlaschke
Copy link
Contributor

How do I check this on Summit? What's the command to test on the compute nodes?

What exactly do you want to test? If you want to reproduce the original issue (ldd resulting in python segfaulting), then I suggest you run any job on more than 5 nodes.

If you just want to test this one a login node, then all you need to do is install cctbx (I use the build scripts from here: https://github.com/JBlaschke/cctbx_deployment -- but that's probably overkill for you). And then run libtbx.refresh.

If you want to test this on a management node, run: bsub -W 0:10 -nnodes 1 -P <you project id> -Is /bin/bash and then run libtbx.refresh. If you want to actually have this run on a compute node, run the bsub command above, but then run jsrun -n 1 -a 1 libtbx.refresh

@JBlaschke
Copy link
Contributor

JBlaschke commented Jul 8, 2021

@dermen is running into this also. So I would like to push all y'all to get this merged ASAP

@phyy-nx
Copy link
Contributor

phyy-nx commented Jul 21, 2021

From group meeting today, we will move libcpp compatibility check to a separate function with a corresponding dispatcher that acts as a test. Goal is to merge by end of the week.

@JBlaschke
Copy link
Contributor

Allright, i've moved this test into a seperate libtbx.check_libcpp test, and removed this test from libtbx.refresh. The idea is that users who suspect that they are encountering libstdc++ version compatibility issues. I'll wait until CI finishes running.

@JBlaschke
Copy link
Contributor

@bkpoon I can't seem to satisfy the syntax checker -- it insists that:

libtbx/command_line/check_libcpp.py: missing 'from __future__ import division'

however, if I do add this import, it complains that it is unused. What should I do?

@bkpoon
Copy link
Member

bkpoon commented Jul 26, 2021

You had more than one space between from and __future__. My commit should work.

@phyy-nx phyy-nx merged commit e6ee44e into master Jul 28, 2021
@phyy-nx phyy-nx deleted the libstdcpp branch July 28, 2021 20:44
russell-taylor pushed a commit to ReliaSolve/cctbx_project that referenced this pull request Aug 11, 2021
* do not use ldd to check the libstc++ version in import-ext <= this can cause failures on summit
* check for libstdc++ during libtbx.refresh
* Update libtbx/env_config.py
* f-string -> non-f-string for backward compatibility
* set up a seperate libstdc++ test called libtbx.check_libcpp

Co-authored-by: Nicholas Sauter <[email protected]>
Co-authored-by: Johannes Blaschke <[email protected]>
Co-authored-by: Billy K. Poon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants