-
Notifications
You must be signed in to change notification settings - Fork 124
Summit fix for libstc++ #564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm... I think it's a good idea to remove this code and see what happens. Back when I added this code, we got a lot of problem reports related to libstdc++
version mixes. I assume version mixes are still a problem, but maybe it doesn't happen anymore in practice, or not as often?
The world has changed a lot. A few months ago I saw an article (also already old, but newer than this code) that recommends strongly against RTLD_GLOBAL
(the 0x2
in sys.setdlopenflags(0x100|0x2)
here). I don't know how those recommendations play with modern Boost.Python. pybind11 doesn't need RTLD_GLOBAL
, and even hides all symbols by default. Following that modern role model is probably best, if feasible, and will probably allow you to have a mix of libstdc++
versions.
I am afraid I had no idea that code was there! No educated opinion on this, sorry. |
Thanks Luc! Ralf's review yesterday was quite informative. What I know is
that the ldd test caused problems for us at Oak Ridge (Summit), and
removing the code still seems to pass all the tests. I'll wait for
additional comment till Wednesday and then merge.
Nick
…On Tue, Nov 17, 2020 at 8:19 AM Luc J. Bourhis ***@***.***> wrote:
I am afraid I had no idea that code was there!
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#564 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADQ24VXLUFHRYAS7J3BYAXTSQKPBLANCNFSM4TXSSFKQ>
.
|
Thinking about it a bit more, I'm beginning to feel a bit uneasy: if you actually are mixing |
Hmm, I am afraid I am also not going to be overly helpful here. My
knowledge of the dynamic linker is not sufficiently deep to say what can go
wrong.
…On Tue, Nov 17, 2020 at 4:46 PM Ralf W. Grosse-Kunstleve < ***@***.***> wrote:
Thinking about it a bit more, I'm beginning to feel a bit uneasy: if you
actually *are* mixing libstdc++ versions, while also using RTLD_GLOBAL,
you may get lucky today, but run into weird, hard-to-debug situations
later. Essentially, this PR is removing a warning light from the dashboard.
Maybe. My understanding of how dlopen works is too incomplete to really
know what could bite you how.
—
You are receiving this because your review was requested.
Reply to this email directly, view it on GitHub
<#564 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABBQ4DOAKQEC3TORTW45EOLSQKSGLANCNFSM4TXSSFKQ>
.
|
OK, I think we may need a deeper analysis; perhaps Billy & Johannes can lend a hand. One very bad element of the current design is that it seemingly forks multiple (potentially hundreds) of subprocesses at run time, killing performance on MPI-based applications. Are there alternatives, including 1) prescreening all our shared objects with a separate command, 2) making the check optional with an environment flag, or 3) removing the RTLD_GLOBAL behavior altogether? |
Your 2) is probably quickest [with or without 4)]. Assuming you're still using something like libtbx/configure.py, use a command-line flag or some hint from the environment to figure out when to define the environment flag. Then just keep in the back of your mind that in theory you could have binary incompatibilities (mix of libstdc++ versions) but hope for the best :-) |
We are migrating to using One ABI issue (came up with building Rosetta for Phenix) was the ABI change in GCC 5.1, so we could not link code built using an earlier version of GCC with anything built with GCC >= 5.1. I would also put myself in the camp of not having a sufficiently deep understanding of everything in the linked article, but we are testing development builds with different compilers on Azure Pipelines using dependencies from I think the issue with Summit and other HPC systems is that the compilers are usually customized for that particular system and I think the expectation is that all the code is compiled with the same compiler. @JBlaschke would know more. |
I needed to ruminate over all the points raised here -- the topic is complex I think the points raised here are all good. However, if we want a solution that scales (at all, not just well) then we need to address this. Here are my recommendations:
I might need @bkpoon's help to "wire up" the existing Here is my reasoning (all points referencing the list above):
|
@rwgk thanks for the article -- I share your concern, and had to think about this. But I think we need to give some degree of flexibility here to HPC users -- mainly because "swapping out" stl's is still standard practice for things like code injection, or how some compiler support for advanced features. I wish it wasn't. Regardless of how we proceed, I think that this check should happen at compile time. Or are you concerned with the user changing |
I think, as long as (**) Only if the one |
Ugh, sorry ... the check is in
If that's still causing an issue, make the |
Wouldn't that still run once per MPI rank, which will prevent scaling? Do we expect the Also: do we need actually |
I'm afraid there are too many things I'm uncertain about, including what exactly MPI rank is, and how you guarantee that Initially I understood the
You don't even have to change the rest of the logic because If you are absolutely certain that nobody ever links Python with If you decide to keep the protection, I think the suggestion from yesterday is safe. I'd definitely go for it. In retrospect I think I was too happy back then to have plugged a hole, not thinking enough about the runtime impact as the number of extension grows.
At some point in the early 2000s, yes, I could figure out a different way back then. Now: IDK |
@rwgk I hope you're having a good christmas vacation. I spend some time thinking over your suggestions, unfortunately I still think that they are not future-proof. My objective is to eliminate as many edge cases in the behavior of CCTBX -- for reference: the
Right, so we are creating an MPI rank per CPU (i.e. MPI-managed thread) on Summit. Therefore there are potentially tens-of-thousands (more on exascale machines). Hence if each thread calls My last conversation with @bkpoon suggested that
There is no guaranteed environment variable that is set by MPI -- since MPI is a standard, so different implementations will define different things. What you suggest, could be implemented by querying the MPI Comm size, but that's dicey as it would mean importing the
A quick google has revealed anything definitive -- thanks for the explanation, as a |
This discussion has become long and a bit convoluted now, so I want to bring to back to the LCD: We have three solutions:
[1]: IDK if that's an actual issue. I have seen highly variable run times on Cori, and it might be the |
I'm sorry, I feel my contributions here backfired badly. This is something I'd spend no more than a couple hours on myself before doing something simple to close the case. https://www.open-mpi.org/faq/?category=running#mpi-environmental-variables "Open MPI guarantees that these variables will remain stable throughout future releases" If that's not doing the trick, maybe removing the code for all environments is the right thing to do ... unless someone already knows for sure that there are environments that link the python binary with libstdc++. |
No worries @rwgk -- I found this conversation enlightening though: when we started considering this PR, we didn't know why this code was here in the first place. I would therefore vote for pt. 1 (check library versions at compile time only).
In case anyone else reads this and goes off on a wrong tangent: Your solution will work for OpenMPI -- that doesn't make the code portable though since different implementations (OpenMPI, MPICH, MVAPICH, Cray MPICH) will define different environment variables (to the best of my knowledge, the MPI standard does not define environment variable names -- missed opportunity?). |
Slightly OT but ...
I'd be fascinated to hear your experiences about this sometime - I spend a lot of time banging my head against |
What were the things you were struggling with? Just curious.
It's a mixed bag. For anyone starting fresh I'd definitely recommend using pybind11, although I've discovered some serious issues with the quality of implementation***. For that specific aspect, I'd say Boost.Python still has the clear lead. But we're actively working on catching up! |
Hey all -- especially @nksauter, @bkpoon and @rwgk I've just found some time to put what we discussed here in practice. An environment variable: I recommend that @bkpoon look over my changes as they might not be cctbx-y. |
@bkpoon It seems that some CI is failing. I understand the python v2 errors -- is there an elegant way to make this python v2-compatible? I don't understand the other errors, maybe you can take a look, if you have time. |
libtbx/env_config.py
Outdated
def is_true(x): | ||
yes = {"1", "on", "true"} | ||
return (x.strip().lower() in yes) | ||
|
||
if is_true(os.environ.get("CCTBX_CHECK_LDD", "True")): | ||
print("Checking that extension modules and python executabel link to same version of libstdc++") | ||
print("Set CCTBX_CHECK_LDD=false to disable this step") | ||
check_libcpp() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably only need to check that CCTBX_CHECK_LDD
is set, not the specific value. So maybe something like
if os.environ.get("CCTBX_CHECK_LDD", None) is not None:
instead of the is_true
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that will accept an empty string. Just use
if os.getenv("CCTBX_CHECK_LDD"):
Also: *executable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and that (and your comment) inverses the logic, so make it CCTBX_SKIP_LDD_CHECK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I prefer to be more strict so that things like CCTBX_CHECK_LDD=""
or CCTBX_CHECK_LDD="off"
won't accidentally trigger the check -- I know we can use unset
, but it's easy to overlook a blank variable using simple echo $CCTBX_CHECK_LDD
(i.e. there is no quick one-liner in bash to check for nullity).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, are we always running this check on linux or only when the environment variable is set? At the start of def check_libcpp
, there is a check for linux
that returns True
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, it is run regardless of the environment variable being set. But if it is set, it only runs if it's "1", "on", or "true" (case insensitive)? If that's the case, then it's fine. The message explains how to turn it off. But the earlier check of linux seems to be inverted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes the earlier check was inverted -- thanks for catching that -- once that's fixed, the test is on by default on linux (unless it's turned off).
Oh, and Python 2 does not support f-strings. And for the |
How do I check this on Summit? What's the command to test on the compute nodes? |
Do you have an OLCF account? |
Yep, I got around to renewing my account for the project. |
Thanks, I replaced the f-strings with their non-f-string counterparts. This should fix the python 2 compatibility issues. Just waiting to see what the CI tests show us. |
What exactly do you want to test? If you want to reproduce the original issue ( If you just want to test this one a login node, then all you need to do is install cctbx (I use the build scripts from here: https://github.com/JBlaschke/cctbx_deployment -- but that's probably overkill for you). And then run If you want to test this on a management node, run: |
@dermen is running into this also. So I would like to push all y'all to get this merged ASAP |
From group meeting today, we will move libcpp compatibility check to a separate function with a corresponding dispatcher that acts as a test. Goal is to merge by end of the week. |
…n cause failures on summit
Co-authored-by: Billy K. Poon <[email protected]>
Allright, i've moved this test into a seperate |
@bkpoon I can't seem to satisfy the syntax checker -- it insists that:
however, if I do add this import, it complains that it is unused. What should I do? |
You had more than one space between |
* do not use ldd to check the libstc++ version in import-ext <= this can cause failures on summit * check for libstdc++ during libtbx.refresh * Update libtbx/env_config.py * f-string -> non-f-string for backward compatibility * set up a seperate libstdc++ test called libtbx.check_libcpp Co-authored-by: Nicholas Sauter <[email protected]> Co-authored-by: Johannes Blaschke <[email protected]> Co-authored-by: Billy K. Poon <[email protected]>
Problem statement: On Summit, "import boost.python" fails.
This is likely due to the practice, in high-performance computing centers, of linking the Python extension modules against one stdc++ library, whereas the system Python is linked against a different library behind the scene.
Solution: do not use ldd to check the libstc++ version in import-ext
The referenced code is part of the original cctbx design, but is likely not applicable to modern systems.
Removing it will save an external call to "ldd" during every boost python import.