-
-
Notifications
You must be signed in to change notification settings - Fork 31.9k
Enable building both an interpreter that statically links libpython and a shared library too #133312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
…eter This option changes the behavior of --enable-shared to continue to build the libpython3.x.so shared library, but not use it for linking the python3 interpreter executable. Instead, the executable is linked directly against the libpython .o files as it would be with --disable-shared. There are two benefits of this change. First, libpython uses thread-local storage, which is noticeably slower when used in a loaded module instead of in the main program, because the main program can take advantage of constant offsets from the thread state pointer but loaded modules have to dynamically call a function __tls_get_addr() to potentially allocate their thread-local storage area. (There is another thread-local storage model for dynamic libraries which mitigates most of this performance hit, but it comes at the cost of preventing dlopen("libpython3.x.so"), which is a use case we want to preserve.) Second, this improves the user experience around relocatable Python a little bit, in that we don't need to use an $ORIGIN-relative path to locate libpython3.x.so, which has some mild benefits around musl (which does not support $ORIGIN-relative DT_NEEDED, only $ORIGIN-relative DT_RPATH/DT_RUNPATH), users who want to make the interpreter setuid or setcap (which prevents processing $ORIGIN), etc.
Hi! The 2002 Debian change predated the use of See:
(Apologies if you're already aware of `-fno-semantic-interposition, but there was no mention of how this change intersects with it in the OP, when it was the first thing that came to mind when I saw this proposal... :-) ) |
Oh! That would make sense, I was dimly aware of it but I didn't connect the two in my mind, thanks. I think that explains why I'm not seeing a noticeable performance benefit on the "normal" interpreter build but I am seeing it on the free-threaded build; the general slowdown from merely using a dynamic library is gone, but the slowdown on thread-local storage from using a dynamic library is still there, as noted in the benchmark result I posted above. (See my comments on astral-sh/python-build-standalone#592 for some |
Feature or enhancement
Proposal:
Right now, if you
./configure --enable-shared
, you get a bin/python3 that uses libpython3.x.so, and if you leave the option out (or explicitly./configure --disable-shared
), you get a bin/python3 that statically links libpython into itself and no libpython3.x.so.It's very useful to have a libpython3.x.so available for applications that need it because they embed Python in various ways. At the same time, there are some performance speedups from not having the extra layer of indirection in bin/python3. It would be useful to have a build option that gets you a best-of-both-worlds build (at the cost of more disk space): a bin/python3 that statically links libpython and a libpython3.x.so for other binaries that might need it.
As a data point, this is useful enough that Debian currently does this in their Python package in a roundabout way: they build Python twice, once with
--enable-shared
and once without, and they then assemble the package by taking the libpython3.x.so from the former build and everything else from the latter build. (See, for instance, the debian/rules file for Debian python3 3.13.3-2: in lines 395-412 they do a--enable-shared
build into$(buildd_shared)
, in lines 438-450 they do a non-shared build into$(buildd_static)
, in line 878 they do amake -C $(buildd_static) install
, and in lines 939-940 they copy libpython3.x.so.1.0 out of$(buildd_shared)
.)There is not a particular need to do two separate builds, since the behavior changes from
--enable-shared
happen after most of the compilation has happened, in generating the final binaries. All that needs to happen is that the Makefile builds the interpreter binary the way it would for a static build, and also builds the shared library as it would if it were a dependency of the interpreter.I have implemented this change and will open a PR momentarily.
Details on the performance benefits: while this is always a little bit of folklore, I can point to three specific things. First, Debian made this change in 2002 based on a reported 50% speedup/penalty in https://bugs.debian.org/131813 (it is interesting that the maintainer was not able to reproduce the problem, but the end user nonetheless saw the benefit from the change).
Second, there is obviously a benefit from loading one fewer file at process startup, though the impact is most obvious when your files are not in cache and process startup dominates your runtime. I see a ~15% penalty from
python3 -c True
on an AWS t2.medium VM from using the shared library:Finally, in a free-threaded build, running the old "pystones" benchmark in multiple threads is about 10% slower with the shared library:
where
threadstone.py
isand
pystone.py
is taken from just before 61fd70e.This particular penalty is very understandable. In a shared library, thread-local storage for variables (globals or statics) in that library is allocated dynamically and on demand with the help of a function call to the C runtime that needs to be called whenever you're making an access and don't already have the right pointer cached. In the main executable, thread-local storage can be allocated up front, statically, with a fixed offset from the register that holds the thread-local storage area. So, code that makes heavy use of thread-local storage will perform better if compiled directly into the main binary. (A convenient thing about how ELF handles this is that there is a relocation type for thread-local accesses, and while generated code starts off including the function call to the helper function, the relocation is able to overwrite that function call with effectively no-op instructions if it's being linked into a main executable. So the same .o file can be used in both cases without having to tell the compiler up front if the code is going into a main executable or a shared library, without putting the performance benefits at risk.)
See more details on the benchmarks, more benchmarks, and the generated assembly code for thread-local storage access in astral-sh/python-build-standalone#592.
Has this already been discussed elsewhere?
This is a minor feature, which does not need previous discussion elsewhere
Links to previous discussion of this feature:
No response
Linked PRs
The text was updated successfully, but these errors were encountered: