Skip to content

Enable building both an interpreter that statically links libpython and a shared library too #133312

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
geofft opened this issue May 2, 2025 · 2 comments
Labels
build The build process and cross-build type-feature A feature request or enhancement

Comments

@geofft
Copy link
Contributor

geofft commented May 2, 2025

Feature or enhancement

Proposal:

Right now, if you ./configure --enable-shared, you get a bin/python3 that uses libpython3.x.so, and if you leave the option out (or explicitly ./configure --disable-shared), you get a bin/python3 that statically links libpython into itself and no libpython3.x.so.

It's very useful to have a libpython3.x.so available for applications that need it because they embed Python in various ways. At the same time, there are some performance speedups from not having the extra layer of indirection in bin/python3. It would be useful to have a build option that gets you a best-of-both-worlds build (at the cost of more disk space): a bin/python3 that statically links libpython and a libpython3.x.so for other binaries that might need it.

As a data point, this is useful enough that Debian currently does this in their Python package in a roundabout way: they build Python twice, once with --enable-shared and once without, and they then assemble the package by taking the libpython3.x.so from the former build and everything else from the latter build. (See, for instance, the debian/rules file for Debian python3 3.13.3-2: in lines 395-412 they do a --enable-shared build into $(buildd_shared), in lines 438-450 they do a non-shared build into $(buildd_static), in line 878 they do a make -C $(buildd_static) install, and in lines 939-940 they copy libpython3.x.so.1.0 out of $(buildd_shared).)

There is not a particular need to do two separate builds, since the behavior changes from --enable-shared happen after most of the compilation has happened, in generating the final binaries. All that needs to happen is that the Makefile builds the interpreter binary the way it would for a static build, and also builds the shared library as it would if it were a dependency of the interpreter.

I have implemented this change and will open a PR momentarily.

Details on the performance benefits: while this is always a little bit of folklore, I can point to three specific things. First, Debian made this change in 2002 based on a reported 50% speedup/penalty in https://bugs.debian.org/131813 (it is interesting that the maintainer was not able to reproduce the problem, but the end user nonetheless saw the benefit from the change).

Second, there is obviously a benefit from loading one fewer file at process startup, though the impact is most obvious when your files are not in cache and process startup dominates your runtime. I see a ~15% penalty from python3 -c True on an AWS t2.medium VM from using the shared library:

ubuntu@ip-172-16-0-59:~$ hyperfine -L dir b,c '{dir}/python/install/bin/python3 -c True' --prepare 'echo 3 | sudo tee /proc/sys/vm/drop_caches'
Benchmark 1: b/python/install/bin/python3 -c True
  Time (mean ± σ):     183.1 ms ±   7.5 ms    [User: 14.9 ms, System: 16.4 ms]
  Range (min … max):   173.1 ms … 195.1 ms    10 runs
 
Benchmark 2: c/python/install/bin/python3 -c True
  Time (mean ± σ):     155.8 ms ±   7.4 ms    [User: 14.3 ms, System: 16.6 ms]
  Range (min … max):   142.5 ms … 166.9 ms    11 runs
 
Summary
  c/python/install/bin/python3 -c True ran
    1.17 ± 0.07 times faster than b/python/install/bin/python3 -c True

Finally, in a free-threaded build, running the old "pystones" benchmark in multiple threads is about 10% slower with the shared library:

ubuntu@ip-172-16-0-59:~/ft$ hyperfine -L dir b,c '{dir}/python/install/bin/python3 ~/threadstone.py'
Benchmark 1: b/python/install/bin/python3 ~/threadstone.py
  Time (mean ± σ):      2.208 s ±  0.064 s    [User: 3.722 s, System: 0.545 s]
  Range (min … max):    2.155 s …  2.379 s    10 runs
 
Benchmark 2: c/python/install/bin/python3 ~/threadstone.py
  Time (mean ± σ):      1.993 s ±  0.052 s    [User: 3.353 s, System: 0.487 s]
  Range (min … max):    1.921 s …  2.122 s    10 runs
 
Summary
  c/python/install/bin/python3 ~/threadstone.py ran
    1.11 ± 0.04 times faster than b/python/install/bin/python3 ~/threadstone.py

where threadstone.py is

import pystone
import concurrent.futures
t = concurrent.futures.ThreadPoolExecutor()
for i in range(500):
    t.submit(pystone.pystones, 1000)
t.shutdown()

and pystone.py is taken from just before 61fd70e.

This particular penalty is very understandable. In a shared library, thread-local storage for variables (globals or statics) in that library is allocated dynamically and on demand with the help of a function call to the C runtime that needs to be called whenever you're making an access and don't already have the right pointer cached. In the main executable, thread-local storage can be allocated up front, statically, with a fixed offset from the register that holds the thread-local storage area. So, code that makes heavy use of thread-local storage will perform better if compiled directly into the main binary. (A convenient thing about how ELF handles this is that there is a relocation type for thread-local accesses, and while generated code starts off including the function call to the helper function, the relocation is able to overwrite that function call with effectively no-op instructions if it's being linked into a main executable. So the same .o file can be used in both cases without having to tell the compiler up front if the code is going into a main executable or a shared library, without putting the performance benefits at risk.)

See more details on the benchmarks, more benchmarks, and the generated assembly code for thread-local storage access in astral-sh/python-build-standalone#592.

Has this already been discussed elsewhere?

This is a minor feature, which does not need previous discussion elsewhere

Links to previous discussion of this feature:

No response

Linked PRs

@geofft geofft added the type-feature A feature request or enhancement label May 2, 2025
geofft added a commit to geofft/cpython that referenced this issue May 2, 2025
…eter

This option changes the behavior of --enable-shared to continue to build
the libpython3.x.so shared library, but not use it for linking the
python3 interpreter executable. Instead, the executable is linked
directly against the libpython .o files as it would be with
--disable-shared.

There are two benefits of this change. First, libpython uses
thread-local storage, which is noticeably slower when used in a loaded
module instead of in the main program, because the main program can take
advantage of constant offsets from the thread state pointer but loaded
modules have to dynamically call a function __tls_get_addr() to
potentially allocate their thread-local storage area. (There is another
thread-local storage model for dynamic libraries which mitigates most of
this performance hit, but it comes at the cost of preventing
dlopen("libpython3.x.so"), which is a use case we want to preserve.)

Second, this improves the user experience around relocatable Python a
little bit, in that we don't need to use an $ORIGIN-relative path to
locate libpython3.x.so, which has some mild benefits around musl (which
does not support $ORIGIN-relative DT_NEEDED, only $ORIGIN-relative
DT_RPATH/DT_RUNPATH), users who want to make the interpreter setuid or
setcap (which prevents processing $ORIGIN), etc.
@AlexWaygood AlexWaygood added the build The build process and cross-build label May 2, 2025
@edmorley
Copy link

edmorley commented May 19, 2025

Details on the performance benefits: while this is always a little bit of folklore, I can point to three specific things. First, Debian made this change in 2002 based on a reported 50% speedup/penalty in https://bugs.debian.org/131813

Hi! The 2002 Debian change predated the use of -fno-semantic-interposition (which landed upstream in cpython in 3.10) - is it possible that the benefits of this dual approach (static python + still generating a shared library) are much less now that that's occurred?

See:

(Apologies if you're already aware of `-fno-semantic-interposition, but there was no mention of how this change intersects with it in the OP, when it was the first thing that came to mind when I saw this proposal... :-) )

@geofft
Copy link
Contributor Author

geofft commented May 19, 2025

Oh! That would make sense, I was dimly aware of it but I didn't connect the two in my mind, thanks.

I think that explains why I'm not seeing a noticeable performance benefit on the "normal" interpreter build but I am seeing it on the free-threaded build; the general slowdown from merely using a dynamic library is gone, but the slowdown on thread-local storage from using a dynamic library is still there, as noted in the benchmark result I posted above. (See my comments on astral-sh/python-build-standalone#592 for some hyperfine runs that don't show a significant difference with/without this change.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build The build process and cross-build type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

3 participants