-
-
Notifications
You must be signed in to change notification settings - Fork 3.1k
[mypyc] librt base64: use existing SIMD CPU dispatch by customizing build flags #20253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b611c27 to
067b1b8
Compare
This comment has been minimized.
This comment has been minimized.
0ce45ae to
cca1dc2
Compare
This comment has been minimized.
This comment has been minimized.
c584ff9 to
6e96889
Compare
This comment has been minimized.
This comment has been minimized.
6e96889 to
d270f09
Compare
This comment has been minimized.
This comment has been minimized.
d270f09 to
a0c90f5
Compare
This comment has been minimized.
This comment has been minimized.
1 similar comment
This comment has been minimized.
This comment has been minimized.
|
I think there should also be a documented flag for setting the hardware floor (mostly useful for avx2 in order to avoid CPU dispatch if you know your hardware supports it) |
Thank you for the review @jhance ; is adding a flag for setting the hardware floor a blocker for merging? According to upstream, and confirmed by my review of the code, the codec choice (as a result of the CPU detection) is only done once and is saved for the lifetime of the program. mypy/mypyc/lib-rt/base64/lib.c Lines 12 to 15 in 094f66d
Prior to this PR, the CPU detection was already being run: on X86_64 systems mypy/mypyc/lib-rt/base64/codec_choose.c Lines 254 to 264 in 094f66d
In my opinion there is no performance advantage for bypassing the one-time CPU detection on X86_64 systems. |
|
Benchmarking results (n=100, AMD EPYC 9454 48-Core Processor) |
c56196b to
2354438
Compare
This comment has been minimized.
This comment has been minimized.
2354438 to
379cd1e
Compare
This comment has been minimized.
This comment has been minimized.
| X86_64 = platform.machine() in ("x86_64", "AMD64", "amd64") | ||
|
|
||
|
|
||
| def spawn(self, cmd, **kwargs) -> None: # type: ignore[no-untyped-def] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason why not to annotate this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I tried annotating this before, but the signature varies too much between setuptools/distutils versions and Python versions.
| X86_64 = platform.machine() in ("x86_64", "AMD64", "amd64") | ||
|
|
||
|
|
||
| def spawn(self, cmd, **kwargs) -> None: # type: ignore[no-untyped-def] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks similar if not the same, any particular reason why not have it in a shared location, or is this because of the fact these are setup files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the later; the existing code also has duplication issues as already noted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that there is more duplicated we can try to share more of the code, but it can happen outside this PR.
mypyc/build_setup.py
Outdated
| "base64/arch/sse41": ["-msse4.1", "-DBASE64_WITH_SSE41"], | ||
| "base64/arch/sse42": ["-msse4.2", "-DBASE64_WITH_SSE42"], | ||
| "base64/arch/avx2": ["-mavx2", "-DBASE64_WITH_AVX2"], | ||
| "base64/arch/avx": ["-mavx", "-DBASE64_WITH_AVX"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these BASE64_WITH... #defines need to enabled in all files. Otherwise the codec choosing code doesn't get triggered (which happens in codec_choose.c). With these changes we compile the SIMD versions, but I don't think they will be used at runtime. I ran a microbenchmark and performance was slower on an AMD system with this PR.
Here's the benchmark I used (added it to run-base64.test temporarily):
[case testXXX_librt_experimental]
import time
from librt.base64 import b64encode
a = b"foo"
b = a * 10000
def bench1(b: bytes, n: int) -> None:
for i in range(n):
b64encode(b)
bench1(b, 1000000) # Warmup
t0 = time.time()
n = 1000 * 200
bench1(b, n)
td = time.time() - t0
print(len(b) * n / td / 1024 / 1024, "MB/s")There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah you need to force optimization level to 3 to get meaningful benchmark results (e.g. patch mypyc.build.mypycify and force opt_level to be 3).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JukkaL Thanks for looking into this. My laptop died this morning, so feel free to push additional fixes to my branch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these
BASE64_WITH...#defines need to enabled in all files. Otherwise the codec choosing code doesn't get triggered (which happens incodec_choose.c).
Okay, I agree that for X86-64, all the HAVE_* definitions should always be enabled.
I guess the easiest way is to edit mypyc/lib-rt/base64/config.h to set those flags inside a #if defined(__x86_64__) && defined(__LP64__) check and trim the above flags to just setting -mavx2 and similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm back; I got a new laptop charger :-D
Thank you for the micro benchmark, it helps a lot!
FYI: In the lib-rt
setup.pythe 3rd optimization level is enabledLines 130 to 131 in 379cd1e
if compiler.compiler_type == "unix": # type: ignore[attr-defined] cflags += ["-O3"] I'm surprised you had an issue with mypycify, as
3is the default levelLine 545 in 379cd1e
opt_level: str = "3",
Ah, it got overridden by this line, so setting MYPYC_OPT_LEVEL=3 pytest -n0 -vvv -s mypyc -k testXXX_librt_experimental is easier than patching
Line 283 in 1f09855
| opt_level = int(os.environ.get("MYPYC_OPT_LEVEL", 0)) |
I've added a commit to set the HAVE_{SSSE3,SSE41,SSE42,AVX,AVX2} flags automatically for amd64/x86-64 systems, removing the need for the BASE64_WITH_* definitions on the compile time.
The baseline speed on my system using your benchmarking was 9,089 MB/s before my changes, now it is 14,461 MB/s. It also showed that all the -mavx2 -mavx flags were being added also to the final linking stage, which is obviously not appropriate:
INFO root:spawn.py:77 gcc -shared -L/home/mi/crusoe/.pyenv/versions/3.13.2/lib -Wl,-rpath,/home/mi/crusoe/.pyenv/versions/3.13.2/lib -L/home/mi/crusoe/.pyenv/versions/3.13.2/lib -Wl,-rpath,/home/mi/crusoe/.pyenv/versions/3.13.2/lib build/temp.linux-x86_64-cpython-313/build/base64/arch/avx/codec.o build/temp.linux-x86_64-cpython-313/build/base64/arch/avx2/codec.o build/temp.linux-x86_64-cpython-313/build/base64/arch/avx512/codec.o build/temp.linux-x86_64-cpython-313/build/base64/arch/generic/codec.o build/temp.linux-x86_64-cpython-313/build/base64/arch/neon32/codec.o build/temp.linux-x86_64-cpython-313/build/base64/arch/neon64/codec.o build/temp.linux-x86_64-cpython-313/build/base64/arch/sse41/codec.o build/temp.linux-x86_64-cpython-313/build/base64/arch/sse42/codec.o build/temp.linux-x86_64-cpython-313/build/base64/arch/ssse3/codec.o build/temp.linux-x86_64-cpython-313/build/base64/codec_choose.o build/temp.linux-x86_64-cpython-313/build/base64/lib.o build/temp.linux-x86_64-cpython-313/build/base64/tables/tables.o build/temp.linux-x86_64-cpython-313/build/bytes_ops.o build/temp.linux-x86_64-cpython-313/build/dict_ops.o build/temp.linux-x86_64-cpython-313/build/exc_ops.o build/temp.linux-x86_64-cpython-313/build/float_ops.o build/temp.linux-x86_64-cpython-313/build/generic_ops.o build/temp.linux-x86_64-cpython-313/build/getargs.o build/temp.linux-x86_64-cpython-313/build/getargsfast.o build/temp.linux-x86_64-cpython-313/build/init.o build/temp.linux-x86_64-cpython-313/build/int_ops.o build/temp.linux-x86_64-cpython-313/build/librt_base64.o build/temp.linux-x86_64-cpython-313/build/list_ops.o build/temp.linux-x86_64-cpython-313/build/misc_ops.o build/temp.linux-x86_64-cpython-313/build/pythonsupport.o build/temp.linux-x86_64-cpython-313/build/set_ops.o build/temp.linux-x86_64-cpython-313/build/str_ops.o build/temp.linux-x86_64-cpython-313/build/tuple_ops.o -L/home/mi/crusoe/.pyenv/versions/3.13.2/lib -o build/lib.linux-x86_64-cpython-313/librt/base64.cpython-313-x86_64-linux-gnu.so -mssse3 -msse4.2 -msse4.1 -mavx -mavx2 -mavx
So the next commit limits the matches to when the term ends in .c. The new speed is 14,242 MB/s, a 57% improvement from the baseline (before this PR).
…y using HAVE_[x86 features]
…tional compiler flags
This comment has been minimized.
This comment has been minimized.
…macOS universal builds
Matches previous behavior
|
According to mypy_primer, this change doesn't affect type check results on a corpus of open source code. ✅ |
|
@JukkaL all tests pass, I think this is ready for squashing and merging so it can be cherry-picked for the 1.19 release branch |
JukkaL
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates, looks good. We can optionally deal with the additional code duplication later on, but it's a pre-existing issue and we can live with it for now.
| X86_64 = platform.machine() in ("x86_64", "AMD64", "amd64") | ||
|
|
||
|
|
||
| def spawn(self, cmd, **kwargs) -> None: # type: ignore[no-untyped-def] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that there is more duplicated we can try to share more of the code, but it can happen outside this PR.
|
Thank you @JukkaL ! |
…uild flags (#20253) Fixes the current SSE4.2 requirement added in 1b6ebb1 / #20244 This PR fully enables the existing x86-64 CPU detection and dispatch code for SSSE3, SSE4.1, SSE4.2, AVX, and AVX2 in the base64 module. To use the existing CPU dispatch from the [upstream base64 code](https://github.com/aklomp/base64), one needs to compile the sources in each of the CPU specific codec directories with a specific compiler flag; alas this is difficult to do with setuptools, but I found a solution inspired by https://stackoverflow.com/a/68508804 Note that I did not enable the AVX512 path with this PR, as many intel CPUs that support AVX512 can come with a performance hit if AVX512 is sporadically used; the performance of the AVX512 (encoding) path need to be evaluated in the context of how mypyc uses base64 in various realistic scenarios. (There is no AVX512 accelerated decoding path in the upstream base64 codebase, it falls back to the avx2 decoder). If there are additional performance concerns, then I suggest benchmarking with the openmp feature of base64 turned on, for multi-core processing.
Fixes the current SSE4.2 requirement added in 1b6ebb1 / #20244
This PR fully enables the existing x86-64 CPU detection and dispatch code for SSSE3, SSE4.1, SSE4.2, AVX, and AVX2 in the base64 module.
To use the existing CPU dispatch from the upstream base64 code, one needs to compile the sources in each of the CPU specific codec directories with a specific compiler flag; alas this is difficult to do with setuptools, but I found a solution inspired by https://stackoverflow.com/a/68508804
Note that I did not enable the AVX512 path with this PR, as many intel CPUs that support AVX512 can come with a performance hit if AVX512 is sporadically used; the performance of the AVX512 (encoding) path need to be evaluated in the context of how mypyc uses base64 in various realistic scenarios. (There is no AVX512 accelerated decoding path in the upstream base64 codebase, it falls back to the avx2 decoder).
If there are additional performance concerns, then I suggest benchmarking with the openmp feature of base64 turned on, for multi-core processing.