Skip to content

Conversation

@mr-c
Copy link
Contributor

@mr-c mr-c commented Nov 18, 2025

Fixes the current SSE4.2 requirement added in 1b6ebb1 / #20244

This PR fully enables the existing x86-64 CPU detection and dispatch code for SSSE3, SSE4.1, SSE4.2, AVX, and AVX2 in the base64 module.

To use the existing CPU dispatch from the upstream base64 code, one needs to compile the sources in each of the CPU specific codec directories with a specific compiler flag; alas this is difficult to do with setuptools, but I found a solution inspired by https://stackoverflow.com/a/68508804

Note that I did not enable the AVX512 path with this PR, as many intel CPUs that support AVX512 can come with a performance hit if AVX512 is sporadically used; the performance of the AVX512 (encoding) path need to be evaluated in the context of how mypyc uses base64 in various realistic scenarios. (There is no AVX512 accelerated decoding path in the upstream base64 codebase, it falls back to the avx2 decoder).

If there are additional performance concerns, then I suggest benchmarking with the openmp feature of base64 turned on, for multi-core processing.

@mr-c mr-c force-pushed the librt_base64_simd_cpu_dispatch branch from b611c27 to 067b1b8 Compare November 18, 2025 12:24
@github-actions

This comment has been minimized.

@mr-c mr-c force-pushed the librt_base64_simd_cpu_dispatch branch from 0ce45ae to cca1dc2 Compare November 18, 2025 14:17
@github-actions

This comment has been minimized.

@mr-c mr-c force-pushed the librt_base64_simd_cpu_dispatch branch 2 times, most recently from c584ff9 to 6e96889 Compare November 18, 2025 15:00
@github-actions

This comment has been minimized.

@mr-c mr-c force-pushed the librt_base64_simd_cpu_dispatch branch from 6e96889 to d270f09 Compare November 18, 2025 16:42
@mr-c mr-c changed the title librt base64: use existing SIMD CPU dispatch by customizing build flags [mypyc] librt base64: use existing SIMD CPU dispatch by customizing build flags Nov 18, 2025
@github-actions

This comment has been minimized.

@mr-c mr-c force-pushed the librt_base64_simd_cpu_dispatch branch from d270f09 to a0c90f5 Compare November 18, 2025 17:27
@github-actions

This comment has been minimized.

1 similar comment
@github-actions

This comment has been minimized.

@jhance
Copy link
Collaborator

jhance commented Nov 19, 2025

I think there should also be a documented flag for setting the hardware floor (mostly useful for avx2 in order to avoid CPU dispatch if you know your hardware supports it)

@mr-c
Copy link
Contributor Author

mr-c commented Nov 19, 2025

I think there should also be a documented flag for setting the hardware floor (mostly useful for avx2 in order to avoid CPU dispatch if you know your hardware supports it)

Thank you for the review @jhance ; is adding a flag for setting the hardware floor a blocker for merging?

According to upstream, and confirmed by my review of the code, the codec choice (as a result of the CPU detection) is only done once and is saved for the lifetime of the program.

// These static function pointers are initialized once when the library is
// first used, and remain in use for the remaining lifetime of the program.
// The idea being that CPU features don't change at runtime.
static struct codec codec = { NULL, NULL };

Prior to this PR, the CPU detection was already being run: on X86_64 systems BASE64_WITH_SSE42 was always defined, therefore HAVE_SSE42 was always defined prior to confirming the CPU support for SSE4.2

#if HAVE_SSE42
// Check for SSE42 support:
if (max_level >= 1) {
__cpuid(1, eax, ebx, ecx, edx);
if (ecx & bit_SSE42) {
codec->enc = base64_stream_encode_sse42;
codec->dec = base64_stream_decode_sse42;
return true;
}
}
#endif

In my opinion there is no performance advantage for bypassing the one-time CPU detection on X86_64 systems.

@mr-c
Copy link
Contributor Author

mr-c commented Nov 19, 2025

Benchmarking results (n=100, AMD EPYC 9454 48-Core Processor)

master                         6.098s (0.0%)  | stdev 0.059s 
librt_base64_simd_cpu_dispatch 6.076s (-0.4%) | stdev 0.070s

@mr-c mr-c force-pushed the librt_base64_simd_cpu_dispatch branch from c56196b to 2354438 Compare November 20, 2025 10:59
@github-actions

This comment has been minimized.

@mr-c mr-c force-pushed the librt_base64_simd_cpu_dispatch branch from 2354438 to 379cd1e Compare November 22, 2025 23:11
@github-actions

This comment has been minimized.

@mr-c mr-c mentioned this pull request Nov 25, 2025
X86_64 = platform.machine() in ("x86_64", "AMD64", "amd64")


def spawn(self, cmd, **kwargs) -> None: # type: ignore[no-untyped-def]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason why not to annotate this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I tried annotating this before, but the signature varies too much between setuptools/distutils versions and Python versions.

X86_64 = platform.machine() in ("x86_64", "AMD64", "amd64")


def spawn(self, cmd, **kwargs) -> None: # type: ignore[no-untyped-def]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks similar if not the same, any particular reason why not have it in a shared location, or is this because of the fact these are setup files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the later; the existing code also has duplication issues as already noted

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that there is more duplicated we can try to share more of the code, but it can happen outside this PR.

"base64/arch/sse41": ["-msse4.1", "-DBASE64_WITH_SSE41"],
"base64/arch/sse42": ["-msse4.2", "-DBASE64_WITH_SSE42"],
"base64/arch/avx2": ["-mavx2", "-DBASE64_WITH_AVX2"],
"base64/arch/avx": ["-mavx", "-DBASE64_WITH_AVX"],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these BASE64_WITH... #defines need to enabled in all files. Otherwise the codec choosing code doesn't get triggered (which happens in codec_choose.c). With these changes we compile the SIMD versions, but I don't think they will be used at runtime. I ran a microbenchmark and performance was slower on an AMD system with this PR.

Here's the benchmark I used (added it to run-base64.test temporarily):

[case testXXX_librt_experimental]
import time
from librt.base64 import b64encode

a = b"foo"
b = a * 10000

def bench1(b: bytes, n: int) -> None:
    for i in range(n):
        b64encode(b)

bench1(b, 1000000)  # Warmup

t0 = time.time()
n = 1000 * 200
bench1(b, n)
td = time.time() - t0
print(len(b) * n / td / 1024 / 1024, "MB/s")

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah you need to force optimization level to 3 to get meaningful benchmark results (e.g. patch mypyc.build.mypycify and force opt_level to be 3).

Copy link
Contributor Author

@mr-c mr-c Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JukkaL Thanks for looking into this. My laptop died this morning, so feel free to push additional fixes to my branch

Copy link
Contributor Author

@mr-c mr-c Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: In the lib-rt setup.py the 3rd optimization level is enabled

mypy/mypyc/lib-rt/setup.py

Lines 130 to 131 in 379cd1e

if compiler.compiler_type == "unix": # type: ignore[attr-defined]
cflags += ["-O3"]

I'm surprised you had an issue with mypycify, as 3 is the default level

opt_level: str = "3",

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these BASE64_WITH... #defines need to enabled in all files. Otherwise the codec choosing code doesn't get triggered (which happens in codec_choose.c).

Okay, I agree that for X86-64, all the HAVE_* definitions should always be enabled.

I guess the easiest way is to edit mypyc/lib-rt/base64/config.h to set those flags inside a #if defined(__x86_64__) && defined(__LP64__) check and trim the above flags to just setting -mavx2 and similar.

Copy link
Contributor Author

@mr-c mr-c Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm back; I got a new laptop charger :-D

Thank you for the micro benchmark, it helps a lot!

FYI: In the lib-rt setup.py the 3rd optimization level is enabled

mypy/mypyc/lib-rt/setup.py

Lines 130 to 131 in 379cd1e

if compiler.compiler_type == "unix": # type: ignore[attr-defined]
cflags += ["-O3"]

I'm surprised you had an issue with mypycify, as 3 is the default level

opt_level: str = "3",

Ah, it got overridden by this line, so setting MYPYC_OPT_LEVEL=3 pytest -n0 -vvv -s mypyc -k testXXX_librt_experimental is easier than patching

opt_level = int(os.environ.get("MYPYC_OPT_LEVEL", 0))

I've added a commit to set the HAVE_{SSSE3,SSE41,SSE42,AVX,AVX2} flags automatically for amd64/x86-64 systems, removing the need for the BASE64_WITH_* definitions on the compile time.

The baseline speed on my system using your benchmarking was 9,089 MB/s before my changes, now it is 14,461 MB/s. It also showed that all the -mavx2 -mavx flags were being added also to the final linking stage, which is obviously not appropriate:

INFO root:spawn.py:77 gcc -shared -L/home/mi/crusoe/.pyenv/versions/3.13.2/lib -Wl,-rpath,/home/mi/crusoe/.pyenv/versions/3.13.2/lib -L/home/mi/crusoe/.pyenv/versions/3.13.2/lib -Wl,-rpath,/home/mi/crusoe/.pyenv/versions/3.13.2/lib build/temp.linux-x86_64-cpython-313/build/base64/arch/avx/codec.o build/temp.linux-x86_64-cpython-313/build/base64/arch/avx2/codec.o build/temp.linux-x86_64-cpython-313/build/base64/arch/avx512/codec.o build/temp.linux-x86_64-cpython-313/build/base64/arch/generic/codec.o build/temp.linux-x86_64-cpython-313/build/base64/arch/neon32/codec.o build/temp.linux-x86_64-cpython-313/build/base64/arch/neon64/codec.o build/temp.linux-x86_64-cpython-313/build/base64/arch/sse41/codec.o build/temp.linux-x86_64-cpython-313/build/base64/arch/sse42/codec.o build/temp.linux-x86_64-cpython-313/build/base64/arch/ssse3/codec.o build/temp.linux-x86_64-cpython-313/build/base64/codec_choose.o build/temp.linux-x86_64-cpython-313/build/base64/lib.o build/temp.linux-x86_64-cpython-313/build/base64/tables/tables.o build/temp.linux-x86_64-cpython-313/build/bytes_ops.o build/temp.linux-x86_64-cpython-313/build/dict_ops.o build/temp.linux-x86_64-cpython-313/build/exc_ops.o build/temp.linux-x86_64-cpython-313/build/float_ops.o build/temp.linux-x86_64-cpython-313/build/generic_ops.o build/temp.linux-x86_64-cpython-313/build/getargs.o build/temp.linux-x86_64-cpython-313/build/getargsfast.o build/temp.linux-x86_64-cpython-313/build/init.o build/temp.linux-x86_64-cpython-313/build/int_ops.o build/temp.linux-x86_64-cpython-313/build/librt_base64.o build/temp.linux-x86_64-cpython-313/build/list_ops.o build/temp.linux-x86_64-cpython-313/build/misc_ops.o build/temp.linux-x86_64-cpython-313/build/pythonsupport.o build/temp.linux-x86_64-cpython-313/build/set_ops.o build/temp.linux-x86_64-cpython-313/build/str_ops.o build/temp.linux-x86_64-cpython-313/build/tuple_ops.o -L/home/mi/crusoe/.pyenv/versions/3.13.2/lib -o build/lib.linux-x86_64-cpython-313/librt/base64.cpython-313-x86_64-linux-gnu.so -mssse3 -msse4.2 -msse4.1 -mavx -mavx2 -mavx

So the next commit limits the matches to when the term ends in .c. The new speed is 14,242 MB/s, a 57% improvement from the baseline (before this PR).

@github-actions

This comment has been minimized.

@github-actions
Copy link
Contributor

According to mypy_primer, this change doesn't affect type check results on a corpus of open source code. ✅

@mr-c
Copy link
Contributor Author

mr-c commented Nov 27, 2025

@JukkaL all tests pass, I think this is ready for squashing and merging so it can be cherry-picked for the 1.19 release branch

Copy link
Collaborator

@JukkaL JukkaL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates, looks good. We can optionally deal with the additional code duplication later on, but it's a pre-existing issue and we can live with it for now.

X86_64 = platform.machine() in ("x86_64", "AMD64", "amd64")


def spawn(self, cmd, **kwargs) -> None: # type: ignore[no-untyped-def]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that there is more duplicated we can try to share more of the code, but it can happen outside this PR.

@JukkaL JukkaL merged commit 0a5c790 into python:master Nov 28, 2025
21 checks passed
@mr-c mr-c deleted the librt_base64_simd_cpu_dispatch branch November 28, 2025 10:01
@mr-c
Copy link
Contributor Author

mr-c commented Nov 28, 2025

Thank you @JukkaL !

p-sawicki pushed a commit that referenced this pull request Nov 28, 2025
…uild flags (#20253)

Fixes the current SSE4.2 requirement added in
1b6ebb1
/ #20244

This PR fully enables the existing x86-64 CPU detection and dispatch
code for SSSE3, SSE4.1, SSE4.2, AVX, and AVX2 in the base64 module.

To use the existing CPU dispatch from the [upstream base64
code](https://github.com/aklomp/base64), one needs to compile the
sources in each of the CPU specific codec directories with a specific
compiler flag; alas this is difficult to do with setuptools, but I found
a solution inspired by https://stackoverflow.com/a/68508804

Note that I did not enable the AVX512 path with this PR, as many intel
CPUs that support AVX512 can come with a performance hit if AVX512 is
sporadically used; the performance of the AVX512 (encoding) path need to
be evaluated in the context of how mypyc uses base64 in various
realistic scenarios. (There is no AVX512 accelerated decoding path in
the upstream base64 codebase, it falls back to the avx2 decoder).

If there are additional performance concerns, then I suggest
benchmarking with the openmp feature of base64 turned on, for multi-core
processing.
@github-project-automation github-project-automation bot moved this from Todo to Done in GC-Content-Calculator Dec 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants