Skip to content

GH-131521: fix clangcl build on Windows for zlib-ng #131526

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Mar 24, 2025
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 37 additions & 12 deletions PCbuild/zlib-ng.vcxproj
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,7 @@
<PreprocessorDefinitions>%(PreprocessorDefinitions);ZLIB_COMPAT;WITH_GZFILEOP;NO_FSEEKO;HAVE_BUILTIN_ASSUME_ALIGNED;_CRT_SECURE_NO_DEPRECATE;_CRT_NONSTDC_NO_DEPRECATE;</PreprocessorDefinitions>
<PreprocessorDefinitions Condition="$(Platform) == 'Win32' or $(Platform) == 'x64'">%(PreprocessorDefinitions);X86_FEATURES;X86_HAVE_XSAVE_INTRIN;X86_SSE2;X86_SSSE3;X86_SSE42;X86_PCLMULQDQ_CRC;X86_AVX2;X86_AVX512;X86_AVX512VNNI;X86_VPCLMULQDQ_CRC</PreprocessorDefinitions>
<PreprocessorDefinitions Condition="$(Configuration) == 'Debug'">%(PreprocessorDefinitions);ZLIB_DEBUG</PreprocessorDefinitions>
<PreprocessorDefinitions Condition="$(PlatformToolset) == 'ClangCL'">%(PreprocessorDefinitions);HAVE_BUILTIN_CTZ</PreprocessorDefinitions>
<EnableEnhancedInstructionSet Condition="$(Platform) == 'Win32' or $(Platform) == 'x64'">AdvancedVectorExtensions2</EnableEnhancedInstructionSet>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd feel better we'd not enable that for all *.c files (at lest not on Win32). MSVC would be totally happy without it (and AFAIR defaults to SSE2 if not set). Maybe I'd then have to throw in more -m for clang-cl, but IMHO we're safer without that even in the MSVC case, so that the resulting binary can run on ancient CPUs, too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I suspect you're right on this. I'm not sure whether the intrinsics code is skipped or still generates fine without this, but it wouldn't surprise me if it's being ignored (there are __AVX*__ preprocessor checks that will behave differently without this setting).

Putting it just on the .c files from arch\x86 should be enough. Those are already in a conditional ItemGroup, so the condition isn't needed each time (the ClangCL condition is, though).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I suspect you're right on this. I'm not sure whether the intrinsics code is skipped or still generates fine without this, but it wouldn't surprise me if it's being ignored (there are __AVX*__ preprocessor checks that will behave differently without this setting).

Well spotted! Although MSVC allows you to use intrinsics even if the respective /arch part isn't set, it won't set the macros __AVX__, etc. Please note, that it never sets macros like __SSE2__:

Doing a regex __[A-Z]+[0-9]+__ only reveals

#  if (defined(X86_SSE2) && defined(__SSE2__)) || defined(__x86_64__) || defined(_M_X64) || defined(X86_NOCHECK_SSE2)	E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_functions.h	76	37
#  if defined(X86_SSSE3) && defined(__SSSE3__)	E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_functions.h	95	37
#if defined(X86_PCLMULQDQ_CRC) && defined(__PCLMUL__)	E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_functions.h	110	43
#  if defined(X86_AVX2) && defined(__AVX2__)	E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_functions.h	123	36
#  if defined(X86_AVX512) && defined(__AVX512F__) && defined(__AVX512DQ__) && defined(__AVX512BW__) && defined(__AVX512VL__)	E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_functions.h	147	38
#    if defined(X86_AVX512VNNI) && defined(__AVX512VNNI__)	E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_functions.h	159	44
#    if defined(__PCLMUL__) && defined(__AVX512F__) && defined(__VPCLMULQDQ__)	E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_functions.h	166	17

which are within an #ifdef DISABLE_RUNTIME_CPU_DETECTION, which isn't set.

The other hits are in x86_intrins.h:

#ifdef __AVX2__	E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_intrins.h	7	8
#if (!defined(__clang__) && !defined(__NVCOMPILER) && defined(__GNUC__) && __GNUC__ < 10) \	E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_intrins.h	10	63
#ifdef __AVX512F__	E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_intrins.h	18	8
#endif // __AVX512F__	E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_intrins.h	24	11
#endif // __AVX2__	E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_intrins.h	27	11
#ifdef __AVX512F__	E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_intrins.h	31	8
#if (!defined(__clang__) && !defined(__NVCOMPILER) && defined(__GNUC__) && __GNUC__ < 9)	E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_intrins.h	32	63
#endif // __AVX512F__	E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_intrins.h	67	11
#ifdef __AVX2__	E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_intrins.h	74	8
#endif // __AVX2__	E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_intrins.h	78	11
#ifdef __AVX512F__	E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_intrins.h	80	8
#endif // __AVX512F__	E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_intrins.h	84	11
#if defined(_MSC_VER) && defined(__AVX512F__) && !defined(_MM_K0_REG8)	E:\cpython_clang_pgo2\externals\zlib-ng-2.2.4\arch\x86\x86_intrins.h	88	34

where intrinsic quirks of some compiler versions are taken care of. There are some for MSVC that even include __AVX512F__, which wouldn't be covered with the current AdvancedVectorExtensions2. My latest commits should fix this.

Putting it just on the .c files from arch\x86 should be enough. Those are already in a conditional ItemGroup, so the condition isn't needed each time (the ClangCL condition is, though).

Yeah, I've only set it to those files that really need them, rendering some clang specific stuff obsolete :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which are within an #ifdef DISABLE_RUNTIME_CPU_DETECTION, which isn't set.

This makes me quite confident, that you're spot on #131438 (comment)

I believe it's detected and called dynamically based on availability, but it's all compiled at compile time.

I.e., we are in the "dynamic dispatch" code path
https://github.com/zlib-ng/zlib-ng/blob/860e4cff7917d93f54f5d7f0bc1d0e8b1a3cb988/functable.h#L12-L26

#ifdef DISABLE_RUNTIME_CPU_DETECTION

#  include "arch_functions.h"

/* When compiling with native instructions it is not necessary to use functable.
 * Instead we use native_ macro indicating the best available variant of arch-specific
 * functions for the current platform.
 */
#  define FUNCTABLE_INIT ((void)0)
#  define FUNCTABLE_CALL(name) native_ ## name
#  define FUNCTABLE_FPTR(name) &native_ ## name

#else

struct functable_s {

where functable_s will be populated at runtime according to the capabilities of the CPU in:
https://github.com/zlib-ng/zlib-ng/blob/860e4cff7917d93f54f5d7f0bc1d0e8b1a3cb988/functable.c#L45-L49

static void init_functable(void) {
    struct functable_s ft;
    struct cpu_features cf;

    cpu_check_features(&cf);

Using the /arch option only for the needed files will now guarantee that the binary can run even on ancient CPUs and will use optimized code paths according to the CPU capabilites.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the /arch option only for the needed files will now guarantee that the binary can run even on ancient CPUs and will use optimized code paths according to the CPU capabilites.

This is similar to #130213 / #130447. Thus, I've checked using clang-cl 18.1.8: it is picky about just "seeing" intrinsics without the appropriate -m architecture - and it happily compiled :)

</ClCompile>
</ItemDefinitionGroup>
Expand Down Expand Up @@ -141,18 +142,42 @@
<ClCompile Include="$(zlibNgDir)\arch\x86\chunkset_sse2.c" />
<ClCompile Include="$(zlibNgDir)\arch\x86\compare256_sse2.c" />
<ClCompile Include="$(zlibNgDir)\arch\x86\slide_hash_sse2.c" />
<ClCompile Include="$(zlibNgDir)\arch\x86\adler32_ssse3.c" />
<ClCompile Include="$(zlibNgDir)\arch\x86\chunkset_ssse3.c" />
<ClCompile Include="$(zlibNgDir)\arch\x86\adler32_sse42.c" />
<ClCompile Include="$(zlibNgDir)\arch\x86\crc32_pclmulqdq.c" />
<ClCompile Include="$(zlibNgDir)\arch\x86\slide_hash_avx2.c" />
<ClCompile Include="$(zlibNgDir)\arch\x86\chunkset_avx2.c" />
<ClCompile Include="$(zlibNgDir)\arch\x86\compare256_avx2.c" />
<ClCompile Include="$(zlibNgDir)\arch\x86\adler32_avx2.c" />
<ClCompile Include="$(zlibNgDir)\arch\x86\adler32_avx512.c" />
<ClCompile Include="$(zlibNgDir)\arch\x86\chunkset_avx512.c" />
<ClCompile Include="$(zlibNgDir)\arch\x86\adler32_avx512_vnni.c" />
<ClCompile Include="$(zlibNgDir)\arch\x86\crc32_vpclmulqdq.c" />
<ClCompile Include="$(zlibNgDir)\arch\x86\adler32_ssse3.c">
<AdditionalOptions Condition="$(PlatformToolset) == 'ClangCL'">%(AdditionalOptions) -mssse3</AdditionalOptions>
</ClCompile>
<ClCompile Include="$(zlibNgDir)\arch\x86\chunkset_ssse3.c">
<AdditionalOptions Condition="$(PlatformToolset) == 'ClangCL'">%(AdditionalOptions) -mssse3</AdditionalOptions>
</ClCompile>
<ClCompile Include="$(zlibNgDir)\arch\x86\adler32_sse42.c">
Copy link
Member Author

@chris-eibl chris-eibl Mar 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting: the filename suggests sse43 sse42, but -mssse3 deemed sufficient.
Ups: edited typo.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth matching the filename, so that if an updated zlib-ng uses SSE4.3 instructions we don't need to modify anything here.

The file relies on intrinsics, so it's not going to automatically generate newer instructions.

<AdditionalOptions Condition="$(PlatformToolset) == 'ClangCL'">%(AdditionalOptions) -mssse3</AdditionalOptions>
</ClCompile>
<ClCompile Include="$(zlibNgDir)\arch\x86\crc32_pclmulqdq.c">
<AdditionalOptions Condition="$(PlatformToolset) == 'ClangCL'">%(AdditionalOptions) -mssse3 -mpclmul</AdditionalOptions>
</ClCompile>
<ClCompile Include="$(zlibNgDir)\arch\x86\slide_hash_avx2.c">
<AdditionalOptions Condition="$(PlatformToolset) == 'ClangCL'">%(AdditionalOptions) -mavx2</AdditionalOptions>
</ClCompile>
<ClCompile Include="$(zlibNgDir)\arch\x86\chunkset_avx2.c">
<AdditionalOptions Condition="$(PlatformToolset) == 'ClangCL'">%(AdditionalOptions) -mavx2</AdditionalOptions>
</ClCompile>
<ClCompile Include="$(zlibNgDir)\arch\x86\compare256_avx2.c">
<AdditionalOptions Condition="$(PlatformToolset) == 'ClangCL'">%(AdditionalOptions) -mavx2</AdditionalOptions>
</ClCompile>
<ClCompile Include="$(zlibNgDir)\arch\x86\adler32_avx2.c">
<AdditionalOptions Condition="$(PlatformToolset) == 'ClangCL'">%(AdditionalOptions) -mavx2</AdditionalOptions>
</ClCompile>
<ClCompile Include="$(zlibNgDir)\arch\x86\adler32_avx512.c">
<AdditionalOptions Condition="$(PlatformToolset) == 'ClangCL'">%(AdditionalOptions) -mavx512bw</AdditionalOptions>
</ClCompile>
<ClCompile Include="$(zlibNgDir)\arch\x86\chunkset_avx512.c">
<AdditionalOptions Condition="$(PlatformToolset) == 'ClangCL'">%(AdditionalOptions) -mavx512bw -mavx512vl -mbmi2</AdditionalOptions>
</ClCompile>
<ClCompile Include="$(zlibNgDir)\arch\x86\adler32_avx512_vnni.c">
<AdditionalOptions Condition="$(PlatformToolset) == 'ClangCL'">%(AdditionalOptions) -mavx512bw -mavx512vl -mavx512vnni</AdditionalOptions>
</ClCompile>
<ClCompile Include="$(zlibNgDir)\arch\x86\crc32_vpclmulqdq.c">
<AdditionalOptions Condition="$(PlatformToolset) == 'ClangCL'">%(AdditionalOptions) -mavx512f -mvpclmulqdq</AdditionalOptions>
</ClCompile>
</ItemGroup>
<ItemGroup>
<ClInclude Include="..\PC\zconf.h" />
Expand Down
Loading