Description
In the following example:
__attribute__((target("default")))
static int ctz(unsigned i) { return __builtin_ctz(i); }
__attribute__((target("arch=skylake")))
static int ctz(unsigned i) { return __builtin_ctz(i); }
__attribute__((target("default")))
int indirect_ctz(unsigned i) { return ctz(i); }
__attribute__((target("arch=skylake")))
int indirect_ctz(unsigned i) { return ctz(i); }
I would expect that indirect_ctz [default]
and indirect_ctz [clone .arch_skylake]
would be able to be optimized into static calls to ctz [default]
and ctz [clone .arch_skylake]
, respectively. As can be seen on the Godbolt link above, GCC is able to perform this optimization (and then inline them). However, with clang both of the indirect_ctz
versions simply call the ifunc
-resolved version of ctz
, which prevents inlining optimizations from taking effect.
Additionally, it seems that clang is also not able to perform this optimization if __attribute__((target_clones))
is used for either one or both of ctz
or indirect_ctz
:
Example 1
Example 2
Example 3
(GCC is able to optimize it into a static call to the target-specific implementation in examples 2 and 3, but fails to inline)