Skip to content

[clang] -ffast-math in 19.1.0 prevents function from returning intended __m128 bitmask #118152

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
shinfd opened this issue Nov 30, 2024 · 19 comments · May be fixed by #118603
Open

[clang] -ffast-math in 19.1.0 prevents function from returning intended __m128 bitmask #118152

shinfd opened this issue Nov 30, 2024 · 19 comments · May be fixed by #118603
Labels

Comments

@shinfd
Copy link

shinfd commented Nov 30, 2024

Hello.

A function like below may return an unexpected __m128 value when compiled using clang 19.1.0 with -ffast-math -O2.

auto maskFunc = []() {
    return _mm_castsi128_ps(_mm_set_epi32( 0x7FFFFFFF, 0x00000000, 0x00000000, 0xFFFFFFFF ));
};

Sample code at Compiler Explorer https://godbolt.org/z/xG5r1n7xY , demonstrating different result from lambda return and immediate assignment.

Workaround:

  • Remove -ffast-math or -O2
  • Add -fno-finite-math-only
  • Use clang 18

I can speculate that this is because the i32 values set in the sample function are NaN when interpreted as float, and -ffast-math optimizer discarded such return value as invalid. However, the __m128 value was intended for use in bitwise operation intrinsics like _mm_and_ps(), and not as a value in floating point calculation.

This was permitted in clang 18, but no longer as of clang 19.1.0. Was this behavior change intended?

If so, is guaranteeing assignment and returning of an arbitrary __m128 value, which may contain NaN or INFINITY, no longer possible with -ffast-math enabled for clang 19 onward? In addition to bitwise intrinsics, comparison intrinsics like _mm_cmpgt_ps() will use 0xFFFFFFFF as truth values, so handling non-finite value are necessary for multiple intrinsics by definition.

Or, is it simply not recommended to use these intrinsics with -ffast-math, or more specifically -ffinite-math-only? Using -ffast-math -fno-finite-math-only combo is a valid workaround, but will affect the entire compilation, potentially preventing other uninvolved code from taking benefit of finite-math-only-optimization.

Regards,

@github-actions github-actions bot added the clang Clang issues not falling into any other category label Nov 30, 2024
@EugeneZelenko EugeneZelenko added llvm:optimizations floating-point Floating-point math and removed clang Clang issues not falling into any other category labels Nov 30, 2024
@slipher
Copy link

slipher commented Nov 30, 2024

+1, we were also affected by this. https://godbolt.org/z/7v7hb8E9q is a minimized example of what happened.

@RKSimon
Copy link
Collaborator

RKSimon commented Dec 1, 2024

CC @andykaylor

@arsenm
Copy link
Contributor

arsenm commented Dec 1, 2024

However, the __m128 value was intended for use in bitwise operation intrinsics like _mm_and_ps(), and not as a value in floating point calculation.

If it has a floating-point type, then yes this is not allowed to contain a nan or inf under finite math only. I don't know how __m128 is defined or intended to be used, but if this is wrong it should not be considered a floating point type.

@slipher
Copy link

slipher commented Dec 1, 2024

It should not be considered a floating-point type but rather a multi-purpose type that can contain floating point values or bit masks. The most common 32-bit bit patterns used with the bitwise instructions are 0, 0xFFFFFFFF, and 0x80000000, with the latter being used to twiddle the sign bit. It occurs to me with that "fast math", all 3 of these values are potentially unreliable because of -fno-signed-zeroes. Luckily for us Clang does not (yet?) replace 0x80000000 masks with zeroes :)

@shinfd
Copy link
Author

shinfd commented Dec 2, 2024

Would like to point out that 0x80000000 is a valid float, a negative zero. So this is guaranteed to be always permitted.

The most common technique affected by this change, is using 0x7FFFFFFF with _mm_and_ps() to calculate absolute value of a __m128 value, like one described in StackOverflow question below.

x86 - Fastest way to compute absolute value using SSE - Stack Overflow

This will no longer work as of clang 19. The workaround is to use 0x80000000 (valid float) and _mm_andnot_ps() instead. In other situation using bitwise intrinsics, such workaround may not be available, so changing compiler option is likely the most practical remedy.


If it has a floating-point type, then yes this is not allowed to contain a nan or inf under finite math only.

As I wrote in the original description, there are intrinsics like comparison that emit NaN 0xFFFFFFFF by design. This is not as an error output, rather as a standard output. I have to agree that this is Intel's bad design choice, but this makes it slightly harder to treat __m128 as a pure float type, and more hybrid type as slipher had suggested.

The current symptom can be found in implementations like what I and slipher had shown above, using constant non-finite values as input to the intrinsics.

However, the question can be extended to how future optimization may handle NaNs produced by SSE intrinsics. If such output may also be affected, it may be wise to advise against use of select SSE intrinsics in combination with -ffast-math, possibly even showing a warning; this boils down to the last point I made in the original description.

A crude example: given two __m128, this sample code selects greater or equal components out of the two.

https://godbolt.org/z/1eqMWWWKE

As the inputs to _mm_cmpge_ps() are constant, an optimization can potentially deduce that the interim variable vCmp will end up storing NaN, and may treat this entire code as undefined. That is not currently happening, and the code runs as intended, but is that guaranteed for future as well?

Regards,

@andykaylor
Copy link
Contributor

andykaylor commented Dec 2, 2024

The header file defines __m128 as a vector of floats, and clang chooses to lower intrinsics to target-independent IR as often as possible for optimization purposes, so applying the no-nans and no-infinities rules is a natural consequence. However, there is some question as to whether people using intrinsics actually want the fast-math flags to apply to their intrinsics. My experience is that some do and some don't. In clang, we do apply fast-math to the intriniscs. There are some problems with that. If you don't want fast-math flags applied to the intrinsics, you could do something like this:

#pragma float_control(push)
#pragma float_control(precise, on)
#include "immintrin.h"
#pragma float_control(pop)

There are some other problems with that.

I have expressed recently that as our optimization based on the no-nan and no-infinities options becomes more aggressive, these options are going to be less useful to a broader group of people. For that reason, I recently changed the -ffp-model=fast option to no longer make the finite-math-only assumption and introduced -ffp-model=aggressive to keep the old behavior for those who really mean to say that their program will not encounter NaNs or infinites. The -ffast-math and -funsafe-math-optimizations retain the same meaning they have in gcc, for compatibility reasons, but I think many people should consider using -funsafe-math-optimizations instead of -ffast-math.

CC @jcranmer-intel

@slipher
Copy link

slipher commented Dec 2, 2024

-funsafe-math-optimizations implies -fno-signed-zeroes which, as I pointed out above, would also seem to be incompatible with bitwise operations under the "__m128 is always floats" model since Clang could replace 0x80000000 (-0.0f) with 0x00000000 (0.0f) or vice versa.

@andykaylor
Copy link
Contributor

-funsafe-math-optimizations implies -fno-signed-zeroes which, as I pointed out above, would also seem to be incompatible with bitwise operations under the "__m128 is always floats" model since Clang could replace 0x80000000 (-0.0f) with 0x00000000 (0.0f) or vice versa.

Yes, that's true, but of course -ffast-math also implies no signed zeroes. I think part of the problem we're seeing here is that the Intel intrinsics, especially the older intrinsics, were not designed to be strongly typed, but the implementation has become so over time. Many intrinsics users seem to think of the intrinsics as a sort of near-inline-assembly construct and just want the compiler to do what they said, but as I alluded to previously, clang doesn't see things that way.

If you want the raw bitwise functionality of _mm_and_ps, you may be better off using _mm_castps_si128 and _mm_and_si128 but that may just move the problem since what you're really doing is a kind of type-punning that clang doesn't like. I understand why you're doing it and why it's useful, but I'm not sure we have good rules to enable it.

@shinfd
Copy link
Author

shinfd commented Dec 2, 2024

Apologies for my oversight regarding signed zero and its handling in -ffast-math. Thank you slipher for the follow up.

So the use of 0x80000000 and _mm_andnot_ps() to calculate abs seems to function as intended at the time of writing with multiple sample codes tested, but because it uses signed zero, the behavior when optimized may potentially change.

Therefore, for the currently available clang 19 (and for future versions if the change is to stay), in order to ensure handling of an arbitrary bitmask, I guess we should be using -fsigned-zeros in addition to -fno-finite-math-only when enabling -ffast-math.

@jcranmer-intel
Copy link
Contributor

There's a few separate but not entirely orthogonal issues going on here.

  • __m128 at times really wants to just be "128-bit vector, number and type of lanes is unimportant" as opposed to <4 x float>. But I think the semantics today are too baked into the type to really change this; c'est la vie.
  • These functions are, from the user's perspective, largely atomic builtins, and the property of the operations, including their fast-math attributes, should generally be inherited. This is particularly true for something like _mm_add_ps, where we want to be able to contract _mm_add_ps(_mm_mul_ps(x, y), z) into an FMA instruction.
  • Note that several functions correspond to functions that can't get fast-math flags at all, like _mm_andnot_ps. Having the fast-math flags apply to these intrinsics just doesn't make sense.
  • To a degree, the value-based fast-math connotations (no NaNs, infinities, signed zero) don't make a lot of sense to lift on these intrinsics in the first place, so even if we're generally inheriting the fast-math flags, we probably wouldn't want to inherit those (and if the user did want them, they could be inferred from other code, generally speaking).

The change that really broke this code is that now we have nofpclass(nan inf) applied to parameters/returns with __m128. It is probably advisable to change the definition of __m128 in such a way that we don't generate nofpclass(nan inf) with it, even in -ffast-math mode. What are your thoughts, @arsenm?

@arsenm
Copy link
Contributor

arsenm commented Dec 2, 2024

The change that really broke this code is that now we have nofpclass(nan inf) applied to parameters/returns with __m128. It is probably advisable to change the definition of __m128 in such a way that we don't generate nofpclass(nan inf) with it, even in -ffast-math mode. What are your thoughts, @arsenm?

I think the way to go is to stop treating the type __m128 as a floating-point type eligible for fast attributes / flags. IIRC I originally wanted to check the original source type is FP, rather than the IR type, but don't remember where that ended up.

@andykaylor
Copy link
Contributor

I think the way to go is to stop treating the type __m128 as a floating-point type eligible for fast attributes / flags. IIRC I originally wanted to check the original source type is FP, rather than the IR type, but don't remember where that ended up.

The problem with that is that it would also block the FMA formation in the case that Joshua cited in his second bullet.

It seems to me that the problem is with the intrinsics like _mm_and_ps that want to treat the values as integers even though they accept arguments that say the values are floating-point. I suspect that there are integer equivalents that could be used with an intermediate cast intrinsic in all such cases. Could we deprecate the "bitwise floating-point" intrinsics? Maybe change the header definition to a macro that invokes the equivalent cast and integer intrinsics?

CC: @phoebewang

@shinfd
Copy link
Author

shinfd commented Dec 4, 2024

I would believe deprecating it outright is bit too extreme, since the intrinsics function correctly without -ffast-math. Warning when used in conjunction seems more sensible, as I had previously listed (And that is, if clang development will proceed in this direction).

I suspect that there are integer equivalents that could be used with an intermediate cast intrinsic in all such cases.

Unfortunately, although float bitwise intrinsics were defined in the first SSE, __m128i and integer intrinsics (bitwise operations and casts) were added later, starting from SSE2.

x86 intrinsics list | Microsoft Learn

So straight replacement of float intrinsics with integer equivalents will result in error when option -mno-sse2 is used.

@arsenm
Copy link
Contributor

arsenm commented Dec 4, 2024

The problem with that is that it would also block the FMA formation in the case that Joshua cited in his second bullet.

It's ugly, but I suppose this only applies for nofpclass, and not the fast math flags.

@phoebewang
Copy link
Contributor

How about something in between? #118603

@phoebewang
Copy link
Contributor

How about something in between? #118603

Turns float_control doesn't work for it: https://godbolt.org/z/7bEEMYvxW
The InstCombine pass does replacement regardless.

@jcranmer-intel
Copy link
Contributor

The problem I noticed is that using __m128 on a function parameter or return value for any function causes it to get nofpclass, so the user-defined helper functions that return __m128 get the nofpclass, which is sufficient for the poison implications, even if the intrinsics themselves don't have it.

A potential fix for that is to make __m128 somehow not get nofpclass independent of the current finite-math-only rules, and I don't know enough about the clang frontend to know how difficult it is to plumb something like that with an attribute or the like.

@phoebewang
Copy link
Contributor

The problem I noticed is that using __m128 on a function parameter or return value for any function causes it to get nofpclass, so the user-defined helper functions that return __m128 get the nofpclass, which is sufficient for the poison implications, even if the intrinsics themselves don't have it.

A potential fix for that is to make __m128 somehow not get nofpclass independent of the current finite-math-only rules, and I don't know enough about the clang frontend to know how difficult it is to plumb something like that with an attribute or the like.

Thanks for the point! I think it's a big hammer to introduce an attribute like that. And I don't think it's what we want. If __m128 doesn't behave like normal floating type, why don't we replace it with __m128i instead?

I think the proper way to fix this is to make canApplyNoFPClass respect FPOptionsOverride. I'd consider it's a bug of nofpclass that cannot be controlled by float_control.

@andykaylor
Copy link
Contributor

I'd consider it's a bug of nofpclass that cannot be controlled by float_control.

I would agree with this, but it doesn't seem to be limited to nofpclass. The float_control pragma misses some other attributes as well ("approx-func-fp-math" for example).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants