Missed Optimization - Replacement of rint/lrint with X87/SSE specific instructions #55202

Hendiadyoin1 · 2022-04-30T11:54:33Z

X87 and SSE have simple rounding and converting store instructions, which are essentially equivalent to l{0,2}rint[fl]?

Clang/LLVM does not seem to replace calls to rint with these, and neither does it vectorise these when used to round/convert vectors in all cases.
(truncation is properly replaced)

Some examples follow below

GCC is listed aswell,
The main difference to them is, that they do schedule their fldcw for truncation earlier and replace rintl, as well as use some bit-magic for rintf

Note: Using f32x4 for float __vector(4) and i32x4 for int __vector(4)
Note: cvtss2si != cvttss2si
Note: Assuming Overflows etc are UB, and HW's behaviour is acceptable

Scenario	LLVM	GCC	Effective instruction(s)
`rintl`	`call rintl@PLT`	`frndint`	`frndint`
`(int)rintl`	`call rintl@PLT` +truncation	`call rintl@PLT` +truncation	`fistp m16/m32/m64`
`lrintl`	`call lrintl`	`call lrintl`	`fistp m16/m32/m64`

`lrint`	`call lrintl`	`call lrintl`	`cvtss2si r32/r64, xmmX`
`(int)rintf`	`call rintf@PLT;cvttss2si`	Bit magic+`cvttss2si`	`cvtss2si r32, xmmX`
`(int)rintf (SSE4.2)`	`roundss + cvttss2si`	`roundss + cvttss2si`	`cvtss2si r32, xmmX`
`4x lrintf (f32x4->i32x4)`	4x (shuffle+`call lrintl`)	4x (shuffle+`call lrintl`	`cvtps2dq xmmY, xmmX`

Tested using glodbolt and x86_64 Clang 14.0.0 as well as x86_64 GCC 11.2 with O2 and O3

Update: Seems like most cases are now cought, only coalecsing cvtss2sis to cvtps2dq

The text was updated successfully, but these errors were encountered:

llvmbot · 2022-04-30T12:50:05Z

@llvm/issue-subscribers-backend-x86

efriedma-quic · 2022-05-01T00:05:29Z

Brief notes:

Some additional optimizations are enabled here by -fno-math-errno. We could probably be more aggressive here by default.
LLVM doesn't currently combine (int)rint(f) -> lrint(f).
LLVM doesn't currently have any patterns to generate frndint.
The "bit magic" gcc does is mostly a floating-point trick. The rounding step of a floating-point "add" can be used to round a number. The rest is just ensuring the number is small, and doing the necessary sign bit manipulation.

Hendiadyoin1 · 2022-05-01T10:24:28Z

looking at the man page for rint and lrint,
it seems that rint cannot fail in practice, while lrint does not seem to set errno, and can be assumed infallible with -ffinite-math-only
and following the infallibility logic, patterns like "(int)floor" could be replaced by cw/mxcsr accesses and conversions instead of a call and a truncation, which would access the controll registers in a lot of cases anyway (or roundss+any-conversion)

Looking at the output with -fno-math-errno LLVM and GCC seem replace lrint correctly with cvtss2si, but fails to vecotrize it.

Edit:
The lrint man page seems quite weird:
It states:

See math_error(7) for information on how to determine whether an error has >occurred when calling these functions.

The following errors can occur:

Domain error: x is a NaN or infinite, or the rounded value is too large
An invalid floating-point exception (FE_INVALID) is raised.
These functions do not set errno.

So it says it can raise an exception and cause a domnain-error, but not touch errno, although domain errors are signaled in errno?

lygstate · 2022-12-19T20:53:08Z

ping for this, mesa want this at https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/19978

enh-google · 2023-02-07T01:08:55Z

Edit: The lrint man page seems quite weird: It states:

See math_error(7) for information on how to determine whether an error has >occurred when calling these functions.
The following errors can occur:
Domain error: x is a NaN or infinite, or the rounded value is too large
An invalid floating-point exception (FE_INVALID) is raised.
These functions do not set errno.

So it says it can raise an exception and cause a domnain-error, but not touch errno, although domain errors are signaled in errno?

not for these functions, no. the FE_INVALID is the clue here --- it's telling you that you need to use the functionality in <fenv.h> to recognize errors from these functions. (as opposed to the nearbyint() family, which can't report errors via any mechanism.)

and note also that the C standard doesn't guarantee any floating point exceptions here either --- the exact wording in the standard is "The rint functions differ from the nearbyint functions (7.12.9.3) only in that the rint functions
may raise the "inexact" floating-point exception if the result differs in value from the argument." (emphasis mine)

-*-

anyway, i'm here as the maintainer of Android's libc ("bionic"), having been pointed at this bug after i asked our llvm folks if they knew why -- unlike arm64 and risc-v --__builtin_lrint and __builtin_lrintf aren't implemented for x86/x86-64, despite the fact that they're both single instructions. (Android's LP64 ABIs have ld128, so there's no optimization for lrintl() for us.)

"i'd use this" (well, the __builtin_s anyway) :-)

nickdesaulniers · 2023-03-01T01:22:49Z

LLVM doesn't currently combine (int)rint(f) -> lrint(f).

Sounds like simplifylibcalls could use some additions here? cc @davidbolvansky

So it looks like clang (the front end) is lowering calls to __builtin_lrint to either @llvm.lrint.i64.f64 (as expected) or @lrint based on -fno-math-errno? Looking at https://cs.android.com/android/platform/superproject/+/master:bionic/libm/builtins.cpp;l=69?q=lrint&ss=android%2Fplatform%2Fsuperproject:bionic%2Flibm%2F, @enh-google does bionic build libm with -fno-math-errno (or some flag that implicitly sets that) for __aarch64__ and __riscv but not __x86_64__/__i386__? It does look like LLVM can lower those intrinsics on x86, when the frontend does generate them.
https://github.com/llvm/llvm-project/blob/main/llvm/test/CodeGen/X86/lrint-conv-i32.ll

cc @phoebewang

stephenhines · 2023-03-01T02:23:43Z

https://cs.android.com/android/platform/superproject/+/master:bionic/libm/Android.bp;l=432 shows that bionic does build with -fno-math-errno for all targets. It might be good to create a small reproducer in the Android build system (in bionic/ itself) to make sure that there's not another flag interfering with this somehow, or just that the places where this would be expected to show up aren't doing something that interferes with @llvm.lrint.i64.f64 being selected.

enh-google · 2024-07-15T18:17:19Z

sorry, missed that there was a question for me...

yes, bionic does explicitly set -fno-math-errno when building libm.

yes, it does that unconditionally for all architectures.

(and note that we have a bunch of other x86/x86-64 builtins that work fine in https://cs.android.com/android/platform/superproject/+/main:bionic/libm/builtins.cpp.)

does anyone have a godbolt link where this actually works? i haven't been able to come up with one. though

double r(double x) { return __builtin_rint(x); }
float rf(float x) { return __builtin_rintf(x); }
long double fl(long double x) { return __builtin_rintl(x); }

does work in gcc. but trunk clang as of godbolt today still emits function calls.

@appujee

efriedma-quic · 2024-07-15T18:37:12Z

https://godbolt.org/z/4KsKv9f69 shows clang generating roundss/roundsd, if that's what you're asking.

Hendiadyoin1 · 2024-07-15T18:47:24Z

Checked also again with lrint which successfully generates a single cvtsd2si
Sadly doing so for a vector does not seem to be coalesced, so reducing the scope
https://godbolt.org/z/Pofx7eP4G
Casts/truncation seems to be properly cought though

enh-google · 2024-07-15T19:05:44Z

https://godbolt.org/z/4KsKv9f69 shows clang generating roundss/roundsd, if that's what you're asking.

ah, thanks! so it looks like what we're actually missing is -msse4, since that's not in the Android x86 ABI. does SSE4.1 or SSE4.2 not imply SSE4 though? i'm surprised from https://developer.android.com/ndk/guides/abis#sa that this doesn't "just work" on x86-64... certainly changing your -msse4 to -msse4.1 still seems to work. and our build system looks like it actually explicitly says -msse4 -msse4.1 -msse4.2, so i'll have to have a deeper look there and work out what's going on...

efriedma-quic · 2024-07-15T19:10:16Z

Sadly doing so for a vector does not seem to be coalesced, so reducing the scope
https://godbolt.org/z/Pofx7eP4G

I think this is just SLP not detecting the pattern and transforming it to a vector operation; you might want to file a separate bug specifically about SLP.

phoebewang · 2024-07-16T04:02:00Z

I have improved the lowering of vector lrint/llrint by #90065. I think the x86 specific issue can be closed.

appujee · 2024-07-17T15:15:26Z

For android this optimization is still missed. https://godbolt.org/z/xKEfeq97E

$ clang -O2 -ffast-math

# After x86-isel
#Machine code for function fl1(double): IsSSA, TracksLiveness
Frame Objects:
  fi#0: size=8, align=8, at location [SP+8]
  fi#1: size=8, align=8, at location [SP+8]
Function Live Ins: $xmm0 in %0

bb.0.entry:
  liveins: $xmm0
  %0:fr64 = COPY $xmm0
  MOVSDmr %stack.1, 1, $noreg, 0, $noreg, %0:fr64 :: (store (s64) into %stack.1); example.cpp:1:46
  %1:rfp80 = nofpexcept LD_Fp64m80 %stack.1, 1, $noreg, 0, $noreg, implicit-def dead $fpsw, implicit $fpcw :: (load (s64) from %stack.1); example.cpp:1:46
  IST_Fp64m80 %stack.0, 1, $noreg, 0, $noreg, killed %1:rfp80, implicit-def dead $fpsw, implicit $fpcw :: (store (s64) into %stack.0); example.cpp:1:29
  %2:gr64 = MOV64rm %stack.0, 1, $noreg, 0, $noreg :: (load (s64) from %stack.0); example.cpp:1:29
  $rax = COPY %2:gr64; example.cpp:1:22
  RET 0, $rax; example.cpp:1:22

# End machine code for function fl1(double).

clang -O2 -ffast-math --target=x86_64-linux-android

# After x86-isel
# Machine code for function fl1(double): IsSSA, TracksLiveness
Function Live Ins: $xmm0 in %0

bb.0.entry:
  liveins: $xmm0
  %0:fr64 = COPY $xmm0
  ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp; example.cpp:1:46
  $xmm0 = COPY %0:fr64; example.cpp:1:46
  CALL64pcrel32 target-flags(x86-plt) &__extenddftf2, <regmask $bh $bl $bp $bph $bpl $bx $ebp $ebx $hbp $hbx $rbp $rbx $r12 $r13 $r14 $r15 $r12b $r13b $r14b $r15b $r12bh $r13bh $r14bh $r15bh $r12d $r13d $r14d $r15d $r12w $r13w $r14w $r15w $r12wh and 3 more...>, implicit $rsp, implicit $ssp, implicit $xmm0, implicit-def $rsp, implicit-def $ssp, implicit-def $xmm0; example.cpp:1:46
  ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp; example.cpp:1:46
  %1:vr128 = COPY $xmm0; example.cpp:1:46
  $xmm0 = COPY %1:vr128; example.cpp:1:29
  TCRETURNdi64 target-flags(x86-plt) &lrintl, 0, <regmask $bh $bl $bp $bph $bpl $bx $ebp $ebx $hbp $hbx $rbp $rbx $r12 $r13 $r14 $r15 $r12b $r13b $r14b $r15b $r12bh $r13bh $r14bh $r15bh $r12d $r13d $r14d $r15d $r12w $r13w $r14w $r15w $r12wh and 3 more...>, implicit $rsp, implicit $ssp, implicit $xmm0; example.cpp:1:29

# End machine code for function fl1(double).

Hendiadyoin1 · 2024-07-17T19:42:27Z

I have improved the lowering of vector lrint/llrint by #90065. I think the x86 specific issue can be closed.

Sadly this seems to not be able to see through the truncation to 32bit in this example: https://godbolt.org/z/Pofx7eP4G
with -m32 it works nicely though, great work

phoebewang · 2024-07-18T02:08:11Z

For https://godbolt.org/z/xKEfeq97E, the problem is long double interpreted as fp128 instead of fp80 on android. If you don't really need long double, you can use __builtin_lrint.
For https://godbolt.org/z/Pofx7eP4G, the difference comes from IR, which would be SLP issue as @efriedma-quic suggested.
Both are not x86 specific issues.

enh-google · 2024-07-18T11:44:27Z

yeah, no Android-supported architecture has hardware fp128 support, so that's expected. and i messed up in my previous comment --- copying directly from the bionic source now, what i'm trying to do is add __i386__ and __x86_64__ to

#if defined(__aarch64__) || defined(__riscv)
long lrint(double x) { return __builtin_lrint(x); }
long lrintf(float x) { return __builtin_lrintf(x); }
long long llrint(double x) { return __builtin_llrint(x); }
long long llrintf(float x) { return __builtin_llrintf(x); }
#endif

and the reason it doesn't work (found by @pirama-arumuga-nainar yesterday) is that we -include a header containing #pragma STDC FENV_ACCESS ON before this. if i hack in a quick #pragma STDC FENV_ACCESS OFF into this file, i see the same [inlined single instruction] results from the [much] earlier godbolt link.

...but unless i misunderstood https://www.felixcloutier.com/x86/cvtsd2si "When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR register" sounds like this instruction does take <fenv.h> into account (just like the corresponding arm64 and riscv64 instructions), so it seems like a bug that #pragma STDC FENV_ACCESS ON disables use of it for x86/x86-64?

phoebewang · 2024-07-18T13:57:09Z

It's not a bug. When #pragma STDC FENV_ACCESS ON, the front end will generate llvm.experimental.constrained.lrint. We just haven't handled the STRICT_LRINT nodes yet.

github-actions bot added the new issue label Apr 30, 2022

EugeneZelenko added backend:X86 and removed new issue labels Apr 30, 2022

RKSimon mentioned this issue May 1, 2022

Failure to vectorize @llvm.lrint.i64.f32 #55208

Closed

fhahn added the missed-optimization label May 11, 2022

Missed Optimization - Replacement of rint/lrint with X87/SSE specific instructions #55202

Missed Optimization - Replacement of rint/lrint with X87/SSE specific instructions #55202

Comments

Hendiadyoin1 commented Apr 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

llvmbot commented Apr 30, 2022

Uh oh!

efriedma-quic commented May 1, 2022

Uh oh!

Hendiadyoin1 commented May 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lygstate commented Dec 19, 2022

Uh oh!

enh-google commented Feb 7, 2023

Uh oh!

nickdesaulniers commented Mar 1, 2023

Uh oh!

stephenhines commented Mar 1, 2023

Uh oh!

enh-google commented Jul 15, 2024

Uh oh!

efriedma-quic commented Jul 15, 2024

Uh oh!

Hendiadyoin1 commented Jul 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

enh-google commented Jul 15, 2024

Uh oh!

efriedma-quic commented Jul 15, 2024

Uh oh!

phoebewang commented Jul 16, 2024

Uh oh!

appujee commented Jul 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Hendiadyoin1 commented Jul 17, 2024

Uh oh!

phoebewang commented Jul 18, 2024

Uh oh!

enh-google commented Jul 18, 2024

Uh oh!

phoebewang commented Jul 18, 2024

Uh oh!

Hendiadyoin1 commented Apr 30, 2022 •

edited

Loading

Hendiadyoin1 commented May 1, 2022 •

edited

Loading

Hendiadyoin1 commented Jul 15, 2024 •

edited

Loading

appujee commented Jul 17, 2024 •

edited

Loading