-
Notifications
You must be signed in to change notification settings - Fork 13.6k
Missed Optimization - Replacement of rint/lrint with X87/SSE specific instructions #55202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@llvm/issue-subscribers-backend-x86 |
Brief notes:
|
looking at the man page for Looking at the output with Edit:
So it says it can raise an exception and cause a domnain-error, but not touch errno, although domain errors are signaled in errno? |
ping for this, mesa want this at https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/19978 |
not for these functions, no. the and note also that the C standard doesn't guarantee any floating point exceptions here either --- the exact wording in the standard is "The rint functions differ from the nearbyint functions (7.12.9.3) only in that the rint functions -*- anyway, i'm here as the maintainer of Android's libc ("bionic"), having been pointed at this bug after i asked our llvm folks if they knew why -- unlike arm64 and risc-v -- "i'd use this" (well, the |
Sounds like simplifylibcalls could use some additions here? cc @davidbolvansky So it looks like clang (the front end) is lowering calls to cc @phoebewang |
https://cs.android.com/android/platform/superproject/+/master:bionic/libm/Android.bp;l=432 shows that bionic does build with |
sorry, missed that there was a question for me... yes, bionic does explicitly set yes, it does that unconditionally for all architectures. (and note that we have a bunch of other x86/x86-64 builtins that work fine in https://cs.android.com/android/platform/superproject/+/main:bionic/libm/builtins.cpp.) does anyone have a godbolt link where this actually works? i haven't been able to come up with one. though
does work in gcc. but trunk clang as of godbolt today still emits function calls. |
https://godbolt.org/z/4KsKv9f69 shows clang generating roundss/roundsd, if that's what you're asking. |
Checked also again with |
ah, thanks! so it looks like what we're actually missing is |
I think this is just SLP not detecting the pattern and transforming it to a vector operation; you might want to file a separate bug specifically about SLP. |
I have improved the lowering of vector lrint/llrint by #90065. I think the x86 specific issue can be closed. |
For android this optimization is still missed. https://godbolt.org/z/xKEfeq97E $ clang -O2 -ffast-math # After x86-isel
#Machine code for function fl1(double): IsSSA, TracksLiveness
Frame Objects:
fi#0: size=8, align=8, at location [SP+8]
fi#1: size=8, align=8, at location [SP+8]
Function Live Ins: $xmm0 in %0
bb.0.entry:
liveins: $xmm0
%0:fr64 = COPY $xmm0
MOVSDmr %stack.1, 1, $noreg, 0, $noreg, %0:fr64 :: (store (s64) into %stack.1); example.cpp:1:46
%1:rfp80 = nofpexcept LD_Fp64m80 %stack.1, 1, $noreg, 0, $noreg, implicit-def dead $fpsw, implicit $fpcw :: (load (s64) from %stack.1); example.cpp:1:46
IST_Fp64m80 %stack.0, 1, $noreg, 0, $noreg, killed %1:rfp80, implicit-def dead $fpsw, implicit $fpcw :: (store (s64) into %stack.0); example.cpp:1:29
%2:gr64 = MOV64rm %stack.0, 1, $noreg, 0, $noreg :: (load (s64) from %stack.0); example.cpp:1:29
$rax = COPY %2:gr64; example.cpp:1:22
RET 0, $rax; example.cpp:1:22
# End machine code for function fl1(double). clang -O2 -ffast-math --target=x86_64-linux-android # After x86-isel
# Machine code for function fl1(double): IsSSA, TracksLiveness
Function Live Ins: $xmm0 in %0
bb.0.entry:
liveins: $xmm0
%0:fr64 = COPY $xmm0
ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp; example.cpp:1:46
$xmm0 = COPY %0:fr64; example.cpp:1:46
CALL64pcrel32 target-flags(x86-plt) &__extenddftf2, <regmask $bh $bl $bp $bph $bpl $bx $ebp $ebx $hbp $hbx $rbp $rbx $r12 $r13 $r14 $r15 $r12b $r13b $r14b $r15b $r12bh $r13bh $r14bh $r15bh $r12d $r13d $r14d $r15d $r12w $r13w $r14w $r15w $r12wh and 3 more...>, implicit $rsp, implicit $ssp, implicit $xmm0, implicit-def $rsp, implicit-def $ssp, implicit-def $xmm0; example.cpp:1:46
ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp; example.cpp:1:46
%1:vr128 = COPY $xmm0; example.cpp:1:46
$xmm0 = COPY %1:vr128; example.cpp:1:29
TCRETURNdi64 target-flags(x86-plt) &lrintl, 0, <regmask $bh $bl $bp $bph $bpl $bx $ebp $ebx $hbp $hbx $rbp $rbx $r12 $r13 $r14 $r15 $r12b $r13b $r14b $r15b $r12bh $r13bh $r14bh $r15bh $r12d $r13d $r14d $r15d $r12w $r13w $r14w $r15w $r12wh and 3 more...>, implicit $rsp, implicit $ssp, implicit $xmm0; example.cpp:1:29
# End machine code for function fl1(double). |
Sadly this seems to not be able to see through the truncation to 32bit in this example: https://godbolt.org/z/Pofx7eP4G |
For https://godbolt.org/z/xKEfeq97E, the problem is |
yeah, no Android-supported architecture has hardware fp128 support, so that's expected. and i messed up in my previous comment --- copying directly from the bionic source now, what i'm trying to do is add
and the reason it doesn't work (found by @pirama-arumuga-nainar yesterday) is that we ...but unless i misunderstood https://www.felixcloutier.com/x86/cvtsd2si "When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR register" sounds like this instruction does take <fenv.h> into account (just like the corresponding arm64 and riscv64 instructions), so it seems like a bug that |
It's not a bug. When |
Uh oh!
There was an error while loading. Please reload this page.
X87 and SSE have simple rounding and converting store instructions, which are essentially equivalent to
l{0,2}rint[fl]?
Clang/LLVM does not seem to replace calls to
rint
with these, and neither does it vectorise these when used to round/convert vectors in all cases.(truncation is properly replaced)
Some examples follow below
GCC is listed aswell,
The main difference to them is, that they do schedule their
fldcw
for truncation earlier and replacerintl
, as well as use some bit-magic forrintf
Note: Using
f32x4
forfloat __vector(4)
andi32x4
forint __vector(4)
Note:
cvtss2si
!=cvttss2si
Note: Assuming Overflows etc are UB, and HW's behaviour is acceptable
rintl
call rintl@PLT
frndint
frndint
(int)rintl
call rintl@PLT
+truncationcall rintl@PLT
+truncationfistp m16/m32/m64
lrintl
call lrintl
call lrintl
fistp m16/m32/m64
lrint
call lrintl
call lrintl
cvtss2si r32/r64, xmmX
(int)rintf
call rintf@PLT;cvttss2si
cvttss2si
cvtss2si r32, xmmX
(int)rintf (SSE4.2)
roundss + cvttss2si
roundss + cvttss2si
cvtss2si r32, xmmX
4x lrintf (f32x4->i32x4)
call lrintl
)call lrintl
cvtps2dq xmmY, xmmX
Tested using glodbolt and
x86_64 Clang 14.0.0
as well asx86_64 GCC 11.2
with O2 and O3Update: Seems like most cases are now cought, only coalecsing
cvtss2si
s tocvtps2dq
The text was updated successfully, but these errors were encountered: