JIT: Accelerate more casts on x86 #116805

saucecontrol · 2025-06-19T04:03:46Z

This is the follow-up to #114597. It addresses the invalid removal of intermediate double casts in long->double->float chains and does more cleanup/normalization in fgMorphExpandCast, removing more of the platform-specific logic.

It also adds some more acceleration to x86. With this, all integral<->floating casts are accelerated with AVX-512, and all 32-bit integral<->floating casts are accelerated with the baseline ISAs.

Example AVX-512 implementation of floating->long/ulong casts:

 G_M38981_IG01:        ; bbWeight=1, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, nogc <-- Prolog IG
-       sub      esp, 8
-       vzeroupper 
-						;; size=6 bbWeight=1 PerfScore 1.25
+						;; size=0 bbWeight=1 PerfScore 0.00
 G_M38981_IG02:        ; bbWeight=1, gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref
-       vmovsd   xmm0, qword ptr [esp+0x0C]
-       sub      esp, 8
-       ; npt arg push 0
-       ; npt arg push 1
-       vmovsd   qword ptr [esp], xmm0
-       call     CORINFO_HELP_DBL2LNG
-       ; gcr arg pop 2
-						;; size=19 bbWeight=1 PerfScore 6.25
+       vmovsd   xmm0, qword ptr [esp+0x04]
+       vcmpsd   k1, xmm0, xmm0, 7
+       vcvttpd2qq xmm1 {k1}{z}, xmm0
+       vcmpsd   k1, xmm0, qword ptr [@RWD00], 29
+       vpcmpeqd xmm0, xmm0, xmm0
+       vpsrlq   xmm1 {k1}, xmm0, 1
+       vmovd    eax, xmm1
+       vpextrd  edx, xmm1, 1
+						;; size=51 bbWeight=1 PerfScore 18.50
 G_M38981_IG03:        ; bbWeight=1, epilog, nogc, extend
-       add      esp, 8
        ret      8
-						;; size=6 bbWeight=1 PerfScore 2.25
+						;; size=3 bbWeight=1 PerfScore 2.00
+RWD00  	dq	43E0000000000000h

Example SSE2/SSE4.1 implementation of uint->floating casts:

 G_M24329_IG01:        ; bbWeight=1, gcrefRegs=00000000 {}, byrefRegs=00000000 {}
 						;; size=3 bbWeight=1 PerfScore 0.25
 G_M24329_IG02:        ; bbWeight=1, gcrefRegs=00000000 {}, byrefRegs=00000002 {ecx}, byref
        ; byrRegs +[ecx]
-       push     0
-       ; npt arg push 0
-       push     dword ptr [ecx]
-       ; npt arg push 1
-       call     [CORINFO_HELP_LNG2DBL]
-       ; byrRegs -[ecx]
-       ; gcr arg pop 2
-       fstp     qword ptr [esp]
-       movsd    xmm0, qword ptr [esp]
+       mov      eax, dword ptr [ecx]
+       xorps    xmm0, xmm0
+       cvtsi2sd xmm0, eax
+       movaps   xmm1, xmm0
+       addsd    xmm1, qword ptr [reloc @RWD00]
+       movaps   xmm2, xmm0
+       blendvpd xmm0, xmm1
+       movsd    qword ptr [esp], xmm0
        fld      qword ptr [esp]
-						;; size=21 bbWeight=1 PerfScore 10.00
+						;; size=36 bbWeight=1 PerfScore 15.83
 G_M24329_IG03:        ; bbWeight=1, epilog, nogc, extend
        add      esp, 8
        ret      
 						;; size=4 bbWeight=1 PerfScore 1.25
+RWD00  	dq	41F0000000000000h

Full diffs

dotnet-policy-service · 2025-06-19T04:04:33Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

saucecontrol · 2025-06-19T23:16:57Z

cc @tannergooding

src/coreclr/jit/hwintrinsiccodegenxarch.cpp

src/coreclr/jit/morph.cpp

src/coreclr/jit/lowerxarch.cpp

src/coreclr/jit/codegenxarch.cpp

src/coreclr/jit/decomposelongs.cpp

saucecontrol · 2025-07-08T20:20:00Z

The codegen for AVX-512 floating->long regressed after merging in main (likely something with #116983). I'll give it another look.

saucecontrol · 2025-07-10T06:35:20Z

#117485 resolved the regression. Diffs look good again.

src/coreclr/jit/decomposelongs.cpp

tannergooding · 2025-07-10T17:03:59Z

src/coreclr/jit/decomposelongs.cpp

+        CorInfoType srcBaseType = CORINFO_TYPE_UNDEF;
+        CorInfoType dstBaseType = CORINFO_TYPE_UNDEF;
+
+        if (varTypeIsFloating(srcType))


Do we know if the compiler is CSEing this check with the above check as expected?

-- Asking since manually caching might be a way to win some throughput back.

Not for sure, but I generally assume C++ compilers will handle 'obvious' ones like this. It should be noted, the throughput hit to x86 directly correlates with the number of casts that are now inlined.

i.e. the only significant throughput hit is on the coreclr_tests collection

which is also the one that had the most casts in it

👍. The biggest concern is the TP hit to minopts. It may be desirable to leave that using the helper there so that floating-point heavy code doesn't start up slower.

I think #117512 will reduce the hit a bit.

It may be desirable to leave that using the helper there so that floating-point heavy code doesn't start up slower.

This is interesting, because the same argument could apply to the complicated saturating logic that we have for x64 as well. #97529 introduced a similar throughput regression, and although it was done for correctness instead of perf, the throughput hit could have been avoided by using the helper in minopts there too.

the throughput hit could have been avoided by using the helper in minopts there too.

AFAIR, the JIT throughput hit there ended up being very minimal (and often an improvement). It was the perf score and code output size that regressed, which was expected.

If there was a significant perf score hit to minopts, then yes the same would apply here and it would likely be beneficial to ensure that is doing the "better" thing as well.

src/coreclr/jit/flowgraph.cpp

tannergooding · 2025-08-08T16:09:26Z

@saucecontrol there's some merge conflicts on this one and it is also dependent on #117571 going in first, correct?

saucecontrol · 2025-08-08T16:37:03Z

My plan was to trim this down to just the new acceleration for x86, but yeah, that would build on the part split into #117571. I'll set this to draft in the meantime.

dotnet-policy-service · 2025-10-07T03:42:33Z

Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it.

github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jun 19, 2025

dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Jun 19, 2025

build-analysis bot mentioned this pull request Jun 19, 2025

browser-wasm Windows build error #116746

Closed

accelerate more casts on x86

91116c7

saucecontrol force-pushed the lng2flt2 branch from c153a51 to 91116c7 Compare June 19, 2025 21:17

formatting

f65daa7

saucecontrol marked this pull request as ready for review June 19, 2025 23:16

saucecontrol commented Jun 20, 2025

View reviewed changes

src/coreclr/jit/hwintrinsiccodegenxarch.cpp Show resolved Hide resolved

src/coreclr/jit/morph.cpp Show resolved Hide resolved

src/coreclr/jit/lowerxarch.cpp Show resolved Hide resolved

saucecontrol added 2 commits June 21, 2025 10:17

add method header

f23c534

Merge remote-tracking branch 'upstream/main' into lng2flt2

3af2fd6