-
Notifications
You must be signed in to change notification settings - Fork 13.5k
[SLPVectorizer] Performance degradation with xor + fshl on X86 #63980
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Here is the reproducer: https://godbolt.org/z/s1PjYMs5a |
@alexey-bataev @RKSimon any pointers here on what the issue might be? |
So, I tried printing the cost-model for xor and fshl under different mattr:
opt chk.ll -passes="print" 2>&1 -disable-output -mtriple=x86_64-- -mattr=+avx2
Changing -mattr=+avx2 on the above testcase should increase the cost of the vector fshl from 1 to 4. However, debug shows SLP cost model as -11 ! Shouldn't it reduce from -8 to -5 instead?
|
@llvm/issue-subscribers-backend-x86 |
The v2i32 types should be legalized to v4i32 inside the TTI callbacks, AVX512 does have a cheap v4i32 rotate instruction which is what those fshl should lower to, but on AVX2 they will expand to shifts/or - hence the higher cost. What I think is going on though is the high cost of the scalar rotate instructions - we don't cost constant rotate amounts as cheaper than the variable rotate instructions, despite being notably faster on almost all hardware.: { ISD::ROTL, MVT::i32, { 2, 3, 1, 3 } }, So we'd need to add lower costs for constant rotate amounts. |
I don't think it is directly related to SLP vectorizer, but x86 cost model. And Simon, probably , identified the actual issue. |
Yep, thank you. I was looking at the cost of the vector ones and trying to increase those.
Earlier it was -8. For reference, the scalar assembly is:
I tested this with couple of different mattr and I think the cost may not be reflecting the lowering.
True for avx512 and for avx2 with +xop (lowers to single instruction: vprotd $1, %xmm0, %xmm0). However, for the mattr from this repro.ll, we lowered fshl into three vector instructions:
|
So there's a couple of problems to address:
|
Thanks @RKSimon for taking this up! I instrumented the SLPVectorizer cost model and confirmed that the fshl costs correctly reflect what we get in the lowered code. |
Fixes issue identified on #63980 where the undef rotate amounts (during widening from v2i32 -> v4i32) were being constant folded to 0 when the shift amounts are created during expansion, losing the splat'd shift amounts.
Thank you for landing the above change @RKSimon. However, we don't see any difference in the performance of the internal benchmark. While we do generate better vector code (in some cases as seen in the test changes for the patch) it is still not as good as the scalar code. It looks like what needs correction is the X86 cost model. However, even with the lower costs for constant rotate amounts, we still SLP vectorize it. Also, for the mattr where we see the performance degradation (we see across Skylake and cascade lake), we haven't reduced the number of vector instructions (I haven't checked the throughput of the changed instructions). Original code generated was:
With the above improvement, we have:
For:
|
Sorry, still working on this - as you said the costs still need attention (needs some refactoring, while the backend fixes were pretty trivial). |
No worries and thank you! I just wanted to clarify my understanding that to completely avoid the perf degradation here, we should not SLP vectorize this code (by fixing the X86 cost model). |
Including test coverage for Issue #63980
As noted on #63980 rotate by immediate amounts is much cheaper than variable amounts. This still needs to be expanded to vector rotate cases, and we need to add reasonable funnel-shift costs as well (very tricky as there's a huge range in CPU behaviour for these).
I reduced the testcase with bugpoint to show minimal code needed for SLPVectorizing that same loop: cat input.ll:
Run this through The target-features attribute for the function is:
|
@RKSimon @alexey-bataev I went through the SLPVectorizer debug output and I think I found the issue for why we vectorize. It looks like we over-estimate the cost of the scalar loads as well, assuming we generate 8 loads whereas we infact only generate 2 loads. Consider same reduced example as above, here is the main snippet:
This is the assembly for the above scalar code with the accurate mattr:
This is the vectorized assembly:
According to the X86 cost model/SLPVectorizer cost model, we calculate the SLP benefit as: Scalar: 8 (loads) + 6 (xor) + 2 (store) + 2 (each fshl with cost of 1) = 18 // This fshl cost was corrected by Simon's patch to have lower cost of 1 per fshl. SLP Benefit = Vector - Scalar = 12 - 18 = -6 However, in the scalar assembly, we have only 2 loads (not 8). So the correct cost for scalar should be 12. Is this a correct understanding? |
Possibly need to estimate that some of the loads are "free" in scalar code? |
It depends on the cost kind - folded loads aren't "free" for rthroughput / latency / codesize (uop count on x86) - just instruction count. SLP uses rthroughput. Plus we do an awful job of adding rthroughput costs together - we shouldn't just accumulate them together, we really need to account for resource usage. |
This is what I meant, actually. Just this work requires lots of time, accounting rthroughput for the "folded" loads is much easier to do. |
Including test coverage for Issue llvm#63980
As noted on llvm#63980 rotate by immediate amounts is much cheaper than variable amounts. This still needs to be expanded to vector rotate cases, and we need to add reasonable funnel-shift costs as well (very tricky as there's a huge range in CPU behaviour for these).
llvm-mca analysis of the scalar vs vector snippets puts the vector version clearly in the lead: https://llvm.godbolt.org/z/rWPqdG1ve so we need to look further into why the benchmark isn't reflecting this. |
It could be the above code is over-simplified version (I did reduce using bugpoint). Will take a look at final version and see what is going on. I tried with exact mattr as the machine (apart from the mcpu), didn't make a difference in above mca output. |
I've done the following experiment:
However, running through regular perf on cascade lake shows the scalar output is clearly better (higher throughput of 2.55 IPC for scalar versus 1.63 IPC on vector). This possibly points that the scheduler model used in MCA is not accurate enough? We do see that the SLP model doesn't account for the folded loads. So, that is possibly orthogonal to why MCA doesn't reflect perf output? With SLP:
Without SLP:
|
@annamthomas Am I looking at these number correctly? The SLP (vector) seems faster than non-SLP (scalar)? Or are you just looking at the IPC? |
@RKSimon Sorry, the above numbers should be ignored. Perf is running on a Java VM, which means we need to give sufficient time to warmup before measuring numbers. Analyzed this in detail along with the original benchmark to figure out why llvm-mca doesn't reflect the degradation. I think the reason is that reciprocal throughput computed by mca is correct only when there are no loop carried dependency. However, this entire sequence (source code: https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/sun/security/provider/SHA2.java#L160) infact has loop carried dependencey. SLP vectorizer decides to vectorize this (unrolled-by-2) loop and generates bad performance. I have used perf normalized to benchmark iterations (instead of per second which was done above) to give accurate performance comparison: Without SLP
With SLP:
As we can see, SLP has higher cycles/op (4329.138 compared to 2692.120), while IPC is comparable. |
@alexey-bataev Just to bring this discussion back, we continue seeing the huge degradation with SLP vectorizing this code. Any pointers on how we can fix this (without hack-y code downstream) is greatly appreciated. We consider folded loads in scalar cost and we just state it as "free"? |
That's not the issue of the SLP vectorizer, definitely the issue of the cost model. I think we need to teach TTI about folded loads for some scalar operations. I had a patch for this some time before somewhere on phabricator, but it did not go. Plus, required some extra tuning, I assume. |
@annamthomas Let me take another look at this |
@annamthomas Are you still seeing the perf regression fixed when setting slp-threshold=2? At that threshold the vectorised rotate is still there, but we stop replacing 4 i32 loads with a v4i32 load and 4 extracts which might have been another problem - we're probably going to be better off keeping them scalar: https://godbolt.org/z/qdzTYG9Gv |
reverse-ping @annamthomas |
I'll check and let you know on Monday @RKSimon (just returned from vacations). |
@RKSimon Even with -slp-threshold=2, we see the same degradation. I've confirmed in the IR that we no longer replace the 4 i32 loads with a v4i32 load and 4 extracts, so it doesn't seem that's the problem here. |
…e cost analysis. We were only constructing the IntrinsicCostAttributes with the arg type info, and not the args themselves, preventing more detailed cost analysis (constant / uniform args etc.) Just pass the whole IntrinsicInst to the constructor and let it resolve everything it can. Noticed while having yet another attempt at llvm#63980
…e cost analysis. (#124129) We were only constructing the IntrinsicCostAttributes with the arg type info, and not the args themselves, preventing more detailed cost analysis (constant / uniform args etc.) Just pass the whole IntrinsicInst to the constructor and let it resolve everything it can. Noticed while having yet another attempt at #63980
I've finally gotten to the bottom of this - we have gaps in the handling of funnel shift intrinsic costs, especially when the costs are for type only - #124175 partially addresses this, but even then we need to extend the x86 costs tables to prevent fallback to the generic costs in some cases (which are still absurdly cheap if it knows we custom lower the intrinsic). I know what I need to get done now (and can already avoid the bad vectorization with a few hacks!). Better late then never :( |
…tor IntrinsicCostAttributes getVectorCallCosts determines the cost of a vector intrinsic, based off an existing scalar intrinsic call - but we were including the scalar argument data to the IntrinsicCostAttributes, which meant that not only was the cost calculation not type-only based, it was making incorrect assumptions about constant values etc. This also exposed an issue that x86 relied on fallback calculations for funnel shift costs - this is great when we have the argument data as that improves the accuracy of uniform shift amounts etc., but meant that type-only costs would default to Cost=2 for all custom lowered funnel shifts, which was far too cheap. This is the reverse of llvm#124129 where we weren't including argument data when we could. Fixes llvm#63980
…tor IntrinsicCostAttributes (#124254) getVectorCallCosts determines the cost of a vector intrinsic, based off an existing scalar intrinsic call - but we were including the scalar argument data to the IntrinsicCostAttributes, which meant that not only was the cost calculation not type-only based, it was making incorrect assumptions about constant values etc. This also exposed an issue that x86 relied on fallback calculations for funnel shift costs - this is great when we have the argument data as that improves the accuracy of uniform shift amounts etc., but meant that type-only costs would default to Cost=2 for all custom lowered funnel shifts, which was far too cheap. This is the reverse of #124129 where we weren't including argument data when we could. Fixes #63980
…e cost analysis. (#124129) (REAPPLIED) We were only constructing the IntrinsicCostAttributes with the arg type info, and not the args themselves, preventing more detailed cost analysis (constant / uniform args etc.) Just pass the whole IntrinsicInst to the constructor and let it resolve everything it can. Noticed while having yet another attempt at #63980 Reapplied cleanup now that #125223 and #124984 have landed.
…e cost analysis. (llvm#124129) (REAPPLIED) We were only constructing the IntrinsicCostAttributes with the arg type info, and not the args themselves, preventing more detailed cost analysis (constant / uniform args etc.) Just pass the whole IntrinsicInst to the constructor and let it resolve everything it can. Noticed while having yet another attempt at llvm#63980 Reapplied cleanup now that llvm#125223 and llvm#124984 have landed.
With SLP Vectorizer, a hot loop with 6 xors + 2 fshl get reduced to 3 xors + 1 fshl. We vectorize with a VF of 2.
The SLP cost model gives it a cost of -8.
This is the loop in question:
When we vectorize it, we get:
We see about a 40% degradation on benchmark that optimizes this hot loop.
The assembly for this loop shows we use 3 xor instead of vpxor and the fshl lowering using xmm registers:
While looking at cost model for X86 arithmetic instructions, I do not see anything for v2i32 for XOR. Should we actually vectorize this loop?
Will attach the IR reproducer and -slp-threshold=2 shows we only vectorize this tree and still see the 40% degradation.
The text was updated successfully, but these errors were encountered: