-
Notifications
You must be signed in to change notification settings - Fork 13.5k
[X86][SSE2] Failure to vectorize int16_t[8] to pminsw pattern #48223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This is a recent (SLP?) regression since LLVM 12. LLVM 11 produces expected codegen. |
@spatel Could this be to do with the limitations you added to SLP for shifts by multiples of 8 to prevent it destroying bswap patterns? |
This is an unintended consequence of an extra SROA pass between loop unrolling and SLP that we got from the switch to the new pass manager. We also made the old pass manager match this behavior with: So we can repro the behavior in LLVM 11 by toggling the pass manager (because the extra SROA wasn't in the legacy PM at that time, but it was in the new PM): I'm not familiar with SROA, but it is converting the trailing insertvalue into shift+mask ops like: ...which seems ok in general, but it's breaking up the expected min/max reduction pattern in this example, and I'm not sure if we can recover. Canonicalizing to the new min/max intrinsics doesn't appear to change anything. I'm not sure how to solve this: (1) change the pass ordering? (2) add a bailout to SROA for min/max reductions? |
We have that ugly pattern because the function must return { i64, i64 }, But what if we could produce { i16, i16, i16, i16, i16, i16, i16, i16 } I.e., is there any particular reason why we forbid bitcasts between structs? |
I think we would need to define what it means if there are padding bytes in the structs being bitcast. Structs don’t exist in SelectionDAG so we’d need to teach SelectionDAGBuilder how to break down such a bitcast. |
Doesn't SLP need to see store instructions to as seeds? So is it really the ugly pattern that's breaking it or the lack of stores? |
Something strange that SLP is doing is that although the 2 insertvalue seeds have exactly the same shift pattern, we end up vectorizing them with non-uniforms shifts. e.g. It seems to be because we've recognised that we can load combine the first 2 elements to <2 x i16>, but we completely fail to do that for the remaining pairs..... @alexey-bataev Any suggestions?
|
Known problem - vectoriaztion of the copyable elements. We have 2 |
The godbolt link shows that the codegen is pretty good now, so assume it is fixed |
The X86 backend shuffle combining is saving us from some poor vectorised IR
I'm going to reopen this - as showing in 63f3a5b we're relying on the backend (x86 shuffle combining) to do a lot of the work for this - and the middle-end still fails on SSE2 targets despite having low costs for a legal v8i16 SMIN instruction |
Extended Description
https://gcc.godbolt.org/z/57nGK3
This ends up as a horrid mix of scalar and <2 x i16> smin patterns.
Much of the problem seems to be that we end up trying to store the final array as { i64, i64 } aggregate, resulting in a load of zext+shift+or to pack the i16's.
Even more impressive with -march=atom we end up with masked gather calls....
The text was updated successfully, but these errors were encountered: