-
Notifications
You must be signed in to change notification settings - Fork 13.5k
[X86] Questionable codegen for shuffles + combinations #63946
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@llvm/issue-subscribers-backend-x86 |
Related to #63710 |
RKSimon
added a commit
that referenced
this issue
Jul 19, 2023
Current AVX2 codegen: has_equal_in_u32(long long __vector(4), long long __vector(4)): # @has_equal_in_u32(long long __vector(4), long long __vector(4))
vpshufd ymm2, ymm1, 57 # ymm2 = ymm1[1,2,3,0,5,6,7,4]
vpshufd ymm3, ymm1, 78 # ymm3 = ymm1[2,3,0,1,6,7,4,5]
vpshufd ymm4, ymm1, 147 # ymm4 = ymm1[3,0,1,2,7,4,5,6]
vpcmpeqd ymm5, ymm0, ymm1
vpcmpeqd ymm6, ymm2, ymm0
vpor ymm5, ymm6, ymm5
vpcmpeqd ymm3, ymm3, ymm0
vpor ymm3, ymm5, ymm3
vpcmpeqd ymm5, ymm4, ymm0
vpermq ymm6, ymm1, 78 # ymm6 = ymm1[2,3,0,1]
vpcmpeqd ymm6, ymm6, ymm0
vpor ymm3, ymm3, ymm6
vpor ymm3, ymm3, ymm5
vpermq ymm2, ymm2, 78 # ymm2 = ymm2[2,3,0,1]
vpcmpeqd ymm2, ymm2, ymm0
vpermq ymm1, ymm1, 27 # ymm1 = ymm1[3,2,1,0]
vpcmpeqd ymm1, ymm1, ymm0
vpor ymm1, ymm2, ymm1
vpor ymm1, ymm3, ymm1
vpermq ymm2, ymm4, 78 # ymm2 = ymm4[2,3,0,1]
vpcmpeqd ymm0, ymm2, ymm0
vpor ymm0, ymm1, ymm0
vpslld ymm0, ymm0, 31
vpsrad ymm0, ymm0, 31
ret We still have an (expanded) sign_extend_inreg node pattern at the end as ComputeSignBits hits max depth before getting to all the vpcmpeqd nodes. |
The reassociation pass is converting: define dso_local noundef <4 x i64> @has_equal_in_u32(long long __vector(4), long long __vector(4))(<4 x i64> noundef %a, <4 x i64> noundef %b0) local_unnamed_addr {
entry:
%0 = bitcast <4 x i64> %b0 to <8 x i32>
%permil = shufflevector <8 x i32> %0, <8 x i32> poison, <8 x i32> <i32 1, i32 2, i32 3, i32 0, i32 5, i32 6, i32 7, i32 4>
%1 = bitcast <8 x i32> %permil to <4 x i64>
%permil1 = shufflevector <8 x i32> %0, <8 x i32> poison, <8 x i32> <i32 2, i32 3, i32 0, i32 1, i32 6, i32 7, i32 4, i32 5>
%2 = bitcast <8 x i32> %permil1 to <4 x i64>
%permil2 = shufflevector <8 x i32> %permil, <8 x i32> poison, <8 x i32> <i32 2, i32 3, i32 0, i32 1, i32 6, i32 7, i32 4, i32 5>
%3 = bitcast <8 x i32> %permil2 to <4 x i64>
%perm = shufflevector <4 x i64> %b0, <4 x i64> poison, <4 x i32> <i32 2, i32 3, i32 0, i32 1>
%perm3 = shufflevector <4 x i64> %1, <4 x i64> poison, <4 x i32> <i32 2, i32 3, i32 0, i32 1>
%perm4 = shufflevector <4 x i64> %2, <4 x i64> poison, <4 x i32> <i32 2, i32 3, i32 0, i32 1>
%perm5 = shufflevector <4 x i64> %3, <4 x i64> poison, <4 x i32> <i32 2, i32 3, i32 0, i32 1>
%4 = bitcast <4 x i64> %a to <8 x i32>
%cmp.i = icmp eq <8 x i32> %4, %0
%cmp.i48 = icmp eq <8 x i32> %permil, %4
%cmp.i50 = icmp eq <8 x i32> %permil1, %4
%cmp.i52 = icmp eq <8 x i32> %permil2, %4
%5 = bitcast <4 x i64> %perm to <8 x i32>
%cmp.i54 = icmp eq <8 x i32> %4, %5
%6 = bitcast <4 x i64> %perm3 to <8 x i32>
%cmp.i56 = icmp eq <8 x i32> %4, %6
%7 = bitcast <4 x i64> %perm4 to <8 x i32>
%cmp.i58 = icmp eq <8 x i32> %4, %7
%8 = bitcast <4 x i64> %perm5 to <8 x i32>
%cmp.i60 = icmp eq <8 x i32> %4, %8
%or.i6869 = or <8 x i1> %cmp.i48, %cmp.i
%or.i627071 = or <8 x i1> %cmp.i52, %cmp.i50
%or.i637273 = or <8 x i1> %cmp.i56, %cmp.i54
%or.i647475 = or <8 x i1> %cmp.i60, %cmp.i58
%or.i657677 = or <8 x i1> %or.i627071, %or.i6869
%or.i667879 = or <8 x i1> %or.i647475, %or.i637273
%or.i678081 = or <8 x i1> %or.i667879, %or.i657677
%or.i6780 = sext <8 x i1> %or.i678081 to <8 x i32>
%or.i67 = bitcast <8 x i32> %or.i6780 to <4 x i64>
ret <4 x i64> %or.i67
} to define dso_local noundef <4 x i64> @has_equal_in_u32(long long __vector(4), long long __vector(4))(<4 x i64> noundef %a, <4 x i64> noundef %b0) local_unnamed_addr {
entry:
%0 = bitcast <4 x i64> %b0 to <8 x i32>
%permil = shufflevector <8 x i32> %0, <8 x i32> poison, <8 x i32> <i32 1, i32 2, i32 3, i32 0, i32 5, i32 6, i32 7, i32 4>
%1 = bitcast <8 x i32> %permil to <4 x i64>
%permil1 = shufflevector <8 x i32> %0, <8 x i32> poison, <8 x i32> <i32 2, i32 3, i32 0, i32 1, i32 6, i32 7, i32 4, i32 5>
%2 = bitcast <8 x i32> %permil1 to <4 x i64>
%permil2 = shufflevector <8 x i32> %permil, <8 x i32> poison, <8 x i32> <i32 2, i32 3, i32 0, i32 1, i32 6, i32 7, i32 4, i32 5>
%3 = bitcast <8 x i32> %permil2 to <4 x i64>
%perm = shufflevector <4 x i64> %b0, <4 x i64> poison, <4 x i32> <i32 2, i32 3, i32 0, i32 1>
%perm3 = shufflevector <4 x i64> %1, <4 x i64> poison, <4 x i32> <i32 2, i32 3, i32 0, i32 1>
%perm4 = shufflevector <4 x i64> %2, <4 x i64> poison, <4 x i32> <i32 2, i32 3, i32 0, i32 1>
%perm5 = shufflevector <4 x i64> %3, <4 x i64> poison, <4 x i32> <i32 2, i32 3, i32 0, i32 1>
%4 = bitcast <4 x i64> %a to <8 x i32>
%cmp.i = icmp eq <8 x i32> %4, %0
%cmp.i48 = icmp eq <8 x i32> %permil, %4
%cmp.i50 = icmp eq <8 x i32> %permil1, %4
%cmp.i52 = icmp eq <8 x i32> %permil2, %4
%5 = bitcast <4 x i64> %perm to <8 x i32>
%cmp.i54 = icmp eq <8 x i32> %4, %5
%6 = bitcast <4 x i64> %perm3 to <8 x i32>
%cmp.i56 = icmp eq <8 x i32> %4, %6
%7 = bitcast <4 x i64> %perm4 to <8 x i32>
%cmp.i58 = icmp eq <8 x i32> %4, %7
%8 = bitcast <4 x i64> %perm5 to <8 x i32>
%cmp.i60 = icmp eq <8 x i32> %4, %8
%or.i647475 = or <8 x i1> %cmp.i48, %cmp.i
%or.i637273 = or <8 x i1> %or.i647475, %cmp.i50
%or.i667879 = or <8 x i1> %or.i637273, %cmp.i54
%or.i627071 = or <8 x i1> %or.i667879, %cmp.i52
%or.i6869 = or <8 x i1> %or.i627071, %cmp.i56
%or.i657677 = or <8 x i1> %or.i6869, %cmp.i58
%or.i678081 = or <8 x i1> %or.i657677, %cmp.i60
%or.i6780 = sext <8 x i1> %or.i678081 to <8 x i32>
%or.i67 = bitcast <8 x i32> %or.i6780 to <4 x i64>
ret <4 x i64> %or.i67
} In doing so we're making the or-chain serial and increasing its depth, preventing value tracking from traversing it to the max depth. |
razmser
pushed a commit
to SuduIDE/llvm-project
that referenced
this issue
Oct 2, 2023
…CKSS/PACKUS to legalization/lowering Don't prematurely fold TRUNCATE nodes to PACKSS/PACKUS target nodes - we miss out on generic TRUNCATE folds. Helps some regressions from D152928 and llvm#63946 Fixes llvm#63710
razmser
pushed a commit
to SuduIDE/llvm-project
that referenced
this issue
Oct 2, 2023
…CKSS/PACKUS to legalization/lowering Don't prematurely fold TRUNCATE nodes to PACKSS/PACKUS target nodes - we miss out on generic TRUNCATE folds. Helps some regressions from D152928 and llvm#63946 Fixes llvm#63710
razmser
pushed a commit
to SuduIDE/llvm-project
that referenced
this issue
Oct 2, 2023
…CKSS/PACKUS to legalization/lowering Don't prematurely fold TRUNCATE nodes to PACKSS/PACKUS target nodes - we miss out on generic TRUNCATE folds. Helps some regressions from D152928 and llvm#63946 Fixes llvm#63710
razmser
pushed a commit
to SuduIDE/llvm-project
that referenced
this issue
Oct 3, 2023
…CKSS/PACKUS to legalization/lowering Don't prematurely fold TRUNCATE nodes to PACKSS/PACKUS target nodes - we miss out on generic TRUNCATE folds. Helps some regressions from D152928 and llvm#63946 Fixes llvm#63710
razmser
pushed a commit
to SuduIDE/llvm-project
that referenced
this issue
Oct 3, 2023
…CKSS/PACKUS to legalization/lowering Don't prematurely fold TRUNCATE nodes to PACKSS/PACKUS target nodes - we miss out on generic TRUNCATE folds. Helps some regressions from D152928 and llvm#63946 Fixes llvm#63710
razmser
pushed a commit
to SuduIDE/llvm-project
that referenced
this issue
Oct 6, 2023
…CKSS/PACKUS to legalization/lowering Don't prematurely fold TRUNCATE nodes to PACKSS/PACKUS target nodes - we miss out on generic TRUNCATE folds. Helps some regressions from D152928 and llvm#63946 Fixes llvm#63710
razmser
pushed a commit
to SuduIDE/llvm-project
that referenced
this issue
Oct 11, 2023
…CKSS/PACKUS to legalization/lowering Don't prematurely fold TRUNCATE nodes to PACKSS/PACKUS target nodes - we miss out on generic TRUNCATE folds. Helps some regressions from D152928 and llvm#63946 Fixes llvm#63710
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi!
The code: https://godbolt.org/z/8n64TrnqK
Clang introduced a lot of
And other things. It is possoble that I don't understand why this is an optimization but looks suspicious.
Code pasted
The text was updated successfully, but these errors were encountered: