Skip to content

[X86] Questionable codegen for shuffles + combinations #63946

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
DenisYaroshevskiy opened this issue Jul 19, 2023 · 4 comments · Fixed by #123329
Closed

[X86] Questionable codegen for shuffles + combinations #63946

DenisYaroshevskiy opened this issue Jul 19, 2023 · 4 comments · Fixed by #123329

Comments

@DenisYaroshevskiy
Copy link

DenisYaroshevskiy commented Jul 19, 2023

Hi!

The code: https://godbolt.org/z/8n64TrnqK

Clang introduced a lot of

        vextracti128    xmm6, ymm3, 1
        vpackssdw       xmm3, xmm3, xmm6

And other things. It is possoble that I don't understand why this is an optimization but looks suspicious.

Code pasted

#include "immintrin.h"

__m256i has_equal_in_u32(__m256i a, __m256i b0) {
  __m256i b1 = _mm256_shuffle_epi32(b0, 57);     // [1,2,3,0,5,6,7,4]
  __m256i b2 = _mm256_shuffle_epi32(b0, 78);     // [2,3,0,1,6,7,4,5]
  __m256i b3 = _mm256_shuffle_epi32(b1, 78); 
  __m256i b4 = _mm256_permute4x64_epi64(b0, 78); // [2,3,0,1]
  __m256i b5 = _mm256_permute4x64_epi64(b1, 78);
  __m256i b6 = _mm256_permute4x64_epi64(b2, 78);
  __m256i b7 = _mm256_permute4x64_epi64(b3, 78);

  b0 = _mm256_cmpeq_epi32(a, b0);
  b1 = _mm256_cmpeq_epi32(a, b1);
  b2 = _mm256_cmpeq_epi32(a, b2);
  b3 = _mm256_cmpeq_epi32(a, b3);
  b4 = _mm256_cmpeq_epi32(a, b4);
  b5 = _mm256_cmpeq_epi32(a, b5);
  b6 = _mm256_cmpeq_epi32(a, b6);
  b7 = _mm256_cmpeq_epi32(a, b7);

  b0 = _mm256_or_si256(b0, b1);
  b1 = _mm256_or_si256(b2, b3);
  b2 = _mm256_or_si256(b4, b5);
  b3 = _mm256_or_si256(b6, b7);

  b0 = _mm256_or_si256(b0, b1);
  b1 = _mm256_or_si256(b2, b3);

  return _mm256_or_si256(b0, b1);
}
@llvmbot
Copy link
Member

llvmbot commented Jul 19, 2023

@llvm/issue-subscribers-backend-x86

@RKSimon
Copy link
Collaborator

RKSimon commented Jul 19, 2023

Related to #63710

@RKSimon RKSimon self-assigned this Jul 19, 2023
@RKSimon RKSimon changed the title Questionable codegen for shuffles + combinations (x86) [X86] Questionable codegen for shuffles + combinations Jul 19, 2023
RKSimon added a commit that referenced this issue Aug 17, 2023
…CKSS/PACKUS to legalization/lowering

Don't prematurely fold TRUNCATE nodes to PACKSS/PACKUS target nodes - we miss out on generic TRUNCATE folds.

Helps some regressions from D152928 and #63946

Fixes #63710
@RKSimon
Copy link
Collaborator

RKSimon commented Aug 18, 2023

Current AVX2 codegen:

has_equal_in_u32(long long __vector(4), long long __vector(4)):            # @has_equal_in_u32(long long __vector(4), long long __vector(4))
        vpshufd ymm2, ymm1, 57                  # ymm2 = ymm1[1,2,3,0,5,6,7,4]
        vpshufd ymm3, ymm1, 78                  # ymm3 = ymm1[2,3,0,1,6,7,4,5]
        vpshufd ymm4, ymm1, 147                 # ymm4 = ymm1[3,0,1,2,7,4,5,6]
        vpcmpeqd        ymm5, ymm0, ymm1
        vpcmpeqd        ymm6, ymm2, ymm0
        vpor    ymm5, ymm6, ymm5
        vpcmpeqd        ymm3, ymm3, ymm0
        vpor    ymm3, ymm5, ymm3
        vpcmpeqd        ymm5, ymm4, ymm0
        vpermq  ymm6, ymm1, 78                  # ymm6 = ymm1[2,3,0,1]
        vpcmpeqd        ymm6, ymm6, ymm0
        vpor    ymm3, ymm3, ymm6
        vpor    ymm3, ymm3, ymm5
        vpermq  ymm2, ymm2, 78                  # ymm2 = ymm2[2,3,0,1]
        vpcmpeqd        ymm2, ymm2, ymm0
        vpermq  ymm1, ymm1, 27                  # ymm1 = ymm1[3,2,1,0]
        vpcmpeqd        ymm1, ymm1, ymm0
        vpor    ymm1, ymm2, ymm1
        vpor    ymm1, ymm3, ymm1
        vpermq  ymm2, ymm4, 78                  # ymm2 = ymm4[2,3,0,1]
        vpcmpeqd        ymm0, ymm2, ymm0
        vpor    ymm0, ymm1, ymm0
        vpslld  ymm0, ymm0, 31
        vpsrad  ymm0, ymm0, 31
        ret

We still have an (expanded) sign_extend_inreg node pattern at the end as ComputeSignBits hits max depth before getting to all the vpcmpeqd nodes.

@RKSimon
Copy link
Collaborator

RKSimon commented Aug 20, 2023

The reassociation pass is converting:

define dso_local noundef <4 x i64> @has_equal_in_u32(long long __vector(4), long long __vector(4))(<4 x i64> noundef %a, <4 x i64> noundef %b0) local_unnamed_addr {
entry:
  %0 = bitcast <4 x i64> %b0 to <8 x i32>
  %permil = shufflevector <8 x i32> %0, <8 x i32> poison, <8 x i32> <i32 1, i32 2, i32 3, i32 0, i32 5, i32 6, i32 7, i32 4>
  %1 = bitcast <8 x i32> %permil to <4 x i64>
  %permil1 = shufflevector <8 x i32> %0, <8 x i32> poison, <8 x i32> <i32 2, i32 3, i32 0, i32 1, i32 6, i32 7, i32 4, i32 5>
  %2 = bitcast <8 x i32> %permil1 to <4 x i64>
  %permil2 = shufflevector <8 x i32> %permil, <8 x i32> poison, <8 x i32> <i32 2, i32 3, i32 0, i32 1, i32 6, i32 7, i32 4, i32 5>
  %3 = bitcast <8 x i32> %permil2 to <4 x i64>
  %perm = shufflevector <4 x i64> %b0, <4 x i64> poison, <4 x i32> <i32 2, i32 3, i32 0, i32 1>
  %perm3 = shufflevector <4 x i64> %1, <4 x i64> poison, <4 x i32> <i32 2, i32 3, i32 0, i32 1>
  %perm4 = shufflevector <4 x i64> %2, <4 x i64> poison, <4 x i32> <i32 2, i32 3, i32 0, i32 1>
  %perm5 = shufflevector <4 x i64> %3, <4 x i64> poison, <4 x i32> <i32 2, i32 3, i32 0, i32 1>
  %4 = bitcast <4 x i64> %a to <8 x i32>
  %cmp.i = icmp eq <8 x i32> %4, %0
  %cmp.i48 = icmp eq <8 x i32> %permil, %4
  %cmp.i50 = icmp eq <8 x i32> %permil1, %4
  %cmp.i52 = icmp eq <8 x i32> %permil2, %4
  %5 = bitcast <4 x i64> %perm to <8 x i32>
  %cmp.i54 = icmp eq <8 x i32> %4, %5
  %6 = bitcast <4 x i64> %perm3 to <8 x i32>
  %cmp.i56 = icmp eq <8 x i32> %4, %6
  %7 = bitcast <4 x i64> %perm4 to <8 x i32>
  %cmp.i58 = icmp eq <8 x i32> %4, %7
  %8 = bitcast <4 x i64> %perm5 to <8 x i32>
  %cmp.i60 = icmp eq <8 x i32> %4, %8
  %or.i6869 = or <8 x i1> %cmp.i48, %cmp.i
  %or.i627071 = or <8 x i1> %cmp.i52, %cmp.i50
  %or.i637273 = or <8 x i1> %cmp.i56, %cmp.i54
  %or.i647475 = or <8 x i1> %cmp.i60, %cmp.i58
  %or.i657677 = or <8 x i1> %or.i627071, %or.i6869
  %or.i667879 = or <8 x i1> %or.i647475, %or.i637273
  %or.i678081 = or <8 x i1> %or.i667879, %or.i657677
  %or.i6780 = sext <8 x i1> %or.i678081 to <8 x i32>
  %or.i67 = bitcast <8 x i32> %or.i6780 to <4 x i64>
  ret <4 x i64> %or.i67
}

to

define dso_local noundef <4 x i64> @has_equal_in_u32(long long __vector(4), long long __vector(4))(<4 x i64> noundef %a, <4 x i64> noundef %b0) local_unnamed_addr {
entry:
  %0 = bitcast <4 x i64> %b0 to <8 x i32>
  %permil = shufflevector <8 x i32> %0, <8 x i32> poison, <8 x i32> <i32 1, i32 2, i32 3, i32 0, i32 5, i32 6, i32 7, i32 4>
  %1 = bitcast <8 x i32> %permil to <4 x i64>
  %permil1 = shufflevector <8 x i32> %0, <8 x i32> poison, <8 x i32> <i32 2, i32 3, i32 0, i32 1, i32 6, i32 7, i32 4, i32 5>
  %2 = bitcast <8 x i32> %permil1 to <4 x i64>
  %permil2 = shufflevector <8 x i32> %permil, <8 x i32> poison, <8 x i32> <i32 2, i32 3, i32 0, i32 1, i32 6, i32 7, i32 4, i32 5>
  %3 = bitcast <8 x i32> %permil2 to <4 x i64>
  %perm = shufflevector <4 x i64> %b0, <4 x i64> poison, <4 x i32> <i32 2, i32 3, i32 0, i32 1>
  %perm3 = shufflevector <4 x i64> %1, <4 x i64> poison, <4 x i32> <i32 2, i32 3, i32 0, i32 1>
  %perm4 = shufflevector <4 x i64> %2, <4 x i64> poison, <4 x i32> <i32 2, i32 3, i32 0, i32 1>
  %perm5 = shufflevector <4 x i64> %3, <4 x i64> poison, <4 x i32> <i32 2, i32 3, i32 0, i32 1>
  %4 = bitcast <4 x i64> %a to <8 x i32>
  %cmp.i = icmp eq <8 x i32> %4, %0
  %cmp.i48 = icmp eq <8 x i32> %permil, %4
  %cmp.i50 = icmp eq <8 x i32> %permil1, %4
  %cmp.i52 = icmp eq <8 x i32> %permil2, %4
  %5 = bitcast <4 x i64> %perm to <8 x i32>
  %cmp.i54 = icmp eq <8 x i32> %4, %5
  %6 = bitcast <4 x i64> %perm3 to <8 x i32>
  %cmp.i56 = icmp eq <8 x i32> %4, %6
  %7 = bitcast <4 x i64> %perm4 to <8 x i32>
  %cmp.i58 = icmp eq <8 x i32> %4, %7
  %8 = bitcast <4 x i64> %perm5 to <8 x i32>
  %cmp.i60 = icmp eq <8 x i32> %4, %8
  %or.i647475 = or <8 x i1> %cmp.i48, %cmp.i
  %or.i637273 = or <8 x i1> %or.i647475, %cmp.i50
  %or.i667879 = or <8 x i1> %or.i637273, %cmp.i54
  %or.i627071 = or <8 x i1> %or.i667879, %cmp.i52
  %or.i6869 = or <8 x i1> %or.i627071, %cmp.i56
  %or.i657677 = or <8 x i1> %or.i6869, %cmp.i58
  %or.i678081 = or <8 x i1> %or.i657677, %cmp.i60
  %or.i6780 = sext <8 x i1> %or.i678081 to <8 x i32>
  %or.i67 = bitcast <8 x i32> %or.i6780 to <4 x i64>
  ret <4 x i64> %or.i67
}

In doing so we're making the or-chain serial and increasing its depth, preventing value tracking from traversing it to the max depth.

razmser pushed a commit to SuduIDE/llvm-project that referenced this issue Oct 2, 2023
…CKSS/PACKUS to legalization/lowering

Don't prematurely fold TRUNCATE nodes to PACKSS/PACKUS target nodes - we miss out on generic TRUNCATE folds.

Helps some regressions from D152928 and llvm#63946

Fixes llvm#63710
razmser pushed a commit to SuduIDE/llvm-project that referenced this issue Oct 2, 2023
…CKSS/PACKUS to legalization/lowering

Don't prematurely fold TRUNCATE nodes to PACKSS/PACKUS target nodes - we miss out on generic TRUNCATE folds.

Helps some regressions from D152928 and llvm#63946

Fixes llvm#63710
razmser pushed a commit to SuduIDE/llvm-project that referenced this issue Oct 2, 2023
…CKSS/PACKUS to legalization/lowering

Don't prematurely fold TRUNCATE nodes to PACKSS/PACKUS target nodes - we miss out on generic TRUNCATE folds.

Helps some regressions from D152928 and llvm#63946

Fixes llvm#63710
razmser pushed a commit to SuduIDE/llvm-project that referenced this issue Oct 3, 2023
…CKSS/PACKUS to legalization/lowering

Don't prematurely fold TRUNCATE nodes to PACKSS/PACKUS target nodes - we miss out on generic TRUNCATE folds.

Helps some regressions from D152928 and llvm#63946

Fixes llvm#63710
razmser pushed a commit to SuduIDE/llvm-project that referenced this issue Oct 3, 2023
…CKSS/PACKUS to legalization/lowering

Don't prematurely fold TRUNCATE nodes to PACKSS/PACKUS target nodes - we miss out on generic TRUNCATE folds.

Helps some regressions from D152928 and llvm#63946

Fixes llvm#63710
razmser pushed a commit to SuduIDE/llvm-project that referenced this issue Oct 6, 2023
…CKSS/PACKUS to legalization/lowering

Don't prematurely fold TRUNCATE nodes to PACKSS/PACKUS target nodes - we miss out on generic TRUNCATE folds.

Helps some regressions from D152928 and llvm#63946

Fixes llvm#63710
razmser pushed a commit to SuduIDE/llvm-project that referenced this issue Oct 11, 2023
…CKSS/PACKUS to legalization/lowering

Don't prematurely fold TRUNCATE nodes to PACKSS/PACKUS target nodes - we miss out on generic TRUNCATE folds.

Helps some regressions from D152928 and llvm#63946

Fixes llvm#63710
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants