Skip to content

[X86][SSE2] Failure to vectorize int16_t[8] to pminsw pattern #48223

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
RKSimon opened this issue Jan 26, 2021 · 10 comments
Open

[X86][SSE2] Failure to vectorize int16_t[8] to pminsw pattern #48223

RKSimon opened this issue Jan 26, 2021 · 10 comments

Comments

@RKSimon
Copy link
Collaborator

RKSimon commented Jan 26, 2021

Bugzilla Link 48879
Version trunk
OS Windows NT
CC @alexey-bataev,@anton-afanasyev,@weiguozhi,@topperc,@davidbolvansky,@dtemirbulatov,@LebedevRI,@RKSimon,@phoebewang,@rotateright

Extended Description

https://gcc.godbolt.org/z/57nGK3

typedef int16_t T;
constexpr int N = 8;

std::array<T,N> compute_min(std::array<T,N>& x, std::array<T,N>& y) {
    std::array<T,N> result;
    for (int i = 0; i != N; ++i) {
      result[i] = std::min(x[i], y[i]);
    }
    return result;
}

This ends up as a horrid mix of scalar and <2 x i16> smin patterns.

Much of the problem seems to be that we end up trying to store the final array as { i64, i64 } aggregate, resulting in a load of zext+shift+or to pack the i16's.

Even more impressive with -march=atom we end up with masked gather calls....

@davidbolvansky
Copy link
Collaborator

This is a recent (SLP?) regression since LLVM 12. LLVM 11 produces expected codegen.

@RKSimon
Copy link
Collaborator Author

RKSimon commented Apr 4, 2021

@​spatel Could this be to do with the limitations you added to SLP for shifts by multiples of 8 to prevent it destroying bswap patterns?

@rotateright
Copy link
Contributor

This is an unintended consequence of an extra SROA pass between loop unrolling and SLP that we got from the switch to the new pass manager.

We also made the old pass manager match this behavior with:
https://reviews.llvm.org/D87972

So we can repro the behavior in LLVM 11 by toggling the pass manager (because the extra SROA wasn't in the legacy PM at that time, but it was in the new PM):
https://gcc.godbolt.org/z/rv3KT1Ex7

I'm not familiar with SROA, but it is converting the trailing insertvalue into shift+mask ops like:
%.sroa.4.0.insert.ext = zext i16 %27 to i64
%.sroa.4.0.insert.shift = shl i64 %.sroa.4.0.insert.ext, 48
%.sroa.4.0.insert.mask = and i64 undef, 281474976710655
%.sroa.4.0.insert.insert = or i64 %.sroa.4.0.insert.mask, %.sroa.4.0.insert.shift

...which seems ok in general, but it's breaking up the expected min/max reduction pattern in this example, and I'm not sure if we can recover.

Canonicalizing to the new min/max intrinsics doesn't appear to change anything.

I'm not sure how to solve this: (1) change the pass ordering? (2) add a bailout to SROA for min/max reductions?

@LebedevRI
Copy link
Member

We have that ugly pattern because the function must return { i64, i64 },
and we must somehow produce it.

But what if we could produce { i16, i16, i16, i16, i16, i16, i16, i16 }
and then just bitcast it to { i64, i64 }?

I.e., is there any particular reason why we forbid bitcasts between structs?

@topperc
Copy link
Collaborator

topperc commented Apr 6, 2021

We have that ugly pattern because the function must return { i64, i64 },
and we must somehow produce it.

But what if we could produce { i16, i16, i16, i16, i16, i16, i16, i16 }
and then just bitcast it to { i64, i64 }?

I.e., is there any particular reason why we forbid bitcasts between structs?

I think we would need to define what it means if there are padding bytes in the structs being bitcast.

Structs don’t exist in SelectionDAG so we’d need to teach SelectionDAGBuilder how to break down such a bitcast.

@topperc
Copy link
Collaborator

topperc commented Apr 6, 2021

Doesn't SLP need to see store instructions to as seeds? So is it really the ugly pattern that's breaking it or the lack of stores?

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 11, 2021
@RKSimon
Copy link
Collaborator Author

RKSimon commented May 1, 2022

Something strange that SLP is doing is that although the 2 insertvalue seeds have exactly the same shift pattern, we end up vectorizing them with non-uniforms shifts.

e.g.
%11 = shl nuw <2 x i64> %10, <i64 32, i64 48>

It seems to be because we've recognised that we can load combine the first 2 elements to <2 x i16>, but we completely fail to do that for the remaining pairs.....

@alexey-bataev Any suggestions?

define dso_local { i64, i64 } @_Z11compute_minRSt5arrayIsLm8EES1_(ptr nocapture noundef nonnull readonly align 2 dereferenceable(16) %x, ptr nocapture noundef nonnull readonly align 2 dereferenceable(16) %y) local_unnamed_addr #0 {
entry:
  %arrayidx.i.i.1 = getelementptr inbounds [8 x i16], ptr %x, i64 0, i64 1
  %arrayidx.i.i10.1 = getelementptr inbounds [8 x i16], ptr %y, i64 0, i64 1
  %arrayidx.i.i.2 = getelementptr inbounds [8 x i16], ptr %x, i64 0, i64 2
  %arrayidx.i.i10.2 = getelementptr inbounds [8 x i16], ptr %y, i64 0, i64 2
  %arrayidx.i.i.4 = getelementptr inbounds [8 x i16], ptr %x, i64 0, i64 4
  %arrayidx.i.i10.4 = getelementptr inbounds [8 x i16], ptr %y, i64 0, i64 4
  %arrayidx.i.i.5 = getelementptr inbounds [8 x i16], ptr %x, i64 0, i64 5
  %arrayidx.i.i10.5 = getelementptr inbounds [8 x i16], ptr %y, i64 0, i64 5
  %arrayidx.i.i.6 = getelementptr inbounds [8 x i16], ptr %x, i64 0, i64 6
  %arrayidx.i.i10.6 = getelementptr inbounds [8 x i16], ptr %y, i64 0, i64 6
  %0 = load <2 x i16>, ptr %arrayidx.i.i10.2, align 2
  %1 = load <2 x i16>, ptr %arrayidx.i.i.2, align 2
  %2 = call <2 x i16> @llvm.smin.v2i16(<2 x i16> %0, <2 x i16> %1)
  %3 = zext <2 x i16> %2 to <2 x i64>
  %4 = shl nuw <2 x i64> %3, <i64 32, i64 48>
  %5 = extractelement <2 x i64> %4, i32 0
  %6 = extractelement <2 x i64> %4, i32 1
  %7 = load <2 x i16>, ptr %arrayidx.i.i10.6, align 2
  %8 = load <2 x i16>, ptr %arrayidx.i.i.6, align 2
  %9 = call <2 x i16> @llvm.smin.v2i16(<2 x i16> %7, <2 x i16> %8)
  %10 = zext <2 x i16> %9 to <2 x i64>
  %11 = shl nuw <2 x i64> %10, <i64 32, i64 48>
  %12 = extractelement <2 x i64> %11, i32 0
  %13 = extractelement <2 x i64> %11, i32 1
  %14 = load i16, ptr %y, align 2
  %15 = load i16, ptr %x, align 2
  %16 = load i16, ptr %arrayidx.i.i10.1, align 2
  %17 = load i16, ptr %arrayidx.i.i.1, align 2
  %18 = load i16, ptr %arrayidx.i.i10.4, align 2
  %19 = load i16, ptr %arrayidx.i.i.4, align 2
  %20 = insertelement <2 x i16> poison, i16 %14, i32 0
  %21 = insertelement <2 x i16> %20, i16 %18, i32 1
  %22 = insertelement <2 x i16> poison, i16 %15, i32 0
  %23 = insertelement <2 x i16> %22, i16 %19, i32 1
  %24 = call <2 x i16> @llvm.smin.v2i16(<2 x i16> %21, <2 x i16> %23)
  %25 = load i16, ptr %arrayidx.i.i10.5, align 2
  %26 = load i16, ptr %arrayidx.i.i.5, align 2
  %27 = insertelement <2 x i16> poison, i16 %16, i32 0
  %28 = insertelement <2 x i16> %27, i16 %25, i32 1
  %29 = insertelement <2 x i16> poison, i16 %17, i32 0
  %30 = insertelement <2 x i16> %29, i16 %26, i32 1
  %31 = call <2 x i16> @llvm.smin.v2i16(<2 x i16> %28, <2 x i16> %30)
  %32 = insertelement <2 x i64> poison, i64 %6, i32 0
  %33 = insertelement <2 x i64> %32, i64 %13, i32 1
  %34 = insertelement <2 x i64> poison, i64 %5, i32 0
  %35 = insertelement <2 x i64> %34, i64 %12, i32 1
  %36 = or <2 x i64> %33, %35
  %37 = zext <2 x i16> %31 to <2 x i64>
  %38 = shl nuw nsw <2 x i64> %37, <i64 16, i64 16>
  %39 = or <2 x i64> %36, %38
  %40 = zext <2 x i16> %24 to <2 x i64>
  %41 = or <2 x i64> %39, %40
  %42 = extractelement <2 x i64> %41, i32 0
  %.fca.0.insert = insertvalue { i64, i64 } poison, i64 %42, 0
  %43 = extractelement <2 x i64> %41, i32 1
  %.fca.1.insert = insertvalue { i64, i64 } %.fca.0.insert, i64 %43, 1
  ret { i64, i64 } %.fca.1.insert
}

@alexey-bataev
Copy link
Member

Known problem - vectoriaztion of the copyable elements. We have 2 or reductions of 3 shls and 1 zext. Need non-power-of2 vectorization + some extra work for reductions + vectorization of copyable elements.

@alexey-bataev
Copy link
Member

The godbolt link shows that the codegen is pretty good now, so assume it is fixed

RKSimon added a commit that referenced this issue May 12, 2025
The X86 backend shuffle combining is saving us from some poor vectorised IR
@RKSimon
Copy link
Collaborator Author

RKSimon commented May 12, 2025

I'm going to reopen this - as showing in 63f3a5b we're relying on the backend (x86 shuffle combining) to do a lot of the work for this - and the middle-end still fails on SSE2 targets despite having low costs for a legal v8i16 SMIN instruction

@RKSimon RKSimon reopened this May 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants