Skip to content

[SLPVectorizer] Performance degradation with xor + fshl on X86 #63980

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
annamthomas opened this issue Jul 20, 2023 · 30 comments · Fixed by #124254
Closed

[SLPVectorizer] Performance degradation with xor + fshl on X86 #63980

annamthomas opened this issue Jul 20, 2023 · 30 comments · Fixed by #124254

Comments

@annamthomas
Copy link
Contributor

With SLP Vectorizer, a hot loop with 6 xors + 2 fshl get reduced to 3 xors + 1 fshl. We vectorize with a VF of 2.
The SLP cost model gives it a cost of -8.

This is the loop in question:

  %iv = phi i64 [ %add323, %vectorized_slp_bb ], [ 16, %bb2 ]
  %add288 = add nsw i64 %iv, -3
  %getelementptr289 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr4, i64 %add288
  %load290 = load i32, ptr addrspace(1) %getelementptr289, align 4, !tbaa !28, !noundef !3
  %add291 = add nsw i64 %iv, -8
  %getelementptr292 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr4, i64 %add291
  %load293 = load i32, ptr addrspace(1) %getelementptr292, align 4, !tbaa !28, !noundef !3
  %xor294 = xor i32 %load293, %load290
  %add295 = add nsw i64 %iv, -14 
  %getelementptr296 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr4, i64 %add295
  %load297 = load i32, ptr addrspace(1) %getelementptr296, align 4, !tbaa !28, !noundef !3
  %xor298 = xor i32 %xor294, %load297
  %add299 = add nsw i64 %iv, -16 
  %getelementptr300 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr4, i64 %add299
  %load301 = load i32, ptr addrspace(1) %getelementptr300, align 4, !tbaa !28, !noundef !3
  %xor302 = xor i32 %xor298, %load301
  %call303 = call i32 @llvm.fshl.i32(i32 %xor302, i32 %xor302, i32 1) #5
  %getelementptr304 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr4, i64 %iv 
  store i32 %call303, ptr addrspace(1) %getelementptr304, align 4, !tbaa !28 
  %add305 = add nuw nsw i64 %iv, 1
  %add306 = add nsw i64 %iv, -2
  %getelementptr307 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr4, i64 %add306
  %load308 = load i32, ptr addrspace(1) %getelementptr307, align 4, !tbaa !28, !noundef !3
  %add309 = add nsw i64 %iv, -7
  %getelementptr310 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr4, i64 %add309
  %load311 = load i32, ptr addrspace(1) %getelementptr310, align 4, !tbaa !28, !noundef !3
  %xor312 = xor i32 %load311, %load308
  %add313 = add nsw i64 %iv, -13 
  %getelementptr314 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr4, i64 %add313
  %load315 = load i32, ptr addrspace(1) %getelementptr314, align 4, !tbaa !28, !noundef !3
  %xor316 = xor i32 %xor312, %load315
  %add317 = add nsw i64 %iv, -15
  %getelementptr318 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr4, i64 %add317
  %load319 = load i32, ptr addrspace(1) %getelementptr318, align 4, !tbaa !28, !noundef !3
  %xor320 = xor i32 %xor316, %load319
  %call321 = call i32 @llvm.fshl.i32(i32 %xor320, i32 %xor320, i32 1) #5
  %getelementptr322 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr4, i64 %add305
  store i32 %call321, ptr addrspace(1) %getelementptr322, align 4, !tbaa !28
  %add323 = add nuw nsw i64 %iv, 2
  %icmp324 = icmp ugt i64 %add305, 78
  br i1 %icmp324, label %6, label %vectorized_slp_bb

When we vectorize it, we get:

vectorized_slp_bb:                                ; preds = %vectorized_slp_bb, %bb2
  %iv = phi i64 [ %add323, %vectorized_slp_bb ], [ 16, %bb2 ]
  %add288 = add nsw i64 %iv, -3
  %getelementptr289 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr4, i64 %add288
  %add291 = add nsw i64 %iv, -8
  %getelementptr292 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr4, i64 %add291
  %add295 = add nsw i64 %iv, -14 
  %getelementptr296 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr4, i64 %add295
  %add299 = add nsw i64 %iv, -16 
  %getelementptr300 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr4, i64 %add299
  %getelementptr304 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr4, i64 %iv 
  %add305 = add nuw nsw i64 %iv, 1
  %25 = load <2 x i32>, ptr addrspace(1) %getelementptr289, align 4, !tbaa !28 
  %26 = load <2 x i32>, ptr addrspace(1) %getelementptr292, align 4, !tbaa !28 
  %27 = xor <2 x i32> %26, %25 
  %28 = load <2 x i32>, ptr addrspace(1) %getelementptr296, align 4, !tbaa !28 
  %29 = xor <2 x i32> %27, %28 
  %30 = load <2 x i32>, ptr addrspace(1) %getelementptr300, align 4, !tbaa !28 
  %31 = xor <2 x i32> %29, %30 
  %32 = call <2 x i32> @llvm.fshl.v2i32(<2 x i32> %31, <2 x i32> %31, <2 x i32> <i32 1, i32 1>) 
  store <2 x i32> %32, ptr addrspace(1) %getelementptr304, align 4, !tbaa !28 
  %add323 = add nuw nsw i64 %iv, 2
  %icmp324 = icmp ugt i64 %add305, 78
  br i1 %icmp324, label %6, label %vectorized_slp_bb

We see about a 40% degradation on benchmark that optimizes this hot loop.
The assembly for this loop shows we use 3 xor instead of vpxor and the fshl lowering using xmm registers:

 movq  -12(%rax,%rcx,4), %rdx 
  xorq  8(%rax,%rcx,4), %rdx 
  xorq  -36(%rax,%rcx,4), %rdx 
  xorq  -44(%rax,%rcx,4), %rdx 
  vmovq %rdx, %xmm0
  vpsrld  $31, %xmm0, %xmm1
  vpaddd  %xmm0, %xmm0, %xmm0
  vpor  %xmm1, %xmm0, %xmm0
  vmovq %xmm0, 20(%rax,%rcx,4)
  addq  $2, %rcx 
  cmpq  $78, %rcx 

While looking at cost model for X86 arithmetic instructions, I do not see anything for v2i32 for XOR. Should we actually vectorize this loop?

Will attach the IR reproducer and -slp-threshold=2 shows we only vectorize this tree and still see the 40% degradation.

@annamthomas
Copy link
Contributor Author

annamthomas commented Jul 20, 2023

Here is the reproducer: https://godbolt.org/z/s1PjYMs5a

@annamthomas
Copy link
Contributor Author

annamthomas commented Jul 20, 2023

@alexey-bataev @RKSimon any pointers here on what the issue might be?

@annamthomas
Copy link
Contributor Author

annamthomas commented Jul 20, 2023

So, I tried printing the cost-model for xor and fshl under different mattr:

opt chk.ll -passes="print<cost-model>" 2>&1 -disable-output -mtriple=x86_64-- -mattr=+avx512f,+avx512bw

Cost Model: Found an estimated cost of 1 for instruction:   %V2I32 = xor <2 x i32> undef, undef
Cost Model: Found an estimated cost of 1 for instruction:   %call303 = call <2 x i32> @llvm.fshl.v2i32(<2 x i32> %V2I32, <2 x i32> %V2I32, <2 x i32> <i32 1, i32 1>)
Cost Model: Found an estimated cost of 1 for instruction:   %V4I32 = xor <4 x i32> undef, undef
Cost Model: Found an estimated cost of 1 for instruction:   %call304 = call <4 x i32> @llvm.fshl.v4i32(<4 x i32> %V4I32, <4 x i32> %V4I32, <4 x i32> <i32 1, i32 1, i32 1, i32 1>)

opt chk.ll -passes="print" 2>&1 -disable-output -mtriple=x86_64-- -mattr=+avx2

Cost Model: Found an estimated cost of 1 for instruction:   %V2I32 = xor <2 x i32> undef, undef
Cost Model: Found an estimated cost of 4 for instruction:   %call303 = call <2 x i32> @llvm.fshl.v2i32(<2 x i32> %V2I32, <2 x i32> %V2I32, <2 x i32> <i32 1, i32 1>)
Cost Model: Found an estimated cost of 1 for instruction:   %V4I32 = xor <4 x i32> undef, undef
Cost Model: Found an estimated cost of 4 for instruction:   %call304 = call <4 x i32> @llvm.fshl.v4i32(<4 x i32> %V4I32, <4 x i32> %V4I32, <4 x i32> <i32 1, i32 1, i32 1, i32 1>)

Changing -mattr=+avx2 on the above testcase should increase the cost of the vector fshl from 1 to 4. However, debug shows SLP cost model as -11 ! Shouldn't it reduce from -8 to -5 instead?

SLP: Found cost = -11 for VF=2
SLP: Decided to vectorize cost = -11

@llvmbot
Copy link
Member

llvmbot commented Jul 20, 2023

@llvm/issue-subscribers-backend-x86

@RKSimon
Copy link
Collaborator

RKSimon commented Jul 20, 2023

The v2i32 types should be legalized to v4i32 inside the TTI callbacks, AVX512 does have a cheap v4i32 rotate instruction which is what those fshl should lower to, but on AVX2 they will expand to shifts/or - hence the higher cost.

What I think is going on though is the high cost of the scalar rotate instructions - we don't cost constant rotate amounts as cheaper than the variable rotate instructions, despite being notably faster on almost all hardware.:

{ ISD::ROTL,       MVT::i32,     {  2,  3,  1,  3 } },

So we'd need to add lower costs for constant rotate amounts.

@alexey-bataev
Copy link
Member

I don't think it is directly related to SLP vectorizer, but x86 cost model. And Simon, probably , identified the actual issue.

@annamthomas
Copy link
Contributor Author

annamthomas commented Jul 20, 2023

Yep, thank you. I was looking at the cost of the vector ones and trying to increase those.
@RKSimon I tried reducing the cost of ROTL and ROTR for i32 (scalar) to all ones - just as a quick check to see what happens.
We did reduce the benefit of SLP Vectorization, so it is in the right direction (but we still vectorize) :

SLP: Found cost = -6 for VF=2
SLP: Decided to vectorize cost = -6  

Earlier it was -8.

For reference, the scalar assembly is:

movl    8(%rcx,%rax,4), %esi
xorl    -12(%rcx,%rax,4), %esi
movl    -44(%rcx,%rax,4), %edi
xorl    -36(%rcx,%rax,4), %edi
xorl    %esi, %edi
rorxl   $31, %edi, %esi
movl    %esi, 20(%rcx,%rax,4)
incq    %rax
cmpq    $78, %rax
jbe     -38 

AVX512 does have a cheap v4i32 rotate instruction which is what those fshl should lower to

I tested this with couple of different mattr and I think the cost may not be reflecting the lowering.

cat chk.ll
declare <2 x i32> @llvm.fshl.v2i32(<2 x i32>, <2 x i32>, <2 x i32>)

define <2 x i32> @splatconstant_funnnel_v2i32(<2 x i32> %x) nounwind {
  %res = call <2 x i32> @llvm.fshl.v2i32(<2 x i32> %x, <2 x i32> %x, <2 x i32> <i32 1, i32 1>)
  ret <2 x i32> %res
}
llc -mtriple=x86_64-unknown-unknown -mattr=+avx512f,+avx512vbmi,+avx512vbmi2 -mcpu=cascadelake chk.ll 

splatconstant_funnnel_v2i32:            # @splatconstant_funnnel_v2i32
# %bb.0:
	vprold	$1, %xmm0, %xmm0
	retq

True for avx512 and for avx2 with +xop (lowers to single instruction: vprotd $1, %xmm0, %xmm0).

However, for the mattr from this repro.ll, we lowered fshl into three vector instructions:

llc -mattr=+avx512cd,-sha,+xsaveopt,-avxvnni,-mwaitx,+sse4.2,-cldemote,+avx512f,+lzcnt,+fsgsbase,+aes,-sse4a,+rtm,+fma,-avx512vp2intersect,-avxneconvert,+popcnt,-prefetchi,+avx2,-uintr,-ptwrite,+fxsr,-pconfig,-avx512er,+rdseed,+pku,-rdpid,-avx512vbmi,-avx512vbmi2,+sse3,+xsaves,-amx-tile,-fma4,+invpcid,-cmpccxadd,-prefetchwt1,+ssse3,+pclmul,+clflushopt,-tsxldtrk,+crc32,+rdrnd,+sse2,-kl,-clzero,+bmi,-raoint,+xsavec,-serialize,-avxvnniint8,+sse,-rdpru,-tbm,-avx512bf16,-waitpkg,-amx-fp16,-avx512ifma,-vaes,+f16c,+sahf,+xsave,+avx512bw,-amx-int8,-vpclmulqdq,-sgx,-avx512fp16,-gfni,-amx-bf16,+bmi2,-movdir64b,+avx512vl,-xop,+prfchw,+cx16,-enqcmd,+64bit,-amx-complex,-avx512pf,-lwp,-avx512vpopcntdq,+avx512dq,+mmx,-avxifma,+avx512vnni,+avx,+cmov,-hreset,+sse4.1,+movbe,+adx,+clwb,-widekl,-movdiri,+cx8,-shstk,-avx512bitalg,-wbnoinvd,-avx512f -mcpu=cascadelake -mtriple=x86_64-unknown-linux-gnu chk.ll 
splatconstant_funnnel_v2i32:            # @splatconstant_funnnel_v2i32
# %bb.0:
	vpsrlvd	.LCPI0_0(%rip), %xmm0, %xmm1
	vpsllvd	.LCPI0_1(%rip), %xmm0, %xmm0
	vpor	%xmm1, %xmm0, %xmm0
	retq

@RKSimon RKSimon self-assigned this Jul 21, 2023
@RKSimon
Copy link
Collaborator

RKSimon commented Jul 21, 2023

So there's a couple of problems to address:

  • missing rotate/funnel-shift by constant scalar/vector costs
  • pre-AVX512 the DAG fails to expand rotate-shift-by-imm to use vector-shift-by-imm

@annamthomas
Copy link
Contributor Author

Thanks @RKSimon for taking this up! I instrumented the SLPVectorizer cost model and confirmed that the fshl costs correctly reflect what we get in the lowered code.

RKSimon added a commit that referenced this issue Jul 22, 2023
Fixes issue identified on #63980 where the undef rotate amounts (during widening from v2i32 -> v4i32) were being constant folded to 0 when the shift amounts are created during expansion, losing the splat'd shift amounts.
@annamthomas
Copy link
Contributor Author

Thank you for landing the above change @RKSimon. However, we don't see any difference in the performance of the internal benchmark. While we do generate better vector code (in some cases as seen in the test changes for the patch) it is still not as good as the scalar code. It looks like what needs correction is the X86 cost model. However, even with the lower costs for constant rotate amounts, we still SLP vectorize it.

Also, for the mattr where we see the performance degradation (we see across Skylake and cascade lake), we haven't reduced the number of vector instructions (I haven't checked the throughput of the changed instructions).

Original code generated was:

llc -mattr=+avx512cd,-sha,+xsaveopt,-avxvnni,-mwaitx,+sse4.2,-cldemote,+avx512f,+lzcnt,+fsgsbase,+aes,-sse4a,+rtm,+fma,-avx512vp2intersect,-avxneconvert,+popcnt,-prefetchi,+avx2,-uintr,-ptwrite,+fxsr,-pconfig,-avx512er,+rdseed,+pku,-rdpid,-avx512vbmi,-avx512vbmi2,+sse3,+xsaves,-amx-tile,-fma4,+invpcid,-cmpccxadd,-prefetchwt1,+ssse3,+pclmul,+clflushopt,-tsxldtrk,+crc32,+rdrnd,+sse2,-kl,-clzero,+bmi,-raoint,+xsavec,-serialize,-avxvnniint8,+sse,-rdpru,-tbm,-avx512bf16,-waitpkg,-amx-fp16,-avx512ifma,-vaes,+f16c,+sahf,+xsave,+avx512bw,-amx-int8,-vpclmulqdq,-sgx,-avx512fp16,-gfni,-amx-bf16,+bmi2,-movdir64b,+avx512vl,-xop,+prfchw,+cx16,-enqcmd,+64bit,-amx-complex,-avx512pf,-lwp,-avx512vpopcntdq,+avx512dq,+mmx,-avxifma,+avx512vnni,+avx,+cmov,-hreset,+sse4.1,+movbe,+adx,+clwb,-widekl,-movdiri,+cx8,-shstk,-avx512bitalg,-wbnoinvd,-avx512f -mcpu=cascadelake -mtriple=x86_64-unknown-linux-gnu chk.ll 
splatconstant_funnnel_v2i32:            # @splatconstant_funnnel_v2i32
# %bb.0:
	vpsrlvd	.LCPI0_0(%rip), %xmm0, %xmm1
	vpsllvd	.LCPI0_1(%rip), %xmm0, %xmm0
	vpor	%xmm1, %xmm0, %xmm0
	retq

With the above improvement, we have:

	vpsrld	$31, %xmm0, %xmm1
	vpaddd	%xmm0, %xmm0, %xmm0
	vpor	%xmm1, %xmm0, %xmm0
	retq

For:

chk.ll :
declare <2 x i32> @llvm.fshl.v2i32(<2 x i32>, <2 x i32>, <2 x i32>)

define <2 x i32> @splatconstant_funnnel_v2i32(<2 x i32> %x) nounwind {
  %res = call <2 x i32> @llvm.fshl.v2i32(<2 x i32> %x, <2 x i32> %x, <2 x i32> <i32 1, i32 1>)
  ret <2 x i32> %res
}

@RKSimon
Copy link
Collaborator

RKSimon commented Jul 27, 2023

Sorry, still working on this - as you said the costs still need attention (needs some refactoring, while the backend fixes were pretty trivial).

@annamthomas
Copy link
Contributor Author

annamthomas commented Jul 27, 2023

No worries and thank you! I just wanted to clarify my understanding that to completely avoid the perf degradation here, we should not SLP vectorize this code (by fixing the X86 cost model).

RKSimon added a commit that referenced this issue Jul 27, 2023
RKSimon added a commit that referenced this issue Jul 27, 2023
As noted on #63980 rotate by immediate amounts is much cheaper than variable amounts.

This still needs to be expanded to vector rotate cases, and we need to add reasonable funnel-shift costs as well (very tricky as there's a huge range in CPU behaviour for these).
@annamthomas
Copy link
Contributor Author

I reduced the testcase with bugpoint to show minimal code needed for SLPVectorizing that same loop:

cat input.ll:

; ModuleID = 'reduced.ll'
source_filename = "reduced.ll"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128-ni:1-p2:32:8:8:32-ni:2"
target triple = "x86_64-unknown-linux-gnu"

define void @blam(ptr addrspace(1) %arg, ptr addrspace(1) %arg1, i32 %arg2, ptr addrspace(1) %arg3) gc "statepoint-example" personality ptr @wibble !prof !2 {
bb:
  %load = load i64, ptr addrspace(256) inttoptr (i64 64 to ptr addrspace(256)), align 64, !invariant.load !3
  %getelementptr = getelementptr inbounds i8, ptr addrspace(1) %arg, i64 64
  %load4 = load ptr addrspace(1), ptr addrspace(1) %getelementptr, align 8, !tbaa !4, !dereferenceable_or_null !11, !align !12, !noundef !3
  %icmp = icmp eq ptr addrspace(1) %load4, null
  %load5 = load ptr addrspace(1), ptr addrspace(1) %getelementptr, align 8, !dereferenceable_or_null !11, !align !12, !noundef !3
  %icmp6 = icmp eq ptr addrspace(1) %load5, null
  %getelementptr7 = getelementptr inbounds i8, ptr addrspace(1) %load5, i64 16
  %getelementptr8 = getelementptr inbounds i8, ptr addrspace(1) %load5, i64 8
  %load9 = load i32, ptr addrspace(1) %getelementptr8, align 8, !range !13, !invariant.load !3, !noundef !3
  %icmp10 = icmp ugt i32 %load9, 79
  br label %bb11

bb11:                                             ; preds = %bb11, %bb
  %phi = phi i64 [ %add44, %bb11 ], [ 16, %bb ]
  %add = add nsw i64 %phi, -3
  %getelementptr12 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %add
  %load13 = load i32, ptr addrspace(1) %getelementptr12, align 4, !tbaa !14, !noundef !3
  %add14 = add nsw i64 %phi, -8
  %getelementptr15 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %add14
  %load16 = load i32, ptr addrspace(1) %getelementptr15, align 4, !tbaa !14, !noundef !3
  %xor = xor i32 %load16, %load13
  %add17 = add nsw i64 %phi, -14
  %getelementptr18 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %add17
  %load19 = load i32, ptr addrspace(1) %getelementptr18, align 4, !tbaa !14, !noundef !3
  %xor20 = xor i32 %xor, %load19
  %add21 = add nsw i64 %phi, -16
  %getelementptr22 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %add21
  %load23 = load i32, ptr addrspace(1) %getelementptr22, align 4, !tbaa !14, !noundef !3
  %xor24 = xor i32 %xor20, %load23
  %call = call i32 @llvm.fshl.i32(i32 %xor24, i32 %xor24, i32 1) #2
  %getelementptr25 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %phi
  store i32 %call, ptr addrspace(1) %getelementptr25, align 4, !tbaa !14
  %add26 = add nuw nsw i64 %phi, 1
  %add27 = add nsw i64 %phi, -2
  %getelementptr28 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %add27
  %load29 = load i32, ptr addrspace(1) %getelementptr28, align 4, !tbaa !14, !noundef !3
  %add30 = add nsw i64 %phi, -7
  %getelementptr31 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %add30
  %load32 = load i32, ptr addrspace(1) %getelementptr31, align 4, !tbaa !14, !noundef !3
  %xor33 = xor i32 %load32, %load29
  %add34 = add nsw i64 %phi, -13
  %getelementptr35 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %add34
  %load36 = load i32, ptr addrspace(1) %getelementptr35, align 4, !tbaa !14, !noundef !3
  %xor37 = xor i32 %xor33, %load36
  %add38 = add nsw i64 %phi, -15
  %getelementptr39 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %add38
  %load40 = load i32, ptr addrspace(1) %getelementptr39, align 4, !tbaa !14, !noundef !3
  %xor41 = xor i32 %xor37, %load40
  %call42 = call i32 @llvm.fshl.i32(i32 %xor41, i32 %xor41, i32 1) #2
  %getelementptr43 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %add26
  store i32 %call42, ptr addrspace(1) %getelementptr43, align 4, !tbaa !14
  %add44 = add nuw nsw i64 %phi, 2
  %icmp45 = icmp ugt i64 %add26, 78
  br label %bb11
}

declare ptr @wibble()

declare void @wobble()

; Function Attrs: nocallback nofree nosync nounwind speculatable willreturn memory(none)
declare i32 @llvm.fshl.i32(i32, i32, i32) #0

; Function Attrs: nocallback nofree nosync nounwind willreturn memory(read)
declare i64 @llvm.read_register.i64(metadata) #1

attributes #0 = { nocallback nofree nosync nounwind speculatable willreturn memory(none) }
attributes #1 = { nocallback nofree nosync nounwind willreturn memory(read) }
attributes #2 = { "inline-remark"="cost=never, reason=unavailable definition" }

!llvm.module.flags = !{!0, !1}

!0 = !{i32 2, !"Dwarf Version", i32 4}
!1 = !{i32 2, !"Debug Info Version", i32 3}
!2 = !{!"function_entry_count", i64 32768}
!3 = !{}
!4 = !{!5, !9, i64 64}
!5 = !{!"sun/security/provider/SHA.instance", !6, i64 0, !9, i64 64, !9, i64 72}
!6 = !{!"sun/security/provider/DigestBase.instance", !7, i64 0, !9, i64 16, !9, i64 24, !9, i64 28, !9, i64 32, !9, i64 40, !9, i64 48, !9, i64 56}
!7 = !{!"java/security/MessageDigestSpi.instance", !8, i64 0, !9, i64 8}
!8 = !{!"java/lang/Object.instance", !9, i64 0}
!9 = !{!"tbaa_local_fields", !10, i64 0}
!10 = !{!"dolphin-tbaa-access-type"}
!11 = !{i64 16}
!12 = !{i64 8}
!13 = !{i32 0, i32 2147483646}
!14 = !{!15, !15, i64 0}
!15 = !{!"int.array", !10}

Run this through opt -passes=slp-vectorizer input.ll.
The cost is -8.
With the accurate target-features supplied for this function from the machine we saw the degradation, the cost is -8 as well.
This may not mean anything though because when I tried reducing the target-features blindly (trying to simulate bugpoint), the SLP benefit increased. This means something in the target-features matters and it's very hard to find out what exactly (so that we can update the X86 cost with these target-features).

The target-features attribute for the function is:

attributes #0 = { "target-features"="+avx512cd,-sha,+xsaveopt,-avxvnni,-mwaitx,+sse4.2,-cldemote,+avx512f,+lzcnt,+fsgsbase,+aes,-sse4a,+rtm,+fma,-avx512vp2intersect,-avxneconvert,+popcnt,-prefetchi,+avx2,-uintr,-ptwrite,+fxsr,-pconfig,-avx512er,+rdseed,+pku,-rdpid,-avx512vbmi,-avx512vbmi2,+sse3,+xsaves,-amx-tile,-fma4,+invpcid,-cmpccxadd,-prefetchwt1,+ssse3,+pclmul,+clflushopt,-tsxldtrk,+crc32,+rdrnd,+sse2,-kl,-clzero,+bmi,-raoint,+xsavec,-serialize,-avxvnniint8,+sse,-rdpru,-tbm,-avx512bf16,-waitpkg,-amx-fp16,-avx512ifma,-vaes,+f16c,+sahf,+xsave,+avx512bw,-amx-int8,-vpclmulqdq,-sgx,-avx512fp16,-gfni,-amx-bf16,+bmi2,-movdir64b,+avx512vl,-xop,+prfchw,+cx16,-enqcmd,+64bit,-amx-complex,-avx512pf,-lwp,-avx512vpopcntdq,+avx512dq,+mmx,-avxifma,+avx512vnni,+avx,+cmov,-hreset,+sse4.1,+movbe,+adx,+clwb,-widekl,-movdiri,+cx8,-shstk,-avx512bitalg,-wbnoinvd,-avx512f,+avx512cd,-sha,+xsaveopt,-avxvnni,-mwaitx,+sse4.2,-cldemote,+avx512f,+lzcnt,+fsgsbase,+aes,-sse4a,+rtm,+fma,-avx512vp2intersect,-avxneconvert,+popcnt,-prefetchi,+avx2,-uintr,-ptwrite,+fxsr,-pconfig,-avx512er,+rdseed,+pku,-rdpid,-avx512vbmi,-avx512vbmi2,+sse3,+xsaves,-amx-tile,-fma4,+invpcid,-cmpccxadd,-prefetchwt1,+ssse3,+pclmul,+clflushopt,-tsxldtrk,+crc32,+rdrnd,+sse2,-kl,-clzero,+bmi,-raoint,+xsavec,-serialize,-avxvnniint8,+sse,-rdpru,-tbm,-avx512bf16,-waitpkg,-amx-fp16,-avx512ifma,-vaes,+f16c,+sahf,+xsave,+avx512bw,-amx-int8,-vpclmulqdq,-sgx,-avx512fp16,-gfni,-amx-bf16,+bmi2,-movdir64b,+avx512vl,-xop,+prfchw,+cx16,-enqcmd,+64bit,-amx-complex,-avx512pf,-lwp,-avx512vpopcntdq,+avx512dq,+mmx,-avxifma,+avx512vnni,+avx,+cmov,-hreset,+sse4.1,+movbe,+adx,+clwb,-widekl,-movdiri,+cx8,-shstk,-avx512bitalg,-wbnoinvd,-avx512f,+avx512cd,-sha,+xsaveopt,-avxvnni,-mwaitx,+sse4.2,-cldemote,+avx512f,+lzcnt,+fsgsbase,+aes,-sse4a,+rtm,+fma,-avx512vp2intersect,-avxneconvert,+popcnt,-prefetchi,+avx2,-uintr,-ptwrite,+fxsr,-pconfig,-avx512er,+rdseed,+pku,-rdpid,-avx512vbmi,-avx512vbmi2,+sse3,+xsaves,-amx-tile,-fma4,+invpcid,-cmpccxadd,-prefetchwt1,+ssse3,+pclmul,+clflushopt,-tsxldtrk,+crc32,+rdrnd,+sse2,-kl,-clzero,+bmi,-raoint,+xsavec,-serialize,-avxvnniint8,+sse,-rdpru,-tbm,-avx512bf16,-waitpkg,-amx-fp16,-avx512ifma,-vaes,+f16c,+sahf,+xsave,+avx512bw,-amx-int8,-vpclmulqdq,-sgx,-avx512fp16,-gfni,-amx-bf16,+bmi2,-movdir64b,+avx512vl,-xop,+prfchw,+cx16,-enqcmd,+64bit,-amx-complex,-avx512pf,-lwp,-avx512vpopcntdq,+avx512dq,+mmx,-avxifma,+avx512vnni,+avx,+cmov,-hreset,+sse4.1,+movbe,+adx,+clwb,-widekl,-movdiri,+cx8,-shstk,-avx512bitalg,-wbnoinvd,-avx512f,+avx512cd,-sha,+xsaveopt,-avxvnni,-mwaitx,+sse4.2,-cldemote,+avx512f,+lzcnt,+fsgsbase,+aes,-sse4a,+rtm,+fma,-avx512vp2intersect,-avxneconvert,+popcnt,-prefetchi,+avx2,-uintr,-ptwrite,+fxsr,-pconfig,-avx512er,+rdseed,+pku,-rdpid,-avx512vbmi,-avx512vbmi2,+sse3,+xsaves,-amx-tile,-fma4,+invpcid,-cmpccxadd,-prefetchwt1,+ssse3,+pclmul,+clflushopt,-tsxldtrk,+crc32,+rdrnd,+sse2,-kl,-clzero,+bmi,-raoint,+xsavec,-serialize,-avxvnniint8,+sse,-rdpru,-tbm,-avx512bf16,-waitpkg,-amx-fp16,-avx512ifma,-vaes,+f16c,+sahf,+xsave,+avx512bw,-amx-int8,-vpclmulqdq,-sgx,-avx512fp16,-gfni,-amx-bf16,+bmi2,-movdir64b,+avx512vl,-xop,+prfchw,+cx16,-enqcmd,+64bit,-amx-complex,-avx512pf,-lwp,-avx512vpopcntdq,+avx512dq,+mmx,-avxifma,+avx512vnni,+avx,+cmov,-hreset,+sse4.1,+movbe,+adx,+clwb,-widekl,-movdiri,+cx8,-shstk,-avx512bitalg,-wbnoinvd,-avx512f,+avx512cd,-sha,+xsaveopt,-avxvnni,-mwaitx,+sse4.2,-cldemote,+avx512f,+lzcnt,+fsgsbase,+aes,-sse4a,+rtm,+fma,-avx512vp2intersect,-avxneconvert,+popcnt,-prefetchi,+avx2,-uintr,-ptwrite,+fxsr,-pconfig,-avx512er,+rdseed,+pku,-rdpid,-avx512vbmi,-avx512vbmi2,+sse3,+xsaves,-amx-tile,-fma4,+invpcid,-cmpccxadd,-prefetchwt1,+ssse3,+pclmul,+clflushopt,-tsxldtrk,+crc32,+rdrnd,+sse2,-kl,-clzero,+bmi,-raoint,+xsavec,-serialize,-avxvnniint8,+sse,-rdpru,-tbm,-avx512bf16,-waitpkg,-amx-fp16,-avx512ifma,-vaes,+f16c,+sahf,+xsave,+avx512bw,-amx-int8,-vpclmulqdq,-sgx,-avx512fp16,-gfni,-amx-bf16,+bmi2,-movdir64b,+avx512vl,-xop,+prfchw,+cx16,-enqcmd,+64bit,-amx-complex,-avx512pf,-lwp,-avx512vpopcntdq,+avx512dq,+mmx,-avxifma,+avx512vnni,+avx,+cmov,-hreset,+sse4.1,+movbe,+adx,+clwb,-widekl,-movdiri,+cx8,-shstk,-avx512bitalg,-wbnoinvd,-avx512f" }

@annamthomas
Copy link
Contributor Author

@RKSimon @alexey-bataev I went through the SLPVectorizer debug output and I think I found the issue for why we vectorize. It looks like we over-estimate the cost of the scalar loads as well, assuming we generate 8 loads whereas we infact only generate 2 loads.

Consider same reduced example as above, here is the main snippet:

  %phi = phi i64 [ %add44, %bb11 ], [ 16, %bb ]
  %add = add nsw i64 %phi, -3
  %getelementptr12 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %add
  %load13 = load i32, ptr addrspace(1) %getelementptr12, align 4, !tbaa !14, !noundef !3
  %add14 = add nsw i64 %phi, -8
  %getelementptr15 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %add14
  %load16 = load i32, ptr addrspace(1) %getelementptr15, align 4, !tbaa !14, !noundef !3
  %xor = xor i32 %load16, %load13
  %add17 = add nsw i64 %phi, -14
  %getelementptr18 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %add17
  %load19 = load i32, ptr addrspace(1) %getelementptr18, align 4, !tbaa !14, !noundef !3
  %xor20 = xor i32 %xor, %load19
  %add21 = add nsw i64 %phi, -16
  %getelementptr22 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %add21
  %load23 = load i32, ptr addrspace(1) %getelementptr22, align 4, !tbaa !14, !noundef !3
  %xor24 = xor i32 %xor20, %load23
  %call = call i32 @llvm.fshl.i32(i32 %xor24, i32 %xor24, i32 1) #2
  %getelementptr25 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %phi
  store i32 %call, ptr addrspace(1) %getelementptr25, align 4, !tbaa !14
  %add26 = add nuw nsw i64 %phi, 1
  %add27 = add nsw i64 %phi, -2
  %getelementptr28 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %add27
  %load29 = load i32, ptr addrspace(1) %getelementptr28, align 4, !tbaa !14, !noundef !3
  %add30 = add nsw i64 %phi, -7
  %getelementptr31 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %add30
  %load32 = load i32, ptr addrspace(1) %getelementptr31, align 4, !tbaa !14, !noundef !3
  %xor33 = xor i32 %load32, %load29
  %add34 = add nsw i64 %phi, -13
  %getelementptr35 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %add34
  %load36 = load i32, ptr addrspace(1) %getelementptr35, align 4, !tbaa !14, !noundef !3
  %xor37 = xor i32 %xor33, %load36
  %add38 = add nsw i64 %phi, -15
  %getelementptr39 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %add38
  %load40 = load i32, ptr addrspace(1) %getelementptr39, align 4, !tbaa !14, !noundef !3
  %xor41 = xor i32 %xor37, %load40
  %call42 = call i32 @llvm.fshl.i32(i32 %xor41, i32 %xor41, i32 1) #2
  %getelementptr43 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %add26
  store i32 %call42, ptr addrspace(1) %getelementptr43, align 4, !tbaa !14
  %add44 = add nuw nsw i64 %phi, 2
  %icmp45 = icmp ugt i64 %add26, 78
  br label %bb11

This is the assembly for the above scalar code with the accurate mattr:

# %bb.0:                                # %bb
  movq  64(%rdi), %rax 
  movl  $17, %ecx
  .p2align  4, 0x90
.LBB0_1:                                # %bb11
                                        # =>This Inner Loop Header: Depth=1
  movl  -20(%rax,%rcx,4), %edx
  movl  -16(%rax,%rcx,4), %esi
  xorl  (%rax,%rcx,4), %edx
  xorl  -44(%rax,%rcx,4), %edx
  xorl  -52(%rax,%rcx,4), %edx
  rorxl $31, %edx, %edx
  movl  %edx, 12(%rax,%rcx,4)
  xorl  4(%rax,%rcx,4), %esi
  xorl  -40(%rax,%rcx,4), %esi
  xorl  -48(%rax,%rcx,4), %esi
  rorxl $31, %esi, %edx                 
  movl  %edx, 16(%rax,%rcx,4)           
  addq  $2, %rcx
  jmp .LBB0_1

This is the vectorized assembly:

# %bb.0:                                # %bb
  movq  64(%rdi), %rax
  movl  $17, %ecx
  .p2align  4, 0x90
.LBB0_1:                                # %bb11
                                        # =>This Inner Loop Header: Depth=1
  movq  -20(%rax,%rcx,4), %rdx
  xorq  (%rax,%rcx,4), %rdx             
  xorq  -44(%rax,%rcx,4), %rdx
  xorq  -52(%rax,%rcx,4), %rdx
  vmovq %rdx, %xmm0
  vpsrld  $31, %xmm0, %xmm1
  vpaddd  %xmm0, %xmm0, %xmm0
  vpor  %xmm1, %xmm0, %xmm0
  vmovq %xmm0, 12(%rax,%rcx,4)
  addq  $2, %rcx
  jmp .LBB0_1 

According to the X86 cost model/SLPVectorizer cost model, we calculate the SLP benefit as:

Scalar: 8 (loads) + 6 (xor) + 2 (store) + 2 (each fshl with cost of 1) = 18 // This fshl cost was corrected by Simon's patch to have lower cost of 1 per fshl.
Vector: 4 (4 vectorized VF=2 loads) + 3 (3 vectorized VF=2 xor) + 1 (1 VF=2 store) + 4 (1 vectorized fshl cost) = 12 // We will need to correct fshl cost as well (it would be lower for constant rotate).

SLP Benefit = Vector - Scalar = 12 - 18 = -6

However, in the scalar assembly, we have only 2 loads (not 8). So the correct cost for scalar should be 12. Is this a correct understanding?

@alexey-bataev
Copy link
Member

Possibly need to estimate that some of the loads are "free" in scalar code?

@RKSimon
Copy link
Collaborator

RKSimon commented Jul 28, 2023

It depends on the cost kind - folded loads aren't "free" for rthroughput / latency / codesize (uop count on x86) - just instruction count. SLP uses rthroughput.

Plus we do an awful job of adding rthroughput costs together - we shouldn't just accumulate them together, we really need to account for resource usage.

@alexey-bataev
Copy link
Member

Plus we do an awful job of adding rthroughput costs together - we shouldn't just accumulate them together, we really need to account for resource usage.

This is what I meant, actually. Just this work requires lots of time, accounting rthroughput for the "folded" loads is much easier to do.

eymay pushed a commit to eymay/llvm-project that referenced this issue Jul 28, 2023
eymay pushed a commit to eymay/llvm-project that referenced this issue Jul 28, 2023
As noted on llvm#63980 rotate by immediate amounts is much cheaper than variable amounts.

This still needs to be expanded to vector rotate cases, and we need to add reasonable funnel-shift costs as well (very tricky as there's a huge range in CPU behaviour for these).
@RKSimon
Copy link
Collaborator

RKSimon commented Jul 30, 2023

llvm-mca analysis of the scalar vs vector snippets puts the vector version clearly in the lead: https://llvm.godbolt.org/z/rWPqdG1ve

so we need to look further into why the benchmark isn't reflecting this.

@annamthomas
Copy link
Contributor Author

It could be the above code is over-simplified version (I did reduce using bugpoint). Will take a look at final version and see what is going on. I tried with exact mattr as the machine (apart from the mcpu), didn't make a difference in above mca output.

@annamthomas
Copy link
Contributor Author

I've done the following experiment:

  1. Bisect with running only upto before SLP (good score) and after SLP (bad score). This shows there are no further optimizations in play after SLP runs which may cause a different code gen. And the score degradation is the same as running pipeline after SLP.
  2. With that, I ran the entire assembly generated in the above cases through mca. We still see that the vectorized output has better (i.e. lower) reciprocal throughput: https://llvm.godbolt.org/z/9qd1zzcnE

However, running through regular perf on cascade lake shows the scalar output is clearly better (higher throughput of 2.55 IPC for scalar versus 1.63 IPC on vector). This possibly points that the scheduler model used in MCA is not accurate enough?

We do see that the SLP model doesn't account for the folded loads. So, that is possibly orthogonal to why MCA doesn't reflect perf output?

With SLP:

         26,785.87 msec task-clock                #    1.179 CPUs utilized          
            32,405      context-switches          #    0.001 M/sec                  
               233      cpu-migrations            #    0.009 K/sec                  
           324,958      page-faults               #    0.012 M/sec                  
   101,476,320,179      cycles                    #    3.788 GHz                    
   165,888,098,787      instructions              #    1.63  insn per cycle         
    20,465,290,828      branches                  #  764.033 M/sec                  
       245,105,776      branch-misses             #    1.20% of all branches        

      22.717993889 seconds time elapsed

      25.694972000 seconds user
       1.077673000 seconds sys

Without SLP:

         27,514.70 msec task-clock                #    1.209 CPUs utilized          
            32,624      context-switches          #    0.001 M/sec                  
               270      cpu-migrations            #    0.010 K/sec                  
           327,263      page-faults               #    0.012 M/sec                  
   104,237,391,611      cycles                    #    3.788 GHz                    
   265,523,567,495      instructions              #    2.55  insn per cycle         
    29,650,594,977      branches                  # 1077.628 M/sec                  
       345,374,111      branch-misses             #    1.16% of all branches        

      22.766190302 seconds time elapsed

      26.432187000 seconds user
       1.067867000 seconds sys

@RKSimon
Copy link
Collaborator

RKSimon commented Aug 14, 2023

@annamthomas Am I looking at these number correctly? The SLP (vector) seems faster than non-SLP (scalar)? Or are you just looking at the IPC?

@annamthomas
Copy link
Contributor Author

@RKSimon Sorry, the above numbers should be ignored. Perf is running on a Java VM, which means we need to give sufficient time to warmup before measuring numbers. Analyzed this in detail along with the original benchmark to figure out why llvm-mca doesn't reflect the degradation. I think the reason is that reciprocal throughput computed by mca is correct only when there are no loop carried dependency. However, this entire sequence (source code: https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/sun/security/provider/SHA2.java#L160) infact has loop carried dependencey.
We write W[16] on first iteration and read it on the third iteration.
This original loop is unrolled by a factor of 2, which generates the above scalar code we see in the description: #63980 (comment)

SLP vectorizer decides to vectorize this (unrolled-by-2) loop and generates bad performance.

I have used perf normalized to benchmark iterations (instead of per second which was done above) to give accurate performance comparison:

Without SLP
SecureRandomBench.nextBytes thrpt 1457544.609 ± 31492.455 ops/s

SecureRandomBench.nextBytes:CPI                       thrpt             0.658              clks/insn
SecureRandomBench.nextBytes:IPC                       thrpt             1.520              insns/clk
SecureRandomBench.nextBytes:L1-dcache-load-misses     thrpt            58.094                   #/op
SecureRandomBench.nextBytes:L1-dcache-loads           thrpt          1137.904                   #/op
SecureRandomBench.nextBytes:L1-dcache-stores          thrpt           639.830                   #/op
SecureRandomBench.nextBytes:L1-icache-load-misses     thrpt           135.188                   #/op
SecureRandomBench.nextBytes:LLC-load-misses           thrpt             0.198                   #/op
SecureRandomBench.nextBytes:LLC-loads                 thrpt             0.596                   #/op
SecureRandomBench.nextBytes:LLC-store-misses          thrpt             0.067                   #/op
SecureRandomBench.nextBytes:LLC-stores                thrpt             0.287                   #/op
SecureRandomBench.nextBytes:branch-misses             thrpt             7.700                   #/op
SecureRandomBench.nextBytes:branches                  thrpt           874.864                   #/op
SecureRandomBench.nextBytes:cycles                    thrpt          2692.120                   #/op
SecureRandomBench.nextBytes:dTLB-load-misses          thrpt             0.199                   #/op
SecureRandomBench.nextBytes:dTLB-loads                thrpt          1142.688                   #/op
SecureRandomBench.nextBytes:dTLB-store-misses         thrpt             0.012                   #/op
SecureRandomBench.nextBytes:dTLB-stores               thrpt           639.797                   #/op
SecureRandomBench.nextBytes:iTLB-load-misses          thrpt             4.489                   #/op
SecureRandomBench.nextBytes:iTLB-loads                thrpt             8.200                   #/op
SecureRandomBench.nextBytes:instructions              thrpt          4090.985                   #/op

With SLP:
SecureRandomBench.nextBytes thrpt 909508.096 ± 982.611 ops/s



SecureRandomBench.nextBytes:CPI                       thrpt            0.659            clks/insn
SecureRandomBench.nextBytes:IPC                       thrpt            1.517            insns/clk
SecureRandomBench.nextBytes:L1-dcache-load-misses     thrpt           95.428                 #/op
SecureRandomBench.nextBytes:L1-dcache-loads           thrpt         1830.926                 #/op
SecureRandomBench.nextBytes:L1-dcache-stores          thrpt         1029.726                 #/op
SecureRandomBench.nextBytes:L1-icache-load-misses     thrpt          220.363                 #/op
SecureRandomBench.nextBytes:LLC-load-misses           thrpt            0.296                 #/op
SecureRandomBench.nextBytes:LLC-loads                 thrpt            0.903                 #/op
SecureRandomBench.nextBytes:LLC-store-misses          thrpt            0.112                 #/op
SecureRandomBench.nextBytes:LLC-stores                thrpt            0.489                 #/op
SecureRandomBench.nextBytes:branch-misses             thrpt           12.103                 #/op
SecureRandomBench.nextBytes:branches                  thrpt         1402.117                 #/op
SecureRandomBench.nextBytes:cycles                    thrpt         4329.138                 #/op
SecureRandomBench.nextBytes:dTLB-load-misses          thrpt            0.422                 #/op
SecureRandomBench.nextBytes:dTLB-loads                thrpt         1819.267                 #/op
SecureRandomBench.nextBytes:dTLB-store-misses         thrpt            0.027                 #/op
SecureRandomBench.nextBytes:dTLB-stores               thrpt         1023.054                 #/op
SecureRandomBench.nextBytes:iTLB-load-misses          thrpt            7.023                 #/op
SecureRandomBench.nextBytes:iTLB-loads                thrpt           13.154                 #/op
SecureRandomBench.nextBytes:instructions              thrpt         6567.043                 #/op

As we can see, SLP has higher cycles/op (4329.138 compared to 2692.120), while IPC is comparable.

@annamthomas
Copy link
Contributor Author

This is what I meant, actually. Just this work requires lots of time, accounting rthroughput for the "folded" loads is much easier to do.

@alexey-bataev Just to bring this discussion back, we continue seeing the huge degradation with SLP vectorizing this code. Any pointers on how we can fix this (without hack-y code downstream) is greatly appreciated. We consider folded loads in scalar cost and we just state it as "free"?
llvm-mca doesn't show an accurate representation of the actual cost, since we have a loop carried dependency (which MCA doesn't model).

@alexey-bataev
Copy link
Member

That's not the issue of the SLP vectorizer, definitely the issue of the cost model. I think we need to teach TTI about folded loads for some scalar operations. I had a patch for this some time before somewhere on phabricator, but it did not go. Plus, required some extra tuning, I assume.

@RKSimon
Copy link
Collaborator

RKSimon commented Jun 14, 2024

@annamthomas Let me take another look at this

@RKSimon
Copy link
Collaborator

RKSimon commented Jun 21, 2024

@annamthomas Are you still seeing the perf regression fixed when setting slp-threshold=2?

At that threshold the vectorised rotate is still there, but we stop replacing 4 i32 loads with a v4i32 load and 4 extracts which might have been another problem - we're probably going to be better off keeping them scalar: https://godbolt.org/z/qdzTYG9Gv

@RKSimon
Copy link
Collaborator

RKSimon commented Jul 2, 2024

reverse-ping @annamthomas

@annamthomas
Copy link
Contributor Author

annamthomas commented Jul 5, 2024

I'll check and let you know on Monday @RKSimon (just returned from vacations).

@annamthomas
Copy link
Contributor Author

@RKSimon Even with -slp-threshold=2, we see the same degradation. I've confirmed in the IR that we no longer replace the 4 i32 loads with a v4i32 load and 4 extracts, so it doesn't seem that's the problem here.

RKSimon added a commit to RKSimon/llvm-project that referenced this issue Jan 23, 2025
…e cost analysis.

We were only constructing the IntrinsicCostAttributes with the arg type info, and not the args themselves, preventing more detailed cost analysis (constant / uniform args etc.)

Just pass the whole IntrinsicInst to the constructor and let it resolve everything it can.

Noticed while having yet another attempt at llvm#63980
RKSimon added a commit that referenced this issue Jan 23, 2025
…e cost analysis. (#124129)

We were only constructing the IntrinsicCostAttributes with the arg type info, and not the args themselves, preventing more detailed cost analysis (constant / uniform args etc.)

Just pass the whole IntrinsicInst to the constructor and let it resolve everything it can.

Noticed while having yet another attempt at #63980
@RKSimon
Copy link
Collaborator

RKSimon commented Jan 23, 2025

I've finally gotten to the bottom of this - we have gaps in the handling of funnel shift intrinsic costs, especially when the costs are for type only - #124175 partially addresses this, but even then we need to extend the x86 costs tables to prevent fallback to the generic costs in some cases (which are still absurdly cheap if it knows we custom lower the intrinsic).

I know what I need to get done now (and can already avoid the bad vectorization with a few hacks!).

Better late then never :(

RKSimon added a commit to RKSimon/llvm-project that referenced this issue Jan 24, 2025
…tor IntrinsicCostAttributes

getVectorCallCosts determines the cost of a vector intrinsic, based off an existing scalar intrinsic call - but we were including the scalar argument data to the IntrinsicCostAttributes, which meant that not only was the cost calculation not type-only based, it was making incorrect assumptions about constant values etc.

This also exposed an issue that x86 relied on fallback calculations for funnel shift costs - this is great when we have the argument data as that improves the accuracy of uniform shift amounts etc., but meant that type-only costs would default to Cost=2 for all custom lowered funnel shifts, which was far too cheap.

This is the reverse of llvm#124129 where we weren't including argument data when we could.

Fixes llvm#63980
RKSimon added a commit that referenced this issue Jan 24, 2025
…tor IntrinsicCostAttributes (#124254)

getVectorCallCosts determines the cost of a vector intrinsic, based off
an existing scalar intrinsic call - but we were including the scalar
argument data to the IntrinsicCostAttributes, which meant that not only
was the cost calculation not type-only based, it was making incorrect
assumptions about constant values etc.

This also exposed an issue that x86 relied on fallback calculations for
funnel shift costs - this is great when we have the argument data as
that improves the accuracy of uniform shift amounts etc., but meant that
type-only costs would default to Cost=2 for all custom lowered funnel
shifts, which was far too cheap.

This is the reverse of #124129 where we weren't including argument data
when we could.

Fixes #63980
RKSimon added a commit that referenced this issue Feb 3, 2025
…e cost analysis. (#124129) (REAPPLIED)

We were only constructing the IntrinsicCostAttributes with the arg type info, and not the args themselves, preventing more detailed cost analysis (constant / uniform args etc.)

Just pass the whole IntrinsicInst to the constructor and let it resolve everything it can.

Noticed while having yet another attempt at #63980

Reapplied cleanup now that #125223 and #124984 have landed.
Icohedron pushed a commit to Icohedron/llvm-project that referenced this issue Feb 11, 2025
…e cost analysis. (llvm#124129) (REAPPLIED)

We were only constructing the IntrinsicCostAttributes with the arg type info, and not the args themselves, preventing more detailed cost analysis (constant / uniform args etc.)

Just pass the whole IntrinsicInst to the constructor and let it resolve everything it can.

Noticed while having yet another attempt at llvm#63980

Reapplied cleanup now that llvm#125223 and llvm#124984 have landed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants