Skip to content

[AArch64][SLP] Add NFC test cases for floating point reductions #106507

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 12, 2024

Conversation

sushgokh
Copy link
Contributor

A successive patch would be added to fix some of the tests.

@llvmbot
Copy link
Member

llvmbot commented Aug 29, 2024

@llvm/pr-subscribers-llvm-analysis

@llvm/pr-subscribers-llvm-transforms

Author: Sushant Gokhale (sushgokh)

Changes

A successive patch would be added to fix some of the tests.


Patch is 24.48 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/106507.diff

2 Files Affected:

  • (modified) llvm/test/Analysis/CostModel/AArch64/reduce-fadd.ll (+38-34)
  • (added) llvm/test/Transforms/SLPVectorizer/AArch64/reduce-fadd.ll (+225)
diff --git a/llvm/test/Analysis/CostModel/AArch64/reduce-fadd.ll b/llvm/test/Analysis/CostModel/AArch64/reduce-fadd.ll
index a68c21f7943432..c41a532f9f831e 100644
--- a/llvm/test/Analysis/CostModel/AArch64/reduce-fadd.ll
+++ b/llvm/test/Analysis/CostModel/AArch64/reduce-fadd.ll
@@ -2,6 +2,8 @@
 ; RUN: opt -passes='print<cost-model>' 2>&1 -disable-output -mtriple=aarch64--linux-gnu < %s | FileCheck %s
 ; RUN: opt -passes='print<cost-model>' 2>&1 -disable-output -mtriple=aarch64--linux-gnu -mattr=+fullfp16 < %s | FileCheck %s --check-prefix=FP16
 ; RUN: opt -passes='print<cost-model>' 2>&1 -disable-output -mtriple=aarch64--linux-gnu -mattr=+bf16 < %s | FileCheck %s --check-prefix=BF16
+; RUN: opt -passes='print<cost-model>' 2>&1 -disable-output -mtriple=aarch64--linux-gnu \
+; RUN:     -mattr=+neoversev2 < %s | FileCheck %s --check-prefixes=FP16,NEOV2
 
 target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
 
@@ -17,17 +19,6 @@ define void @strict_fp_reductions() {
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 20 for instruction: %fadd_v4f128 = call fp128 @llvm.vector.reduce.fadd.v4f128(fp128 undef, <4 x fp128> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
 ;
-; FP16-LABEL: 'strict_fp_reductions'
-; FP16-NEXT:  Cost Model: Found an estimated cost of 14 for instruction: %fadd_v4f16 = call half @llvm.vector.reduce.fadd.v4f16(half 0xH0000, <4 x half> undef)
-; FP16-NEXT:  Cost Model: Found an estimated cost of 30 for instruction: %fadd_v8f16 = call half @llvm.vector.reduce.fadd.v8f16(half 0xH0000, <8 x half> undef)
-; FP16-NEXT:  Cost Model: Found an estimated cost of 14 for instruction: %fadd_v4f32 = call float @llvm.vector.reduce.fadd.v4f32(float 0.000000e+00, <4 x float> undef)
-; FP16-NEXT:  Cost Model: Found an estimated cost of 28 for instruction: %fadd_v8f32 = call float @llvm.vector.reduce.fadd.v8f32(float 0.000000e+00, <8 x float> undef)
-; FP16-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %fadd_v2f64 = call double @llvm.vector.reduce.fadd.v2f64(double 0.000000e+00, <2 x double> undef)
-; FP16-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %fadd_v4f64 = call double @llvm.vector.reduce.fadd.v4f64(double 0.000000e+00, <4 x double> undef)
-; FP16-NEXT:  Cost Model: Found an estimated cost of 18 for instruction: %fadd_v4f8 = call bfloat @llvm.vector.reduce.fadd.v4bf16(bfloat 0xR0000, <4 x bfloat> undef)
-; FP16-NEXT:  Cost Model: Found an estimated cost of 20 for instruction: %fadd_v4f128 = call fp128 @llvm.vector.reduce.fadd.v4f128(fp128 undef, <4 x fp128> undef)
-; FP16-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
-;
 ; BF16-LABEL: 'strict_fp_reductions'
 ; BF16-NEXT:  Cost Model: Found an estimated cost of 18 for instruction: %fadd_v4f16 = call half @llvm.vector.reduce.fadd.v4f16(half 0xH0000, <4 x half> undef)
 ; BF16-NEXT:  Cost Model: Found an estimated cost of 38 for instruction: %fadd_v8f16 = call half @llvm.vector.reduce.fadd.v8f16(half 0xH0000, <8 x half> undef)
@@ -38,6 +29,17 @@ define void @strict_fp_reductions() {
 ; BF16-NEXT:  Cost Model: Found an estimated cost of 14 for instruction: %fadd_v4f8 = call bfloat @llvm.vector.reduce.fadd.v4bf16(bfloat 0xR0000, <4 x bfloat> undef)
 ; BF16-NEXT:  Cost Model: Found an estimated cost of 20 for instruction: %fadd_v4f128 = call fp128 @llvm.vector.reduce.fadd.v4f128(fp128 undef, <4 x fp128> undef)
 ; BF16-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
+;
+; NEOV2-LABEL: 'strict_fp_reductions'
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 18 for instruction: %fadd_v4f16 = call half @llvm.vector.reduce.fadd.v4f16(half 0xH0000, <4 x half> undef)
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 38 for instruction: %fadd_v8f16 = call half @llvm.vector.reduce.fadd.v8f16(half 0xH0000, <8 x half> undef)
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 14 for instruction: %fadd_v4f32 = call float @llvm.vector.reduce.fadd.v4f32(float 0.000000e+00, <4 x float> undef)
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 28 for instruction: %fadd_v8f32 = call float @llvm.vector.reduce.fadd.v8f32(float 0.000000e+00, <8 x float> undef)
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %fadd_v2f64 = call double @llvm.vector.reduce.fadd.v2f64(double 0.000000e+00, <2 x double> undef)
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %fadd_v4f64 = call double @llvm.vector.reduce.fadd.v4f64(double 0.000000e+00, <4 x double> undef)
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 18 for instruction: %fadd_v4f8 = call bfloat @llvm.vector.reduce.fadd.v4bf16(bfloat 0xR0000, <4 x bfloat> undef)
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 20 for instruction: %fadd_v4f128 = call fp128 @llvm.vector.reduce.fadd.v4f128(fp128 undef, <4 x fp128> undef)
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
 ;
   %fadd_v4f16 = call half @llvm.vector.reduce.fadd.v4f16(half 0.0, <4 x half> undef)
   %fadd_v8f16 = call half @llvm.vector.reduce.fadd.v8f16(half 0.0, <8 x half> undef)
@@ -76,29 +78,6 @@ define void @fast_fp_reductions() {
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %fadd_v4f128 = call reassoc fp128 @llvm.vector.reduce.fadd.v4f128(fp128 undef, <4 x fp128> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
 ;
-; FP16-LABEL: 'fast_fp_reductions'
-; FP16-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %fadd_v4f16_fast = call fast half @llvm.vector.reduce.fadd.v4f16(half 0xH0000, <4 x half> undef)
-; FP16-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %fadd_v4f16_reassoc = call reassoc half @llvm.vector.reduce.fadd.v4f16(half 0xH0000, <4 x half> undef)
-; FP16-NEXT:  Cost Model: Found an estimated cost of 27 for instruction: %fadd_v8f16 = call fast half @llvm.vector.reduce.fadd.v8f16(half 0xH0000, <8 x half> undef)
-; FP16-NEXT:  Cost Model: Found an estimated cost of 27 for instruction: %fadd_v8f16_reassoc = call reassoc half @llvm.vector.reduce.fadd.v8f16(half 0xH0000, <8 x half> undef)
-; FP16-NEXT:  Cost Model: Found an estimated cost of 35 for instruction: %fadd_v11f16 = call fast half @llvm.vector.reduce.fadd.v11f16(half 0xH0000, <11 x half> undef)
-; FP16-NEXT:  Cost Model: Found an estimated cost of 35 for instruction: %fadd_v13f16_reassoc = call reassoc half @llvm.vector.reduce.fadd.v13f16(half 0xH0000, <13 x half> undef)
-; FP16-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %fadd_v4f32 = call fast float @llvm.vector.reduce.fadd.v4f32(float 0.000000e+00, <4 x float> undef)
-; FP16-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %fadd_v4f32_reassoc = call reassoc float @llvm.vector.reduce.fadd.v4f32(float 0.000000e+00, <4 x float> undef)
-; FP16-NEXT:  Cost Model: Found an estimated cost of 15 for instruction: %fadd_v8f32 = call fast float @llvm.vector.reduce.fadd.v8f32(float 0.000000e+00, <8 x float> undef)
-; FP16-NEXT:  Cost Model: Found an estimated cost of 15 for instruction: %fadd_v8f32_reassoc = call reassoc float @llvm.vector.reduce.fadd.v8f32(float 0.000000e+00, <8 x float> undef)
-; FP16-NEXT:  Cost Model: Found an estimated cost of 25 for instruction: %fadd_v13f32 = call fast float @llvm.vector.reduce.fadd.v13f32(float 0.000000e+00, <13 x float> undef)
-; FP16-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %fadd_v5f32_reassoc = call reassoc float @llvm.vector.reduce.fadd.v5f32(float 0.000000e+00, <5 x float> undef)
-; FP16-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %fadd_v2f64 = call fast double @llvm.vector.reduce.fadd.v2f64(double 0.000000e+00, <2 x double> undef)
-; FP16-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %fadd_v2f64_reassoc = call reassoc double @llvm.vector.reduce.fadd.v2f64(double 0.000000e+00, <2 x double> undef)
-; FP16-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %fadd_v4f64 = call fast double @llvm.vector.reduce.fadd.v4f64(double 0.000000e+00, <4 x double> undef)
-; FP16-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %fadd_v4f64_reassoc = call reassoc double @llvm.vector.reduce.fadd.v4f64(double 0.000000e+00, <4 x double> undef)
-; FP16-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %fadd_v7f64 = call fast double @llvm.vector.reduce.fadd.v7f64(double 0.000000e+00, <7 x double> undef)
-; FP16-NEXT:  Cost Model: Found an estimated cost of 15 for instruction: %fadd_v9f64_reassoc = call reassoc double @llvm.vector.reduce.fadd.v9f64(double 0.000000e+00, <9 x double> undef)
-; FP16-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %fadd_v4f8 = call reassoc bfloat @llvm.vector.reduce.fadd.v4bf16(bfloat 0xR8000, <4 x bfloat> undef)
-; FP16-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %fadd_v4f128 = call reassoc fp128 @llvm.vector.reduce.fadd.v4f128(fp128 undef, <4 x fp128> undef)
-; FP16-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
-;
 ; BF16-LABEL: 'fast_fp_reductions'
 ; BF16-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %fadd_v4f16_fast = call fast half @llvm.vector.reduce.fadd.v4f16(half 0xH0000, <4 x half> undef)
 ; BF16-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %fadd_v4f16_reassoc = call reassoc half @llvm.vector.reduce.fadd.v4f16(half 0xH0000, <4 x half> undef)
@@ -121,6 +100,29 @@ define void @fast_fp_reductions() {
 ; BF16-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %fadd_v4f8 = call reassoc bfloat @llvm.vector.reduce.fadd.v4bf16(bfloat 0xR8000, <4 x bfloat> undef)
 ; BF16-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %fadd_v4f128 = call reassoc fp128 @llvm.vector.reduce.fadd.v4f128(fp128 undef, <4 x fp128> undef)
 ; BF16-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
+;
+; NEOV2-LABEL: 'fast_fp_reductions'
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %fadd_v4f16_fast = call fast half @llvm.vector.reduce.fadd.v4f16(half 0xH0000, <4 x half> undef)
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %fadd_v4f16_reassoc = call reassoc half @llvm.vector.reduce.fadd.v4f16(half 0xH0000, <4 x half> undef)
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 30 for instruction: %fadd_v8f16 = call fast half @llvm.vector.reduce.fadd.v8f16(half 0xH0000, <8 x half> undef)
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 30 for instruction: %fadd_v8f16_reassoc = call reassoc half @llvm.vector.reduce.fadd.v8f16(half 0xH0000, <8 x half> undef)
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 38 for instruction: %fadd_v11f16 = call fast half @llvm.vector.reduce.fadd.v11f16(half 0xH0000, <11 x half> undef)
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 38 for instruction: %fadd_v13f16_reassoc = call reassoc half @llvm.vector.reduce.fadd.v13f16(half 0xH0000, <13 x half> undef)
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %fadd_v4f32 = call fast float @llvm.vector.reduce.fadd.v4f32(float 0.000000e+00, <4 x float> undef)
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %fadd_v4f32_reassoc = call reassoc float @llvm.vector.reduce.fadd.v4f32(float 0.000000e+00, <4 x float> undef)
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 15 for instruction: %fadd_v8f32 = call fast float @llvm.vector.reduce.fadd.v8f32(float 0.000000e+00, <8 x float> undef)
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 15 for instruction: %fadd_v8f32_reassoc = call reassoc float @llvm.vector.reduce.fadd.v8f32(float 0.000000e+00, <8 x float> undef)
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 25 for instruction: %fadd_v13f32 = call fast float @llvm.vector.reduce.fadd.v13f32(float 0.000000e+00, <13 x float> undef)
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %fadd_v5f32_reassoc = call reassoc float @llvm.vector.reduce.fadd.v5f32(float 0.000000e+00, <5 x float> undef)
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %fadd_v2f64 = call fast double @llvm.vector.reduce.fadd.v2f64(double 0.000000e+00, <2 x double> undef)
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %fadd_v2f64_reassoc = call reassoc double @llvm.vector.reduce.fadd.v2f64(double 0.000000e+00, <2 x double> undef)
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %fadd_v4f64 = call fast double @llvm.vector.reduce.fadd.v4f64(double 0.000000e+00, <4 x double> undef)
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %fadd_v4f64_reassoc = call reassoc double @llvm.vector.reduce.fadd.v4f64(double 0.000000e+00, <4 x double> undef)
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %fadd_v7f64 = call fast double @llvm.vector.reduce.fadd.v7f64(double 0.000000e+00, <7 x double> undef)
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 15 for instruction: %fadd_v9f64_reassoc = call reassoc double @llvm.vector.reduce.fadd.v9f64(double 0.000000e+00, <9 x double> undef)
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %fadd_v4f8 = call reassoc bfloat @llvm.vector.reduce.fadd.v4bf16(bfloat 0xR8000, <4 x bfloat> undef)
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %fadd_v4f128 = call reassoc fp128 @llvm.vector.reduce.fadd.v4f128(fp128 undef, <4 x fp128> undef)
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
 ;
   %fadd_v4f16_fast = call fast half @llvm.vector.reduce.fadd.v4f16(half 0.0, <4 x half> undef)
   %fadd_v4f16_reassoc = call reassoc half @llvm.vector.reduce.fadd.v4f16(half 0.0, <4 x half> undef)
@@ -172,3 +174,5 @@ declare double @llvm.vector.reduce.fadd.v2f64(double, <2 x double>)
 declare double @llvm.vector.reduce.fadd.v4f64(double, <4 x double>)
 declare double @llvm.vector.reduce.fadd.v7f64(double, <7 x double>)
 declare double @llvm.vector.reduce.fadd.v9f64(double, <9 x double>)
+;; NOTE: These prefixes are unused and the list is autogenerated. Do not add tests below this line:
+; FP16: {{.*}}
diff --git a/llvm/test/Transforms/SLPVectorizer/AArch64/reduce-fadd.ll b/llvm/test/Transforms/SLPVectorizer/AArch64/reduce-fadd.ll
new file mode 100644
index 00000000000000..10fd1f7e7a2995
--- /dev/null
+++ b/llvm/test/Transforms/SLPVectorizer/AArch64/reduce-fadd.ll
@@ -0,0 +1,225 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt < %s -passes=slp-vectorizer -mtriple=aarch64-unknown-linux -S -mcpu=neoverse-v2 | FileCheck %s
+
+define half @reduction_half2(<2 x half> %vec2){
+; CHECK-LABEL: define half @reduction_half2(
+; CHECK-SAME: <2 x half> [[VEC2:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[ELT0:%.*]] = extractelement <2 x half> [[VEC2]], i64 0
+; CHECK-NEXT:    [[ELT1:%.*]] = extractelement <2 x half> [[VEC2]], i64 1
+; CHECK-NEXT:    [[ADD1:%.*]] = fadd fast half [[ELT1]], [[ELT0]]
+; CHECK-NEXT:    ret half [[ADD1]]
+entry:
+  %elt0 = extractelement <2 x half> %vec2, i64 0
+  %elt1 = extractelement <2 x half> %vec2, i64 1
+  %add1 = fadd fast half %elt1, %elt0
+
+  ret half %add1
+}
+
+define half @reduction_half4(<4 x half> %vec4){
+; CHECK-LABEL: define half @reduction_half4(
+; CHECK-SAME: <4 x half> [[VEC4:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[TMP0:%.*]] = call fast half @llvm.vector.reduce.fadd.v4f16(half 0xH8000, <4 x half> [[VEC4]])
+; CHECK-NEXT:    ret half [[TMP0]]
+entry:
+  %elt0 = extractelement <4 x half> %vec4, i64 0
+  %elt1 = extractelement <4 x half> %vec4, i64 1
+  %elt2 = extractelement <4 x half> %vec4, i64 2
+  %elt3 = extractelement <4 x half> %vec4, i64 3
+  %add1 = fadd fast half %elt1, %elt0
+  %add2 = fadd fast half %elt2, %add1
+  %add3 = fadd fast half %elt3, %add2
+
+  ret half %add3
+}
+
+define half @reduction_half8(<8 x half> %vec8){
+; CHECK-LABEL: define half @reduction_half8(
+; CHECK-SAME: <8 x half> [[VEC8:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[ELT4:%.*]] = extractelement <8 x half> [[VEC8]], i64 4
+; CHECK-NEXT:    [[ELT5:%.*]] = extractelement <8 x half> [[VEC8]], i64 5
+; CHECK-NEXT:    [[ELT6:%.*]] = extractelement <8 x half> [[VEC8]], i64 6
+; CHECK-NEXT:    [[ELT7:%.*]] = extractelement <8 x half> [[VEC8]], i64 7
+; CHECK-NEXT:    [[TMP2:%.*]] = shufflevector <8 x half> [[VEC8]], <8 x half> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+; CHECK-NEXT:    [[TMP1:%.*]] = call fast half @llvm.vector.reduce.fadd.v4f16(half 0xH8000, <4 x half> [[TMP2]])
+; CHECK-NEXT:    [[OP_RDX:%.*]] = fadd fast half [[TMP1]], [[ELT4]]
+; CHECK-NEXT:    [[OP_RDX1:%.*]] = fadd fast half [[ELT5]], [[ELT6]]
+; CHECK-NEXT:    [[OP_RDX2:%.*]] = fadd fast half [[OP_RDX]], [[OP_RDX1]]
+; CHECK-NEXT:    [[TMP0:%.*]] = fadd fast half [[OP_RDX2]], [[ELT7]]
+; CHECK-NEXT:    ret half [[TMP0]]
+entry:
+  %elt0 = extractelement <8 x half> %vec8, i64 0
+  %elt1 = extractelement <8 x half> %vec8, i64 1
+  %elt2 = extractelement <8 x half> %vec8, i64 2
+  %elt3 = extractelement <8 x half> %vec8, i64 3
+  %elt4 = extractelement <8 x half> %vec8, i64 4
+  %elt5 = extractelement <8 x half> %vec8, i64 5
+  %elt6 = extractelement <8 x half> %vec8, i64 6
+  %elt7 = extractelement <8 x half> %vec8, i64 7
+  %add1 = fadd fast half %elt1, %elt0
+  %add2 = fadd fast half %elt2, %add1
+  %add3 = fadd fast half %elt3, %add2
+  %add4 = fadd fast half %elt4, %add3
+  %add5 = fadd fast half %elt5, %add4
+  %add6 = fadd fast half %elt6, %add5
+  %add7 = fadd fast half %elt7, %add6
+
+  ret half %add7
+}
+
+define half @reduction_half16(<16 x half> %vec16) {
+; CHECK-LABEL: define half @reduction_half16(
+; CHECK-SAME: <16 x half> [[VEC16:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[ELT4:%.*]] = extractelement <16 x half> [[VEC16]], i64 4
+; CHECK-NEXT:    [[ELT5:%.*]] = extractelement <16 x half> [[VEC16]], i64 5
+; CHECK-NEXT:    [[ELT6:%.*]] = extractelement <16 x half> [[VEC16]], i64 6
+; CHECK-NEXT:    [[ELT7:%.*]] = extractelement <16 x half> [[VEC16]], i64 7
+; CHECK-NEXT:    [[ELT12:%.*]] = extractelement <16 x half> [[VEC16]], i64 12
+; CHECK-NEXT:    [[ELT13:%.*]] = extractelement <16 x half> [[VEC16]], i64 13
+; CHECK-NEXT:    [[ELT14:%.*]] = extractelement <16 x half> [[VEC16]], i64 14
+; CHECK-NEXT:    [[ELT15:%.*]] = extractelement <16 x half> [[VEC16]], i64 15
+; CHECK-NEXT:    [[TMP4:%.*]] = shufflevector <16 x half> [[VEC16]], <16 x half> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+; CHECK-NEXT:    [[TMP1:%.*]] = call fast half @llvm.vector.reduce.fadd.v4f16(half 0xH8000, <4 x half> [[TMP4]])
+; CHECK-NEXT:    [[TMP2:%.*]] = shufflevector <16 x half> [[VEC16]], <16 x half> poison, <4 x i32> <i32 8, i32 9, i32 10, i32 11>
+; CHECK-NEXT:    [[TMP3:%.*]] = call fast half @llvm.vector.reduce.fadd.v4f16(half 0xH8000, <4 x half> [[TMP2]])
+; CHECK-NEXT:    [[OP_RDX:%.*]] = fadd fast half [[TMP1]], [[TMP3]]
+; CHECK-NEXT:    [[OP_RDX1:%.*]] = fadd fast half [[OP_RDX]], [[ELT4]]
+; CHECK-NEXT:    [[OP_RDX2:%.*]] = fadd fast half [[ELT5]], [[ELT6]]
+; CHECK-NEXT:    [[OP_RDX3:%.*]] = fadd fast half [[ELT7]], [[ELT12]]
+; CHECK-NEXT:    [[OP_RDX4:%.*]] = fadd fast half [[ELT13]], [[ELT14]]
+; CHECK-NEXT:    [[OP_RDX5:%.*]] = fadd fast half [[OP_RDX1]], [[OP_RDX2]]
+; CHECK-NEXT:    [[OP_RDX6:%.*]] = fadd fast half [[OP_RDX3]], [[OP_RDX4]]
+; CHECK-NEXT:    [[OP_RDX7:%.*]] = fadd fast half [[OP_RDX5]], [[OP_RDX6]]
+; CHECK-NEXT:    [[TMP0:%.*]] = fadd fast half [[OP_RDX7]], [[ELT15]]
+; CHECK-NEXT:    ret half [[TMP0]]
+entry:
+  %elt0 = extractelement <16 x half> %vec16, i64 0
+  %elt1 = extractelement <16 x half> %vec16, i64 1
+  %elt2 = extractelement <16 x half> %vec16, i64 2
+  %elt3 = extractelement <16 x half> %vec16, i64 3
+  %elt4 = extractelement <16 x half> %vec16, i64 4
+  %elt5 = extractelement <16 x half> %vec16, i64 5
+  %elt6 = extractelement <16 x half> %vec16, i64 6
+  %elt7 = extractelement <16 x half> %vec16, i64 7
+  %elt8 = extractelement <16 x half> %vec16, i64 8
+  %elt9 = extractelement <16 x half> %vec16, i64 9
+  %elt10 = extractelement <16 x half> %vec16, i64 10
+  %elt11 = extractelement <16 x half> %vec16...
[truncated]

@davemgreen
Copy link
Collaborator

Do you need these to be based on the CPU? I assume your followup will alter the costs in some way?

@@ -17,17 +19,6 @@ define void @strict_fp_reductions() {
; CHECK-NEXT: Cost Model: Found an estimated cost of 20 for instruction: %fadd_v4f128 = call fp128 @llvm.vector.reduce.fadd.v4f128(fp128 undef, <4 x fp128> undef)
; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
;
; FP16-LABEL: 'strict_fp_reductions'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These checks were removed, they must be restored

@@ -76,29 +78,6 @@ define void @fast_fp_reductions() {
; CHECK-NEXT: Cost Model: Found an estimated cost of 12 for instruction: %fadd_v4f128 = call reassoc fp128 @llvm.vector.reduce.fadd.v4f128(fp128 undef, <4 x fp128> undef)
; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
;
; FP16-LABEL: 'fast_fp_reductions'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

@sushgokh
Copy link
Contributor Author

u need these to be based on the CPU? I assume your followup will alter the costs in some way?

Thought of adding a RUN line for Neoverse-v2 because

  1. The throughput for 'faddp' instruction, generated for these reductions, has doubled from 2 to 4 for Neoverse-V1/V2 onwards
  2. Want to avoid getting into a trap where you are not able to detect issues due to generalization for multiple cpus. Maybe if we try for a cost which is one size fits all, then we may be hiding some cost-modelling issues?

But yeah, I haven't run the benchmarks yet with/without Neoverse-V2 option.

Copy link
Collaborator

@davemgreen davemgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain what it is you plan to change? All the tests are fast and it might be something that needs to dependent on that.

@sushgokh
Copy link
Contributor Author

sushgokh commented Sep 2, 2024

Can you explain what it is you plan to change? All the tests are fast and it might be something that needs to dependent on that.

The reduction cost for @llvm.vector.reduce.fadd with 'fast'

@davemgreen
Copy link
Collaborator

They do look a little high, it sure sounds like a sensible change to make. Do you intend to change them in general or just for -mcpu=neoverse-v2? If they are just a series of faddp's, then we could hopefully produce better costs for every cpu. If so you could remove the -mcpu options from the tests and just add -mattr where needed.

For the SLP tests it might be good to have at least some non-fast tests along with the ones you have.

@sushgokh
Copy link
Contributor Author

sushgokh commented Sep 2, 2024

Do you intend to change them in general or just for -mcpu=neoverse-v2?

Right now, the patch intends to address the issue just for Neoverse-V2. I couldn't test patch for any other cpu(<= neoverse-n2). Maybe could you help me in this regard?

@davemgreen
Copy link
Collaborator

Sure I can try that. If you put a patch for it I can give it a test in a few places.

@sushgokh
Copy link
Contributor Author

sushgokh commented Sep 3, 2024

Sure I can try that. If you put a patch for it I can give it a test in a few places.

--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
@@ -4056,6 +4059,23 @@ AArch64TTIImpl::getArithmeticReductionCost(unsigned Opcode, VectorType *ValTy,
   switch (ISD) {
   default:
     break;
+  case ISD::FADD: {
+    if (MTy.isVector()) {
+      unsigned NumElts = MTy.getVectorNumElements();
+      if (ValTy->getElementCount().getFixedValue() >= 2 && NumElts >= 2 &&
+          isPowerOf2_32(NumElts)
+          //&& ST->getProcFamily() == AArch64Subtarget::NeoverseV2)
+      ) {
+        // Floating point reductions are lowered to series of faddp
+        // instructions.
+        // For Neoverse-V1 onwards, for `faddp` instruction, Latency=2 and
+        // Throughput=4.
+        unsigned NumFAddpIns = Log2_32(NumElts);
+        return (LT.first - 1) +
+               /*Latency=*/2 * divideCeil(NumFAddpIns, /*Throughput=*/4);
+      }
+    }
+  } break;
   case ISD::ADD:
     if (const auto *Entry = CostTableLookup(CostTblNoPairwise, ISD, MTy))
       return (LT.first - 1) + Entry->Cost;

@davemgreen
Copy link
Collaborator

Hi again - It looks like we should be able to treat a faddp the same as a fadd cost-wise on most modern cpus (and by default). Some older cpus prior to cortex-a73 (but not little cores) had them a little higher, we might want to add a target feature if needed, but I think this would make a good default cost-model.
The Throughput=4 isn't really meaningful with how we model costs at the moment, and the Latency=2 would only be used for TCK_Latency (although we don't currently handle very thoroughly). The default TCK_RecipThroughput just adds together reciprocal throughput estimations that are relative to one-another. The cost should either be similar to a fadd for each step (which I believe is 1 now), or doubling it is probably fine if that produces better results (and then would probably be OK for any CPU).

It might be easier in this case add any extra CostModel/AArch64 tests in the same pr as the costmodel adjustments, as that will show what tests we really need. The SLP ones look like good additions if we remove the -mcpu option, but it might be good to have at least some tests for both fast and non-fast. FP16 costs are usually dependant on whether +fullfp16 is present (they should ideally promote to fp32 otherwise), so it might be worth having an extra run line for those if it will be relevant in the end.

@sushgokh
Copy link
Contributor Author

sushgokh commented Sep 4, 2024

@davemgreen Thanks for the help. Yes, will add more tests( and change the costing as well for the next patch).

@sushgokh sushgokh force-pushed the GRCO-699 branch 2 times, most recently from 0836b93 to 6b7925f Compare September 10, 2024 09:46
@sushgokh sushgokh requested a review from davemgreen September 10, 2024 09:49
Comment on lines 3 to 4
; RUN: opt < %s -passes=slp-vectorizer -mtriple=aarch64-unknown-linux -mcpu=neoverse-v2 \
; RUN: -S | FileCheck %s --check-prefix=NEOV2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can remove the V2 results from here now, unless you expect them to be different.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The result is different for v16 half type.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Different based on the cpu or based on -mattr=+fullfp16 or -mattr=+sve2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Different for following cases in isolation

-mattr=+fullfp16
-mattr=+sve2
-mcpu=neoverse-v2

But surprisingly same for following case !

-mattr=neoversev2

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see what you mean, the codegen is different already. That does sound odd, it is likely because the scalar cost is lower with fullfp16. And that -mcpu=neoverse-v2 implies +sve2 and +sve2 implies +fullfp16, so +fullfp16 is enough to show the difference.

Can you change the run lines to:

; RUN: opt < %s -passes=slp-vectorizer -mtriple=aarch64-unknown-linux -S | FileCheck %s --check-prefixes=CHECK,CHECK-FP16
; RUN: opt < %s -passes=slp-vectorizer -mtriple=aarch64-unknown-linux -mattr=+fullfp16 -S | FileCheck %s --check-prefixes=CHECK,CHECK-FP16

It should collapse a lot of the run lines together when they are the same. These all looks like a useful set of tests and we can hopefully get them in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davemgreen The presence of function attribute as below with -mattr=+fullfp16

; NEOV2-SAME: <2 x half> [[VEC2:%.*]]) #[[ATTR0:[0-9]+]] {

is proving to be hindrance with merging of outputs with check-prefixes.

There are manual ways to get over this issue such as

  1. Checking only the required part as below with CHECK-SAME
; RUN: opt < %s -S -passes=slp-vectorizer -mtriple=aarch64-unknown-linux | FileCheck %s
; RUN: opt < %s -S -passes=slp-vectorizer -mtriple=aarch64-unknown-linux -mattr=+fullfp16 | FileCheck %s 

define half @reduce_fast_half2(<2 x half> %vec2) {
; CHECK-LABEL: define half @reduce_fast_half2(
; CHECK-SAME: <2 x half> [[VEC2:%.*]]) 
...
...
  1. Maybe using regular expression {{.*}} in place of the function attribute

The cons of the manual approach here would be:

  1. Auto updating tests is not possible
  2. For large no of tests, this is lot of manual work

Any way to scrub the function attribute here so that using check-prefixes becomes possible?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I hadn't seen that before (and I got the first check prefix wrong, sorry about that).

Maybe try:

; RUN: opt < %s -passes=slp-vectorizer -mtriple=aarch64-unknown-linux -mattr=-fullfp16 -S | FileCheck %s --check-prefixes=CHECK,CHECK-NOFP16
; RUN: opt < %s -passes=slp-vectorizer -mtriple=aarch64-unknown-linux -mattr=+fullfp16 -S | FileCheck %s --check-prefixes=CHECK,CHECK-FP16

The -fullfp16 seems to work around the issue with the attributes.

Copy link
Collaborator

@davemgreen davemgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry - sometimes github doesn't send out notifications for new patches. This LGTM now, thanks.

; NOFP16-SAME: <16 x half> [[VEC16:%.*]]) #[[ATTR0]] {
; NOFP16-NEXT: [[ENTRY:.*:]]
; NOFP16-NEXT: [[TMP0:%.*]] = call fast half @llvm.vector.reduce.fadd.v16f16(half 0xH8000, <16 x half> [[VEC16]])
; NOFP16-NEXT: ret half [[TMP0]]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks a little off, there is usually an extra newline I think. You might have to give it a quick regenerate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, right. I have got a review few times to remove the last blank line generated by the update_test_checks script and removed them manually. But I have ran the test through llvm-lit and it passes. I hope this is fine.

A successive patch would be added to fix some of the tests.
@sushgokh sushgokh merged commit 7a6945f into llvm:main Sep 12, 2024
4 of 5 checks passed
sushgokh added a commit to sushgokh/llvm-project that referenced this pull request Sep 12, 2024
Updating the failing test in this patch.
sushgokh added a commit that referenced this pull request Sep 12, 2024
Updating the failing test in this patch.
@sushgokh sushgokh deleted the GRCO-699 branch September 12, 2024 19:59
citymarina pushed a commit to citymarina/llvm-project that referenced this pull request Oct 7, 2024
…#106507)

A successive patch would be added to fix some of the tests.

Pull request: llvm#106507

(cherry picked from commit 7a6945f)
citymarina pushed a commit to citymarina/llvm-project that referenced this pull request Oct 7, 2024
Updating the failing test in this patch.

(cherry picked from commit d37d057)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants