[AArch64][SLP] Add NFC test cases for floating point reductions #106507

sushgokh · 2024-08-29T08:09:32Z

A successive patch would be added to fix some of the tests.

llvmbot · 2024-08-29T08:10:08Z

@llvm/pr-subscribers-llvm-analysis

@llvm/pr-subscribers-llvm-transforms

Author: Sushant Gokhale (sushgokh)

Changes

A successive patch would be added to fix some of the tests.

Patch is 24.48 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/106507.diff

2 Files Affected:

(modified) llvm/test/Analysis/CostModel/AArch64/reduce-fadd.ll (+38-34)
(added) llvm/test/Transforms/SLPVectorizer/AArch64/reduce-fadd.ll (+225)

diff --git a/llvm/test/Analysis/CostModel/AArch64/reduce-fadd.ll b/llvm/test/Analysis/CostModel/AArch64/reduce-fadd.ll
index a68c21f7943432..c41a532f9f831e 100644
--- a/llvm/test/Analysis/CostModel/AArch64/reduce-fadd.ll
+++ b/llvm/test/Analysis/CostModel/AArch64/reduce-fadd.ll
@@ -2,6 +2,8 @@
 ; RUN: opt -passes='print<cost-model>' 2>&1 -disable-output -mtriple=aarch64--linux-gnu < %s | FileCheck %s
 ; RUN: opt -passes='print<cost-model>' 2>&1 -disable-output -mtriple=aarch64--linux-gnu -mattr=+fullfp16 < %s | FileCheck %s --check-prefix=FP16
 ; RUN: opt -passes='print<cost-model>' 2>&1 -disable-output -mtriple=aarch64--linux-gnu -mattr=+bf16 < %s | FileCheck %s --check-prefix=BF16
+; RUN: opt -passes='print<cost-model>' 2>&1 -disable-output -mtriple=aarch64--linux-gnu \
+; RUN:     -mattr=+neoversev2 < %s | FileCheck %s --check-prefixes=FP16,NEOV2
 
 target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
 
@@ -17,17 +19,6 @@ define void @strict_fp_reductions() {
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 20 for instruction: %fadd_v4f128 = call fp128 @llvm.vector.reduce.fadd.v4f128(fp128 undef, <4 x fp128> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
 ;
-; FP16-LABEL: 'strict_fp_reductions'
-; FP16-NEXT:  Cost Model: Found an estimated cost of 14 for instruction: %fadd_v4f16 = call half @llvm.vector.reduce.fadd.v4f16(half 0xH0000, <4 x half> undef)
-; FP16-NEXT:  Cost Model: Found an estimated cost of 30 for instruction: %fadd_v8f16 = call half @llvm.vector.reduce.fadd.v8f16(half 0xH0000, <8 x half> undef)
-; FP16-NEXT:  Cost Model: Found an estimated cost of 14 for instruction: %fadd_v4f32 = call float @llvm.vector.reduce.fadd.v4f32(float 0.000000e+00, <4 x float> undef)
-; FP16-NEXT:  Cost Model: Found an estimated cost of 28 for instruction: %fadd_v8f32 = call float @llvm.vector.reduce.fadd.v8f32(float 0.000000e+00, <8 x float> undef)
-; FP16-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %fadd_v2f64 = call double @llvm.vector.reduce.fadd.v2f64(double 0.000000e+00, <2 x double> undef)
-; FP16-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %fadd_v4f64 = call double @llvm.vector.reduce.fadd.v4f64(double 0.000000e+00, <4 x double> undef)
-; FP16-NEXT:  Cost Model: Found an estimated cost of 18 for instruction: %fadd_v4f8 = call bfloat @llvm.vector.reduce.fadd.v4bf16(bfloat 0xR0000, <4 x bfloat> undef)
-; FP16-NEXT:  Cost Model: Found an estimated cost of 20 for instruction: %fadd_v4f128 = call fp128 @llvm.vector.reduce.fadd.v4f128(fp128 undef, <4 x fp128> undef)
-; FP16-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
-;
 ; BF16-LABEL: 'strict_fp_reductions'
 ; BF16-NEXT:  Cost Model: Found an estimated cost of 18 for instruction: %fadd_v4f16 = call half @llvm.vector.reduce.fadd.v4f16(half 0xH0000, <4 x half> undef)
 ; BF16-NEXT:  Cost Model: Found an estimated cost of 38 for instruction: %fadd_v8f16 = call half @llvm.vector.reduce.fadd.v8f16(half 0xH0000, <8 x half> undef)
@@ -38,6 +29,17 @@ define void @strict_fp_reductions() {
 ; BF16-NEXT:  Cost Model: Found an estimated cost of 14 for instruction: %fadd_v4f8 = call bfloat @llvm.vector.reduce.fadd.v4bf16(bfloat 0xR0000, <4 x bfloat> undef)
 ; BF16-NEXT:  Cost Model: Found an estimated cost of 20 for instruction: %fadd_v4f128 = call fp128 @llvm.vector.reduce.fadd.v4f128(fp128 undef, <4 x fp128> undef)
 ; BF16-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
+;
+; NEOV2-LABEL: 'strict_fp_reductions'
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 18 for instruction: %fadd_v4f16 = call half @llvm.vector.reduce.fadd.v4f16(half 0xH0000, <4 x half> undef)
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 38 for instruction: %fadd_v8f16 = call half @llvm.vector.reduce.fadd.v8f16(half 0xH0000, <8 x half> undef)
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 14 for instruction: %fadd_v4f32 = call float @llvm.vector.reduce.fadd.v4f32(float 0.000000e+00, <4 x float> undef)
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 28 for instruction: %fadd_v8f32 = call float @llvm.vector.reduce.fadd.v8f32(float 0.000000e+00, <8 x float> undef)
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 6 for instruction: %fadd_v2f64 = call double @llvm.vector.reduce.fadd.v2f64(double 0.000000e+00, <2 x double> undef)
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %fadd_v4f64 = call double @llvm.vector.reduce.fadd.v4f64(double 0.000000e+00, <4 x double> undef)
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 18 for instruction: %fadd_v4f8 = call bfloat @llvm.vector.reduce.fadd.v4bf16(bfloat 0xR0000, <4 x bfloat> undef)
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 20 for instruction: %fadd_v4f128 = call fp128 @llvm.vector.reduce.fadd.v4f128(fp128 undef, <4 x fp128> undef)
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
 ;
   %fadd_v4f16 = call half @llvm.vector.reduce.fadd.v4f16(half 0.0, <4 x half> undef)
   %fadd_v8f16 = call half @llvm.vector.reduce.fadd.v8f16(half 0.0, <8 x half> undef)
@@ -76,29 +78,6 @@ define void @fast_fp_reductions() {
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %fadd_v4f128 = call reassoc fp128 @llvm.vector.reduce.fadd.v4f128(fp128 undef, <4 x fp128> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
 ;
-; FP16-LABEL: 'fast_fp_reductions'
-; FP16-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %fadd_v4f16_fast = call fast half @llvm.vector.reduce.fadd.v4f16(half 0xH0000, <4 x half> undef)
-; FP16-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %fadd_v4f16_reassoc = call reassoc half @llvm.vector.reduce.fadd.v4f16(half 0xH0000, <4 x half> undef)
-; FP16-NEXT:  Cost Model: Found an estimated cost of 27 for instruction: %fadd_v8f16 = call fast half @llvm.vector.reduce.fadd.v8f16(half 0xH0000, <8 x half> undef)
-; FP16-NEXT:  Cost Model: Found an estimated cost of 27 for instruction: %fadd_v8f16_reassoc = call reassoc half @llvm.vector.reduce.fadd.v8f16(half 0xH0000, <8 x half> undef)
-; FP16-NEXT:  Cost Model: Found an estimated cost of 35 for instruction: %fadd_v11f16 = call fast half @llvm.vector.reduce.fadd.v11f16(half 0xH0000, <11 x half> undef)
-; FP16-NEXT:  Cost Model: Found an estimated cost of 35 for instruction: %fadd_v13f16_reassoc = call reassoc half @llvm.vector.reduce.fadd.v13f16(half 0xH0000, <13 x half> undef)
-; FP16-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %fadd_v4f32 = call fast float @llvm.vector.reduce.fadd.v4f32(float 0.000000e+00, <4 x float> undef)
-; FP16-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %fadd_v4f32_reassoc = call reassoc float @llvm.vector.reduce.fadd.v4f32(float 0.000000e+00, <4 x float> undef)
-; FP16-NEXT:  Cost Model: Found an estimated cost of 15 for instruction: %fadd_v8f32 = call fast float @llvm.vector.reduce.fadd.v8f32(float 0.000000e+00, <8 x float> undef)
-; FP16-NEXT:  Cost Model: Found an estimated cost of 15 for instruction: %fadd_v8f32_reassoc = call reassoc float @llvm.vector.reduce.fadd.v8f32(float 0.000000e+00, <8 x float> undef)
-; FP16-NEXT:  Cost Model: Found an estimated cost of 25 for instruction: %fadd_v13f32 = call fast float @llvm.vector.reduce.fadd.v13f32(float 0.000000e+00, <13 x float> undef)
-; FP16-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %fadd_v5f32_reassoc = call reassoc float @llvm.vector.reduce.fadd.v5f32(float 0.000000e+00, <5 x float> undef)
-; FP16-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %fadd_v2f64 = call fast double @llvm.vector.reduce.fadd.v2f64(double 0.000000e+00, <2 x double> undef)
-; FP16-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %fadd_v2f64_reassoc = call reassoc double @llvm.vector.reduce.fadd.v2f64(double 0.000000e+00, <2 x double> undef)
-; FP16-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %fadd_v4f64 = call fast double @llvm.vector.reduce.fadd.v4f64(double 0.000000e+00, <4 x double> undef)
-; FP16-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %fadd_v4f64_reassoc = call reassoc double @llvm.vector.reduce.fadd.v4f64(double 0.000000e+00, <4 x double> undef)
-; FP16-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %fadd_v7f64 = call fast double @llvm.vector.reduce.fadd.v7f64(double 0.000000e+00, <7 x double> undef)
-; FP16-NEXT:  Cost Model: Found an estimated cost of 15 for instruction: %fadd_v9f64_reassoc = call reassoc double @llvm.vector.reduce.fadd.v9f64(double 0.000000e+00, <9 x double> undef)
-; FP16-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %fadd_v4f8 = call reassoc bfloat @llvm.vector.reduce.fadd.v4bf16(bfloat 0xR8000, <4 x bfloat> undef)
-; FP16-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %fadd_v4f128 = call reassoc fp128 @llvm.vector.reduce.fadd.v4f128(fp128 undef, <4 x fp128> undef)
-; FP16-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
-;
 ; BF16-LABEL: 'fast_fp_reductions'
 ; BF16-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %fadd_v4f16_fast = call fast half @llvm.vector.reduce.fadd.v4f16(half 0xH0000, <4 x half> undef)
 ; BF16-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %fadd_v4f16_reassoc = call reassoc half @llvm.vector.reduce.fadd.v4f16(half 0xH0000, <4 x half> undef)
@@ -121,6 +100,29 @@ define void @fast_fp_reductions() {
 ; BF16-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %fadd_v4f8 = call reassoc bfloat @llvm.vector.reduce.fadd.v4bf16(bfloat 0xR8000, <4 x bfloat> undef)
 ; BF16-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %fadd_v4f128 = call reassoc fp128 @llvm.vector.reduce.fadd.v4f128(fp128 undef, <4 x fp128> undef)
 ; BF16-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
+;
+; NEOV2-LABEL: 'fast_fp_reductions'
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %fadd_v4f16_fast = call fast half @llvm.vector.reduce.fadd.v4f16(half 0xH0000, <4 x half> undef)
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %fadd_v4f16_reassoc = call reassoc half @llvm.vector.reduce.fadd.v4f16(half 0xH0000, <4 x half> undef)
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 30 for instruction: %fadd_v8f16 = call fast half @llvm.vector.reduce.fadd.v8f16(half 0xH0000, <8 x half> undef)
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 30 for instruction: %fadd_v8f16_reassoc = call reassoc half @llvm.vector.reduce.fadd.v8f16(half 0xH0000, <8 x half> undef)
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 38 for instruction: %fadd_v11f16 = call fast half @llvm.vector.reduce.fadd.v11f16(half 0xH0000, <11 x half> undef)
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 38 for instruction: %fadd_v13f16_reassoc = call reassoc half @llvm.vector.reduce.fadd.v13f16(half 0xH0000, <13 x half> undef)
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %fadd_v4f32 = call fast float @llvm.vector.reduce.fadd.v4f32(float 0.000000e+00, <4 x float> undef)
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %fadd_v4f32_reassoc = call reassoc float @llvm.vector.reduce.fadd.v4f32(float 0.000000e+00, <4 x float> undef)
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 15 for instruction: %fadd_v8f32 = call fast float @llvm.vector.reduce.fadd.v8f32(float 0.000000e+00, <8 x float> undef)
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 15 for instruction: %fadd_v8f32_reassoc = call reassoc float @llvm.vector.reduce.fadd.v8f32(float 0.000000e+00, <8 x float> undef)
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 25 for instruction: %fadd_v13f32 = call fast float @llvm.vector.reduce.fadd.v13f32(float 0.000000e+00, <13 x float> undef)
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %fadd_v5f32_reassoc = call reassoc float @llvm.vector.reduce.fadd.v5f32(float 0.000000e+00, <5 x float> undef)
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %fadd_v2f64 = call fast double @llvm.vector.reduce.fadd.v2f64(double 0.000000e+00, <2 x double> undef)
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %fadd_v2f64_reassoc = call reassoc double @llvm.vector.reduce.fadd.v2f64(double 0.000000e+00, <2 x double> undef)
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %fadd_v4f64 = call fast double @llvm.vector.reduce.fadd.v4f64(double 0.000000e+00, <4 x double> undef)
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %fadd_v4f64_reassoc = call reassoc double @llvm.vector.reduce.fadd.v4f64(double 0.000000e+00, <4 x double> undef)
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %fadd_v7f64 = call fast double @llvm.vector.reduce.fadd.v7f64(double 0.000000e+00, <7 x double> undef)
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 15 for instruction: %fadd_v9f64_reassoc = call reassoc double @llvm.vector.reduce.fadd.v9f64(double 0.000000e+00, <9 x double> undef)
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %fadd_v4f8 = call reassoc bfloat @llvm.vector.reduce.fadd.v4bf16(bfloat 0xR8000, <4 x bfloat> undef)
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %fadd_v4f128 = call reassoc fp128 @llvm.vector.reduce.fadd.v4f128(fp128 undef, <4 x fp128> undef)
+; NEOV2-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
 ;
   %fadd_v4f16_fast = call fast half @llvm.vector.reduce.fadd.v4f16(half 0.0, <4 x half> undef)
   %fadd_v4f16_reassoc = call reassoc half @llvm.vector.reduce.fadd.v4f16(half 0.0, <4 x half> undef)
@@ -172,3 +174,5 @@ declare double @llvm.vector.reduce.fadd.v2f64(double, <2 x double>)
 declare double @llvm.vector.reduce.fadd.v4f64(double, <4 x double>)
 declare double @llvm.vector.reduce.fadd.v7f64(double, <7 x double>)
 declare double @llvm.vector.reduce.fadd.v9f64(double, <9 x double>)
+;; NOTE: These prefixes are unused and the list is autogenerated. Do not add tests below this line:
+; FP16: {{.*}}
diff --git a/llvm/test/Transforms/SLPVectorizer/AArch64/reduce-fadd.ll b/llvm/test/Transforms/SLPVectorizer/AArch64/reduce-fadd.ll
new file mode 100644
index 00000000000000..10fd1f7e7a2995
--- /dev/null
+++ b/llvm/test/Transforms/SLPVectorizer/AArch64/reduce-fadd.ll
@@ -0,0 +1,225 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt < %s -passes=slp-vectorizer -mtriple=aarch64-unknown-linux -S -mcpu=neoverse-v2 | FileCheck %s
+
+define half @reduction_half2(<2 x half> %vec2){
+; CHECK-LABEL: define half @reduction_half2(
+; CHECK-SAME: <2 x half> [[VEC2:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[ELT0:%.*]] = extractelement <2 x half> [[VEC2]], i64 0
+; CHECK-NEXT:    [[ELT1:%.*]] = extractelement <2 x half> [[VEC2]], i64 1
+; CHECK-NEXT:    [[ADD1:%.*]] = fadd fast half [[ELT1]], [[ELT0]]
+; CHECK-NEXT:    ret half [[ADD1]]
+entry:
+  %elt0 = extractelement <2 x half> %vec2, i64 0
+  %elt1 = extractelement <2 x half> %vec2, i64 1
+  %add1 = fadd fast half %elt1, %elt0
+
+  ret half %add1
+}
+
+define half @reduction_half4(<4 x half> %vec4){
+; CHECK-LABEL: define half @reduction_half4(
+; CHECK-SAME: <4 x half> [[VEC4:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[TMP0:%.*]] = call fast half @llvm.vector.reduce.fadd.v4f16(half 0xH8000, <4 x half> [[VEC4]])
+; CHECK-NEXT:    ret half [[TMP0]]
+entry:
+  %elt0 = extractelement <4 x half> %vec4, i64 0
+  %elt1 = extractelement <4 x half> %vec4, i64 1
+  %elt2 = extractelement <4 x half> %vec4, i64 2
+  %elt3 = extractelement <4 x half> %vec4, i64 3
+  %add1 = fadd fast half %elt1, %elt0
+  %add2 = fadd fast half %elt2, %add1
+  %add3 = fadd fast half %elt3, %add2
+
+  ret half %add3
+}
+
+define half @reduction_half8(<8 x half> %vec8){
+; CHECK-LABEL: define half @reduction_half8(
+; CHECK-SAME: <8 x half> [[VEC8:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[ELT4:%.*]] = extractelement <8 x half> [[VEC8]], i64 4
+; CHECK-NEXT:    [[ELT5:%.*]] = extractelement <8 x half> [[VEC8]], i64 5
+; CHECK-NEXT:    [[ELT6:%.*]] = extractelement <8 x half> [[VEC8]], i64 6
+; CHECK-NEXT:    [[ELT7:%.*]] = extractelement <8 x half> [[VEC8]], i64 7
+; CHECK-NEXT:    [[TMP2:%.*]] = shufflevector <8 x half> [[VEC8]], <8 x half> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+; CHECK-NEXT:    [[TMP1:%.*]] = call fast half @llvm.vector.reduce.fadd.v4f16(half 0xH8000, <4 x half> [[TMP2]])
+; CHECK-NEXT:    [[OP_RDX:%.*]] = fadd fast half [[TMP1]], [[ELT4]]
+; CHECK-NEXT:    [[OP_RDX1:%.*]] = fadd fast half [[ELT5]], [[ELT6]]
+; CHECK-NEXT:    [[OP_RDX2:%.*]] = fadd fast half [[OP_RDX]], [[OP_RDX1]]
+; CHECK-NEXT:    [[TMP0:%.*]] = fadd fast half [[OP_RDX2]], [[ELT7]]
+; CHECK-NEXT:    ret half [[TMP0]]
+entry:
+  %elt0 = extractelement <8 x half> %vec8, i64 0
+  %elt1 = extractelement <8 x half> %vec8, i64 1
+  %elt2 = extractelement <8 x half> %vec8, i64 2
+  %elt3 = extractelement <8 x half> %vec8, i64 3
+  %elt4 = extractelement <8 x half> %vec8, i64 4
+  %elt5 = extractelement <8 x half> %vec8, i64 5
+  %elt6 = extractelement <8 x half> %vec8, i64 6
+  %elt7 = extractelement <8 x half> %vec8, i64 7
+  %add1 = fadd fast half %elt1, %elt0
+  %add2 = fadd fast half %elt2, %add1
+  %add3 = fadd fast half %elt3, %add2
+  %add4 = fadd fast half %elt4, %add3
+  %add5 = fadd fast half %elt5, %add4
+  %add6 = fadd fast half %elt6, %add5
+  %add7 = fadd fast half %elt7, %add6
+
+  ret half %add7
+}
+
+define half @reduction_half16(<16 x half> %vec16) {
+; CHECK-LABEL: define half @reduction_half16(
+; CHECK-SAME: <16 x half> [[VEC16:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT:  [[ENTRY:.*:]]
+; CHECK-NEXT:    [[ELT4:%.*]] = extractelement <16 x half> [[VEC16]], i64 4
+; CHECK-NEXT:    [[ELT5:%.*]] = extractelement <16 x half> [[VEC16]], i64 5
+; CHECK-NEXT:    [[ELT6:%.*]] = extractelement <16 x half> [[VEC16]], i64 6
+; CHECK-NEXT:    [[ELT7:%.*]] = extractelement <16 x half> [[VEC16]], i64 7
+; CHECK-NEXT:    [[ELT12:%.*]] = extractelement <16 x half> [[VEC16]], i64 12
+; CHECK-NEXT:    [[ELT13:%.*]] = extractelement <16 x half> [[VEC16]], i64 13
+; CHECK-NEXT:    [[ELT14:%.*]] = extractelement <16 x half> [[VEC16]], i64 14
+; CHECK-NEXT:    [[ELT15:%.*]] = extractelement <16 x half> [[VEC16]], i64 15
+; CHECK-NEXT:    [[TMP4:%.*]] = shufflevector <16 x half> [[VEC16]], <16 x half> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+; CHECK-NEXT:    [[TMP1:%.*]] = call fast half @llvm.vector.reduce.fadd.v4f16(half 0xH8000, <4 x half> [[TMP4]])
+; CHECK-NEXT:    [[TMP2:%.*]] = shufflevector <16 x half> [[VEC16]], <16 x half> poison, <4 x i32> <i32 8, i32 9, i32 10, i32 11>
+; CHECK-NEXT:    [[TMP3:%.*]] = call fast half @llvm.vector.reduce.fadd.v4f16(half 0xH8000, <4 x half> [[TMP2]])
+; CHECK-NEXT:    [[OP_RDX:%.*]] = fadd fast half [[TMP1]], [[TMP3]]
+; CHECK-NEXT:    [[OP_RDX1:%.*]] = fadd fast half [[OP_RDX]], [[ELT4]]
+; CHECK-NEXT:    [[OP_RDX2:%.*]] = fadd fast half [[ELT5]], [[ELT6]]
+; CHECK-NEXT:    [[OP_RDX3:%.*]] = fadd fast half [[ELT7]], [[ELT12]]
+; CHECK-NEXT:    [[OP_RDX4:%.*]] = fadd fast half [[ELT13]], [[ELT14]]
+; CHECK-NEXT:    [[OP_RDX5:%.*]] = fadd fast half [[OP_RDX1]], [[OP_RDX2]]
+; CHECK-NEXT:    [[OP_RDX6:%.*]] = fadd fast half [[OP_RDX3]], [[OP_RDX4]]
+; CHECK-NEXT:    [[OP_RDX7:%.*]] = fadd fast half [[OP_RDX5]], [[OP_RDX6]]
+; CHECK-NEXT:    [[TMP0:%.*]] = fadd fast half [[OP_RDX7]], [[ELT15]]
+; CHECK-NEXT:    ret half [[TMP0]]
+entry:
+  %elt0 = extractelement <16 x half> %vec16, i64 0
+  %elt1 = extractelement <16 x half> %vec16, i64 1
+  %elt2 = extractelement <16 x half> %vec16, i64 2
+  %elt3 = extractelement <16 x half> %vec16, i64 3
+  %elt4 = extractelement <16 x half> %vec16, i64 4
+  %elt5 = extractelement <16 x half> %vec16, i64 5
+  %elt6 = extractelement <16 x half> %vec16, i64 6
+  %elt7 = extractelement <16 x half> %vec16, i64 7
+  %elt8 = extractelement <16 x half> %vec16, i64 8
+  %elt9 = extractelement <16 x half> %vec16, i64 9
+  %elt10 = extractelement <16 x half> %vec16, i64 10
+  %elt11 = extractelement <16 x half> %vec16...
[truncated]

davemgreen · 2024-08-29T09:25:43Z

Do you need these to be based on the CPU? I assume your followup will alter the costs in some way?

alexey-bataev · 2024-08-29T10:03:40Z

llvm/test/Analysis/CostModel/AArch64/reduce-fadd.ll

@@ -17,17 +19,6 @@ define void @strict_fp_reductions() {
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 20 for instruction: %fadd_v4f128 = call fp128 @llvm.vector.reduce.fadd.v4f128(fp128 undef, <4 x fp128> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
 ;
-; FP16-LABEL: 'strict_fp_reductions'


These checks were removed, they must be restored

alexey-bataev · 2024-08-29T10:03:47Z

llvm/test/Analysis/CostModel/AArch64/reduce-fadd.ll

@@ -76,29 +78,6 @@ define void @fast_fp_reductions() {
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 12 for instruction: %fadd_v4f128 = call reassoc fp128 @llvm.vector.reduce.fadd.v4f128(fp128 undef, <4 x fp128> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
 ;
-; FP16-LABEL: 'fast_fp_reductions'


sushgokh · 2024-08-29T10:11:02Z

u need these to be based on the CPU? I assume your followup will alter the costs in some way?

Thought of adding a RUN line for Neoverse-v2 because

The throughput for 'faddp' instruction, generated for these reductions, has doubled from 2 to 4 for Neoverse-V1/V2 onwards
Want to avoid getting into a trap where you are not able to detect issues due to generalization for multiple cpus. Maybe if we try for a cost which is one size fits all, then we may be hiding some cost-modelling issues?

But yeah, I haven't run the benchmarks yet with/without Neoverse-V2 option.

davemgreen

Can you explain what it is you plan to change? All the tests are fast and it might be something that needs to dependent on that.

sushgokh · 2024-09-02T10:58:31Z

Can you explain what it is you plan to change? All the tests are fast and it might be something that needs to dependent on that.

The reduction cost for @llvm.vector.reduce.fadd with 'fast'

davemgreen · 2024-09-02T14:54:46Z

They do look a little high, it sure sounds like a sensible change to make. Do you intend to change them in general or just for -mcpu=neoverse-v2? If they are just a series of faddp's, then we could hopefully produce better costs for every cpu. If so you could remove the -mcpu options from the tests and just add -mattr where needed.

For the SLP tests it might be good to have at least some non-fast tests along with the ones you have.

llvm/test/Transforms/SLPVectorizer/AArch64/reduce-fadd.ll

sushgokh · 2024-09-02T17:30:19Z

Do you intend to change them in general or just for -mcpu=neoverse-v2?

Right now, the patch intends to address the issue just for Neoverse-V2. I couldn't test patch for any other cpu(<= neoverse-n2). Maybe could you help me in this regard?

davemgreen · 2024-09-03T05:56:35Z

Sure I can try that. If you put a patch for it I can give it a test in a few places.

sushgokh · 2024-09-03T08:24:30Z

Sure I can try that. If you put a patch for it I can give it a test in a few places.

--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
@@ -4056,6 +4059,23 @@ AArch64TTIImpl::getArithmeticReductionCost(unsigned Opcode, VectorType *ValTy,
   switch (ISD) {
   default:
     break;
+  case ISD::FADD: {
+    if (MTy.isVector()) {
+      unsigned NumElts = MTy.getVectorNumElements();
+      if (ValTy->getElementCount().getFixedValue() >= 2 && NumElts >= 2 &&
+          isPowerOf2_32(NumElts)
+          //&& ST->getProcFamily() == AArch64Subtarget::NeoverseV2)
+      ) {
+        // Floating point reductions are lowered to series of faddp
+        // instructions.
+        // For Neoverse-V1 onwards, for `faddp` instruction, Latency=2 and
+        // Throughput=4.
+        unsigned NumFAddpIns = Log2_32(NumElts);
+        return (LT.first - 1) +
+               /*Latency=*/2 * divideCeil(NumFAddpIns, /*Throughput=*/4);
+      }
+    }
+  } break;
   case ISD::ADD:
     if (const auto *Entry = CostTableLookup(CostTblNoPairwise, ISD, MTy))
       return (LT.first - 1) + Entry->Cost;

davemgreen · 2024-09-04T08:01:39Z

Hi again - It looks like we should be able to treat a faddp the same as a fadd cost-wise on most modern cpus (and by default). Some older cpus prior to cortex-a73 (but not little cores) had them a little higher, we might want to add a target feature if needed, but I think this would make a good default cost-model.
The Throughput=4 isn't really meaningful with how we model costs at the moment, and the Latency=2 would only be used for TCK_Latency (although we don't currently handle very thoroughly). The default TCK_RecipThroughput just adds together reciprocal throughput estimations that are relative to one-another. The cost should either be similar to a fadd for each step (which I believe is 1 now), or doubling it is probably fine if that produces better results (and then would probably be OK for any CPU).

It might be easier in this case add any extra CostModel/AArch64 tests in the same pr as the costmodel adjustments, as that will show what tests we really need. The SLP ones look like good additions if we remove the -mcpu option, but it might be good to have at least some tests for both fast and non-fast. FP16 costs are usually dependant on whether +fullfp16 is present (they should ideally promote to fp32 otherwise), so it might be worth having an extra run line for those if it will be relevant in the end.

sushgokh · 2024-09-04T12:44:46Z

@davemgreen Thanks for the help. Yes, will add more tests( and change the costing as well for the next patch).

davemgreen · 2024-09-10T09:54:56Z

llvm/test/Transforms/SLPVectorizer/AArch64/reduce-fadd.ll

+; RUN: opt < %s -passes=slp-vectorizer -mtriple=aarch64-unknown-linux -mcpu=neoverse-v2 \
+; RUN:       -S | FileCheck %s --check-prefix=NEOV2


I think you can remove the V2 results from here now, unless you expect them to be different.

The result is different for v16 half type.

Different based on the cpu or based on -mattr=+fullfp16 or -mattr=+sve2?

Different for following cases in isolation

-mattr=+fullfp16 -mattr=+sve2 -mcpu=neoverse-v2

But surprisingly same for following case !

-mattr=neoversev2

Oh I see what you mean, the codegen is different already. That does sound odd, it is likely because the scalar cost is lower with fullfp16. And that -mcpu=neoverse-v2 implies +sve2 and +sve2 implies +fullfp16, so +fullfp16 is enough to show the difference.

Can you change the run lines to:

; RUN: opt < %s -passes=slp-vectorizer -mtriple=aarch64-unknown-linux -S | FileCheck %s --check-prefixes=CHECK,CHECK-FP16 ; RUN: opt < %s -passes=slp-vectorizer -mtriple=aarch64-unknown-linux -mattr=+fullfp16 -S | FileCheck %s --check-prefixes=CHECK,CHECK-FP16

It should collapse a lot of the run lines together when they are the same. These all looks like a useful set of tests and we can hopefully get them in.

@davemgreen The presence of function attribute as below with -mattr=+fullfp16

; NEOV2-SAME: <2 x half> [[VEC2:%.*]]) #[[ATTR0:[0-9]+]] {

is proving to be hindrance with merging of outputs with check-prefixes.

There are manual ways to get over this issue such as

Checking only the required part as below with CHECK-SAME

; RUN: opt < %s -S -passes=slp-vectorizer -mtriple=aarch64-unknown-linux | FileCheck %s ; RUN: opt < %s -S -passes=slp-vectorizer -mtriple=aarch64-unknown-linux -mattr=+fullfp16 | FileCheck %s define half @reduce_fast_half2(<2 x half> %vec2) { ; CHECK-LABEL: define half @reduce_fast_half2( ; CHECK-SAME: <2 x half> [[VEC2:%.*]]) ... ...

Maybe using regular expression {{.*}} in place of the function attribute

The cons of the manual approach here would be:

Auto updating tests is not possible

For large no of tests, this is lot of manual work

Any way to scrub the function attribute here so that using check-prefixes becomes possible?

Oh I hadn't seen that before (and I got the first check prefix wrong, sorry about that).

Maybe try:

; RUN: opt < %s -passes=slp-vectorizer -mtriple=aarch64-unknown-linux -mattr=-fullfp16 -S | FileCheck %s --check-prefixes=CHECK,CHECK-NOFP16 ; RUN: opt < %s -passes=slp-vectorizer -mtriple=aarch64-unknown-linux -mattr=+fullfp16 -S | FileCheck %s --check-prefixes=CHECK,CHECK-FP16

The -fullfp16 seems to work around the issue with the attributes.

davemgreen

Sorry - sometimes github doesn't send out notifications for new patches. This LGTM now, thanks.

davemgreen · 2024-09-12T15:57:20Z

llvm/test/Transforms/SLPVectorizer/AArch64/reduce-fadd.ll

+; NOFP16-SAME: <16 x half> [[VEC16:%.*]]) #[[ATTR0]] {
+; NOFP16-NEXT:  [[ENTRY:.*:]]
+; NOFP16-NEXT:    [[TMP0:%.*]] = call fast half @llvm.vector.reduce.fadd.v16f16(half 0xH8000, <16 x half> [[VEC16]])
+; NOFP16-NEXT:    ret half [[TMP0]]


This looks a little off, there is usually an extra newline I think. You might have to give it a quick regenerate.

Yes, right. I have got a review few times to remove the last blank line generated by the update_test_checks script and removed them manually. But I have ran the test through llvm-lit and it passes. I hope this is fine.

A successive patch would be added to fix some of the tests.

Updating the failing test in this patch.

…#106507) A successive patch would be added to fix some of the tests. Pull request: llvm#106507 (cherry picked from commit 7a6945f)

Updating the failing test in this patch. (cherry picked from commit d37d057)

sushgokh requested review from sebpop, alexey-bataev, davemgreen and david-arm August 29, 2024 08:09

llvmbot added llvm:analysis llvm:transforms labels Aug 29, 2024

alexey-bataev reviewed Aug 29, 2024

View reviewed changes

sushgokh force-pushed the GRCO-699 branch from 2b064cd to 6d89654 Compare September 2, 2024 08:36

sushgokh requested a review from alexey-bataev September 2, 2024 08:37

davemgreen reviewed Sep 2, 2024

View reviewed changes

llvm/test/Transforms/SLPVectorizer/AArch64/reduce-fadd.ll Show resolved Hide resolved

sushgokh force-pushed the GRCO-699 branch 2 times, most recently from 0836b93 to 6b7925f Compare September 10, 2024 09:46

sushgokh requested a review from davemgreen September 10, 2024 09:49

davemgreen reviewed Sep 10, 2024

View reviewed changes

sushgokh force-pushed the GRCO-699 branch from 6b7925f to b792591 Compare September 11, 2024 16:34

davemgreen approved these changes Sep 12, 2024

View reviewed changes

[AArch64][SLP] Add NFC test cases for floating point reductions

df295b4

A successive patch would be added to fix some of the tests.

sushgokh force-pushed the GRCO-699 branch from b792591 to df295b4 Compare September 12, 2024 17:36

sushgokh merged commit 7a6945f into llvm:main Sep 12, 2024
4 of 5 checks passed

sushgokh added a commit to sushgokh/llvm-project that referenced this pull request Sep 12, 2024

[SLP][AArch64] Fix test failure for PR llvm#106507

a905818

Updating the failing test in this patch.

sushgokh added a commit that referenced this pull request Sep 12, 2024

[SLP][AArch64] Fix test failure for PR #106507 (#108442)

d37d057

Updating the failing test in this patch.

sushgokh deleted the GRCO-699 branch September 12, 2024 19:59

citymarina pushed a commit to citymarina/llvm-project that referenced this pull request Oct 7, 2024

[SLP][AArch64] Fix test failure for PR llvm#106507 (llvm#108442)

0297e01

Updating the failing test in this patch. (cherry picked from commit d37d057)

		; RUN: opt < %s -passes=slp-vectorizer -mtriple=aarch64-unknown-linux -mcpu=neoverse-v2 \
		; RUN: -S \| FileCheck %s --check-prefix=NEOV2

[AArch64][SLP] Add NFC test cases for floating point reductions #106507

[AArch64][SLP] Add NFC test cases for floating point reductions #106507

Uh oh!

Conversation

sushgokh commented Aug 29, 2024

Uh oh!

llvmbot commented Aug 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davemgreen commented Aug 29, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sushgokh commented Aug 29, 2024

Uh oh!

davemgreen left a comment

Choose a reason for hiding this comment

Uh oh!

sushgokh commented Sep 2, 2024

Uh oh!

davemgreen commented Sep 2, 2024

Uh oh!

Uh oh!

sushgokh commented Sep 2, 2024

Uh oh!

davemgreen commented Sep 3, 2024

Uh oh!

sushgokh commented Sep 3, 2024

Uh oh!

davemgreen commented Sep 4, 2024

Uh oh!

sushgokh commented Sep 4, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davemgreen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

llvmbot commented Aug 29, 2024 •

edited

Loading