-
Notifications
You must be signed in to change notification settings - Fork 13.6k
[AArch64][SLP] Add NFC test cases for floating point reductions #106507
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-llvm-analysis @llvm/pr-subscribers-llvm-transforms Author: Sushant Gokhale (sushgokh) ChangesA successive patch would be added to fix some of the tests. Patch is 24.48 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/106507.diff 2 Files Affected:
diff --git a/llvm/test/Analysis/CostModel/AArch64/reduce-fadd.ll b/llvm/test/Analysis/CostModel/AArch64/reduce-fadd.ll
index a68c21f7943432..c41a532f9f831e 100644
--- a/llvm/test/Analysis/CostModel/AArch64/reduce-fadd.ll
+++ b/llvm/test/Analysis/CostModel/AArch64/reduce-fadd.ll
@@ -2,6 +2,8 @@
; RUN: opt -passes='print<cost-model>' 2>&1 -disable-output -mtriple=aarch64--linux-gnu < %s | FileCheck %s
; RUN: opt -passes='print<cost-model>' 2>&1 -disable-output -mtriple=aarch64--linux-gnu -mattr=+fullfp16 < %s | FileCheck %s --check-prefix=FP16
; RUN: opt -passes='print<cost-model>' 2>&1 -disable-output -mtriple=aarch64--linux-gnu -mattr=+bf16 < %s | FileCheck %s --check-prefix=BF16
+; RUN: opt -passes='print<cost-model>' 2>&1 -disable-output -mtriple=aarch64--linux-gnu \
+; RUN: -mattr=+neoversev2 < %s | FileCheck %s --check-prefixes=FP16,NEOV2
target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
@@ -17,17 +19,6 @@ define void @strict_fp_reductions() {
; CHECK-NEXT: Cost Model: Found an estimated cost of 20 for instruction: %fadd_v4f128 = call fp128 @llvm.vector.reduce.fadd.v4f128(fp128 undef, <4 x fp128> undef)
; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
;
-; FP16-LABEL: 'strict_fp_reductions'
-; FP16-NEXT: Cost Model: Found an estimated cost of 14 for instruction: %fadd_v4f16 = call half @llvm.vector.reduce.fadd.v4f16(half 0xH0000, <4 x half> undef)
-; FP16-NEXT: Cost Model: Found an estimated cost of 30 for instruction: %fadd_v8f16 = call half @llvm.vector.reduce.fadd.v8f16(half 0xH0000, <8 x half> undef)
-; FP16-NEXT: Cost Model: Found an estimated cost of 14 for instruction: %fadd_v4f32 = call float @llvm.vector.reduce.fadd.v4f32(float 0.000000e+00, <4 x float> undef)
-; FP16-NEXT: Cost Model: Found an estimated cost of 28 for instruction: %fadd_v8f32 = call float @llvm.vector.reduce.fadd.v8f32(float 0.000000e+00, <8 x float> undef)
-; FP16-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %fadd_v2f64 = call double @llvm.vector.reduce.fadd.v2f64(double 0.000000e+00, <2 x double> undef)
-; FP16-NEXT: Cost Model: Found an estimated cost of 12 for instruction: %fadd_v4f64 = call double @llvm.vector.reduce.fadd.v4f64(double 0.000000e+00, <4 x double> undef)
-; FP16-NEXT: Cost Model: Found an estimated cost of 18 for instruction: %fadd_v4f8 = call bfloat @llvm.vector.reduce.fadd.v4bf16(bfloat 0xR0000, <4 x bfloat> undef)
-; FP16-NEXT: Cost Model: Found an estimated cost of 20 for instruction: %fadd_v4f128 = call fp128 @llvm.vector.reduce.fadd.v4f128(fp128 undef, <4 x fp128> undef)
-; FP16-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
-;
; BF16-LABEL: 'strict_fp_reductions'
; BF16-NEXT: Cost Model: Found an estimated cost of 18 for instruction: %fadd_v4f16 = call half @llvm.vector.reduce.fadd.v4f16(half 0xH0000, <4 x half> undef)
; BF16-NEXT: Cost Model: Found an estimated cost of 38 for instruction: %fadd_v8f16 = call half @llvm.vector.reduce.fadd.v8f16(half 0xH0000, <8 x half> undef)
@@ -38,6 +29,17 @@ define void @strict_fp_reductions() {
; BF16-NEXT: Cost Model: Found an estimated cost of 14 for instruction: %fadd_v4f8 = call bfloat @llvm.vector.reduce.fadd.v4bf16(bfloat 0xR0000, <4 x bfloat> undef)
; BF16-NEXT: Cost Model: Found an estimated cost of 20 for instruction: %fadd_v4f128 = call fp128 @llvm.vector.reduce.fadd.v4f128(fp128 undef, <4 x fp128> undef)
; BF16-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
+;
+; NEOV2-LABEL: 'strict_fp_reductions'
+; NEOV2-NEXT: Cost Model: Found an estimated cost of 18 for instruction: %fadd_v4f16 = call half @llvm.vector.reduce.fadd.v4f16(half 0xH0000, <4 x half> undef)
+; NEOV2-NEXT: Cost Model: Found an estimated cost of 38 for instruction: %fadd_v8f16 = call half @llvm.vector.reduce.fadd.v8f16(half 0xH0000, <8 x half> undef)
+; NEOV2-NEXT: Cost Model: Found an estimated cost of 14 for instruction: %fadd_v4f32 = call float @llvm.vector.reduce.fadd.v4f32(float 0.000000e+00, <4 x float> undef)
+; NEOV2-NEXT: Cost Model: Found an estimated cost of 28 for instruction: %fadd_v8f32 = call float @llvm.vector.reduce.fadd.v8f32(float 0.000000e+00, <8 x float> undef)
+; NEOV2-NEXT: Cost Model: Found an estimated cost of 6 for instruction: %fadd_v2f64 = call double @llvm.vector.reduce.fadd.v2f64(double 0.000000e+00, <2 x double> undef)
+; NEOV2-NEXT: Cost Model: Found an estimated cost of 12 for instruction: %fadd_v4f64 = call double @llvm.vector.reduce.fadd.v4f64(double 0.000000e+00, <4 x double> undef)
+; NEOV2-NEXT: Cost Model: Found an estimated cost of 18 for instruction: %fadd_v4f8 = call bfloat @llvm.vector.reduce.fadd.v4bf16(bfloat 0xR0000, <4 x bfloat> undef)
+; NEOV2-NEXT: Cost Model: Found an estimated cost of 20 for instruction: %fadd_v4f128 = call fp128 @llvm.vector.reduce.fadd.v4f128(fp128 undef, <4 x fp128> undef)
+; NEOV2-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
;
%fadd_v4f16 = call half @llvm.vector.reduce.fadd.v4f16(half 0.0, <4 x half> undef)
%fadd_v8f16 = call half @llvm.vector.reduce.fadd.v8f16(half 0.0, <8 x half> undef)
@@ -76,29 +78,6 @@ define void @fast_fp_reductions() {
; CHECK-NEXT: Cost Model: Found an estimated cost of 12 for instruction: %fadd_v4f128 = call reassoc fp128 @llvm.vector.reduce.fadd.v4f128(fp128 undef, <4 x fp128> undef)
; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
;
-; FP16-LABEL: 'fast_fp_reductions'
-; FP16-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %fadd_v4f16_fast = call fast half @llvm.vector.reduce.fadd.v4f16(half 0xH0000, <4 x half> undef)
-; FP16-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %fadd_v4f16_reassoc = call reassoc half @llvm.vector.reduce.fadd.v4f16(half 0xH0000, <4 x half> undef)
-; FP16-NEXT: Cost Model: Found an estimated cost of 27 for instruction: %fadd_v8f16 = call fast half @llvm.vector.reduce.fadd.v8f16(half 0xH0000, <8 x half> undef)
-; FP16-NEXT: Cost Model: Found an estimated cost of 27 for instruction: %fadd_v8f16_reassoc = call reassoc half @llvm.vector.reduce.fadd.v8f16(half 0xH0000, <8 x half> undef)
-; FP16-NEXT: Cost Model: Found an estimated cost of 35 for instruction: %fadd_v11f16 = call fast half @llvm.vector.reduce.fadd.v11f16(half 0xH0000, <11 x half> undef)
-; FP16-NEXT: Cost Model: Found an estimated cost of 35 for instruction: %fadd_v13f16_reassoc = call reassoc half @llvm.vector.reduce.fadd.v13f16(half 0xH0000, <13 x half> undef)
-; FP16-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %fadd_v4f32 = call fast float @llvm.vector.reduce.fadd.v4f32(float 0.000000e+00, <4 x float> undef)
-; FP16-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %fadd_v4f32_reassoc = call reassoc float @llvm.vector.reduce.fadd.v4f32(float 0.000000e+00, <4 x float> undef)
-; FP16-NEXT: Cost Model: Found an estimated cost of 15 for instruction: %fadd_v8f32 = call fast float @llvm.vector.reduce.fadd.v8f32(float 0.000000e+00, <8 x float> undef)
-; FP16-NEXT: Cost Model: Found an estimated cost of 15 for instruction: %fadd_v8f32_reassoc = call reassoc float @llvm.vector.reduce.fadd.v8f32(float 0.000000e+00, <8 x float> undef)
-; FP16-NEXT: Cost Model: Found an estimated cost of 25 for instruction: %fadd_v13f32 = call fast float @llvm.vector.reduce.fadd.v13f32(float 0.000000e+00, <13 x float> undef)
-; FP16-NEXT: Cost Model: Found an estimated cost of 9 for instruction: %fadd_v5f32_reassoc = call reassoc float @llvm.vector.reduce.fadd.v5f32(float 0.000000e+00, <5 x float> undef)
-; FP16-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %fadd_v2f64 = call fast double @llvm.vector.reduce.fadd.v2f64(double 0.000000e+00, <2 x double> undef)
-; FP16-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %fadd_v2f64_reassoc = call reassoc double @llvm.vector.reduce.fadd.v2f64(double 0.000000e+00, <2 x double> undef)
-; FP16-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %fadd_v4f64 = call fast double @llvm.vector.reduce.fadd.v4f64(double 0.000000e+00, <4 x double> undef)
-; FP16-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %fadd_v4f64_reassoc = call reassoc double @llvm.vector.reduce.fadd.v4f64(double 0.000000e+00, <4 x double> undef)
-; FP16-NEXT: Cost Model: Found an estimated cost of 9 for instruction: %fadd_v7f64 = call fast double @llvm.vector.reduce.fadd.v7f64(double 0.000000e+00, <7 x double> undef)
-; FP16-NEXT: Cost Model: Found an estimated cost of 15 for instruction: %fadd_v9f64_reassoc = call reassoc double @llvm.vector.reduce.fadd.v9f64(double 0.000000e+00, <9 x double> undef)
-; FP16-NEXT: Cost Model: Found an estimated cost of 10 for instruction: %fadd_v4f8 = call reassoc bfloat @llvm.vector.reduce.fadd.v4bf16(bfloat 0xR8000, <4 x bfloat> undef)
-; FP16-NEXT: Cost Model: Found an estimated cost of 12 for instruction: %fadd_v4f128 = call reassoc fp128 @llvm.vector.reduce.fadd.v4f128(fp128 undef, <4 x fp128> undef)
-; FP16-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
-;
; BF16-LABEL: 'fast_fp_reductions'
; BF16-NEXT: Cost Model: Found an estimated cost of 10 for instruction: %fadd_v4f16_fast = call fast half @llvm.vector.reduce.fadd.v4f16(half 0xH0000, <4 x half> undef)
; BF16-NEXT: Cost Model: Found an estimated cost of 10 for instruction: %fadd_v4f16_reassoc = call reassoc half @llvm.vector.reduce.fadd.v4f16(half 0xH0000, <4 x half> undef)
@@ -121,6 +100,29 @@ define void @fast_fp_reductions() {
; BF16-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %fadd_v4f8 = call reassoc bfloat @llvm.vector.reduce.fadd.v4bf16(bfloat 0xR8000, <4 x bfloat> undef)
; BF16-NEXT: Cost Model: Found an estimated cost of 12 for instruction: %fadd_v4f128 = call reassoc fp128 @llvm.vector.reduce.fadd.v4f128(fp128 undef, <4 x fp128> undef)
; BF16-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
+;
+; NEOV2-LABEL: 'fast_fp_reductions'
+; NEOV2-NEXT: Cost Model: Found an estimated cost of 10 for instruction: %fadd_v4f16_fast = call fast half @llvm.vector.reduce.fadd.v4f16(half 0xH0000, <4 x half> undef)
+; NEOV2-NEXT: Cost Model: Found an estimated cost of 10 for instruction: %fadd_v4f16_reassoc = call reassoc half @llvm.vector.reduce.fadd.v4f16(half 0xH0000, <4 x half> undef)
+; NEOV2-NEXT: Cost Model: Found an estimated cost of 30 for instruction: %fadd_v8f16 = call fast half @llvm.vector.reduce.fadd.v8f16(half 0xH0000, <8 x half> undef)
+; NEOV2-NEXT: Cost Model: Found an estimated cost of 30 for instruction: %fadd_v8f16_reassoc = call reassoc half @llvm.vector.reduce.fadd.v8f16(half 0xH0000, <8 x half> undef)
+; NEOV2-NEXT: Cost Model: Found an estimated cost of 38 for instruction: %fadd_v11f16 = call fast half @llvm.vector.reduce.fadd.v11f16(half 0xH0000, <11 x half> undef)
+; NEOV2-NEXT: Cost Model: Found an estimated cost of 38 for instruction: %fadd_v13f16_reassoc = call reassoc half @llvm.vector.reduce.fadd.v13f16(half 0xH0000, <13 x half> undef)
+; NEOV2-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %fadd_v4f32 = call fast float @llvm.vector.reduce.fadd.v4f32(float 0.000000e+00, <4 x float> undef)
+; NEOV2-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %fadd_v4f32_reassoc = call reassoc float @llvm.vector.reduce.fadd.v4f32(float 0.000000e+00, <4 x float> undef)
+; NEOV2-NEXT: Cost Model: Found an estimated cost of 15 for instruction: %fadd_v8f32 = call fast float @llvm.vector.reduce.fadd.v8f32(float 0.000000e+00, <8 x float> undef)
+; NEOV2-NEXT: Cost Model: Found an estimated cost of 15 for instruction: %fadd_v8f32_reassoc = call reassoc float @llvm.vector.reduce.fadd.v8f32(float 0.000000e+00, <8 x float> undef)
+; NEOV2-NEXT: Cost Model: Found an estimated cost of 25 for instruction: %fadd_v13f32 = call fast float @llvm.vector.reduce.fadd.v13f32(float 0.000000e+00, <13 x float> undef)
+; NEOV2-NEXT: Cost Model: Found an estimated cost of 9 for instruction: %fadd_v5f32_reassoc = call reassoc float @llvm.vector.reduce.fadd.v5f32(float 0.000000e+00, <5 x float> undef)
+; NEOV2-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %fadd_v2f64 = call fast double @llvm.vector.reduce.fadd.v2f64(double 0.000000e+00, <2 x double> undef)
+; NEOV2-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %fadd_v2f64_reassoc = call reassoc double @llvm.vector.reduce.fadd.v2f64(double 0.000000e+00, <2 x double> undef)
+; NEOV2-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %fadd_v4f64 = call fast double @llvm.vector.reduce.fadd.v4f64(double 0.000000e+00, <4 x double> undef)
+; NEOV2-NEXT: Cost Model: Found an estimated cost of 5 for instruction: %fadd_v4f64_reassoc = call reassoc double @llvm.vector.reduce.fadd.v4f64(double 0.000000e+00, <4 x double> undef)
+; NEOV2-NEXT: Cost Model: Found an estimated cost of 9 for instruction: %fadd_v7f64 = call fast double @llvm.vector.reduce.fadd.v7f64(double 0.000000e+00, <7 x double> undef)
+; NEOV2-NEXT: Cost Model: Found an estimated cost of 15 for instruction: %fadd_v9f64_reassoc = call reassoc double @llvm.vector.reduce.fadd.v9f64(double 0.000000e+00, <9 x double> undef)
+; NEOV2-NEXT: Cost Model: Found an estimated cost of 10 for instruction: %fadd_v4f8 = call reassoc bfloat @llvm.vector.reduce.fadd.v4bf16(bfloat 0xR8000, <4 x bfloat> undef)
+; NEOV2-NEXT: Cost Model: Found an estimated cost of 12 for instruction: %fadd_v4f128 = call reassoc fp128 @llvm.vector.reduce.fadd.v4f128(fp128 undef, <4 x fp128> undef)
+; NEOV2-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
;
%fadd_v4f16_fast = call fast half @llvm.vector.reduce.fadd.v4f16(half 0.0, <4 x half> undef)
%fadd_v4f16_reassoc = call reassoc half @llvm.vector.reduce.fadd.v4f16(half 0.0, <4 x half> undef)
@@ -172,3 +174,5 @@ declare double @llvm.vector.reduce.fadd.v2f64(double, <2 x double>)
declare double @llvm.vector.reduce.fadd.v4f64(double, <4 x double>)
declare double @llvm.vector.reduce.fadd.v7f64(double, <7 x double>)
declare double @llvm.vector.reduce.fadd.v9f64(double, <9 x double>)
+;; NOTE: These prefixes are unused and the list is autogenerated. Do not add tests below this line:
+; FP16: {{.*}}
diff --git a/llvm/test/Transforms/SLPVectorizer/AArch64/reduce-fadd.ll b/llvm/test/Transforms/SLPVectorizer/AArch64/reduce-fadd.ll
new file mode 100644
index 00000000000000..10fd1f7e7a2995
--- /dev/null
+++ b/llvm/test/Transforms/SLPVectorizer/AArch64/reduce-fadd.ll
@@ -0,0 +1,225 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt < %s -passes=slp-vectorizer -mtriple=aarch64-unknown-linux -S -mcpu=neoverse-v2 | FileCheck %s
+
+define half @reduction_half2(<2 x half> %vec2){
+; CHECK-LABEL: define half @reduction_half2(
+; CHECK-SAME: <2 x half> [[VEC2:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-NEXT: [[ENTRY:.*:]]
+; CHECK-NEXT: [[ELT0:%.*]] = extractelement <2 x half> [[VEC2]], i64 0
+; CHECK-NEXT: [[ELT1:%.*]] = extractelement <2 x half> [[VEC2]], i64 1
+; CHECK-NEXT: [[ADD1:%.*]] = fadd fast half [[ELT1]], [[ELT0]]
+; CHECK-NEXT: ret half [[ADD1]]
+entry:
+ %elt0 = extractelement <2 x half> %vec2, i64 0
+ %elt1 = extractelement <2 x half> %vec2, i64 1
+ %add1 = fadd fast half %elt1, %elt0
+
+ ret half %add1
+}
+
+define half @reduction_half4(<4 x half> %vec4){
+; CHECK-LABEL: define half @reduction_half4(
+; CHECK-SAME: <4 x half> [[VEC4:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT: [[ENTRY:.*:]]
+; CHECK-NEXT: [[TMP0:%.*]] = call fast half @llvm.vector.reduce.fadd.v4f16(half 0xH8000, <4 x half> [[VEC4]])
+; CHECK-NEXT: ret half [[TMP0]]
+entry:
+ %elt0 = extractelement <4 x half> %vec4, i64 0
+ %elt1 = extractelement <4 x half> %vec4, i64 1
+ %elt2 = extractelement <4 x half> %vec4, i64 2
+ %elt3 = extractelement <4 x half> %vec4, i64 3
+ %add1 = fadd fast half %elt1, %elt0
+ %add2 = fadd fast half %elt2, %add1
+ %add3 = fadd fast half %elt3, %add2
+
+ ret half %add3
+}
+
+define half @reduction_half8(<8 x half> %vec8){
+; CHECK-LABEL: define half @reduction_half8(
+; CHECK-SAME: <8 x half> [[VEC8:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT: [[ENTRY:.*:]]
+; CHECK-NEXT: [[ELT4:%.*]] = extractelement <8 x half> [[VEC8]], i64 4
+; CHECK-NEXT: [[ELT5:%.*]] = extractelement <8 x half> [[VEC8]], i64 5
+; CHECK-NEXT: [[ELT6:%.*]] = extractelement <8 x half> [[VEC8]], i64 6
+; CHECK-NEXT: [[ELT7:%.*]] = extractelement <8 x half> [[VEC8]], i64 7
+; CHECK-NEXT: [[TMP2:%.*]] = shufflevector <8 x half> [[VEC8]], <8 x half> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+; CHECK-NEXT: [[TMP1:%.*]] = call fast half @llvm.vector.reduce.fadd.v4f16(half 0xH8000, <4 x half> [[TMP2]])
+; CHECK-NEXT: [[OP_RDX:%.*]] = fadd fast half [[TMP1]], [[ELT4]]
+; CHECK-NEXT: [[OP_RDX1:%.*]] = fadd fast half [[ELT5]], [[ELT6]]
+; CHECK-NEXT: [[OP_RDX2:%.*]] = fadd fast half [[OP_RDX]], [[OP_RDX1]]
+; CHECK-NEXT: [[TMP0:%.*]] = fadd fast half [[OP_RDX2]], [[ELT7]]
+; CHECK-NEXT: ret half [[TMP0]]
+entry:
+ %elt0 = extractelement <8 x half> %vec8, i64 0
+ %elt1 = extractelement <8 x half> %vec8, i64 1
+ %elt2 = extractelement <8 x half> %vec8, i64 2
+ %elt3 = extractelement <8 x half> %vec8, i64 3
+ %elt4 = extractelement <8 x half> %vec8, i64 4
+ %elt5 = extractelement <8 x half> %vec8, i64 5
+ %elt6 = extractelement <8 x half> %vec8, i64 6
+ %elt7 = extractelement <8 x half> %vec8, i64 7
+ %add1 = fadd fast half %elt1, %elt0
+ %add2 = fadd fast half %elt2, %add1
+ %add3 = fadd fast half %elt3, %add2
+ %add4 = fadd fast half %elt4, %add3
+ %add5 = fadd fast half %elt5, %add4
+ %add6 = fadd fast half %elt6, %add5
+ %add7 = fadd fast half %elt7, %add6
+
+ ret half %add7
+}
+
+define half @reduction_half16(<16 x half> %vec16) {
+; CHECK-LABEL: define half @reduction_half16(
+; CHECK-SAME: <16 x half> [[VEC16:%.*]]) #[[ATTR0]] {
+; CHECK-NEXT: [[ENTRY:.*:]]
+; CHECK-NEXT: [[ELT4:%.*]] = extractelement <16 x half> [[VEC16]], i64 4
+; CHECK-NEXT: [[ELT5:%.*]] = extractelement <16 x half> [[VEC16]], i64 5
+; CHECK-NEXT: [[ELT6:%.*]] = extractelement <16 x half> [[VEC16]], i64 6
+; CHECK-NEXT: [[ELT7:%.*]] = extractelement <16 x half> [[VEC16]], i64 7
+; CHECK-NEXT: [[ELT12:%.*]] = extractelement <16 x half> [[VEC16]], i64 12
+; CHECK-NEXT: [[ELT13:%.*]] = extractelement <16 x half> [[VEC16]], i64 13
+; CHECK-NEXT: [[ELT14:%.*]] = extractelement <16 x half> [[VEC16]], i64 14
+; CHECK-NEXT: [[ELT15:%.*]] = extractelement <16 x half> [[VEC16]], i64 15
+; CHECK-NEXT: [[TMP4:%.*]] = shufflevector <16 x half> [[VEC16]], <16 x half> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+; CHECK-NEXT: [[TMP1:%.*]] = call fast half @llvm.vector.reduce.fadd.v4f16(half 0xH8000, <4 x half> [[TMP4]])
+; CHECK-NEXT: [[TMP2:%.*]] = shufflevector <16 x half> [[VEC16]], <16 x half> poison, <4 x i32> <i32 8, i32 9, i32 10, i32 11>
+; CHECK-NEXT: [[TMP3:%.*]] = call fast half @llvm.vector.reduce.fadd.v4f16(half 0xH8000, <4 x half> [[TMP2]])
+; CHECK-NEXT: [[OP_RDX:%.*]] = fadd fast half [[TMP1]], [[TMP3]]
+; CHECK-NEXT: [[OP_RDX1:%.*]] = fadd fast half [[OP_RDX]], [[ELT4]]
+; CHECK-NEXT: [[OP_RDX2:%.*]] = fadd fast half [[ELT5]], [[ELT6]]
+; CHECK-NEXT: [[OP_RDX3:%.*]] = fadd fast half [[ELT7]], [[ELT12]]
+; CHECK-NEXT: [[OP_RDX4:%.*]] = fadd fast half [[ELT13]], [[ELT14]]
+; CHECK-NEXT: [[OP_RDX5:%.*]] = fadd fast half [[OP_RDX1]], [[OP_RDX2]]
+; CHECK-NEXT: [[OP_RDX6:%.*]] = fadd fast half [[OP_RDX3]], [[OP_RDX4]]
+; CHECK-NEXT: [[OP_RDX7:%.*]] = fadd fast half [[OP_RDX5]], [[OP_RDX6]]
+; CHECK-NEXT: [[TMP0:%.*]] = fadd fast half [[OP_RDX7]], [[ELT15]]
+; CHECK-NEXT: ret half [[TMP0]]
+entry:
+ %elt0 = extractelement <16 x half> %vec16, i64 0
+ %elt1 = extractelement <16 x half> %vec16, i64 1
+ %elt2 = extractelement <16 x half> %vec16, i64 2
+ %elt3 = extractelement <16 x half> %vec16, i64 3
+ %elt4 = extractelement <16 x half> %vec16, i64 4
+ %elt5 = extractelement <16 x half> %vec16, i64 5
+ %elt6 = extractelement <16 x half> %vec16, i64 6
+ %elt7 = extractelement <16 x half> %vec16, i64 7
+ %elt8 = extractelement <16 x half> %vec16, i64 8
+ %elt9 = extractelement <16 x half> %vec16, i64 9
+ %elt10 = extractelement <16 x half> %vec16, i64 10
+ %elt11 = extractelement <16 x half> %vec16...
[truncated]
|
Do you need these to be based on the CPU? I assume your followup will alter the costs in some way? |
@@ -17,17 +19,6 @@ define void @strict_fp_reductions() { | |||
; CHECK-NEXT: Cost Model: Found an estimated cost of 20 for instruction: %fadd_v4f128 = call fp128 @llvm.vector.reduce.fadd.v4f128(fp128 undef, <4 x fp128> undef) | |||
; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void | |||
; | |||
; FP16-LABEL: 'strict_fp_reductions' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These checks were removed, they must be restored
@@ -76,29 +78,6 @@ define void @fast_fp_reductions() { | |||
; CHECK-NEXT: Cost Model: Found an estimated cost of 12 for instruction: %fadd_v4f128 = call reassoc fp128 @llvm.vector.reduce.fadd.v4f128(fp128 undef, <4 x fp128> undef) | |||
; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void | |||
; | |||
; FP16-LABEL: 'fast_fp_reductions' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
Thought of adding a RUN line for Neoverse-v2 because
But yeah, I haven't run the benchmarks yet with/without Neoverse-V2 option. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain what it is you plan to change? All the tests are fast
and it might be something that needs to dependent on that.
The reduction cost for @llvm.vector.reduce.fadd with 'fast' |
They do look a little high, it sure sounds like a sensible change to make. Do you intend to change them in general or just for -mcpu=neoverse-v2? If they are just a series of faddp's, then we could hopefully produce better costs for every cpu. If so you could remove the -mcpu options from the tests and just add -mattr where needed. For the SLP tests it might be good to have at least some non-fast tests along with the ones you have. |
Right now, the patch intends to address the issue just for Neoverse-V2. I couldn't test patch for any other cpu(<= neoverse-n2). Maybe could you help me in this regard? |
Sure I can try that. If you put a patch for it I can give it a test in a few places. |
|
Hi again - It looks like we should be able to treat a faddp the same as a fadd cost-wise on most modern cpus (and by default). Some older cpus prior to cortex-a73 (but not little cores) had them a little higher, we might want to add a target feature if needed, but I think this would make a good default cost-model. It might be easier in this case add any extra CostModel/AArch64 tests in the same pr as the costmodel adjustments, as that will show what tests we really need. The SLP ones look like good additions if we remove the -mcpu option, but it might be good to have at least some tests for both fast and non-fast. FP16 costs are usually dependant on whether +fullfp16 is present (they should ideally promote to fp32 otherwise), so it might be worth having an extra run line for those if it will be relevant in the end. |
@davemgreen Thanks for the help. Yes, will add more tests( and change the costing as well for the next patch). |
0836b93
to
6b7925f
Compare
; RUN: opt < %s -passes=slp-vectorizer -mtriple=aarch64-unknown-linux -mcpu=neoverse-v2 \ | ||
; RUN: -S | FileCheck %s --check-prefix=NEOV2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can remove the V2 results from here now, unless you expect them to be different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The result is different for v16 half type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Different based on the cpu or based on -mattr=+fullfp16 or -mattr=+sve2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Different for following cases in isolation
-mattr=+fullfp16
-mattr=+sve2
-mcpu=neoverse-v2
But surprisingly same for following case !
-mattr=neoversev2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see what you mean, the codegen is different already. That does sound odd, it is likely because the scalar cost is lower with fullfp16. And that -mcpu=neoverse-v2 implies +sve2 and +sve2 implies +fullfp16, so +fullfp16 is enough to show the difference.
Can you change the run lines to:
; RUN: opt < %s -passes=slp-vectorizer -mtriple=aarch64-unknown-linux -S | FileCheck %s --check-prefixes=CHECK,CHECK-FP16
; RUN: opt < %s -passes=slp-vectorizer -mtriple=aarch64-unknown-linux -mattr=+fullfp16 -S | FileCheck %s --check-prefixes=CHECK,CHECK-FP16
It should collapse a lot of the run lines together when they are the same. These all looks like a useful set of tests and we can hopefully get them in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davemgreen The presence of function attribute as below with -mattr=+fullfp16
; NEOV2-SAME: <2 x half> [[VEC2:%.*]]) #[[ATTR0:[0-9]+]] {
is proving to be hindrance with merging of outputs with check-prefixes
.
There are manual ways to get over this issue such as
- Checking only the required part as below with
CHECK-SAME
; RUN: opt < %s -S -passes=slp-vectorizer -mtriple=aarch64-unknown-linux | FileCheck %s
; RUN: opt < %s -S -passes=slp-vectorizer -mtriple=aarch64-unknown-linux -mattr=+fullfp16 | FileCheck %s
define half @reduce_fast_half2(<2 x half> %vec2) {
; CHECK-LABEL: define half @reduce_fast_half2(
; CHECK-SAME: <2 x half> [[VEC2:%.*]])
...
...
- Maybe using regular expression
{{.*}}
in place of the function attribute
The cons of the manual approach here would be:
- Auto updating tests is not possible
- For large no of tests, this is lot of manual work
Any way to scrub the function attribute here so that using check-prefixes
becomes possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I hadn't seen that before (and I got the first check prefix wrong, sorry about that).
Maybe try:
; RUN: opt < %s -passes=slp-vectorizer -mtriple=aarch64-unknown-linux -mattr=-fullfp16 -S | FileCheck %s --check-prefixes=CHECK,CHECK-NOFP16
; RUN: opt < %s -passes=slp-vectorizer -mtriple=aarch64-unknown-linux -mattr=+fullfp16 -S | FileCheck %s --check-prefixes=CHECK,CHECK-FP16
The -fullfp16 seems to work around the issue with the attributes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry - sometimes github doesn't send out notifications for new patches. This LGTM now, thanks.
; NOFP16-SAME: <16 x half> [[VEC16:%.*]]) #[[ATTR0]] { | ||
; NOFP16-NEXT: [[ENTRY:.*:]] | ||
; NOFP16-NEXT: [[TMP0:%.*]] = call fast half @llvm.vector.reduce.fadd.v16f16(half 0xH8000, <16 x half> [[VEC16]]) | ||
; NOFP16-NEXT: ret half [[TMP0]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks a little off, there is usually an extra newline I think. You might have to give it a quick regenerate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, right. I have got a review few times to remove the last blank line generated by the update_test_checks script and removed them manually. But I have ran the test through llvm-lit and it passes. I hope this is fine.
A successive patch would be added to fix some of the tests.
Updating the failing test in this patch.
…#106507) A successive patch would be added to fix some of the tests. Pull request: llvm#106507 (cherry picked from commit 7a6945f)
Updating the failing test in this patch. (cherry picked from commit d37d057)
A successive patch would be added to fix some of the tests.