-
Notifications
You must be signed in to change notification settings - Fork 14.7k
[AArch64][SVE] Use FeatureUseFixedOverScalableIfEqualCost for A320 #152156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-llvm-transforms @llvm/pr-subscribers-backend-aarch64 Author: Ties Stuij (stuij) ChangesWith this new A320 in-order core, we follow adding the FeatureUseFixedOverScalableIfEqualCost feature to A510 and A520 (#132246), which reaps the same code generation benefits of preferring fixed over scalable when the cost is equal. So when we have:
When compiling without the feature enabled, we get:
When compiling with, we get:
Full diff: https://github.com/llvm/llvm-project/pull/152156.diff 2 Files Affected:
diff --git a/llvm/lib/Target/AArch64/AArch64Processors.td b/llvm/lib/Target/AArch64/AArch64Processors.td
index adc984ad795af..5c5ebf391b19d 100644
--- a/llvm/lib/Target/AArch64/AArch64Processors.td
+++ b/llvm/lib/Target/AArch64/AArch64Processors.td
@@ -741,7 +741,8 @@ def ProcessorFeatures {
FeaturePAuth, FeaturePsUAO, FeatureRAS,
FeatureRDM, FeatureTRBE, FeatureVH,
FeatureFlagM, FeaturePredRes, FeatureSB, FeatureSSBS,
- FeatureSVE, FeatureSVE2];
+ FeatureSVE, FeatureSVE2,
+ FeatureUseFixedOverScalableIfEqualCost];
list<SubtargetFeature> A53 = [HasV8_0aOps, FeatureCRC, FeatureSHA2, FeatureAES,
FeatureFPARMv8, FeatureNEON, FeaturePerfMon];
list<SubtargetFeature> A55 = [HasV8_2aOps, FeatureSHA2, FeatureAES, FeatureFPARMv8,
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/sve-fixed-width-inorder-core.ll b/llvm/test/Transforms/LoopVectorize/AArch64/sve-fixed-width-inorder-core.ll
index 20bc0af648458..76a7536501bd6 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/sve-fixed-width-inorder-core.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/sve-fixed-width-inorder-core.ll
@@ -1,6 +1,7 @@
; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
; RUN: opt < %s -mtriple=aarch64-none-elf -mcpu=cortex-a510 -mattr=+sve -passes=loop-vectorize -S | FileCheck %s --check-prefix=CHECK-CA510
; RUN: opt < %s -mtriple=aarch64-none-elf -mcpu=cortex-a520 -mattr=+sve -passes=loop-vectorize -S | FileCheck %s --check-prefix=CHECK-CA520
+; RUN: opt < %s -mtriple=aarch64-none-elf -mcpu=cortex-a320 -mattr=+sve -passes=loop-vectorize -S | FileCheck %s --check-prefix=CHECK-CA320
define void @sve_add(ptr %dst, ptr %a, ptr %b, i64 %n) {
; CHECK-CA510-LABEL: define void @sve_add(
@@ -131,6 +132,70 @@ define void @sve_add(ptr %dst, ptr %a, ptr %b, i64 %n) {
; CHECK-CA520: [[FOR_COND_CLEANUP]]:
; CHECK-CA520-NEXT: ret void
;
+; CHECK-CA320-LABEL: define void @sve_add(
+; CHECK-CA320-SAME: ptr [[DST:%.*]], ptr [[A:%.*]], ptr [[B:%.*]], i64 [[N:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-CA320-NEXT: [[ENTRY:.*:]]
+; CHECK-CA320-NEXT: [[B3:%.*]] = ptrtoint ptr [[B]] to i64
+; CHECK-CA320-NEXT: [[A2:%.*]] = ptrtoint ptr [[A]] to i64
+; CHECK-CA320-NEXT: [[DST1:%.*]] = ptrtoint ptr [[DST]] to i64
+; CHECK-CA320-NEXT: [[CMP9_NOT:%.*]] = icmp eq i64 [[N]], 0
+; CHECK-CA320-NEXT: br i1 [[CMP9_NOT]], label %[[FOR_COND_CLEANUP:.*]], label %[[FOR_BODY_PREHEADER:.*]]
+; CHECK-CA320: [[FOR_BODY_PREHEADER]]:
+; CHECK-CA320-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[N]], 8
+; CHECK-CA320-NEXT: br i1 [[MIN_ITERS_CHECK]], label %[[SCALAR_PH:.*]], label %[[VECTOR_MEMCHECK:.*]]
+; CHECK-CA320: [[VECTOR_MEMCHECK]]:
+; CHECK-CA320-NEXT: [[TMP0:%.*]] = sub i64 [[DST1]], [[A2]]
+; CHECK-CA320-NEXT: [[DIFF_CHECK:%.*]] = icmp ult i64 [[TMP0]], 32
+; CHECK-CA320-NEXT: [[TMP1:%.*]] = sub i64 [[DST1]], [[B3]]
+; CHECK-CA320-NEXT: [[DIFF_CHECK4:%.*]] = icmp ult i64 [[TMP1]], 32
+; CHECK-CA320-NEXT: [[CONFLICT_RDX:%.*]] = or i1 [[DIFF_CHECK]], [[DIFF_CHECK4]]
+; CHECK-CA320-NEXT: br i1 [[CONFLICT_RDX]], label %[[SCALAR_PH]], label %[[VECTOR_PH:.*]]
+; CHECK-CA320: [[VECTOR_PH]]:
+; CHECK-CA320-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N]], 8
+; CHECK-CA320-NEXT: [[N_VEC:%.*]] = sub i64 [[N]], [[N_MOD_VF]]
+; CHECK-CA320-NEXT: br label %[[VECTOR_BODY:.*]]
+; CHECK-CA320: [[VECTOR_BODY]]:
+; CHECK-CA320-NEXT: [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; CHECK-CA320-NEXT: [[TMP2:%.*]] = getelementptr inbounds nuw float, ptr [[A]], i64 [[INDEX]]
+; CHECK-CA320-NEXT: [[TMP3:%.*]] = getelementptr inbounds nuw float, ptr [[TMP2]], i32 4
+; CHECK-CA320-NEXT: [[WIDE_LOAD:%.*]] = load <4 x float>, ptr [[TMP2]], align 4
+; CHECK-CA320-NEXT: [[WIDE_LOAD5:%.*]] = load <4 x float>, ptr [[TMP3]], align 4
+; CHECK-CA320-NEXT: [[TMP4:%.*]] = getelementptr inbounds nuw float, ptr [[B]], i64 [[INDEX]]
+; CHECK-CA320-NEXT: [[TMP5:%.*]] = getelementptr inbounds nuw float, ptr [[TMP4]], i32 4
+; CHECK-CA320-NEXT: [[WIDE_LOAD6:%.*]] = load <4 x float>, ptr [[TMP4]], align 4
+; CHECK-CA320-NEXT: [[WIDE_LOAD7:%.*]] = load <4 x float>, ptr [[TMP5]], align 4
+; CHECK-CA320-NEXT: [[TMP6:%.*]] = fadd fast <4 x float> [[WIDE_LOAD6]], [[WIDE_LOAD]]
+; CHECK-CA320-NEXT: [[TMP7:%.*]] = fadd fast <4 x float> [[WIDE_LOAD7]], [[WIDE_LOAD5]]
+; CHECK-CA320-NEXT: [[TMP8:%.*]] = getelementptr inbounds nuw float, ptr [[DST]], i64 [[INDEX]]
+; CHECK-CA320-NEXT: [[TMP9:%.*]] = getelementptr inbounds nuw float, ptr [[TMP8]], i32 4
+; CHECK-CA320-NEXT: store <4 x float> [[TMP6]], ptr [[TMP8]], align 4
+; CHECK-CA320-NEXT: store <4 x float> [[TMP7]], ptr [[TMP9]], align 4
+; CHECK-CA320-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 8
+; CHECK-CA320-NEXT: [[TMP10:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
+; CHECK-CA320-NEXT: br i1 [[TMP10]], label %[[MIDDLE_BLOCK:.*]], label %[[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
+; CHECK-CA320: [[MIDDLE_BLOCK]]:
+; CHECK-CA320-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[N]], [[N_VEC]]
+; CHECK-CA320-NEXT: br i1 [[CMP_N]], label %[[FOR_COND_CLEANUP_LOOPEXIT:.*]], label %[[SCALAR_PH]]
+; CHECK-CA320: [[SCALAR_PH]]:
+; CHECK-CA320-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], %[[MIDDLE_BLOCK]] ], [ 0, %[[FOR_BODY_PREHEADER]] ], [ 0, %[[VECTOR_MEMCHECK]] ]
+; CHECK-CA320-NEXT: br label %[[FOR_BODY:.*]]
+; CHECK-CA320: [[FOR_BODY]]:
+; CHECK-CA320-NEXT: [[INDVARS_IV:%.*]] = phi i64 [ [[INDVARS_IV_NEXT:%.*]], %[[FOR_BODY]] ], [ [[BC_RESUME_VAL]], %[[SCALAR_PH]] ]
+; CHECK-CA320-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds nuw float, ptr [[A]], i64 [[INDVARS_IV]]
+; CHECK-CA320-NEXT: [[TMP11:%.*]] = load float, ptr [[ARRAYIDX]], align 4
+; CHECK-CA320-NEXT: [[ARRAYIDX2:%.*]] = getelementptr inbounds nuw float, ptr [[B]], i64 [[INDVARS_IV]]
+; CHECK-CA320-NEXT: [[TMP12:%.*]] = load float, ptr [[ARRAYIDX2]], align 4
+; CHECK-CA320-NEXT: [[ADD:%.*]] = fadd fast float [[TMP12]], [[TMP11]]
+; CHECK-CA320-NEXT: [[ARRAYIDX4:%.*]] = getelementptr inbounds nuw float, ptr [[DST]], i64 [[INDVARS_IV]]
+; CHECK-CA320-NEXT: store float [[ADD]], ptr [[ARRAYIDX4]], align 4
+; CHECK-CA320-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
+; CHECK-CA320-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], [[N]]
+; CHECK-CA320-NEXT: br i1 [[EXITCOND_NOT]], label %[[FOR_COND_CLEANUP_LOOPEXIT]], label %[[FOR_BODY]], !llvm.loop [[LOOP3:![0-9]+]]
+; CHECK-CA320: [[FOR_COND_CLEANUP_LOOPEXIT]]:
+; CHECK-CA320-NEXT: br label %[[FOR_COND_CLEANUP]]
+; CHECK-CA320: [[FOR_COND_CLEANUP]]:
+; CHECK-CA320-NEXT: ret void
+;
entry:
%cmp9.not = icmp eq i64 %n, 0
br i1 %cmp9.not, label %for.cond.cleanup, label %for.body
@@ -160,3 +225,8 @@ for.cond.cleanup: ; preds = %for.cond.cleanup.lo
; CHECK-CA520: [[META2]] = !{!"llvm.loop.unroll.runtime.disable"}
; CHECK-CA520: [[LOOP3]] = distinct !{[[LOOP3]], [[META1]]}
;.
+; CHECK-CA320: [[LOOP0]] = distinct !{[[LOOP0]], [[META1:![0-9]+]], [[META2:![0-9]+]]}
+; CHECK-CA320: [[META1]] = !{!"llvm.loop.isvectorized", i32 1}
+; CHECK-CA320: [[META2]] = !{!"llvm.loop.unroll.runtime.disable"}
+; CHECK-CA320: [[LOOP3]] = distinct !{[[LOOP3]], [[META1]]}
+;.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM if the performance results look OK. Thanks
@@ -741,7 +741,8 @@ def ProcessorFeatures { | |||
FeaturePAuth, FeaturePsUAO, FeatureRAS, | |||
FeatureRDM, FeatureTRBE, FeatureVH, | |||
FeatureFlagM, FeaturePredRes, FeatureSB, FeatureSSBS, | |||
FeatureSVE, FeatureSVE2]; | |||
FeatureSVE, FeatureSVE2, | |||
FeatureUseFixedOverScalableIfEqualCost]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh - I should read the patch first. As this is a tuning feature it should be in TuneA320. It keeps the -mcpu architecture features and the -mtune tuning features separate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I took the liberty to also move FeatureUseFixedOverScalableIfEqualCost to Tune..
for A510 and A520 in this patch. Happy to make another patch for them if that's more fitting.
With this new A320 in-order core, we follow adding the FeatureUseFixedOverScalableIfEqualCost feature to A510 and A520 (llvm#132246), which reaps the same code generation benefits of preferring fixed over scalable when the cost is equal. So when we have: ``` void foo(float* a, float* b, float* dst, unsigned n) { for (unsigned i = 0; i < n; ++i) dst[i] = a[i] + b[i]; } ``` When compiling without the feature enabled, we get: ``` ... ld1b { z0.b }, p0/z, [x0, x10] ld1b { z2.b }, p0/z, [x1, x10] add x12, x0, x10 ldr z1, [x12, #1, mul vl] add x12, x1, x10 ldr z3, [x12, #1, mul vl] fadd z0.s, z2.s, z0.s add x12, x2, x10 fadd z1.s, z3.s, z1.s dech x11 st1b { z0.b }, p0, [x2, x10] incb x10, all, mul llvm#2 str z1, [x12, #1, mul vl] ... ``` When compiling with, we get: ``` ... ldp q0, q1, [x12, #-16] ldp q2, q3, [x11, #-16] subs x13, x13, llvm#8 fadd v0.4s, v2.4s, v0.4s fadd v1.4s, v3.4s, v1.4s add x11, x11, llvm#32 add x12, x12, llvm#32 stp q0, q1, [x10, #-16] add x10, x10, llvm#32 ... ``` This patch also moves FeatureUseFixedOverScalableIfEqualCost for A510 and A520 from the CPU features to the tune features.
571dfa9
to
6d2e2ca
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks LGTM
With this new A320 in-order core, we follow adding the FeatureUseFixedOverScalableIfEqualCost feature to A510 and A520 (#132246), which reaps the same code generation benefits of preferring fixed over scalable when the cost is equal.
So when we have:
When compiling without the feature enabled, we get:
When compiling with, we get: