-
Notifications
You must be signed in to change notification settings - Fork 13.6k
[VPlan] Model FOR extract of exit value in VPlan. #93395
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-llvm-transforms Author: Florian Hahn (fhahn) ChangesThis patch introduces a new ExtractRecurrenceResult VPInstruction opcode to extract the value of a FOR for users outside the loop (i.e. in the scalar loop's exits). This moves the first part of fixing first order recurrences to VPlan, and removes some additional code to patch up live-outs, which is now handled automatically. The majority of test changes is due to changes in the order of which the extracts are generated now. As we are now using VPTransformState to generate the extracts, we may be able to re-use existing extracts in the loop body in some cases. For scalable vectors, in some cases we now have to compute the runtime VF twice, as each extract is now independent, but those should be trivial to clean up for later passes (and in line with other places in the code that also liberally re-compute runtime VFs). Patch is 36.50 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/93395.diff 10 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 48981a6bd39e3..71ee690a8fabd 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -3536,44 +3536,6 @@ void InnerLoopVectorizer::fixFixedOrderRecurrence(
Builder.CreateExtractElement(Incoming, LastIdx, "vector.recur.extract");
}
- auto RecurSplice = cast<VPInstruction>(*PhiR->user_begin());
- assert(PhiR->getNumUsers() == 1 &&
- RecurSplice->getOpcode() ==
- VPInstruction::FirstOrderRecurrenceSplice &&
- "recurrence phi must have a single user: FirstOrderRecurrenceSplice");
- SmallVector<VPLiveOut *> LiveOuts;
- for (VPUser *U : RecurSplice->users())
- if (auto *LiveOut = dyn_cast<VPLiveOut>(U))
- LiveOuts.push_back(LiveOut);
-
- if (!LiveOuts.empty()) {
- // Extract the second last element in the middle block if the
- // Phi is used outside the loop. We need to extract the phi itself
- // and not the last element (the phi update in the current iteration). This
- // will be the value when jumping to the exit block from the
- // LoopMiddleBlock, when the scalar loop is not run at all.
- Value *ExtractForPhiUsedOutsideLoop = nullptr;
- if (VF.isVector()) {
- auto *Idx = Builder.CreateSub(RuntimeVF, ConstantInt::get(IdxTy, 2));
- ExtractForPhiUsedOutsideLoop = Builder.CreateExtractElement(
- Incoming, Idx, "vector.recur.extract.for.phi");
- } else {
- assert(UF > 1 && "VF and UF cannot both be 1");
- // When loop is unrolled without vectorizing, initialize
- // ExtractForPhiUsedOutsideLoop with the value just prior to unrolled
- // value of `Incoming`. This is analogous to the vectorized case above:
- // extracting the second last element when VF > 1.
- ExtractForPhiUsedOutsideLoop = State.get(PreviousDef, UF - 2);
- }
-
- for (VPLiveOut *LiveOut : LiveOuts) {
- assert(!Cost->requiresScalarEpilogue(VF.isVector()));
- PHINode *LCSSAPhi = LiveOut->getPhi();
- LCSSAPhi->addIncoming(ExtractForPhiUsedOutsideLoop, LoopMiddleBlock);
- State.Plan->removeLiveOut(LCSSAPhi);
- }
- }
-
// Fix the initial value of the original recurrence in the scalar loop.
Builder.SetInsertPoint(LoopScalarPreHeader, LoopScalarPreHeader->begin());
PHINode *Phi = cast<PHINode>(PhiR->getUnderlyingValue());
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index 3aee17921086d..91cbd459675d9 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -167,8 +167,8 @@ class VPLane {
static VPLane getFirstLane() { return VPLane(0, VPLane::Kind::First); }
- static VPLane getLastLaneForVF(const ElementCount &VF) {
- unsigned LaneOffset = VF.getKnownMinValue() - 1;
+ static VPLane getLastLaneForVF(const ElementCount &VF, unsigned Offset = 1) {
+ unsigned LaneOffset = VF.getKnownMinValue() - Offset;
Kind LaneKind;
if (VF.isScalable())
// In this case 'LaneOffset' refers to the offset from the start of the
@@ -1179,6 +1179,7 @@ class VPInstruction : public VPRecipeWithIRFlags {
BranchOnCount,
BranchOnCond,
ComputeReductionResult,
+ ExtractRecurrenceResult,
LogicalAnd, // Non-poison propagating logical And.
// Add an offset in bytes (second operand) to a base pointer (first
// operand). Only generates scalar values (either for the first lane only or
@@ -3612,7 +3613,8 @@ inline bool isUniformAfterVectorization(VPValue *VPV) {
if (auto *GEP = dyn_cast<VPWidenGEPRecipe>(Def))
return all_of(GEP->operands(), isUniformAfterVectorization);
if (auto *VPI = dyn_cast<VPInstruction>(Def))
- return VPI->getOpcode() == VPInstruction::ComputeReductionResult;
+ return VPI->getOpcode() == VPInstruction::ComputeReductionResult ||
+ VPI->getOpcode() == VPInstruction::ExtractRecurrenceResult;
return false;
}
} // end namespace vputils
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index 5eb99ffd1e10e..82e3bd530a8d6 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -137,6 +137,7 @@ bool VPRecipeBase::mayHaveSideEffects() const {
case VPInstruction::Not:
case VPInstruction::CalculateTripCountMinusVF:
case VPInstruction::CanonicalIVIncrementForPart:
+ case VPInstruction::ExtractRecurrenceResult:
case VPInstruction::LogicalAnd:
case VPInstruction::PtrAdd:
return false;
@@ -300,6 +301,7 @@ bool VPInstruction::canGenerateScalarForFirstLane() const {
case VPInstruction::CalculateTripCountMinusVF:
case VPInstruction::CanonicalIVIncrementForPart:
case VPInstruction::ComputeReductionResult:
+ case VPInstruction::ExtractRecurrenceResult:
case VPInstruction::PtrAdd:
case VPInstruction::ExplicitVectorLength:
return true;
@@ -558,6 +560,27 @@ Value *VPInstruction::generatePerPart(VPTransformState &State, unsigned Part) {
return ReducedPartRdx;
}
+ case VPInstruction::ExtractRecurrenceResult: {
+ if (Part != 0)
+ return State.get(this, 0, /*IsScalar*/ true);
+
+ // Extract the second last element in the middle block for users outside the
+ // loop.
+ Value *Res;
+ if (State.VF.isVector()) {
+ Res = State.get(
+ getOperand(0),
+ VPIteration(State.UF - 1, VPLane::getLastLaneForVF(State.VF, 2)));
+ } else {
+ assert(State.UF > 1 && "VF and UF cannot both be 1");
+ // When loop is unrolled without vectorizing, retrieve the value just
+ // prior to the final unrolled value. This is analogous to the vectorized
+ // case above: extracting the second last element when VF > 1.
+ Res = State.get(getOperand(0), State.UF - 2);
+ }
+ Res->setName(Name);
+ return Res;
+ }
case VPInstruction::LogicalAnd: {
Value *A = State.get(getOperand(0), Part);
Value *B = State.get(getOperand(1), Part);
@@ -598,6 +621,7 @@ void VPInstruction::execute(VPTransformState &State) {
bool GeneratesPerFirstLaneOnly =
canGenerateScalarForFirstLane() &&
(vputils::onlyFirstLaneUsed(this) ||
+ getOpcode() == VPInstruction::ExtractRecurrenceResult ||
getOpcode() == VPInstruction::ComputeReductionResult);
bool GeneratesPerAllLanes = doesGeneratePerAllLanes();
for (unsigned Part = 0; Part < State.UF; ++Part) {
@@ -692,6 +716,9 @@ void VPInstruction::print(raw_ostream &O, const Twine &Indent,
case VPInstruction::BranchOnCount:
O << "branch-on-count";
break;
+ case VPInstruction::ExtractRecurrenceResult:
+ O << "extract-recurrence-result";
+ break;
case VPInstruction::ComputeReductionResult:
O << "compute-reduction-result";
break;
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
index 422579ea8b84f..f4f43ed15615e 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
@@ -812,6 +812,8 @@ bool VPlanTransforms::adjustFixedOrderRecurrences(VPlan &Plan,
if (auto *FOR = dyn_cast<VPFirstOrderRecurrencePHIRecipe>(&R))
RecurrencePhis.push_back(FOR);
+ VPBuilder MiddleBuilder(
+ cast<VPBasicBlock>(Plan.getVectorLoopRegion()->getSingleSuccessor()));
for (VPFirstOrderRecurrencePHIRecipe *FOR : RecurrencePhis) {
SmallPtrSet<VPFirstOrderRecurrencePHIRecipe *, 4> SeenPhis;
VPRecipeBase *Previous = FOR->getBackedgeValue()->getDefiningRecipe();
@@ -843,6 +845,12 @@ bool VPlanTransforms::adjustFixedOrderRecurrences(VPlan &Plan,
// Set the first operand of RecurSplice to FOR again, after replacing
// all users.
RecurSplice->setOperand(0, FOR);
+
+ auto *Result = cast<VPInstruction>(MiddleBuilder.createNaryOp(
+ VPInstruction::ExtractRecurrenceResult, {FOR->getBackedgeValue()}, {},
+ "vector.recur.extract.for.phi"));
+ RecurSplice->replaceUsesWithIf(
+ Result, [](VPUser &U, unsigned) { return isa<VPLiveOut>(&U); });
}
return true;
}
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/induction-costs.ll b/llvm/test/Transforms/LoopVectorize/AArch64/induction-costs.ll
index 4d9c850abdf3d..162fb9c802dd2 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/induction-costs.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/induction-costs.ll
@@ -126,9 +126,9 @@ define i64 @pointer_induction_only(ptr %start, ptr %end) {
; CHECK-NEXT: [[TMP12:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
; CHECK-NEXT: br i1 [[TMP12]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
; CHECK: middle.block:
+; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT_FOR_PHI:%.*]] = extractelement <2 x i64> [[TMP9]], i32 0
; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[TMP2]], [[N_VEC]]
; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT:%.*]] = extractelement <2 x i64> [[TMP9]], i32 1
-; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT_FOR_PHI:%.*]] = extractelement <2 x i64> [[TMP9]], i32 0
; CHECK-NEXT: br i1 [[CMP_N]], label [[EXIT:%.*]], label [[SCALAR_PH]]
; CHECK: scalar.ph:
; CHECK-NEXT: [[SCALAR_RECUR_INIT:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[VECTOR_RECUR_EXTRACT]], [[MIDDLE_BLOCK]] ]
@@ -191,8 +191,8 @@ define i64 @int_and_pointer_iv(ptr %start, i32 %N) {
; CHECK-NEXT: [[TMP8:%.*]] = icmp eq i64 [[INDEX_NEXT]], 1000
; CHECK-NEXT: br i1 [[TMP8]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP6:![0-9]+]]
; CHECK: middle.block:
-; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT:%.*]] = extractelement <4 x i64> [[TMP5]], i32 3
; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT_FOR_PHI:%.*]] = extractelement <4 x i64> [[TMP5]], i32 2
+; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT:%.*]] = extractelement <4 x i64> [[TMP5]], i32 3
; CHECK-NEXT: br i1 true, label [[EXIT:%.*]], label [[SCALAR_PH]]
; CHECK: scalar.ph:
; CHECK-NEXT: [[SCALAR_RECUR_INIT:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[VECTOR_RECUR_EXTRACT]], [[MIDDLE_BLOCK]] ]
@@ -332,9 +332,9 @@ define i64 @test_ptr_ivs_and_widened_ivs(ptr %src, i32 %N) {
; DEFAULT-NEXT: [[TMP18:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
; DEFAULT-NEXT: br i1 [[TMP18]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP6:![0-9]+]]
; DEFAULT: middle.block:
+; DEFAULT-NEXT: [[VECTOR_RECUR_EXTRACT_FOR_PHI:%.*]] = extractelement <4 x i64> [[TMP15]], i32 2
; DEFAULT-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[TMP2]], [[N_VEC]]
; DEFAULT-NEXT: [[VECTOR_RECUR_EXTRACT:%.*]] = extractelement <4 x i64> [[TMP15]], i32 3
-; DEFAULT-NEXT: [[VECTOR_RECUR_EXTRACT_FOR_PHI:%.*]] = extractelement <4 x i64> [[TMP15]], i32 2
; DEFAULT-NEXT: br i1 [[CMP_N]], label [[EXIT:%.*]], label [[SCALAR_PH]]
; DEFAULT: scalar.ph:
; DEFAULT-NEXT: [[SCALAR_RECUR_INIT:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[VECTOR_RECUR_EXTRACT]], [[MIDDLE_BLOCK]] ]
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/loop-vectorization-factors.ll b/llvm/test/Transforms/LoopVectorize/AArch64/loop-vectorization-factors.ll
index f2f58f89eb95d..718148a67fcc8 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/loop-vectorization-factors.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/loop-vectorization-factors.ll
@@ -870,9 +870,9 @@ define i8 @add_phifail2(ptr noalias nocapture readonly %p, ptr noalias nocapture
; CHECK-NEXT: [[TMP12:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
; CHECK-NEXT: br i1 [[TMP12]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP23:![0-9]+]]
; CHECK: middle.block:
+; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT_FOR_PHI:%.*]] = extractelement <16 x i32> [[TMP6]], i32 14
; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[TMP2]], [[N_VEC]]
; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT:%.*]] = extractelement <16 x i32> [[TMP6]], i32 15
-; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT_FOR_PHI:%.*]] = extractelement <16 x i32> [[TMP6]], i32 14
; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_COND_CLEANUP:%.*]], label [[SCALAR_PH]]
; CHECK: scalar.ph:
; CHECK-NEXT: [[SCALAR_RECUR_INIT:%.*]] = phi i32 [ 0, [[ENTRY:%.*]] ], [ [[VECTOR_RECUR_EXTRACT]], [[MIDDLE_BLOCK]] ]
diff --git a/llvm/test/Transforms/LoopVectorize/first-order-recurrence-chains.ll b/llvm/test/Transforms/LoopVectorize/first-order-recurrence-chains.ll
index c663d2b15b587..a1173c6b46a2c 100644
--- a/llvm/test/Transforms/LoopVectorize/first-order-recurrence-chains.ll
+++ b/llvm/test/Transforms/LoopVectorize/first-order-recurrence-chains.ll
@@ -18,10 +18,10 @@ define i16 @test_chained_first_order_recurrences_1(ptr %ptr) {
; CHECK-NEXT: [[TMP8:%.*]] = icmp eq i64 [[INDEX_NEXT]], 1000
; CHECK-NEXT: br i1 [[TMP8]], label %middle.block, label %vector.body
; CHECK: middle.block:
-; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT:%.*]] = extractelement <4 x i16> [[WIDE_LOAD]], i32 3
; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT_FOR_PHI:%.*]] = extractelement <4 x i16> [[WIDE_LOAD]], i32 2
-; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT2:%.*]] = extractelement <4 x i16> [[TMP4]], i32 3
; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT_FOR_PHI3:%.*]] = extractelement <4 x i16> [[TMP4]], i32 2
+; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT:%.*]] = extractelement <4 x i16> [[WIDE_LOAD]], i32 3
+; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT2:%.*]] = extractelement <4 x i16> [[TMP4]], i32 3
;
entry:
br label %loop
@@ -61,10 +61,10 @@ define i16 @test_chained_first_order_recurrences_2(ptr %ptr) {
; CHECK-NEXT: [[TMP8:%.*]] = icmp eq i64 [[INDEX_NEXT]], 1000
; CHECK-NEXT: br i1 [[TMP8]], label %middle.block, label %vector.body, !llvm.loop [[LOOP4:![0-9]+]]
; CHECK: middle.block:
-; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT:%.*]] = extractelement <4 x i16> [[TMP4]], i32 3
; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT_FOR_PHI:%.*]] = extractelement <4 x i16> [[TMP4]], i32 2
-; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT2:%.*]] = extractelement <4 x i16> [[WIDE_LOAD]], i32 3
; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT_FOR_PHI3:%.*]] = extractelement <4 x i16> [[WIDE_LOAD]], i32 2
+; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT:%.*]] = extractelement <4 x i16> [[TMP4]], i32 3
+; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT2:%.*]] = extractelement <4 x i16> [[WIDE_LOAD]], i32 3
;
entry:
br label %loop
@@ -107,12 +107,12 @@ define i16 @test_chained_first_order_recurrences_3(ptr %ptr) {
; CHECK-NEXT: [[TMP10:%.*]] = icmp eq i64 [[INDEX_NEXT]], 1000
; CHECK-NEXT: br i1 [[TMP10]], label %middle.block, label %vector.body, !llvm.loop [[LOOP6:![0-9]+]]
; CHECK: middle.block:
-; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT:%.*]] = extractelement <4 x i16> [[WIDE_LOAD]], i32 3
; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT_FOR_PHI:%.*]] = extractelement <4 x i16> [[WIDE_LOAD]], i32 2
-; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT3:%.*]] = extractelement <4 x i16> [[TMP4]], i32 3
; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT_FOR_PHI4:%.*]] = extractelement <4 x i16> [[TMP4]], i32 2
-; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT7:%.*]] = extractelement <4 x i16> [[TMP5]], i32 3
; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT_FOR_PHI8:%.*]] = extractelement <4 x i16> [[TMP5]], i32 2
+; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT:%.*]] = extractelement <4 x i16> [[WIDE_LOAD]], i32 3
+; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT3:%.*]] = extractelement <4 x i16> [[TMP4]], i32 3
+; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT7:%.*]] = extractelement <4 x i16> [[TMP5]], i32 3
;
entry:
br label %loop
@@ -219,12 +219,12 @@ define i16 @test_chained_first_order_recurrences_3_reordered_1(ptr %ptr) {
; CHECK-NEXT: [[TMP10:%.*]] = icmp eq i64 [[INDEX_NEXT]], 1000
; CHECK-NEXT: br i1 [[TMP10]], label %middle.block, label %vector.body, !llvm.loop [[LOOP6:![0-9]+]]
; CHECK: middle.block:
-; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT7:%.*]] = extractelement <4 x i16> [[TMP5]], i32 3
; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT_FOR_PHI8:%.*]] = extractelement <4 x i16> [[TMP5]], i32 2
-; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT3:%.*]] = extractelement <4 x i16> [[TMP4]], i32 3
; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT_FOR_PHI4:%.*]] = extractelement <4 x i16> [[TMP4]], i32 2
-; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT:%.*]] = extractelement <4 x i16> [[WIDE_LOAD]], i32 3
; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT_FOR_PHI:%.*]] = extractelement <4 x i16> [[WIDE_LOAD]], i32 2
+; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT7:%.*]] = extractelement <4 x i16> [[TMP5]], i32 3
+; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT3:%.*]] = extractelement <4 x i16> [[TMP4]], i32 3
+; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT:%.*]] = extractelement <4 x i16> [[WIDE_LOAD]], i32 3
;
entry:
br label %loop
@@ -270,12 +270,12 @@ define i16 @test_chained_first_order_recurrences_3_reordered_2(ptr %ptr) {
; CHECK-NEXT: [[TMP10:%.*]] = icmp eq i64 [[INDEX_NEXT]], 1000
; CHECK-NEXT: br i1 [[TMP10]], label %middle.block, label %vector.body, !llvm.loop [[LOOP6:![0-9]+]]
; CHECK: middle.block:
-; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT3:%.*]] = extractelement <4 x i16> [[TMP4]], i32 3
; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT_FOR_PHI4:%.*]] = extractelement <4 x i16> [[TMP4]], i32 2
-; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT7:%.*]] = extractelement <4 x i16> [[TMP5]], i32 3
; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT_FOR_PHI8:%.*]] = extractelement <4 x i16> [[TMP5]], i32 2
-; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT:%.*]] = extractelement <4 x i16> [[WIDE_LOAD]], i32 3
; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT_FOR_PHI:%.*]] = extractelement <4 x i16> [[WIDE_LOAD]], i32 2
+; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT3:%.*]] = extractelement <4 x i16> [[TMP4]], i32 3
+; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT7:%.*]] = extractelement <4 x i16> [[TMP5]], i32 3
+; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT:%.*]] = extractelement <4 x i16> [[WIDE_LOAD]], i32 3
;
entry:
br label %loop
@@ -321,12 +321,12 @@ define i16 @test_chained_first_order_recurrences_3_for2_no_other_uses(ptr %ptr)
; CHECK-NEXT: [[TMP10:%.*]] = icmp eq i64 [[INDEX_NEXT]], 1000
; CHECK-NEXT: br i1 [[TMP10]], label %middle.block, label %vector.body, !llvm.loop [[LOOP6:![0-9]+]]
; CHECK: middle.block:
-; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT:%.*]] = extractelement <4 x i16> [[WIDE_LOAD]], i32 3
; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT_FOR_PHI:%.*]] = extractelement <4 x i16> [[WIDE_LOAD]], i32 2
-; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT3:%.*]] = extractelement <4 x i16> [[TMP4]], i32 3
; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT_FOR_PHI4:%.*]] = extractelement <4 x i16> [[TMP4]], i32 2
-; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT7:%.*]] = extractelement <4 x i16> [[TMP5]], i32 3
; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT_FOR_PHI8:%.*]] = extractelement <4 x i16> [[TMP5]], i32 2
+; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT:%.*]] = extractelement <4 x i16> [[WIDE_LOAD]], i32 3
+; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT3:%.*]] = extractelement <4 x i16> [[TMP4]], i32 3
+; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT7:%.*]] = extractelement <4 x i16> [[TMP5]], i32 3
;
entry:
br label %loop
@@ -371,12 +371,12 @@ define i16 @test_chained_first_order_recurrences_3_for1_for2_no_other_uses(ptr %
; CHECK-NEXT: [[TMP10:%.*]] = icmp eq i64 [[INDEX_NEXT]], 1000
; CHECK-NEXT: br i1 [[TMP10]], label %middle.block, label %vector.body, !llvm.loop [[LOOP6:![0-9]+]]
; CHECK: middle.block:
-; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT:%.*]] = extractelement <4 x i16> [[WIDE_LOAD]], i32 3
; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT_FOR_PHI:%.*]] = extractelement <4 x i16> [[WIDE_LOAD]], i32 2
-; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT3:%.*]] = extractelement <4 x i16> [[TMP4]], i32 3
; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT_FOR_PHI4:%.*]] = extractelement <4 x i16> [[TMP4]], i32 2
-; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT7:%.*]] = extractelement <4 x i16> [[TMP5]], i32 3
; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT_FOR_PHI8:%.*]] = extractelement <4 x i16> [[TMP5]], i32 2
+; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT:%.*]] = extractelement <4 x i16> [[WIDE_LOAD]], i32 3
+; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT3:%.*]] = extractelement <4 x i16> [[TMP4]], i32 3
+; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT7:%.*]] = extractelement <4 x i16> [[TMP5]], i32 3
;
entry:
br label %loop
@@ -420,10 +420,10 @@ define double @test_chained_first_order_recurrence_sink_users_1(ptr %ptr) {
; CHECK-NEXT: [[TMP9...
[truncated]
|
This patch introduces a new ExtractRecurrenceResume VPInstruction opcode to extract the value of a FOR to be used as resume value for the ph in the scalar loop. Note that it takes a VPFirstOrderRecurrencePHIRecipe as operand. This is needed at the moment to conveniently let fixFixedOrderRecurrence still handle creating and patching up the phis in the scalar pre-header and header. As ExtractRecurrenceResume won't have any users yet, it is marked as having side-effects until the phi in the scalar pre-header is also created and managed by VPlan. Depends on llvm#93395
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice step populating middle block with recipes!
@@ -167,8 +167,8 @@ class VPLane { | |||
|
|||
static VPLane getFirstLane() { return VPLane(0, VPLane::Kind::First); } | |||
|
|||
static VPLane getLastLaneForVF(const ElementCount &VF) { | |||
unsigned LaneOffset = VF.getKnownMinValue() - 1; | |||
static VPLane getLastLaneForVF(const ElementCount &VF, unsigned Offset = 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last
is no longer accurate. Plus worth asserting that Offset
is not larger than VF.
How about adding getPenultimateLaneForVF(VF)
instead, which together with getLastLaneForVF(VF)
call a private getLaneFromEnd(VF, OffsetFromEnd)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generalized the patch to use ExtractFromEnd, added public getLaneFromEnd
for that use case
@@ -558,6 +560,27 @@ Value *VPInstruction::generatePerPart(VPTransformState &State, unsigned Part) { | |||
|
|||
return ReducedPartRdx; | |||
} | |||
case VPInstruction::ExtractRecurrenceResult: { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Such a VPInstruction seems independent of Recurrence Result, and could hopefully be used elsewhere if generalized(?).
How about calling it ExtractPenultimateElement
? Or ExtractElementFromEnd with another operand conveying which element to extract, from the end...
Admittedly hard to see reuse opportunities atm, but better focus on what it does than what it's used for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generalized to ExtractFromEnd and will update #93396 to use it to extract the resume value as well
if (Part != 0) | ||
return State.get(this, 0, /*IsScalar*/ true); | ||
|
||
// Extract the second last element in the middle block for users outside the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Extract the second last element in the middle block for users outside the | |
// Extract the penultimate element from the operand. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment is now gone after generalizing to ExtractFromEnd
assert(State.UF > 1 && "VF and UF cannot both be 1"); | ||
// When loop is unrolled without vectorizing, retrieve the value just | ||
// prior to the final unrolled value. This is analogous to the vectorized | ||
// case above: extracting the second last element when VF > 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// case above: extracting the second last element when VF > 1. | |
// case above: extracting the penultimate element when VF > 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has now been generalized to ExtractFromEnd, so the comment is gone
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment still appears? Anyhow, "else" case corresponds to retrieving part UF - Offset "vertically".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks!
// case above: extracting the second last element when VF > 1. | ||
Res = State.get(getOperand(0), State.UF - 2); | ||
} | ||
Res->setName(Name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This Name was found useless and destined to removal some patches ago: #81411 (review)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is used to retain the original name for the IR values, which still seems useful to keep the IR readable and reduce the test changes (in particular it makes it easier to trace back to where the IR instruction is coming from)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are available via Underlying Values? This Name was originally designed to facilitate new names, and as such appear redundant. (Issue independent of this patch.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is for cases where there is no underlying value, which is the case for this opcode and (most?) custom VPInstruction opcodes.
if (State.VF.isVector()) { | ||
Res = State.get( | ||
getOperand(0), | ||
VPIteration(State.UF - 1, VPLane::getLastLaneForVF(State.VF, 2))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nicer if State.VF - 2
could return the desired VPlane, analogous to State.UF - 1 returning the desired part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be done via implementing operator-
for VPLane as follow-up if desired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Operator- would naturally return an ElementCount. Perhaps operator[] e.g., VF[0], VF[-1], VF[-2] to obtain first, last and penultimate VPlanes, respectively (and possibly VF[1] for 2nd if useful), although admittedly first is computes regardless of VF.
@@ -598,6 +621,7 @@ void VPInstruction::execute(VPTransformState &State) { | |||
bool GeneratesPerFirstLaneOnly = | |||
canGenerateScalarForFirstLane() && | |||
(vputils::onlyFirstLaneUsed(this) || | |||
getOpcode() == VPInstruction::ExtractRecurrenceResult || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps worth introducing an isReduction
trait to classify vector-to-scalar recipes including ComputeReductionResult
and Extract*'s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can do, but reduction may not necessarily be a good fit for ExtractFromEnd. For now, they both generate a single scalar, either by reducing or extracting. How about hasSingleScalarResult
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps isVectorToScalar: trying to capture the inverse behavior of broadcast, rather than only the type of the result, which may depend on the types of the operands - the sum of uniform / single scalar per part values also produces a uniform / single scalar per part value.
@@ -1223,11 +1222,11 @@ define i32 @PR30183(i32 %pre_load, ptr %a, ptr %b, i64 %n) { | |||
; SINK-AFTER-NEXT: [[TMP14:%.*]] = getelementptr inbounds i32, ptr [[A]], i64 [[TMP10]] | |||
; SINK-AFTER-NEXT: [[TMP15:%.*]] = load i32, ptr [[TMP11]], align 4 | |||
; SINK-AFTER-NEXT: [[TMP16:%.*]] = load i32, ptr [[TMP12]], align 4 | |||
; SINK-AFTER-NEXT: [[TMP17:%.*]] = load i32, ptr [[TMP13]], align 4 | |||
; SINK-AFTER-NEXT: [[VECTOR_RECUR_EXTRACT_FOR_PHI:%.*]] = load i32, ptr [[TMP13]], align 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This VECTOR_RECUR_EXTRACT_FOR_PHI
is a load
? Another occurrence above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is a consequence of using State.get() to generate extracts, which will re-use existing extracts. (also mentioned in the commit message)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be hopefully clarified when ExtractAllLanes is introduced to represent in VPlan vector-to-all-scalar extractions explicitly rather than having State.get() deal with them on demand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏻
@@ -812,6 +812,8 @@ bool VPlanTransforms::adjustFixedOrderRecurrences(VPlan &Plan, | |||
if (auto *FOR = dyn_cast<VPFirstOrderRecurrencePHIRecipe>(&R)) | |||
RecurrencePhis.push_back(FOR); | |||
|
|||
VPBuilder MiddleBuilder( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth renaming Builder
here to, say, LoopBuilder
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks!
8b4bffc
to
07a955d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated and generalized to ExtracFromEnd opcode, thanks!
// case above: extracting the second last element when VF > 1. | ||
Res = State.get(getOperand(0), State.UF - 2); | ||
} | ||
Res->setName(Name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is used to retain the original name for the IR values, which still seems useful to keep the IR readable and reduce the test changes (in particular it makes it easier to trace back to where the IR instruction is coming from)
assert(State.UF > 1 && "VF and UF cannot both be 1"); | ||
// When loop is unrolled without vectorizing, retrieve the value just | ||
// prior to the final unrolled value. This is analogous to the vectorized | ||
// case above: extracting the second last element when VF > 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has now been generalized to ExtractFromEnd, so the comment is gone
if (State.VF.isVector()) { | ||
Res = State.get( | ||
getOperand(0), | ||
VPIteration(State.UF - 1, VPLane::getLastLaneForVF(State.VF, 2))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be done via implementing operator-
for VPLane as follow-up if desired.
if (Part != 0) | ||
return State.get(this, 0, /*IsScalar*/ true); | ||
|
||
// Extract the second last element in the middle block for users outside the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment is now gone after generalizing to ExtractFromEnd
@@ -558,6 +560,27 @@ Value *VPInstruction::generatePerPart(VPTransformState &State, unsigned Part) { | |||
|
|||
return ReducedPartRdx; | |||
} | |||
case VPInstruction::ExtractRecurrenceResult: { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generalized to ExtractFromEnd and will update #93396 to use it to extract the resume value as well
@@ -167,8 +167,8 @@ class VPLane { | |||
|
|||
static VPLane getFirstLane() { return VPLane(0, VPLane::Kind::First); } | |||
|
|||
static VPLane getLastLaneForVF(const ElementCount &VF) { | |||
unsigned LaneOffset = VF.getKnownMinValue() - 1; | |||
static VPLane getLastLaneForVF(const ElementCount &VF, unsigned Offset = 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generalized the patch to use ExtractFromEnd, added public getLaneFromEnd
for that use case
@@ -598,6 +621,7 @@ void VPInstruction::execute(VPTransformState &State) { | |||
bool GeneratesPerFirstLaneOnly = | |||
canGenerateScalarForFirstLane() && | |||
(vputils::onlyFirstLaneUsed(this) || | |||
getOpcode() == VPInstruction::ExtractRecurrenceResult || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can do, but reduction may not necessarily be a good fit for ExtractFromEnd. For now, they both generate a single scalar, either by reducing or extracting. How about hasSingleScalarResult
?
@@ -812,6 +812,8 @@ bool VPlanTransforms::adjustFixedOrderRecurrences(VPlan &Plan, | |||
if (auto *FOR = dyn_cast<VPFirstOrderRecurrencePHIRecipe>(&R)) | |||
RecurrencePhis.push_back(FOR); | |||
|
|||
VPBuilder MiddleBuilder( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks!
@@ -1223,11 +1222,11 @@ define i32 @PR30183(i32 %pre_load, ptr %a, ptr %b, i64 %n) { | |||
; SINK-AFTER-NEXT: [[TMP14:%.*]] = getelementptr inbounds i32, ptr [[A]], i64 [[TMP10]] | |||
; SINK-AFTER-NEXT: [[TMP15:%.*]] = load i32, ptr [[TMP11]], align 4 | |||
; SINK-AFTER-NEXT: [[TMP16:%.*]] = load i32, ptr [[TMP12]], align 4 | |||
; SINK-AFTER-NEXT: [[TMP17:%.*]] = load i32, ptr [[TMP13]], align 4 | |||
; SINK-AFTER-NEXT: [[VECTOR_RECUR_EXTRACT_FOR_PHI:%.*]] = load i32, ptr [[TMP13]], align 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is a consequence of using State.get() to generate extracts, which will re-use existing extracts. (also mentioned in the commit message)
This patch introduces a new ExtractRecurrenceResume VPInstruction opcode to extract the value of a FOR to be used as resume value for the ph in the scalar loop. Note that it takes a VPFirstOrderRecurrencePHIRecipe as operand. This is needed at the moment to conveniently let fixFixedOrderRecurrence still handle creating and patching up the phis in the scalar pre-header and header. As ExtractRecurrenceResume won't have any users yet, it is marked as having side-effects until the phi in the scalar pre-header is also created and managed by VPlan. Depends on llvm#93395
@@ -167,8 +167,8 @@ class VPLane { | |||
|
|||
static VPLane getFirstLane() { return VPLane(0, VPLane::Kind::First); } | |||
|
|||
static VPLane getLastLaneForVF(const ElementCount &VF) { | |||
unsigned LaneOffset = VF.getKnownMinValue() - 1; | |||
static VPLane getLaneFromEnd(const ElementCount &VF, unsigned Offset) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works "horizontally" for Offsets between 1 and VF, inclusive, if VF >= 2,
and "vertically" for Offsets between 1 and UF, inclusive, if VF = 1, implying UF >= 2.
I.e., this works in general only for Offset = 1 and Offset = 2 across possible {VF, UF} combinations.
May be good to keep private and expose only public last
and penultimate
cases. And/or assert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth supporting any lane w.r.t. to he constraints that it is <= VF in the vector case or <= UF for the scalar case. Added an assert for now. This also avoids explicitly checking the lane and delegating to multiple helper functions.
if (Part != 0) | ||
return State.get(this, 0, /*IsScalar*/ true); | ||
|
||
auto *CI = cast<ConstantInt>(getOperand(1)->getLiveInIRValue()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document somewhere that 2nd operand of ExtractFromEnd must be a constant, either 1 or 2.
This patch exercises Offset = 2 and follow-up #93396 exercises Offset = 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment to where the opcode is defined, thanks. I kept the code general for now, with the constraints that the lane to extract must be <= VF when vectorizing or <= UF otherwise.
auto *CI = cast<ConstantInt>(getOperand(1)->getLiveInIRValue()); | ||
unsigned Offset = CI->getZExtValue(); | ||
|
||
// Extract lane VF - Offset in the from the operand. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Extract lane VF - Offset in the from the operand. | |
// Extract lane VF - Offset from the operand. |
Better place under the "if", as it applies to the horizontal "then" case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved and updated, thanks!
// case above: extracting the second last element when VF > 1. | ||
Res = State.get(getOperand(0), State.UF - 2); | ||
} | ||
Res->setName(Name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are available via Underlying Values? This Name was originally designed to facilitate new names, and as such appear redundant. (Issue independent of this patch.)
@@ -1223,11 +1222,11 @@ define i32 @PR30183(i32 %pre_load, ptr %a, ptr %b, i64 %n) { | |||
; SINK-AFTER-NEXT: [[TMP14:%.*]] = getelementptr inbounds i32, ptr [[A]], i64 [[TMP10]] | |||
; SINK-AFTER-NEXT: [[TMP15:%.*]] = load i32, ptr [[TMP11]], align 4 | |||
; SINK-AFTER-NEXT: [[TMP16:%.*]] = load i32, ptr [[TMP12]], align 4 | |||
; SINK-AFTER-NEXT: [[TMP17:%.*]] = load i32, ptr [[TMP13]], align 4 | |||
; SINK-AFTER-NEXT: [[VECTOR_RECUR_EXTRACT_FOR_PHI:%.*]] = load i32, ptr [[TMP13]], align 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be hopefully clarified when ExtractAllLanes is introduced to represent in VPlan vector-to-all-scalar extractions explicitly rather than having State.get() deal with them on demand.
if (State.VF.isVector()) { | ||
Res = State.get( | ||
getOperand(0), | ||
VPIteration(State.UF - 1, VPLane::getLastLaneForVF(State.VF, 2))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Operator- would naturally return an ElementCount. Perhaps operator[] e.g., VF[0], VF[-1], VF[-2] to obtain first, last and penultimate VPlanes, respectively (and possibly VF[1] for 2nd if useful), although admittedly first is computes regardless of VF.
@@ -598,6 +621,7 @@ void VPInstruction::execute(VPTransformState &State) { | |||
bool GeneratesPerFirstLaneOnly = | |||
canGenerateScalarForFirstLane() && | |||
(vputils::onlyFirstLaneUsed(this) || | |||
getOpcode() == VPInstruction::ExtractRecurrenceResult || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps isVectorToScalar: trying to capture the inverse behavior of broadcast, rather than only the type of the result, which may depend on the types of the operands - the sum of uniform / single scalar per part values also produces a uniform / single scalar per part value.
assert(State.UF > 1 && "VF and UF cannot both be 1"); | ||
// When loop is unrolled without vectorizing, retrieve the value just | ||
// prior to the final unrolled value. This is analogous to the vectorized | ||
// case above: extracting the second last element when VF > 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment still appears? Anyhow, "else" case corresponds to retrieving part UF - Offset "vertically".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed latest comments, thanks!
Added new isVectorToScalar
as suggested.
Operator- would naturally return an ElementCount
An issue with that may be that we would need to create a VPLane from the element count first, to apply it's operator-, which may be almost as verbose as currently, which is a bit more explicit.
if (Part != 0) | ||
return State.get(this, 0, /*IsScalar*/ true); | ||
|
||
auto *CI = cast<ConstantInt>(getOperand(1)->getLiveInIRValue()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment to where the opcode is defined, thanks. I kept the code general for now, with the constraints that the lane to extract must be <= VF when vectorizing or <= UF otherwise.
@@ -167,8 +167,8 @@ class VPLane { | |||
|
|||
static VPLane getFirstLane() { return VPLane(0, VPLane::Kind::First); } | |||
|
|||
static VPLane getLastLaneForVF(const ElementCount &VF) { | |||
unsigned LaneOffset = VF.getKnownMinValue() - 1; | |||
static VPLane getLaneFromEnd(const ElementCount &VF, unsigned Offset) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth supporting any lane w.r.t. to he constraints that it is <= VF in the vector case or <= UF for the scalar case. Added an assert for now. This also avoids explicitly checking the lane and delegating to multiple helper functions.
auto *CI = cast<ConstantInt>(getOperand(1)->getLiveInIRValue()); | ||
unsigned Offset = CI->getZExtValue(); | ||
|
||
// Extract lane VF - Offset in the from the operand. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved and updated, thanks!
assert(State.UF > 1 && "VF and UF cannot both be 1"); | ||
// When loop is unrolled without vectorizing, retrieve the value just | ||
// prior to the final unrolled value. This is analogous to the vectorized | ||
// case above: extracting the second last element when VF > 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks!
// case above: extracting the second last element when VF > 1. | ||
Res = State.get(getOperand(0), State.UF - 2); | ||
} | ||
Res->setName(Name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is for cases where there is no underlying value, which is the case for this opcode and (most?) custom VPInstruction opcodes.
@@ -1223,11 +1222,11 @@ define i32 @PR30183(i32 %pre_load, ptr %a, ptr %b, i64 %n) { | |||
; SINK-AFTER-NEXT: [[TMP14:%.*]] = getelementptr inbounds i32, ptr [[A]], i64 [[TMP10]] | |||
; SINK-AFTER-NEXT: [[TMP15:%.*]] = load i32, ptr [[TMP11]], align 4 | |||
; SINK-AFTER-NEXT: [[TMP16:%.*]] = load i32, ptr [[TMP12]], align 4 | |||
; SINK-AFTER-NEXT: [[TMP17:%.*]] = load i32, ptr [[TMP13]], align 4 | |||
; SINK-AFTER-NEXT: [[VECTOR_RECUR_EXTRACT_FOR_PHI:%.*]] = load i32, ptr [[TMP13]], align 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏻
07a955d
to
faf35ac
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for accommodating! Adding some final suggestions.
@@ -1182,6 +1188,11 @@ class VPInstruction : public VPRecipeWithIRFlags { | |||
BranchOnCount, | |||
BranchOnCond, | |||
ComputeReductionResult, | |||
// Takes the VPValue to extract from as first operand and the lane to | |||
// extract from as second operand. The second operand must be a constant and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// extract from as second operand. The second operand must be a constant and | |
// extract as second operand, counting from the end starting with 1 for last. The second operand must be a positive constant and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adjusted, thanks!
@@ -1182,6 +1188,11 @@ class VPInstruction : public VPRecipeWithIRFlags { | |||
BranchOnCount, | |||
BranchOnCond, | |||
ComputeReductionResult, | |||
// Takes the VPValue to extract from as first operand and the lane to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Takes the VPValue to extract from as first operand and the lane to | |
// Takes the VPValue to extract from as first operand and the lane or part to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adjusted, thanks!
@@ -1182,6 +1188,11 @@ class VPInstruction : public VPRecipeWithIRFlags { | |||
BranchOnCount, | |||
BranchOnCond, | |||
ComputeReductionResult, | |||
// Takes the VPValue to extract from as first operand and the lane to | |||
// extract from as second operand. The second operand must be a constant and | |||
// <= VF when extracting from a vector or <= UF when extracting from a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// <= VF when extracting from a vector or <= UF when extracting from a | |
// <= VF when extracting from a vector or <= UF when extracting from an unrolled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adjusted, thanks!
@@ -1219,6 +1230,10 @@ class VPInstruction : public VPRecipeWithIRFlags { | |||
/// value for lane \p Lane. | |||
Value *generatePerLane(VPTransformState &State, const VPIteration &Lane); | |||
|
|||
/// Returns true if this VPInstruction converts a vector value to a scalar, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Returns true if this VPInstruction converts a vector value to a scalar, | |
/// Returns true if this VPInstruction produces a scalar value from a vector, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adjusted, thanks!
@@ -137,6 +137,7 @@ bool VPRecipeBase::mayHaveSideEffects() const { | |||
case VPInstruction::Not: | |||
case VPInstruction::CalculateTripCountMinusVF: | |||
case VPInstruction::CanonicalIVIncrementForPart: | |||
case VPInstruction::ExtractFromEnd: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed in this patch, given the LiveOut user, or only for the follow-up patch extracting the last element?
Worth a TODO? Temporarily to keep alive until VPUsers are introduced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is needed for the patch, as it always adds the extract and RAUW outside the loop, relying on DCE to clean up unneeded extracts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, this marks ExtractFromEnd as not having side effects.
|
||
auto *CI = cast<ConstantInt>(getOperand(1)->getLiveInIRValue()); | ||
unsigned Offset = CI->getZExtValue(); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert(Offset > 0 && "Offset from end must be positive"); | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks!
return VPI->getOpcode() == VPInstruction::ComputeReductionResult || | ||
VPI->getOpcode() == VPInstruction::ExtractFromEnd; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return VPI->getOpcode() == VPInstruction::ComputeReductionResult || | |
VPI->getOpcode() == VPInstruction::ExtractFromEnd; | |
return VPI->isVectorToScalar(); |
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks!
static VPLane getLastLaneForVF(const ElementCount &VF) { | ||
unsigned LaneOffset = VF.getKnownMinValue() - 1; | ||
static VPLane getLaneFromEnd(const ElementCount &VF, unsigned Offset) { | ||
assert(Offset <= VF.getKnownMinValue() && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert(Offset <= VF.getKnownMinValue() && | |
assert(Offset > 0 && Offset <= VF.getKnownMinValue() && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added, thanks!
VPInstruction::ExtractFromEnd, | ||
{FOR->getBackedgeValue(), | ||
Plan.getOrAddLiveIn(ConstantInt::get(IntTy, 2))}, | ||
{}, "vector.recur.extract.for.phi")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, this is the unique Name assigned to this VPInstruction instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep
This patch introduces a new ExtractRecurrenceResult VPInstruction opcode to extract the value of a FOR for users outside the loop (i.e. in the scalar loop's exits). This moves the first part of fixing first order recurrences to VPlan, and removes some additional code to patch up live-outs, which is now handled automatically. The majority of test changes is due to changes in the order of which the extracts are generated now. As we are now using VPTransformState to generate the extracts, we may be able to re-use existing extracts in the loop body in some cases. For scalable vectors, in some cases we now have to compute the runtime VF twice, as each extract is now independent, but those should be trivial to clean up for later passes (and in line with other places in the code that also liberally re-compute runtime VFs).
faf35ac
to
fb5997b
Compare
This patch introduces a new ExtractRecurrenceResume VPInstruction opcode to extract the value of a FOR to be used as resume value for the ph in the scalar loop. Note that it takes a VPFirstOrderRecurrencePHIRecipe as operand. This is needed at the moment to conveniently let fixFixedOrderRecurrence still handle creating and patching up the phis in the scalar pre-header and header. As ExtractRecurrenceResume won't have any users yet, it is marked as having side-effects until the phi in the scalar pre-header is also created and managed by VPlan. Depends on llvm#93395
This patch introduces a new ExtractRecurrenceResume VPInstruction opcode to extract the value of a FOR to be used as resume value for the ph in the scalar loop. Note that it takes a VPFirstOrderRecurrencePHIRecipe as operand. This is needed at the moment to conveniently let fixFixedOrderRecurrence still handle creating and patching up the phis in the scalar pre-header and header. As ExtractRecurrenceResume won't have any users yet, it is marked as having side-effects until the phi in the scalar pre-header is also created and managed by VPlan. Depends on llvm#93395
This patch introduces a new ExtractRecurrenceResume VPInstruction opcode to extract the value of a FOR to be used as resume value for the ph in the scalar loop. Note that it takes a VPFirstOrderRecurrencePHIRecipe as operand. This is needed at the moment to conveniently let fixFixedOrderRecurrence still handle creating and patching up the phis in the scalar pre-header and header. As ExtractRecurrenceResume won't have any users yet, it is marked as having side-effects until the phi in the scalar pre-header is also created and managed by VPlan. Depends on llvm#93395
This patch introduces a new ExtractRecurrenceResume VPInstruction opcode to extract the value of a FOR to be used as resume value for the ph in the scalar loop. Note that it takes a VPFirstOrderRecurrencePHIRecipe as operand. This is needed at the moment to conveniently let fixFixedOrderRecurrence still handle creating and patching up the phis in the scalar pre-header and header. As ExtractRecurrenceResume won't have any users yet, it is marked as having side-effects until the phi in the scalar pre-header is also created and managed by VPlan. Depends on llvm#93395
This patch uses the ExtractFromEnd VPInstruction opcode to extract the value of a FOR to be used as resume value for the ph in the scalar loop. It adds a new live-out that temporarily wraps the FOR phi in the scalar loop. fixFixedOrderRecurrence will process live outs for fixed order recurrence phis by creating a new phi node in the scalar preheader, using the generated value for the live-out as incoming value from the middle block and the original start value as incoming value for the other edge. Creation of the phi in the preheader, as well as updating the phi in the scalar loop will also be moved to VPlan in the future, eventually retiring fixFixedOrderRecurrence Depends on #93395 PR: #93396
The assertion added as part of #93395 surfaced cases where first-order recurrences are vectorized with <vscale x 1 x ..>. If vscale is 1, then we are unable to extract the penultimate value (second to last lane). Previously this case got mis-compiled, trying to extract from an invalid lane (-1) https://llvm.godbolt.org/z/3adzYYcf9. Fixes #97452.
The assertion added as part of llvm#93395 surfaced cases where first-order recurrences are vectorized with <vscale x 1 x ..>. If vscale is 1, then we are unable to extract the penultimate value (second to last lane). Previously this case got mis-compiled, trying to extract from an invalid lane (-1) https://llvm.godbolt.org/z/3adzYYcf9. Fixes llvm#97452.
This patch introduces a new ExtractFromEnd VPInstruction opcode to extract the value of a FOR for users outside the loop (i.e. in the scalar loop's exits). This moves the first part of fixing first order recurrences to VPlan, and removes some additional code to patch up live-outs, which is now handled automatically.
The majority of test changes is due to changes in the order of which the extracts are generated now. As we are now using VPTransformState to generate the extracts, we may be able to re-use existing extracts in the loop body in some cases. For scalable vectors, in some cases we now have to compute the runtime VF twice, as each extract is now independent, but those should be trivial to clean up for later passes (and in line with other places in the code that also liberally re-compute runtime VFs).