[VPlan] Delay adding canonical IV increment. #82270

fhahn · 2024-02-19T18:49:11Z

This patch delays adding canonical IV increments, computing the next active lane mask and the branch recipes to exit the vector loop.

During initial construction of a VPlan, only add the canonical IV phi and active-lane-mask phis if needed. Similarly, do not add the branches to exit the loop initially. Computing the next IV value, the next active-lane-mask or the exit branches are details that are only needed to simplify codegen (execute of the individual recipes).

This makes the inital VPlans more abstract (and simpler), in that we initially leave out some details. Initial VPlans still have the canonical induction recipe, which provides the value of the canonical induction at every iteration, but we leave out the detail of how it is computed, which is not needed until code generation.

Similar reasoning applies to active lane mask increments and branch recipes. The vector loop region initially abstractly models the loop processing all vector iterations, without specifying how exactly exiting the region is handled. Again these details are only needed for code generation.

To introduce the missing pieces and complete the abstract VPlan, a new prepareToExecute transform is added and run just before exiting the VPlan.

This is an attempt to further employ gradual lowering, as outlined in https://llvm.org/devmtg/2023-10/slides/techtalks/Hahn-VPlan-StatusUpdateAndRoadmap.pdf and already applied for replicate region handling.

llvmbot · 2024-02-19T18:49:42Z

@llvm/pr-subscribers-backend-powerpc
@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-backend-risc-v

Author: Florian Hahn (fhahn)

Changes

This patch delays adding canonical IV increments, computing the next active lane mask and the branch recipes to exit the vector loop.

During initial construction of a VPlan, only add the canonical IV phi and active-lane-mask phis if needed. Similarly, do not add the branches to exit the loop initially. Computing the next IV value, the next active-lane-mask or the exit branches are details that are only needed to simplify codegen (execute of the individual recipes).

This makes the inital VPlans more abstract (and simpler), in that we initially leave out some details. Initial VPlans still have the canonical induction recipe, which provides the value of the canonical induction at every iteration, but we leave out the detail of how it is computed, which is not needed until code generation.

Similar reasoning applies to active lane mask increments and branch recipes. The vector loop region initially abstractly models the loop processing all vector iterations, without specifying how exactly exiting the region is handled. Again these details are only needed for code generation.

To introduce the missing pieces and complete the abstract VPlan, a new prepareToExecute transform is added and run just before exiting the VPlan.

This is an attempt to further employ gradual lowering, as outlined in https://llvm.org/devmtg/2023-10/slides/techtalks/Hahn-VPlan-StatusUpdateAndRoadmap.pdf and already applied for replicate region handling.

Patch is 88.10 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/82270.diff

27 Files Affected:

(modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+9-12)
(modified) llvm/lib/Transforms/Vectorize/VPlan.h (+2-1)
(modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp (+69-40)
(modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.h (+3)
(modified) llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp (+21-19)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/clamped-trip-count.ll (+2-2)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll (+4-4)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/sve-interleaved-masked-accesses.ll (+3-3)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding-forced.ll (+1-1)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding-overflow-checks.ll (+1-1)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding-reductions.ll (+3-3)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding-unroll.ll (+26-26)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding.ll (+10-10)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/sve-widen-gep.ll (+1-1)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/synthesize-mask-for-call.ll (-24)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/tail-folding-styles.ll (+1-1)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/uniform-args-call-variants.ll (+12-12)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/widen-call-with-intrinsic-or-libfunc.ll (-9)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/riscv-vector-reverse.ll (-8)
(modified) llvm/test/Transforms/LoopVectorize/first-order-recurrence-sink-replicate-region.ll (+6-6)
(modified) llvm/test/Transforms/LoopVectorize/icmp-uniforms.ll (+1-1)
(modified) llvm/test/Transforms/LoopVectorize/vplan-dot-printing.ll (+1-1)
(modified) llvm/test/Transforms/LoopVectorize/vplan-iv-transforms.ll (+1-1)
(modified) llvm/test/Transforms/LoopVectorize/vplan-printing.ll (+15-15)
(modified) llvm/test/Transforms/LoopVectorize/vplan-sink-scalars-and-merge-vf1.ll (+1-1)
(modified) llvm/test/Transforms/LoopVectorize/vplan-sink-scalars-and-merge.ll (+14-14)
(modified) llvm/test/Transforms/LoopVectorize/vplan-unused-interleave-group.ll (+1-1)

diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index eca901fcdae4ce..8b5590c3400f48 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -7452,6 +7452,14 @@ LoopVectorizationPlanner::executePlan(
   LLVM_DEBUG(dbgs() << "Executing best plan with VF=" << BestVF << ", UF=" << BestUF
                     << '\n');
 
+  TailFoldingStyle Style =
+      CM.getTailFoldingStyle(!isIndvarOverflowCheckKnownFalse(&CM, BestVF));
+  // When not folding the tail, we know that the induction increment will not
+  // overflow.
+  bool HasNUW = Style == TailFoldingStyle::None;
+  bool WithoutRuntimeCheck =
+      Style == TailFoldingStyle::DataAndControlFlowWithoutRuntimeCheck;
+  VPlanTransforms::prepareToExecute(BestVPlan, HasNUW, WithoutRuntimeCheck);
   if (!IsEpilogueVectorization)
     VPlanTransforms::optimizeForVFAndUF(BestVPlan, BestVF, BestUF, PSE);
 
@@ -8509,17 +8517,6 @@ static void addCanonicalIVRecipes(VPlan &Plan, Type *IdxTy, bool HasNUW,
   VPRegionBlock *TopRegion = Plan.getVectorLoopRegion();
   VPBasicBlock *Header = TopRegion->getEntryBasicBlock();
   Header->insert(CanonicalIVPHI, Header->begin());
-
-  VPBuilder Builder(TopRegion->getExitingBasicBlock());
-  // Add a VPInstruction to increment the scalar canonical IV by VF * UF.
-  auto *CanonicalIVIncrement = Builder.createOverflowingOp(
-      Instruction::Add, {CanonicalIVPHI, &Plan.getVFxUF()}, {HasNUW, false}, DL,
-      "index.next");
-  CanonicalIVPHI->addOperand(CanonicalIVIncrement);
-
-  // Add the BranchOnCount VPInstruction to the latch.
-  Builder.createNaryOp(VPInstruction::BranchOnCount,
-                       {CanonicalIVIncrement, &Plan.getVectorTripCount()}, DL);
 }
 
 // Add exit values to \p Plan. VPLiveOuts are added for each LCSSA phi in the
@@ -9005,7 +9002,7 @@ void LoopVectorizationPlanner::adjustRecipesForReductions(
       PreviousLink = RedRecipe;
     }
   }
-  Builder.setInsertPoint(&*LatchVPBB->begin());
+  Builder.setInsertPoint(LatchVPBB, LatchVPBB->begin());
   for (VPRecipeBase &R :
        Plan->getVectorLoopRegion()->getEntryBasicBlock()->phis()) {
     VPReductionPHIRecipe *PhiR = dyn_cast<VPReductionPHIRecipe>(&R);
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index 42e3b4a003511b..cfc4161adc0c82 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -2363,7 +2363,8 @@ class VPCanonicalIVPHIRecipe : public VPHeaderPHIRecipe {
 
   VPRecipeBase *clone() override {
     auto *R = new VPCanonicalIVPHIRecipe(getOperand(0), getDebugLoc());
-    R->addOperand(getBackedgeValue());
+    if (getNumOperands() == 2)
+      R->addOperand(getBackedgeValue());
     return R;
   }
 
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
index 3d443421024202..965ea0386b9475 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
@@ -1109,16 +1109,10 @@ void VPlanTransforms::optimize(VPlan &Plan, ScalarEvolution &SE) {
 static VPActiveLaneMaskPHIRecipe *addVPLaneMaskPhiAndUpdateExitBranch(
     VPlan &Plan, bool DataAndControlFlowWithoutRuntimeCheck) {
   VPRegionBlock *TopRegion = Plan.getVectorLoopRegion();
-  VPBasicBlock *EB = TopRegion->getExitingBasicBlock();
   auto *CanonicalIVPHI = Plan.getCanonicalIV();
   VPValue *StartV = CanonicalIVPHI->getStartValue();
 
-  auto *CanonicalIVIncrement =
-      cast<VPInstruction>(CanonicalIVPHI->getBackedgeValue());
-  // TODO: Check if dropping the flags is needed if
-  // !DataAndControlFlowWithoutRuntimeCheck.
-  CanonicalIVIncrement->dropPoisonGeneratingFlags();
-  DebugLoc DL = CanonicalIVIncrement->getDebugLoc();
+  DebugLoc DL = CanonicalIVPHI->getDebugLoc();
   // We can't use StartV directly in the ActiveLaneMask VPInstruction, since
   // we have to take unrolling into account. Each part needs to start at
   //   Part * VF
@@ -1128,21 +1122,6 @@ static VPActiveLaneMaskPHIRecipe *addVPLaneMaskPhiAndUpdateExitBranch(
   // Create the ActiveLaneMask instruction using the correct start values.
   VPValue *TC = Plan.getTripCount();
 
-  VPValue *TripCount, *IncrementValue;
-  if (!DataAndControlFlowWithoutRuntimeCheck) {
-    // When the loop is guarded by a runtime overflow check for the loop
-    // induction variable increment by VF, we can increment the value before
-    // the get.active.lane mask and use the unmodified tripcount.
-    IncrementValue = CanonicalIVIncrement;
-    TripCount = TC;
-  } else {
-    // When avoiding a runtime check, the active.lane.mask inside the loop
-    // uses a modified trip count and the induction variable increment is
-    // done after the active.lane.mask intrinsic is called.
-    IncrementValue = CanonicalIVPHI;
-    TripCount = Builder.createNaryOp(VPInstruction::CalculateTripCountMinusVF,
-                                     {TC}, DL);
-  }
   auto *EntryIncrement = Builder.createOverflowingOp(
       VPInstruction::CanonicalIVIncrementForPart, {StartV}, {false, false}, DL,
       "index.part.next");
@@ -1156,24 +1135,6 @@ static VPActiveLaneMaskPHIRecipe *addVPLaneMaskPhiAndUpdateExitBranch(
   // preheader ActiveLaneMask instruction.
   auto LaneMaskPhi = new VPActiveLaneMaskPHIRecipe(EntryALM, DebugLoc());
   LaneMaskPhi->insertAfter(CanonicalIVPHI);
-
-  // Create the active lane mask for the next iteration of the loop before the
-  // original terminator.
-  VPRecipeBase *OriginalTerminator = EB->getTerminator();
-  Builder.setInsertPoint(OriginalTerminator);
-  auto *InLoopIncrement =
-      Builder.createOverflowingOp(VPInstruction::CanonicalIVIncrementForPart,
-                                  {IncrementValue}, {false, false}, DL);
-  auto *ALM = Builder.createNaryOp(VPInstruction::ActiveLaneMask,
-                                   {InLoopIncrement, TripCount}, DL,
-                                   "active.lane.mask.next");
-  LaneMaskPhi->addOperand(ALM);
-
-  // Replace the original terminator with BranchOnCond. We have to invert the
-  // mask here because a true condition means jumping to the exit block.
-  auto *NotMask = Builder.createNot(ALM, DL);
-  Builder.createNaryOp(VPInstruction::BranchOnCond, {NotMask}, DL);
-  OriginalTerminator->eraseFromParent();
   return LaneMaskPhi;
 }
 
@@ -1304,3 +1265,71 @@ void VPlanTransforms::dropPoisonGeneratingRecipes(
     }
   }
 }
+
+void VPlanTransforms::prepareToExecute(
+    VPlan &Plan, bool HasNUW, bool DataAndControlFlowWithoutRuntimeCheck) {
+  auto *CanIV = Plan.getCanonicalIV();
+
+  VPBasicBlock *EB = Plan.getVectorLoopRegion()->getExitingBasicBlock();
+  VPBuilder Builder(EB);
+  DebugLoc DL = CanIV->getDebugLoc();
+  // Add a VPInstruction to increment the scalar canonical IV by VF * UF.
+  auto *CanonicalIVIncrement =
+      Builder.createOverflowingOp(Instruction::Add, {CanIV, &Plan.getVFxUF()},
+                                  {HasNUW, false}, DL, "index.next");
+
+  CanIV->addOperand(CanonicalIVIncrement);
+
+  auto FoundLaneMaskPhi = find_if(
+      Plan.getVectorLoopRegion()->getEntryBasicBlock()->phis(),
+      [](VPRecipeBase &P) { return isa<VPActiveLaneMaskPHIRecipe>(P); });
+
+  if (FoundLaneMaskPhi ==
+      Plan.getVectorLoopRegion()->getEntryBasicBlock()->phis().end()) {
+    // Add the BranchOnCount VPInstruction to the latch.
+    Builder.createNaryOp(VPInstruction::BranchOnCount,
+                         {CanonicalIVIncrement, &Plan.getVectorTripCount()},
+                         DL);
+    return;
+  }
+  auto *LaneMaskPhi = cast<VPActiveLaneMaskPHIRecipe>(&*FoundLaneMaskPhi);
+  auto *VecPreheader =
+      cast<VPBasicBlock>(Plan.getVectorLoopRegion()->getSinglePredecessor());
+  Builder.setInsertPoint(VecPreheader);
+
+  VPValue *TC = Plan.getTripCount();
+
+  // TODO: Check if dropping the flags is needed if
+  // !DataAndControlFlowWithoutRuntimeCheck.
+  CanonicalIVIncrement->dropPoisonGeneratingFlags();
+  VPValue *TripCount, *IncrementValue;
+  if (!DataAndControlFlowWithoutRuntimeCheck) {
+    // When the loop is guarded by a runtime overflow check for the loop
+    // induction variable increment by VF, we can increment the value before
+    // the get.active.lane mask and use the unmodified tripcount.
+    IncrementValue = CanonicalIVIncrement;
+    TripCount = TC;
+  } else {
+    // When avoiding a runtime check, the active.lane.mask inside the loop
+    // uses a modified trip count and the induction variable increment is
+    // done after the active.lane.mask intrinsic is called.
+    IncrementValue = CanIV;
+    TripCount = Builder.createNaryOp(VPInstruction::CalculateTripCountMinusVF,
+                                     {TC}, DL);
+  }
+  // Create the active lane mask for the next iteration of the loop before the
+  // original terminator.
+  Builder.setInsertPoint(EB);
+  auto *InLoopIncrement =
+      Builder.createOverflowingOp(VPInstruction::CanonicalIVIncrementForPart,
+                                  {IncrementValue}, {false, false}, DL);
+  auto *ALM = Builder.createNaryOp(VPInstruction::ActiveLaneMask,
+                                   {InLoopIncrement, TripCount}, DL,
+                                   "active.lane.mask.next");
+  LaneMaskPhi->addOperand(ALM);
+
+  // Replace the original terminator with BranchOnCond. We have to invert the
+  // mask here because a true condition means jumping to the exit block.
+  auto *NotMask = Builder.createNot(ALM, DL);
+  Builder.createNaryOp(VPInstruction::BranchOnCond, {NotMask}, DL);
+}
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.h b/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
index ff83c3f083b093..1ddbe769d4a132 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
@@ -98,6 +98,9 @@ struct VPlanTransforms {
   ///       VPlan directly.
   static void dropPoisonGeneratingRecipes(
       VPlan &Plan, function_ref<bool(BasicBlock *)> BlockNeedsPredication);
+
+  static void prepareToExecute(VPlan &Plan, bool HasNUW,
+                               bool DataAndControlFlowWithoutRuntimeCheck);
 };
 
 } // namespace llvm
diff --git a/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp b/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp
index d6b81543dbc9cc..22128e9dbb35a3 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp
@@ -51,12 +51,13 @@ static void verifyBlocksInRegion(const VPRegionBlock *Region) {
 
     auto *VPBB = dyn_cast<VPBasicBlock>(VPB);
     // Check block's condition bit.
-    if (VPB->getNumSuccessors() > 1 || (VPBB && VPBB->isExiting()))
-      assert(VPBB && VPBB->getTerminator() &&
-             "Block has multiple successors but doesn't "
-             "have a proper branch recipe!");
-    else
-      assert((!VPBB || !VPBB->getTerminator()) && "Unexpected branch recipe!");
+    /*    if (VPB->getNumSuccessors() > 1 || (VPBB && VPBB->isExiting()))*/
+    /*assert(VPBB && VPBB->getTerminator() &&*/
+    /*"Block has multiple successors but doesn't "*/
+    /*"have a proper branch recipe!");*/
+    /*else*/
+    /*assert((!VPBB || !VPBB->getTerminator()) && "Unexpected branch
+     * recipe!");*/
 
     // Check block's successors.
     const auto &Successors = VPB->getSuccessors();
@@ -260,19 +261,20 @@ bool VPlanVerifier::verifyPlanIsValid(const VPlan &Plan) {
     return false;
   }
 
-  if (Exiting->empty()) {
-    errs() << "VPlan vector loop exiting block must end with BranchOnCount or "
-              "BranchOnCond VPInstruction but is empty\n";
-    return false;
-  }
-
-  auto *LastInst = dyn_cast<VPInstruction>(std::prev(Exiting->end()));
-  if (!LastInst || (LastInst->getOpcode() != VPInstruction::BranchOnCount &&
-                    LastInst->getOpcode() != VPInstruction::BranchOnCond)) {
-    errs() << "VPlan vector loop exit must end with BranchOnCount or "
-              "BranchOnCond VPInstruction\n";
-    return false;
-  }
+  /*  if (Exiting->empty()) {*/
+  /*errs() << "VPlan vector loop exiting block must end with BranchOnCount or
+   * "*/
+  /*"BranchOnCond VPInstruction but is empty\n";*/
+  /*return false;*/
+  /*}*/
+
+  /*auto *LastInst = dyn_cast<VPInstruction>(std::prev(Exiting->end()));*/
+  /*if (!LastInst || (LastInst->getOpcode() != VPInstruction::BranchOnCount &&*/
+  /*LastInst->getOpcode() != VPInstruction::BranchOnCond)) {*/
+  /*errs() << "VPlan vector loop exit must end with BranchOnCount or "*/
+  /*"BranchOnCond VPInstruction\n";*/
+  /*return false;*/
+  /*}*/
 
   for (const VPRegionBlock *Region :
        VPBlockUtils::blocksOnly<const VPRegionBlock>(
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/clamped-trip-count.ll b/llvm/test/Transforms/LoopVectorize/AArch64/clamped-trip-count.ll
index 44ace377ac7927..11ddb096e357e0 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/clamped-trip-count.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/clamped-trip-count.ll
@@ -18,12 +18,12 @@ define void @clamped_tc_8(ptr nocapture %dst, i32 %n, i64 %val){
 ; CHECK-NEXT:    [[IND_END:%.*]] = getelementptr i8, ptr [[DST]], i64 [[N_VEC]]
 ; CHECK-NEXT:    [[TMP5:%.*]] = call i64 @llvm.vscale.i64()
 ; CHECK-NEXT:    [[TMP6:%.*]] = mul i64 [[TMP5]], 8
+; CHECK-NEXT:    [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 8 x i1> @llvm.get.active.lane.mask.nxv8i1.i64(i64 0, i64 8)
 ; CHECK-NEXT:    [[TMP7:%.*]] = call i64 @llvm.vscale.i64()
 ; CHECK-NEXT:    [[TMP8:%.*]] = mul i64 [[TMP7]], 8
 ; CHECK-NEXT:    [[TMP9:%.*]] = sub i64 8, [[TMP8]]
 ; CHECK-NEXT:    [[TMP10:%.*]] = icmp ugt i64 8, [[TMP8]]
 ; CHECK-NEXT:    [[TMP11:%.*]] = select i1 [[TMP10]], i64 [[TMP9]], i64 0
-; CHECK-NEXT:    [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 8 x i1> @llvm.get.active.lane.mask.nxv8i1.i64(i64 0, i64 8)
 ; CHECK-NEXT:    [[TMP12:%.*]] = call <vscale x 8 x i64> @llvm.experimental.stepvector.nxv8i64()
 ; CHECK-NEXT:    [[TMP13:%.*]] = add <vscale x 8 x i64> [[TMP12]], zeroinitializer
 ; CHECK-NEXT:    [[TMP14:%.*]] = mul <vscale x 8 x i64> [[TMP13]], shufflevector (<vscale x 8 x i64> insertelement (<vscale x 8 x i64> poison, i64 1, i64 0), <vscale x 8 x i64> poison, <vscale x 8 x i32> zeroinitializer)
@@ -115,12 +115,12 @@ define void @clamped_tc_max_8(ptr nocapture %dst, i32 %n, i64 %val){
 ; CHECK-NEXT:    [[IND_END:%.*]] = getelementptr i8, ptr [[DST]], i64 [[N_VEC]]
 ; CHECK-NEXT:    [[TMP5:%.*]] = call i64 @llvm.vscale.i64()
 ; CHECK-NEXT:    [[TMP6:%.*]] = mul i64 [[TMP5]], 8
+; CHECK-NEXT:    [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 8 x i1> @llvm.get.active.lane.mask.nxv8i1.i64(i64 0, i64 [[WIDE_TRIP_COUNT]])
 ; CHECK-NEXT:    [[TMP7:%.*]] = call i64 @llvm.vscale.i64()
 ; CHECK-NEXT:    [[TMP8:%.*]] = mul i64 [[TMP7]], 8
 ; CHECK-NEXT:    [[TMP9:%.*]] = sub i64 [[WIDE_TRIP_COUNT]], [[TMP8]]
 ; CHECK-NEXT:    [[TMP10:%.*]] = icmp ugt i64 [[WIDE_TRIP_COUNT]], [[TMP8]]
 ; CHECK-NEXT:    [[TMP11:%.*]] = select i1 [[TMP10]], i64 [[TMP9]], i64 0
-; CHECK-NEXT:    [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 8 x i1> @llvm.get.active.lane.mask.nxv8i1.i64(i64 0, i64 [[WIDE_TRIP_COUNT]])
 ; CHECK-NEXT:    [[TMP12:%.*]] = call <vscale x 8 x i64> @llvm.experimental.stepvector.nxv8i64()
 ; CHECK-NEXT:    [[TMP13:%.*]] = add <vscale x 8 x i64> [[TMP12]], zeroinitializer
 ; CHECK-NEXT:    [[TMP14:%.*]] = mul <vscale x 8 x i64> [[TMP13]], shufflevector (<vscale x 8 x i64> insertelement (<vscale x 8 x i64> poison, i64 1, i64 0), <vscale x 8 x i64> poison, <vscale x 8 x i32> zeroinitializer)
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll b/llvm/test/Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll
index fc67fb5aded6a8..11bae5068a180f 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll
@@ -138,12 +138,12 @@ define float @fadd_strict(ptr noalias nocapture readonly %a, i64 %n) #0 {
 ; CHECK-ORDERED-TF-NEXT:    [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
 ; CHECK-ORDERED-TF-NEXT:    [[TMP15:%.*]] = call i64 @llvm.vscale.i64()
 ; CHECK-ORDERED-TF-NEXT:    [[TMP16:%.*]] = mul i64 [[TMP15]], 8
+; CHECK-ORDERED-TF-NEXT:    [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 8 x i1> @llvm.get.active.lane.mask.nxv8i1.i64(i64 0, i64 [[N]])
 ; CHECK-ORDERED-TF-NEXT:    [[TMP5:%.*]] = call i64 @llvm.vscale.i64()
 ; CHECK-ORDERED-TF-NEXT:    [[TMP6:%.*]] = mul i64 [[TMP5]], 8
 ; CHECK-ORDERED-TF-NEXT:    [[TMP7:%.*]] = sub i64 [[N]], [[TMP6]]
 ; CHECK-ORDERED-TF-NEXT:    [[TMP8:%.*]] = icmp ugt i64 [[N]], [[TMP6]]
 ; CHECK-ORDERED-TF-NEXT:    [[TMP9:%.*]] = select i1 [[TMP8]], i64 [[TMP7]], i64 0
-; CHECK-ORDERED-TF-NEXT:    [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 8 x i1> @llvm.get.active.lane.mask.nxv8i1.i64(i64 0, i64 [[N]])
 ; CHECK-ORDERED-TF-NEXT:    br label [[VECTOR_BODY:%.*]]
 ; CHECK-ORDERED-TF:       vector.body:
 ; CHECK-ORDERED-TF-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
@@ -733,11 +733,11 @@ define void @fadd_strict_interleave(ptr noalias nocapture readonly %a, ptr noali
 ; CHECK-ORDERED-TF-NEXT:    [[TMP22:%.*]] = call i64 @llvm.vscale.i64()
 ; CHECK-ORDERED-TF-NEXT:    [[TMP23:%.*]] = mul i64 [[TMP22]], 4
 ; CHECK-ORDERED-TF-NEXT:    [[TMP8:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-ORDERED-TF-NEXT:    [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 0, i64 [[TMP2]])
 ; CHECK-ORDERED-TF-NEXT:    [[TMP9:%.*]] = mul i64 [[TMP8]], 4
 ; CHECK-ORDERED-TF-NEXT:    [[TMP10:%.*]] = sub i64 [[TMP2]], [[TMP9]]
 ; CHECK-ORDERED-TF-NEXT:    [[TMP11:%.*]] = icmp ugt i64 [[TMP2]], [[TMP9]]
 ; CHECK-ORDERED-TF-NEXT:    [[TMP12:%.*]] = select i1 [[TMP11]], i64 [[TMP10]], i64 0
-; CHECK-ORDERED-TF-NEXT:    [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 0, i64 [[TMP2]])
 ; CHECK-ORDERED-TF-NEXT:    br label [[VECTOR_BODY:%.*]]
 ; CHECK-ORDERED-TF:       vector.body:
 ; CHECK-ORDERED-TF-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
@@ -993,12 +993,12 @@ define float @fadd_of_sum(ptr noalias nocapture readonly %a, ptr noalias nocaptu
 ; CHECK-ORDERED-TF-NEXT:    [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
 ; CHECK-ORDERED-TF-NEXT:    [[TMP19:%.*]] = call i64 @llvm.vscale.i64()
 ; CHECK-ORDERED-TF-NEXT:    [[TMP20:%.*]] = mul i64 [[TMP19]], 4
+; CHECK-ORDERED-TF-NEXT:    [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 0, i64 [[N]])
 ; CHECK-ORDERED-TF-NEXT:    [[TMP6:%.*]] = call i64 @llvm.vscale.i64()
 ; CHECK-ORDERED-TF-NEXT:    [[TMP7:%.*]] = mul i64 [[TMP6]], 4
 ; CHECK-ORDERED-TF-NEXT:    [[TMP8:%.*]] = sub i64 [[N]], [[TMP7]]
 ; CHECK-ORDERED-TF-NEXT:    [[TMP9:%.*]] = icmp ugt i64 [[N]], [[TMP7]]
 ; CHECK-ORDERED-TF-NEXT:    [[TMP10:%.*]] = select i1 [[TMP9]], i64 [[TMP8]], i64 0
-; CHECK-ORDERED-TF-NEXT:    [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 0, i64 [[N]])
 ; CHECK-ORDERED-TF-NEXT:    br label [[VECTOR_BODY:%.*]]
 ; CHECK-ORDERED-TF:       vector.body:
 ; CHECK-ORDERED-TF-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
@@ -1234,12 +1234,12 @@ define float @fadd_conditional(ptr noalias nocapture readonly %a, ptr noalias no
 ; CHECK-ORDERED-TF-NEXT:    [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
 ; CHECK-ORDERED-TF-NEXT:    [[TMP22:%.*]] = call i64 @llvm.vscale.i64()
 ; CHECK-ORDERED-TF-NEXT:    [[TMP23:%.*]] = mul i64 [[TMP22]], 4
+; CHECK-ORDERED-TF-NEXT:    [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 0, i64 [[N]])
 ; CHECK-ORDERED-TF-NEXT:    [[TMP5:%.*]] = call i64 @llvm.vscale.i64()
 ; CHECK-ORDERED-TF-NEXT:    [[TMP6:%.*]] = mul i64 [[TMP5]], 4
 ; CHECK-ORDERED-TF-NEXT:    [[TMP7:%.*]] = sub i64 [[N]], [[TMP6]]
 ; CHECK-ORDERED-TF-NEXT:    [[TMP8:%.*]] = icmp ugt i64 [[N]], [[TMP6]]
 ; CHECK-ORDERED-TF-NEXT:    [[TMP9:%.*]] = select i1 [[TMP8]], i64 [[TMP7]], i64 0
-; CHECK-ORDERED-TF-NEXT:    [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 0, i64 [[N]])
 ; CHECK-ORDERED-TF-NEXT:    br label [[VECTOR_BODY:%.*]]
 ...
[truncated]

nikolaypanchenko

I'm concerned that such missed update till later phase make VPlan ill-formed for xforms and vplan-based cost model ? That is, that canonical IV can be seen as a dead recipe by xforms, i.e. requires special handling everywhere and possibly can break vendor's downstream.
For the vplan-based cost model such implicit representation will require extra code to properly model the cost (even though it's simple 1 reg + 1 add).
What is a plan to handle these areas later ?

fhahn · 2024-02-25T13:36:00Z

Thank you very much for taking a look!

I'm concerned that such missed update till later phase make VPlan ill-formed for xforms and vplan-based cost model ? That is, that canonical IV can be seen as a dead recipe by xforms, i.e. requires special handling everywhere and possibly can break vendor's downstream.

I don't think the form without the increment or branch-on-cond should be seen as ill-formed, just more abstract. The main idea is to gradually lower the from a more abstract VPlan representation to a more concrete one that can be code-gen'd easily (like also done for replicate regions, which only get introduced later). One advantage of VPlan should be that it allows us to model more complex constructs directly , especially early on.

At the early stages, various IV increments and how exactly the loop is exited do not really add any additional info (e.g. the canonical IV may or may not control exiting the loop). IIRC when adding recipes to model the canonical IV and its increment, the reason why the increment was modeled was to simplify code-gen, similar to how #82021 will simplify code-gen. At that time, VPlan as only added for code-gen, so it made sense at that time.

Since then the infrastructure has evolved quite a bit, and relatively recently we started to use gradual lowering to introduce complexity as needed at later stages.

IIUC the main benefit from #82021 is that it simplifies codegen, so I think it would make sense to consistently introduce those in preparation for codegen; modeling the increments for all indications early on would unnecessarily make the plans more verbose, while only being needed for codegen. Just before codegen, if the increments are modeled explicitly, then we could also lower to a generic scalar or vector phi nodes, instead of having multiple recipes to generate phis.

For the vplan-based cost model such implicit representation will require extra code to properly model the cost (even though it's simple 1 reg + 1 add). What is a plan to handle these areas later ?

It depends, if we see CanonincalIVPHIRecipe and VPWidenIntOrFpRecipe as complex recipes that take care of their increment as well then computing the cost for both is natural. If the more verbose form is only introduced just before executing, there should be no need to account for costing the increment separately.

ayalz

Interesting addition, indeed following outlined roadmap. Adding a couple of thoughts.

ayalz · 2024-02-28T10:19:29Z

llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp

       !VPBB->getParent()->isReplicator())) {
-    if (!VPBB || !VPBB->getTerminator()) {
+    if (!IsAbstract && (!VPBB || !VPBB->getTerminator())) {


Perhaps rather than a global parameter, Is[RegionBackBranch]Abstract is more a property of the parental region - whether it ends with an explicit terminating branch or not - the latter applying to both replicating regions and non-replicating regions in "abstract loop-control" state.

It applies per-region, but the global flag indicates whether it is required at this verification stage. Not sure if it would be worth tracking separately somewhere?

ayalz · 2024-02-28T10:22:10Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

@@ -1307,3 +1268,71 @@ void VPlanTransforms::dropPoisonGeneratingRecipes(
    }
  }
 }
+
+void VPlanTransforms::prepareToExecute(


Would be good to be more precise than "prepareToExecute()".

Changed to finalizePlan (with a comment), but there still might be a better name

ayalz · 2024-02-28T10:22:52Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

-// loop.
-static void addCanonicalIVRecipes(VPlan &Plan, Type *IdxTy, bool HasNUW,
-                                  DebugLoc DL) {
+// Add the required canonical IV


Suggested change

// Add the required canonical IV

// Add the required canonical IV.

Fixed, thanks!

ayalz · 2024-02-28T10:25:37Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

-static void addCanonicalIVRecipes(VPlan &Plan, Type *IdxTy, bool HasNUW,
-                                  DebugLoc DL) {
+// Add the required canonical IV
+static void addCanonicalIV(VPlan &Plan, Type *IdxTy, bool HasNUW, DebugLoc DL) {


Perhaps instead of creating the final original canonical-IV recipe, leaving it temporarily with only a single operand, it would be better to employ another (sub)class, having a single operand, which is later replaced by the more explicit CanonicalIVPHI?

I think eventually CanonicalIVPHI could become itself an 'abstract' recipe (terminology needs to determined), that doesn't codegen itself. Instead of adding the increment as operand in prepareToExecute we could replace it with a generic scalar PHI recipe, which could take care of codegen for any scalar phis (including EVL phi among others).

Could introduce a temporary recipe, but once we are converging on the overall approach, I can also put up a follow-up PR to use the generic scalar PHI recipe.

ok, worth leaving behind a TODO?

Done, thanks!

nikolaypanchenko · 2024-02-28T18:45:27Z

I don't think the form without the increment or branch-on-cond should be seen as ill-formed, just more abstract. The main idea is to gradually lower the from a more abstract VPlan representation to a more concrete one that can be code-gen'd easily (like also done for replicate regions, which only get introduced later). One advantage of VPlan should be that it allows us to model more complex constructs directly , especially early on.

Can you please share a VPlan design doc which has been discussed with a community ? Perhaps I'm missing something and as right now my view of a VPlan is it has self-contained i.e. it should be possible to get all required information to make xform without accessing underlying IR (except for live-ins and live-outs) and at any point be able to generate code out of it without any extra manipulations on a VPlan.

At the early stages, various IV increments and how exactly the loop is exited do not really add any additional info (e.g. the canonical IV may or may not control exiting the loop). IIRC when adding recipes to model the canonical IV and its increment, the reason why the increment was modeled was to simplify code-gen, similar to how #82021 will simplify code-gen. At that time, VPlan as only added for code-gen, so it made sense at that time.

Since then the infrastructure has evolved quite a bit, and relatively recently we started to use gradual lowering to introduce complexity as needed at later stages.

Ignoring cost model part, for VPlan that represents just one loop it's certainly simplifies xforms. But if we consider future work with peel/remainder (prolog/epilog/tail/cleanup) loops represented that is going to complicate xforms (unrolling, interleaving), codegen and cost model, am I wrong ? That technically can be mitigated by https://reviews.llvm.org/D148581, but if we consider SEME loops, from my perspective, that becomes complicated too.

IIUC the main benefit from #82021 is that it simplifies codegen, so I think it would make sense to consistently introduce those in preparation for codegen; modeling the increments for all indications early on would unnecessarily make the plans more verbose, while only being needed for codegen. Just before codegen, if the increments are modeled explicitly, then we could also lower to a generic scalar or vector phi nodes, instead of having multiple recipes to generate phis.

Eventually yes, to simplify codegen. But most importantly, for EVL-based vectorization that change helps to replace WidenVFxUF that is used with WidenEVL. That will be true for the remaining VPWidenPointerInductionRecipe that also better to be decomposed.
Also want to point out #82021 moves trunc optimization to xform so that VPlan became more explicit, which IMO makes more sense. However, if VPlan is going to allow some implicitness that's not so clear if it's good.

It depends, if we see CanonincalIVPHIRecipe and VPWidenIntOrFpRecipe as complex recipes that take care of their increment as well then computing the cost for both is natural. If the more verbose form is only introduced just before executing, there should be no need to account for costing the increment separately.

Right, and that's my point: now all places that need to be aware of the update will have extra logic to deal with them. Also, for the VPWidenIntOrFpRecipe heuristics becomes a bit more complicated / unclear to estimate broadcast of VF. For the VPWidenPointerInductionRecipe it would also have to accommodate onlyScalarsGenerated case.
For these vectorized IVs, is there a plan to represent broadcast(VF) in the preheader ?

This patch adds a new VPInstruction::HeaderMask opcode to model the abstract header-mask used for tail-folding. It will be lowered depending on target preference (either using active-lane-mask, explicit-vector-length or a wide compare of the canonical IV and the backedge taken count) Similarly to llvm#82270, it would be good to clarify/agree on the terminology w.r.t. to recipes/opcodes that cannot be code-gen'd directly (i.e. require further gradual lowering). NOTE: some tests are failing or needed updating, due to widened IVs being replaced by scalar-steps, as their only use was the earlier wide compare. This could be fixed by either adding a suitable wide canonical IV as operand to the header-mask recipe and exactly preserve the original behavior. Alternatively we could keep the current behavior of the patch and update the tests. Or introduce a wide induction PHI instead of VPWidenCanonicalIVReicpe; currently we *only* use a wide IV for VPWidenCanonicalIVRecipe, if there was a suitable IV in the original loop, *even* if the mask compare is the *only* wide use. Either never or always using a wide PHI would be more consistent (or eventually make a more informed cost-based decision).

github-actions · 2024-06-12T20:57:19Z

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:

git-clang-format --diff 6dc356d6985fc49d1b69c20cc27f6b066742144a 62e78e99a624ed4fa07fd4dd882844d7e38cfb63 --extensions cpp,h -- llvm/lib/Transforms/Vectorize/LoopVectorize.cpp llvm/lib/Transforms/Vectorize/VPlan.cpp llvm/lib/Transforms/Vectorize/VPlan.h llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp llvm/lib/Transforms/Vectorize/VPlanTransforms.h llvm/lib/Transforms/Vectorize/VPlanUtils.cpp

View the diff from clang-format here.

diff --git a/llvm/lib/Transforms/Vectorize/VPlan.cpp b/llvm/lib/Transforms/Vectorize/VPlan.cpp
index f26c3bffb0..06d142425f 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlan.cpp
@@ -20,9 +20,9 @@
 #include "LoopVectorizationPlanner.h"
 #include "VPlanCFG.h"
 #include "VPlanPatternMatch.h"
-#include "VPlanVerifier.h"
 #include "VPlanTransforms.h"
 #include "VPlanUtils.h"
+#include "VPlanVerifier.h"
 #include "llvm/ADT/PostOrderIterator.h"
 #include "llvm/ADT/STLExtras.h"
 #include "llvm/ADT/SmallVector.h"
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.h b/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
index 49e7817275..c09d7a8710 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
@@ -141,7 +141,7 @@ struct VPlanTransforms {
   /// DataAndControlFlowWithoutRuntimeCheck, add its increment and adjust the
   /// loop region's terminator to BranchOnCond based on the active-lane-mask.
   static void convertCanonicalIV(VPlan &Plan, bool HasNUW,
-                               bool DataAndControlFlowWithoutRuntimeCheck);
+                                 bool DataAndControlFlowWithoutRuntimeCheck);
 
   /// Lower abstract recipes to concrete ones, that can be codegen'd.
   static void convertToConcreteRecipes(VPlan &Plan);

This patch delays adding canonical IV increments, computing the next active lane mask and the branch recipes to exit the vector loop. During initial construction of a VPlan, only add the canonical IV phi and active-lane-mask phis if needed. Similarly, do not add the branches to exit the loop initially. Computing the next IV value, the next active-lane-mask or the exit branches are details that are only needed to simplify codegen (execute of the individual recipes). This makes the inital VPlans more abstract (and simpler), in that we initially leave out some details. Initial VPlans still have the canonical induction recipe, which provides the value of the canonical induction at every iteration, but we leave out the detail of how it is computed, which is not needed until code generation. Similar reasoning applies to active lane mask increments and branch recipes. The vector loop region initially abstractly models the loop processing all vector iterations, without specifying how exactly exiting the region is handled. Again these details are only needed for code generation. To introduce the missing pieces and complete the abstract VPlan, a new prepareToExecute transform is added and run just before exiting the VPlan. This is an attempt to further employ gradual lowering, as outlined in https://llvm.org/devmtg/2023-10/slides/techtalks/Hahn-VPlan-StatusUpdateAndRoadmap.pdf and already applied for replicate region handling.

fhahn · 2024-06-12T21:01:01Z

Can you please share a VPlan design doc which has been discussed with a community ? Perhaps I'm missing something and as right now my view of a VPlan is it has self-contained i.e. it should be possible to get all required information to make xform without accessing underlying IR (except for live-ins and live-outs) and at any point be able to generate code out of it without any extra manipulations on a VPlan.

Put up PRs to update some parts of the documentation (which started out as the original design docs from when VPlan was originally proposed) based on the recent Developers' Meeting talk #85688, #85689. W.r.t. to the VPWidenIntOrFpInductionRecipe, the increment was intentionally not modeled explicitly IIRC to make use of VPlan’s capabilities to model higher-level concepts directly, thus having simpler plans with more expressive recipes.

The part of generating code at any point would get less true with that patch, but we already employ gradual lowering, so initial VPlans cannot be code-gen'd directly (e.g. predicated VPReplicateRecipes only get expanded to replicate regions later, which helps to simplifier earlier transforms)

Ignoring cost model part, for VPlan that represents just one loop it's certainly simplifies xforms. But if we consider future work with peel/remainder (prolog/epilog/tail/cleanup) loops represented that is going to complicate xforms (unrolling, interleaving), codegen and cost model, am I wrong ? That technically can be mitigated by https://reviews.llvm.org/D148581, but if we consider SEME loops, from my perspective, that becomes complicated too.

For some transformations it may be helpful to have the fully expanded form, but in that case it could be introduced when needed (e.g. as we already do with expanding replicate regions before merging them)

I finally managed to wrap up the implementation of an explicit interleave transform on the current state: #94339

The current version adds a set of VPInstructions to compute the vector parts of inductions, which requires quite a bit of work to handle all cases. The expanded form is quite a bit more verbose, which I think is why it is desirable to keep things more abstract until needed. I think the expanded form early on would also force us to detect the induction increments (chained together across parts) and handle that explicitly. It might simplify things to further delay full expansion until after interleaving by introducing an intermediate opcode to compute the step vector for a given part.

Eventually yes, to simplify codegen. But most importantly, for EVL-based vectorization that change helps to replace WidenVFxUF that is used with WidenEVL. That will be true for the remaining VPWidenPointerInductionRecipe that also better to be decomposed. Also want to point out #82021 moves trunc optimization to xform so that VPlan became more explicit, which IMO makes more sense. However, if VPlan is going to allow some implicitness that's not so clear if it's good.

VPWidenPointerInductionRecipe has recently been updated to only generate the vector part, the scalar part is now modeled via scalar-iv-steps (since #83068 )

VPWidenIntOrFpInductionRecipe’s are by definition being tied to being an induction, which as a consequence also means their step is loop-invariant. Not modeling the increment explicitly as the benefit of any transform changing their increment to not be loop-invariant.

To support wide inductions with EVL, I think it would be better to introduce a generic wide phi recipe (which doesn’t have to be an induction) and lower to that + explicit increment. This can also be used to lower wide induction recipes for simpler code-gen late in the pipeline.

Another thing to note is that at the moment VPWidenIntOrFpInductionRecipe will create a step vector per part, using the step-vector of the last part as incoming value.

There’s a small benefit from explicitly adding the VF as operand to VPWidenIntOrFpInductionRecipe (avoiding some vscale * VF.getKnownMinValue()) as in #95305. But that wouldn’t help with EVL, as EVL computations in the vector loop region won’t dominate the wide IV recipes.

Right, and that's my point: now all places that need to be aware of the update will have extra logic to deal with them. Also, for the VPWidenIntOrFpRecipe heuristics becomes a bit more complicated / unclear to estimate broadcast of VF. For the VPWidenPointerInductionRecipe it would also have to accommodate onlyScalarsGenerated case. For these vectorized IVs, is there a plan to represent broadcast(VF) in the preheader ?

VPWidenIntOrFpRecipe already store their scalar step, it would probably also make sense to model broadcast of step scaled by VF explicitly in the preheader, to allow CSE once we support explicit broadcasts/extracts. Again, this kind of transformation would fit well late in the pipeline, once we lowered more complex constructs for code-gen. The legacy cost-modeling for inductions at the moment doesn't consider any of this, but I think we are close to improving on this aspect, even though we might not be able to capture all aspects completely accurately, at least initially.

Finally, another concrete example where abstract recipes can be used to simplify transformations initially is modeling the vector loop’s header mask as general recipe initially, to be specialized later once we decide how to model it concretely (e.g. EVL, active.lane.mask, wide IV compare) #89603

ayalz

Worth starting with delaying the introduction of canonicalIVIncrement alone, while keeping a terminating branch (existing branch-on-count or introducing some other branch, temporarily)? This seems to provide substantial incremental improvement while incurring smaller changes, i.e., w/o affecting VPlanVerifier or introducing IsAbstract.

Delaying the introduction of the loop-region-closing branch may deserve separate attention.

ayalz · 2024-09-15T05:55:41Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

-static void addCanonicalIVRecipes(VPlan &Plan, Type *IdxTy, bool HasNUW,
-                                  DebugLoc DL) {
+// Add the required canonical IV.
+static void addCanonicalIV(VPlan &Plan, Type *IdxTy, bool HasNUW, DebugLoc DL) {


Suggested change

static void addCanonicalIV(VPlan &Plan, Type *IdxTy, bool HasNUW, DebugLoc DL) {

static void addCanonicalIV(VPlan &Plan, Type *IdxTy, bool DebugLoc DL) {

Done thanks!

ayalz · 2024-09-15T05:56:30Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

@@ -8555,7 +8551,7 @@ LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes(VFRange &Range) {
  // When not folding the tail, we know that the induction increment will not
  // overflow.
  bool HasNUW = Style == TailFoldingStyle::None;
-  addCanonicalIVRecipes(*Plan, Legal->getWidestInductionType(), HasNUW, DL);
+  addCanonicalIV(*Plan, Legal->getWidestInductionType(), HasNUW, DL);


Suggested change

addCanonicalIV(*Plan, Legal->getWidestInductionType(), HasNUW, DL);

addCanonicalIV(*Plan, Legal->getWidestInductionType(), DL);

along with the cleanup of code above that sets HasNUW, although Style is still needed by addActiveLaneMask() below? Note that IVUpdateMayOverflow is now computed for BestVF rather than across all VF's in range, when setting canonical IV but not active lane mask below? I.e., patch is not NFC.

Cleaned up ,thanks!

ayalz · 2024-09-15T05:58:33Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

-  addCanonicalIVRecipes(*Plan, Legal->getWidestInductionType(), HasNUW,
-                        DebugLoc());
-  assert(verifyVPlanIsValid(*Plan) && "VPlan is invalid");
+  addCanonicalIV(*Plan, Legal->getWidestInductionType(), HasNUW, DebugLoc());


Suggested change

addCanonicalIV(*Plan, Legal->getWidestInductionType(), HasNUW, DebugLoc());

addCanonicalIV(*Plan, Legal->getWidestInductionType(), DebugLoc());

Setting HasNUW for native, along with its comment, should move to finalizeVPlan()?

Cleaned up as well thanks!

ayalz · 2024-09-15T05:59:13Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

@@ -8991,7 +8986,7 @@ void LoopVectorizationPlanner::adjustRecipesForReductions(
      PreviousLink = RedRecipe;
    }
  }
-  Builder.setInsertPoint(&*LatchVPBB->begin());
+  Builder.setInsertPoint(LatchVPBB, LatchVPBB->begin());


Independent of this patch?

Undone in the latest version as it now still introduces the branch early.

ayalz · 2024-09-15T06:24:42Z

llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding-unroll.ll

+; CHECK-NEXT:    [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 0, i64 [[UMAX]])
+; CHECK-NEXT:    [[ACTIVE_LANE_MASK_ENTRY3:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[INDEX_PART_NEXT]], i64 [[UMAX]])
+; CHECK-NEXT:    [[ACTIVE_LANE_MASK_ENTRY4:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[INDEX_PART_NEXT1]], i64 [[UMAX]])
+; CHECK-NEXT:    [[ACTIVE_LANE_MASK_ENTRY5:%.*]] = call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 [[INDEX_PART_NEXT2]], i64 [[UMAX]])


Is this addition intentional?

ayalz · 2024-09-15T09:20:21Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

-                         CanonicalIVIncrement->hasNoSignedWrap()},
-                        CanonicalIVIncrement->getDebugLoc(), "index.evl.next");
-  NextEVLIV->insertBefore(CanonicalIVIncrement);
+      new VPInstruction(Instruction::Add, {OpVPEVL, EVLPhi}, {false, false},


Wrap flags initiated to false may later be turned on?

Yes at this point we don't know if it wraps or not. Without #111758, the flags for EVL increment would always be false I think, as it always folds the tail.

With it, we could update it when introducing the increment (currently not done)

Worth a comment and/or TODO?

Added, thanks

Thanks. Possibly worth commenting what each false stands for.

ayalz · 2024-09-15T09:22:36Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

@@ -1572,3 +1529,71 @@ void VPlanTransforms::dropPoisonGeneratingRecipes(
    }
  }
 }
+
+void VPlanTransforms::finalizePlan(VPlan &Plan, bool HasNUW,


How about something like lowerCanonicalIV(), or materializeCanonicalIVIncrement(), rather than finalizePlan()?

Updated to lowerCanonicalIV for the initial version

ayalz · 2024-09-15T09:30:17Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

+                                   "active.lane.mask.next");
+  LaneMaskPhi->addOperand(ALM);
+
+  // Replace the original terminator with BranchOnCond. We have to invert the


This now introduces the terminal rather than replace it.

Perhaps a terminal branch-on-count recipe should be introduced along with the abstract canonical IV (could conceptually check the pre-bumped IV with TC-step), delaying only the introduction of the canonical IV's increment between them for later? The canonical IV still remains abstract until this increment is added, but the VPlan continues to be "valid" w/o updating verify().

Updated to initially still introduce the branch early. At the moment it still uses the same opcode, but would probably need to introduce a separate one for the different semantics (or define them conditionally on whether lowering has been finalized)

ayalz · 2024-09-15T09:34:53Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

+  }
+  auto *LaneMaskPhi = cast<VPActiveLaneMaskPHIRecipe>(&*FoundLaneMaskPhi);


Suggested change

}

auto *LaneMaskPhi = cast<VPActiveLaneMaskPHIRecipe>(&*FoundLaneMaskPhi);

}

// Now introduce a conditional branch to control the loop until the lane mask is exhuasted.

auto *LaneMaskPhi = cast<VPActiveLaneMaskPHIRecipe>(&*FoundLaneMaskPhi);

Done thanks!

ayalz · 2024-09-15T15:09:59Z

llvm/test/Transforms/LoopVectorize/AArch64/sve-widen-gep.ll

-; CHECK:        VPlan 'Initial VPlan for VF={vscale x 2},UF>=1' {
+; CHECK:        VPlan 'Final VPlan for VF={vscale x 2},UF>=1' {


Is this change relevant to this patch? Better separate to isolate test changes.

fhahn

Worth starting with delaying the introduction of canonicalIVIncrement alone, while keeping a terminating branch (existing branch-on-count or introducing some other branch, temporarily)? This seems to provide substantial incremental improvement while incurring smaller changes, i.e., w/o affecting VPlanVerifier or introducing IsAbstract.

Delaying the introduction of the loop-region-closing branch may deserve separate attention.

Done in the updated version, thanks!

fhahn · 2024-10-09T20:12:46Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

@@ -7428,6 +7428,14 @@ LoopVectorizationPlanner::executePlan(
      "expanded SCEVs to reuse can only be used during epilogue vectorization");
  (void)IsEpilogueVectorization;

+  TailFoldingStyle Style =
+      CM.getTailFoldingStyle(!isIndvarOverflowCheckKnownFalse(&CM, BestVF));


Updated thanks!

fhahn · 2024-10-09T20:39:53Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+  TailFoldingStyle Style =
+      CM.getTailFoldingStyle(!isIndvarOverflowCheckKnownFalse(&CM, BestVF));
+  // When not folding the tail, we know that the induction increment will not
+  // overflow.


The current code just moves the existing logic. Put up #111758 to improve this separately.

fhahn · 2024-10-09T20:41:33Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

-static void addCanonicalIVRecipes(VPlan &Plan, Type *IdxTy, bool HasNUW,
-                                  DebugLoc DL) {
+// Add the required canonical IV
+static void addCanonicalIV(VPlan &Plan, Type *IdxTy, bool HasNUW, DebugLoc DL) {


Done, thanks!

fhahn · 2024-10-09T20:41:39Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

-static void addCanonicalIVRecipes(VPlan &Plan, Type *IdxTy, bool HasNUW,
-                                  DebugLoc DL) {
+// Add the required canonical IV.
+static void addCanonicalIV(VPlan &Plan, Type *IdxTy, bool HasNUW, DebugLoc DL) {


Done thanks!

fhahn · 2024-10-09T20:44:24Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

@@ -8555,7 +8551,7 @@ LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes(VFRange &Range) {
  // When not folding the tail, we know that the induction increment will not
  // overflow.
  bool HasNUW = Style == TailFoldingStyle::None;
-  addCanonicalIVRecipes(*Plan, Legal->getWidestInductionType(), HasNUW, DL);
+  addCanonicalIV(*Plan, Legal->getWidestInductionType(), HasNUW, DL);


Cleaned up ,thanks!

fhahn · 2024-10-09T20:53:22Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

@@ -1235,16 +1235,10 @@ void VPlanTransforms::optimize(VPlan &Plan, ScalarEvolution &SE) {
 static VPActiveLaneMaskPHIRecipe *addVPLaneMaskPhiAndUpdateExitBranch(
    VPlan &Plan, bool DataAndControlFlowWithoutRuntimeCheck) {


Done, thanks!

fhahn · 2024-10-09T20:53:26Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

-  addCanonicalIVRecipes(*Plan, Legal->getWidestInductionType(), HasNUW,
-                        DebugLoc());
-  assert(verifyVPlanIsValid(*Plan) && "VPlan is invalid");
+  addCanonicalIV(*Plan, Legal->getWidestInductionType(), HasNUW, DebugLoc());


Cleaned up as well thanks!

fhahn · 2024-10-09T20:53:43Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

@@ -1572,3 +1529,71 @@ void VPlanTransforms::dropPoisonGeneratingRecipes(
    }
  }
 }
+
+void VPlanTransforms::finalizePlan(VPlan &Plan, bool HasNUW,


Updated to lowerCanonicalIV for the initial version

fhahn · 2024-10-09T21:07:32Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

+  }
+  auto *LaneMaskPhi = cast<VPActiveLaneMaskPHIRecipe>(&*FoundLaneMaskPhi);


Done thanks!

fhahn · 2024-10-09T21:08:39Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

+                                   "active.lane.mask.next");
+  LaneMaskPhi->addOperand(ALM);
+
+  // Replace the original terminator with BranchOnCond. We have to invert the


Updated to initially still introduce the branch early. At the moment it still uses the same opcode, but would probably need to introduce a separate one for the different semantics (or define them conditionally on whether lowering has been finalized)

Introduce a general recipe to generate a scalar phi. Lower VPCanonicalIVPHIRecipe and VPEVLBasedIVRecipe to VPScalarIVPHIrecipe before plan execution, avoiding the need for duplicated ::execute implementations. There are other cases that could benefit, including in-loop reduction phis. Builds on a similar idea as llvm#82270.

…C). (#114305) Introduce a general recipe to generate a scalar phi. Lower VPCanonicalIVPHIRecipe and VPEVLBasedIVRecipe to VPScalarIVPHIrecipe before plan execution, avoiding the need for duplicated ::execute implementations. There are other cases that could benefit, including in-loop reduction phis and pointer induction phis. Builds on a similar idea as #82270. PR: #114305

fhahn

Updated on top of latest changes, ping :)

fhahn · 2024-12-09T11:38:12Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+  TailFoldingStyle Style =
+      CM.getTailFoldingStyle(!isIndvarOverflowCheckKnownFalse(&CM, BestVF));
+  // When not folding the tail, we know that the induction increment will not
+  // overflow.


PR has now landed, updated the code here.

ayalz · 2024-12-10T23:04:14Z

llvm/lib/Transforms/Vectorize/VPlanUtils.cpp

@@ -91,7 +91,7 @@ bool vputils::isUniformAcrossVFsAndUFs(VPValue *V) {

  auto *CanonicalIV = R->getParent()->getPlan()->getCanonicalIV();
  // Canonical IV chain is uniform.
-  if (V == CanonicalIV || V == CanonicalIV->getBackedgeValue())
+  if (V == CanonicalIV) // || V == CanonicalIV->getBackedgeValue())


Continue to consider the backedge value uniform, if it exists?

Done thanks!

ayalz · 2024-12-11T07:53:26Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

  // TODO: Move to VPlan transform stage once the transition to the VPlan-based
  // cost model is complete for better cost estimates.


What does this comment refer to, should it move?

It refers to the VPlan transforms below, I moved down and clarified.

ayalz · 2024-12-11T07:53:46Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+  bool WithoutRuntimeCheck =
+      Style == TailFoldingStyle::DataAndControlFlowWithoutRuntimeCheck;
+  // Use NUW for the induction increment if we proved that it won't overflow in
+  // the vector loop or when not folding the tail. In the later case, we know


Suggested change

// the vector loop or when not folding the tail. In the later case, we know

// the vector loop or when not folding the tail. In the latter case, we know

Done thanks!

ayalz · 2024-12-11T12:51:12Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

-// loop.
-static void addCanonicalIVRecipes(VPlan &Plan, Type *IdxTy, bool HasNUW,
-                                  DebugLoc DL) {
+// Add the required canonical IV.


Suggested change

// Add the required canonical IV.

// Add the required canonical IV along with its loop branch, but w/o its increment - which is introduced later.

Done thanks

ayalz · 2024-12-11T14:34:26Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

@@ -1289,19 +1289,12 @@ void VPlanTransforms::optimize(VPlan &Plan) {
 //   %Negated = Not %ALM
 //   branch-on-cond %Negated
 //
-static VPActiveLaneMaskPHIRecipe *addVPLaneMaskPhiAndUpdateExitBranch(
-    VPlan &Plan, bool DataAndControlFlowWithoutRuntimeCheck) {
+static VPActiveLaneMaskPHIRecipe *createActiveLaneMaskPhi(VPlan &Plan) {


Above documentation should be kept in sync.

Updated, thanks

ayalz · 2024-12-11T14:56:16Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

+    VPlan &Plan, bool HasNUW, bool DataAndControlFlowWithoutRuntimeCheck) {
+  auto *CanIV = Plan.getCanonicalIV();
+
+  VPBasicBlock *EB = Plan.getVectorLoopRegion()->getExitingBasicBlock();


Suggested change

VPBasicBlock *EB = Plan.getVectorLoopRegion()->getExitingBasicBlock();

VPBasicBlock *Latch = Plan.getVectorLoopRegion()->getExitingBasicBlock();

?

Updated, thanks

ayalz · 2024-12-11T14:56:34Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

+  auto *CanIV = Plan.getCanonicalIV();
+
+  VPBasicBlock *EB = Plan.getVectorLoopRegion()->getExitingBasicBlock();
+  auto *Term = EB->getTerminator();


Suggested change

auto *Term = EB->getTerminator();

auto *LatchBranch = Latch->getTerminator();

?

Updated, thanks

ayalz · 2024-12-11T15:02:17Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

+  Builder.createNaryOp(VPInstruction::BranchOnCond, {NotMask}, DL);
+  Term->eraseFromParent();


The existing terminator is already a branch on cond; reset its operand?

The current version leaves the BranchOnCount until we replace it here.

ayalz · 2024-12-11T15:06:11Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.h

+  /// Finalize \p Plan by introducing explicit increments for the canonical
+  /// induction.


Suggested change

/// Finalize \p Plan by introducing explicit increments for the canonical

/// induction.

/// Complete the canonical IV by introducing an explicit increment between its header phi and latch branch on count.

this is one step in getting Plan to be ready for code-gen, not necessarily the last one after which Plan has been "Finalized".

Good point, we need to do this before unrolling, adjusted, thanks!

ayalz · 2024-12-11T15:10:41Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.h

@@ -124,6 +124,11 @@ struct VPlanTransforms {
  /// Remove dead recipes from \p Plan.
  static void removeDeadRecipes(VPlan &Plan);

+  /// Finalize \p Plan by introducing explicit increments for the canonical
+  /// induction.
+  static void lowerCanonicalIV(VPlan &Plan, bool HasNUW,


Instead of lower can use convert to be consistent with convertToConcreteRecipes(), but for now this is more of a complementing/completing process rather than a lowering/conversion one?

Updated, thanks

fhahn · 2024-12-17T21:45:14Z

ping :)

fhahn · 2025-01-21T13:58:52Z

ping :)

ayalz · 2025-02-05T22:57:25Z

This patch delays adding canonical IV increments, computing the next active lane mask and the branch recipes to exit the vector loop.

During initial construction of a VPlan, only add the canonical IV phi and active-lane-mask phis if needed. Similarly, do not add the branches to exit the loop initially. Computing the next IV value, the next active-lane-mask or the exit branches are details that are only needed to simplify codegen (execute of the individual recipes).

This makes the inital VPlans more abstract (and simpler), in that we initially leave out some details. Initial VPlans still have the canonical induction recipe, which provides the value of the canonical induction at every iteration, but we leave out the detail of how it is computed, which is not needed until code generation.

Similar reasoning applies to active lane mask increments and branch recipes. The vector loop region initially abstractly models the loop processing all vector iterations, without specifying how exactly exiting the region is handled. Again these details are only needed for code generation.

To introduce the missing pieces and complete the abstract VPlan, a new prepareToExecute transform is added and run just before exiting the VPlan.

This is an attempt to further employ gradual lowering, as outlined in https://llvm.org/devmtg/2023-10/slides/techtalks/Hahn-VPlan-StatusUpdateAndRoadmap.pdf and already applied for replicate region handling.

Having the canonical IV phi (i.e., the iteration number of the vector loop), and the active lane mask phi (i.e., the mask of the vector loop header), as abstract VPValues during first (possibly most) VPlan transformations, leaving the introduction of recipes along their associated cyclic chains to a later stage, sounds great! However, placing them as regular header phi recipes inside the header block, as placeholders - w/o feeding or being fed along their associated cyclic def/use chains, seems inconsistent: header phi recipes are expected to have two operands, one fed across the back-edge; w/o it, what's to keep the recipe from being hoisted to the preheader. As placeholders these recipes continue to maintain their current concrete semantics themselves, rather than being abstract.

An alternative (discussed w/ @aniragil ): have VPRegionBlock hold the canonical IV and active lane mask as VPValues, rather than having the header VPBasicBlock hold them as header phi recipes. Such VPValues are neither recipes nor live-ins - whose scope is the entire VPlan, which holds them. These VPValues can be used by any recipe inside the loop region (including header phi recipes?) or inside the middle block. These VPValues are destined to be materialized into header-phi + increment + loop branch recipes when the VPRegionBlock is replaced by CFG, prior to code-gen, or earlier if desired. WDYT?

ayalz

This raises several thoughts, including the one commented at the outset - of representing the canonical IV and active lane mask as VPValues of the VPRegionBlock rather than ordinary header phi recipes of the header VPBasicBlock.

ayalz · 2025-02-03T12:51:51Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+  bool HasNUW = !IVUpdateMayOverflow || Style == TailFoldingStyle::None;
+  // TODO: Move transforms to VPlan transform stage once the transition to the
+  // VPlan-based cost model is complete for better cost estimates.
+  VPlanTransforms::convertCanonicalIV(BestVPlan, HasNUW, WithoutRuntimeCheck);


Setting HasNUW here during VPlan execute is more accurate because of the BestUF --> IVUpdateMayOverflow --> tail-fold-style --> HasNUW dependence (i.e., along with "setUF() or "unrollByUF"), but is there a better way to set this earlier during VPlan planning, possibly splitting VF ranges - and UF ranges?

ayalz · 2025-02-03T13:01:15Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+  // TODO: introduce branch-on-count during VPlan final (pre-codegen) lowering.
  Builder.createNaryOp(VPInstruction::BranchOnCount,
-                       {CanonicalIVIncrement, &Plan.getVectorTripCount()}, DL);
+                       {CanonicalIVPHI, &Plan.getVectorTripCount()}, DL);


How about having BranchOnCount also be responsible for bumping the IV, at least initially? (Inspired by PowerPC's bdnz instruction; can call it BranchOnIncrementCount.) It could then feed back the canonical IV Phi across the back-edge, and possibly be split into a separate Add later to simplify code-gen.
Unclear if doing so helps achieve the desired simplification(?)

ayalz · 2025-02-03T16:43:24Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

@@ -10272,7 +10271,7 @@ preparePlanForEpilogueVectorLoop(VPlan &Plan, Loop *L,
                             isa<VPScalarCastRecipe>(U) ||
                             isa<VPDerivedIVRecipe>(U) ||
                             cast<VPInstruction>(U)->getOpcode() ==
-                                 Instruction::Add;
+                                 VPInstruction::BranchOnCount;
                    }) &&
             "the canonical IV should only be used by its increment or "


Suggested change

"the canonical IV should only be used by its increment or "

"the canonical IV should only be used by its BranchOnCount or "

ayalz · 2025-02-03T16:46:22Z

llvm/lib/Transforms/Vectorize/VPlan.h

@@ -3239,7 +3239,8 @@ class VPCanonicalIVPHIRecipe : public VPHeaderPHIRecipe {

  VPCanonicalIVPHIRecipe *clone() override {
    auto *R = new VPCanonicalIVPHIRecipe(getOperand(0), getDebugLoc());
-    R->addOperand(getBackedgeValue());
+    if (getNumOperands() == 2)


Seems reasonable to expect every header phi recipe to have two operands (always) - implementing getStartValue() and getBackedgeValue() of VPHeaderPhiRecipe?

ayalz · 2025-02-04T18:20:25Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+  // Don't use getDecisionAndClampRange here, because we don't know the UF
+  // so this function is better to be conservative, rather than to split
+  // it up into different VPlans.
+  // TODO: Consider using getDecisionAndClampRange here to split up VPlans.
+  bool IVUpdateMayOverflow = false;
+  for (ElementCount VF : Range)
+    IVUpdateMayOverflow |= !isIndvarOverflowCheckKnownFalse(&CM, VF);
+
+  TailFoldingStyle Style = CM.getTailFoldingStyle(IVUpdateMayOverflow);


This is moved here from earlier in tryToBuildVPlanWithVPRecipes(), so this information is still computed conservatively at this stage.

ayalz · 2025-02-05T10:00:01Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

+  // Replace the original terminator with BranchOnCond. We have to invert the
+  // mask here because a true condition means jumping to the exit block.
+  auto *NotMask = Builder.createNot(ALM, DL);
+  Builder.createNaryOp(VPInstruction::BranchOnCond, {NotMask}, DL);


BranchOnCond operates on a single condition bit, so NotMask (only last part thereof?) should effectively be reduced using AnyOf reduction - although its execute/generate seems to look for the first lane (only), rather than the last?

ayalz · 2025-02-05T10:04:31Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

  Value *StartIdx = ConstantInt::get(IdxTy, 0);
  auto *StartV = Plan.getOrAddLiveIn(StartIdx);

  // Add a VPCanonicalIVPHIRecipe starting at 0 to the header.
+  // TODO: Introduce a separate scalar phi recipe that can be used for codegen,
+  // turning VPCanonicalIVPHIRecipe into an 'abstract' recipe which cannot be


VPCanonicalIVPHIRecipe, as a phi recipe in the header block, and BranchOnCount as a recipe in the latch, seem to retain their individual concrete semantics, even if the increment is introduced later between them.

ayalz · 2025-02-05T10:49:43Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

-// dropped from the canonical IV increment. Return the created
+// Add a VPActiveLaneMaskPHIRecipe and recipes to compute its start lane mask to
+// \p Plan, but w/o its increment in the loop region or adjusting the exit
+// conditions of the loop region - those are adjusted later. Return the created


This essentially introduces VPActiveLaneMaskPHIRecipe as a placeholder.

ayalz · 2025-02-05T17:55:12Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

-                         CanonicalIVIncrement->hasNoSignedWrap()},
-                        CanonicalIVIncrement->getDebugLoc(), "index.evl.next");
-  NextEVLIV->insertBefore(CanonicalIVIncrement);
+      new VPInstruction(Instruction::Add, {OpVPEVL, EVLPhi}, {false, false},


Thanks. Possibly worth commenting what each false stands for.

ayalz · 2025-02-05T17:56:01Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

+  VPValue *TripCount, *IncrementValue;
+  if (!DataAndControlFlowWithoutRuntimeCheck) {
+    // When the loop is guarded by a runtime overflow check for the loop
+    // induction variable increment by VF, we can increment the value before


Suggested change

// induction variable increment by VF, we can increment the value before

// induction variable increment by VFxUF, we can increment the value before

?

fhahn requested review from ayalz and aniragil February 19, 2024 18:49

llvmbot added backend:RISC-V vectorizers llvm:transforms labels Feb 19, 2024

nikolaypanchenko reviewed Feb 20, 2024

View reviewed changes

ayalz reviewed Feb 28, 2024

View reviewed changes

npanchen mentioned this pull request Mar 20, 2024

[VPlan] Add additional guiding principles to docs. #85688

Open

ayalz mentioned this pull request Mar 24, 2024

[LV, VP]VP intrinsics support for the Loop Vectorizer + adding new tail-folding mode using EVL. #76172

Merged

fhahn mentioned this pull request Apr 22, 2024

[VPlan] Add new VPInstruction ocpode for header mask. #89603

Open

fhahn force-pushed the vplan-prepare-to-execute branch from b2f8789 to 025037e Compare June 12, 2024 20:54

fhahn force-pushed the vplan-prepare-to-execute branch from 025037e to 2ae1d69 Compare June 12, 2024 21:00

ayalz reviewed Sep 15, 2024

View reviewed changes

fhahn added 2 commits October 9, 2024 20:45

Merge remote-tracking branch 'origin/main' into vplan-prepare-to-execute

470e0ae

!fixup address latest comments, thanks!

311f105

fhahn commented Oct 9, 2024

View reviewed changes

fhahn mentioned this pull request Oct 30, 2024

[VPlan] Introduce VPScalarPHIRecipe, use for can & EVL IV codegen (NFC). #114305

Merged

fhahn added 2 commits November 10, 2024 09:34

Merge remote-tracking branch 'origin/main' into vplan-prepare-to-execute

fbf1e06

!fixup fix after merge, update vplan printing tests.

53b812d

fhahn changed the title ~~[VPlan] Delay adding canonical IV increment and exit branches.~~ [VPlan] Delay adding canonical IV increment. Nov 10, 2024

fhahn added 2 commits December 9, 2024 11:09

Merge remote-tracking branch 'origin/main' into vplan-prepare-to-execute

304f11b

!fixup update to latest main

4afb1f0

llvmbot added the backend:PowerPC label Dec 9, 2024

fhahn commented Dec 9, 2024

View reviewed changes

ayalz reviewed Dec 11, 2024

View reviewed changes

fhahn added 2 commits December 13, 2024 09:57

Merge remote-tracking branch 'origin/main' into vplan-prepare-to-execute

6af2b50

!fixup address latest comments, thanks!

e1cd5d9

ayalz mentioned this pull request Jan 15, 2025

[VPlan] Update final IV exit value via VPlan. #112147

Merged

fhahn added 2 commits January 21, 2025 13:23

Merge remote-tracking branch 'origin/main' into vplan-prepare-to-execute

9770a1e

!fixup update after recent merge

62e78e9

ayalz reviewed Feb 5, 2025

View reviewed changes

	// Add the required canonical IV
	// Add the required canonical IV.

	static void addCanonicalIV(VPlan &Plan, Type *IdxTy, bool HasNUW, DebugLoc DL) {
	static void addCanonicalIV(VPlan &Plan, Type *IdxTy, bool DebugLoc DL) {

	addCanonicalIV(*Plan, Legal->getWidestInductionType(), HasNUW, DL);
	addCanonicalIV(*Plan, Legal->getWidestInductionType(), DL);

		}
		auto LaneMaskPhi = cast<VPActiveLaneMaskPHIRecipe>(&FoundLaneMaskPhi);

		; CHECK: VPlan 'Initial VPlan for VF={vscale x 2},UF>=1' {
		; CHECK: VPlan 'Final VPlan for VF={vscale x 2},UF>=1' {

		@@ -1235,16 +1235,10 @@ void VPlanTransforms::optimize(VPlan &Plan, ScalarEvolution &SE) {
		static VPActiveLaneMaskPHIRecipe *addVPLaneMaskPhiAndUpdateExitBranch(
		VPlan &Plan, bool DataAndControlFlowWithoutRuntimeCheck) {

		// TODO: Move to VPlan transform stage once the transition to the VPlan-based
		// cost model is complete for better cost estimates.

	// the vector loop or when not folding the tail. In the later case, we know
	// the vector loop or when not folding the tail. In the latter case, we know

	// Add the required canonical IV.
	// Add the required canonical IV along with its loop branch, but w/o its increment - which is introduced later.

	VPBasicBlock *EB = Plan.getVectorLoopRegion()->getExitingBasicBlock();
	VPBasicBlock *Latch = Plan.getVectorLoopRegion()->getExitingBasicBlock();

	auto *Term = EB->getTerminator();
	auto *LatchBranch = Latch->getTerminator();

		Builder.createNaryOp(VPInstruction::BranchOnCond, {NotMask}, DL);
		Term->eraseFromParent();

		/// Finalize \p Plan by introducing explicit increments for the canonical
		/// induction.

	/// Finalize \p Plan by introducing explicit increments for the canonical
	/// induction.
	/// Complete the canonical IV by introducing an explicit increment between its header phi and latch branch on count.

	"the canonical IV should only be used by its increment or "
	"the canonical IV should only be used by its BranchOnCount or "

	// induction variable increment by VF, we can increment the value before
	// induction variable increment by VFxUF, we can increment the value before

[VPlan] Delay adding canonical IV increment. #82270

Are you sure you want to change the base?

[VPlan] Delay adding canonical IV increment. #82270

Uh oh!

Conversation

fhahn commented Feb 19, 2024

Uh oh!

llvmbot commented Feb 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nikolaypanchenko left a comment

Choose a reason for hiding this comment

Uh oh!

fhahn commented Feb 25, 2024

Uh oh!

ayalz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nikolaypanchenko commented Feb 28, 2024

Uh oh!

github-actions bot commented Jun 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fhahn commented Jun 12, 2024

Uh oh!

ayalz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

llvmbot commented Feb 19, 2024 •

edited

Loading

github-actions bot commented Jun 12, 2024 •

edited

Loading