Skip to content

[LoopVectorize] Enable vectorisation of early exit loops with live-outs #120567

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jan 30, 2025

Conversation

david-arm
Copy link
Contributor

@david-arm david-arm commented Dec 19, 2024

This work feeds part of PR #88385, and adds support for vectorising
loops with uncountable early exits and outside users of loop-defined
variables. When calculating the final value from an uncountable early
exit we need to calculate the vector lane that triggered the exit,
and hence determine the value at the point we exited.

All code for calculating the last value when exiting the loop early
now lives in a new vector.early.exit block, which sits between the
middle.split block and the original exit block. Doing this required
two fixes:

  1. The vplan verifier incorrectly assumed that the block containing
    a definition always dominates the block of the user. That's not true
    if you can arrive at the use block from multiple incoming blocks.
    This is possible for early exit loops where both the early exit and
    the latch jump to the same block.
  2. We were adding the new vector.early.exit to the wrong parent loop.
    It needs to have the same parent as the actual early exit block from
    the original loop.

I've added a new ExtractFirstActive VPInstruction that extracts the
first active lane of a vector, i.e. the lane of the vector predicate
that triggered the exit.

NOTE: The IR generated for dealing with live-outs from early exit
loops is unoptimised, as opposed to normal loops. This inevitably
leads to poor quality code, but this can be fixed up later.

@llvmbot
Copy link
Member

llvmbot commented Dec 19, 2024

@llvm/pr-subscribers-vectorizers

@llvm/pr-subscribers-llvm-transforms

Author: David Sherwood (david-arm)

Changes

This work feeds part of PR #88385, and adds support for vectorising loops with uncountable early exits and outside users of loop-defined variables.

I've added a new fixupEarlyExitIVUsers to mirror what happens in fixupIVUsers when patching up outside users of induction variables in the early exit block. We have to handle these differently for two reasons:

  1. We can't work backwards from the end value in the middle block because we didn't leave at the last iteration.
  2. We need to generate different IR that calculates the vector lane that triggered the exit, and hence can determine the induction value at the point we exited.

I've added a new 'null' VPValue as a dummy placeholder to manage the incoming operands of PHI nodes in the exit block. We can have situations where one of the incoming values is an induction variable (or its update) and the other is not. For example, both the latch and the early exiting block can jump to the same exit block. However, VPInstruction::generate walks through all predecessors of the PHI assuming the value is not an IV. In order to ensure that we process the right value for the right incoming block we use this new 'null' value is a marker to indicate it should be skipped, since it will be handled separately in fixupIVUsers or fixupEarlyExitIVUsers.

All code for calculating the last value when exiting the loop early now lives in a new vector.early.exit block, which sits between the middle.split block and the original exit block. I also had to fix up the vplan verifier because it assumed that the block containing a definition always dominated the parent of the user. That's no longer the case because we can arrive at the exit block via one of the latch or the early exiting block.

I've added a new ExtractFirstActive VPInstruction that extracts the first active lane of a vector, i.e. the lane of the vector predicate that triggered the exit.


Patch is 138.11 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/120567.diff

14 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+164-25)
  • (modified) llvm/lib/Transforms/Vectorize/VPlan.cpp (+3)
  • (modified) llvm/lib/Transforms/Vectorize/VPlan.h (+28)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+17-5)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp (+6-1)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanValue.h (+11)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp (+5-1)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/simple_early_exit.ll (+256-57)
  • (modified) llvm/test/Transforms/LoopVectorize/early_exit_legality.ll (-2)
  • (modified) llvm/test/Transforms/LoopVectorize/multi_early_exit.ll (+3-3)
  • (modified) llvm/test/Transforms/LoopVectorize/multi_early_exit_live_outs.ll (+6-6)
  • (modified) llvm/test/Transforms/LoopVectorize/single_early_exit.ll (+21-13)
  • (modified) llvm/test/Transforms/LoopVectorize/single_early_exit_live_outs.ll (+641-79)
  • (modified) llvm/test/Transforms/LoopVectorize/uncountable-early-exit-vplan.ll (+15-6)
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index a8511483e00fbe..4962dff3ab1f8a 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -563,6 +563,11 @@ class InnerLoopVectorizer {
                             Value *VectorTripCount, BasicBlock *MiddleBlock,
                             VPTransformState &State);
 
+  void fixupEarlyExitIVUsers(PHINode *OrigPhi, const InductionDescriptor &II,
+                             BasicBlock *VectorEarlyExitBB,
+                             BasicBlock *MiddleBlock, VPlan &Plan,
+                             VPTransformState &State);
+
   /// Iteratively sink the scalarized operands of a predicated instruction into
   /// the block that was created for it.
   void sinkScalarOperands(Instruction *PredInst);
@@ -2838,6 +2843,23 @@ BasicBlock *InnerLoopVectorizer::createVectorizedLoopSkeleton(
   return LoopVectorPreHeader;
 }
 
+static bool isValueIncomingFromBlock(BasicBlock *ExitingBB, Value *V,
+                                     Instruction *UI) {
+  PHINode *PHI = dyn_cast<PHINode>(UI);
+  assert(PHI && "Expected LCSSA form");
+
+  // If this loop has an uncountable early exit then there could be
+  // different users of OrigPhi with either:
+  //   1. Multiple users, because each exiting block (countable or
+  //      uncountable) jumps to the same exit block, or ..
+  //   2. A single user with an incoming value from a countable or
+  //      uncountable exiting block.
+  // In both cases there is no guarantee this came from a countable exiting
+  // block, i.e. the latch.
+  int Index = PHI->getBasicBlockIndex(ExitingBB);
+  return Index != -1 && PHI->getIncomingValue(Index) == V;
+}
+
 // Fix up external users of the induction variable. At this point, we are
 // in LCSSA form, with all external PHIs that use the IV having one input value,
 // coming from the remainder loop. We need those PHIs to also have a correct
@@ -2853,6 +2875,7 @@ void InnerLoopVectorizer::fixupIVUsers(PHINode *OrigPhi,
   // We allow both, but they, obviously, have different values.
 
   DenseMap<Value *, Value *> MissingVals;
+  BasicBlock *OrigLoopLatch = OrigLoop->getLoopLatch();
 
   Value *EndValue = cast<PHINode>(OrigPhi->getIncomingValueForBlock(
                                       OrigLoop->getLoopPreheader()))
@@ -2860,12 +2883,12 @@ void InnerLoopVectorizer::fixupIVUsers(PHINode *OrigPhi,
 
   // An external user of the last iteration's value should see the value that
   // the remainder loop uses to initialize its own IV.
-  Value *PostInc = OrigPhi->getIncomingValueForBlock(OrigLoop->getLoopLatch());
+  Value *PostInc = OrigPhi->getIncomingValueForBlock(OrigLoopLatch);
   for (User *U : PostInc->users()) {
     Instruction *UI = cast<Instruction>(U);
     if (!OrigLoop->contains(UI)) {
-      assert(isa<PHINode>(UI) && "Expected LCSSA form");
-      MissingVals[UI] = EndValue;
+      if (isValueIncomingFromBlock(OrigLoopLatch, PostInc, UI))
+        MissingVals[cast<PHINode>(UI)] = EndValue;
     }
   }
 
@@ -2875,7 +2898,9 @@ void InnerLoopVectorizer::fixupIVUsers(PHINode *OrigPhi,
   for (User *U : OrigPhi->users()) {
     auto *UI = cast<Instruction>(U);
     if (!OrigLoop->contains(UI)) {
-      assert(isa<PHINode>(UI) && "Expected LCSSA form");
+      if (!isValueIncomingFromBlock(OrigLoopLatch, OrigPhi, UI))
+        continue;
+
       IRBuilder<> B(MiddleBlock->getTerminator());
 
       // Fast-math-flags propagate from the original induction instruction.
@@ -2905,18 +2930,6 @@ void InnerLoopVectorizer::fixupIVUsers(PHINode *OrigPhi,
     }
   }
 
-  assert((MissingVals.empty() ||
-          all_of(MissingVals,
-                 [MiddleBlock, this](const std::pair<Value *, Value *> &P) {
-                   return all_of(
-                       predecessors(cast<Instruction>(P.first)->getParent()),
-                       [MiddleBlock, this](BasicBlock *Pred) {
-                         return Pred == MiddleBlock ||
-                                Pred == OrigLoop->getLoopLatch();
-                       });
-                 })) &&
-         "Expected escaping values from latch/middle.block only");
-
   for (auto &I : MissingVals) {
     PHINode *PHI = cast<PHINode>(I.first);
     // One corner case we have to handle is two IVs "chasing" each-other,
@@ -2929,6 +2942,102 @@ void InnerLoopVectorizer::fixupIVUsers(PHINode *OrigPhi,
   }
 }
 
+void InnerLoopVectorizer::fixupEarlyExitIVUsers(PHINode *OrigPhi,
+                                                const InductionDescriptor &II,
+                                                BasicBlock *VectorEarlyExitBB,
+                                                BasicBlock *MiddleBlock,
+                                                VPlan &Plan,
+                                                VPTransformState &State) {
+  // There are two kinds of external IV usages - those that use the value
+  // computed in the last iteration (the PHI) and those that use the penultimate
+  // value (the value that feeds into the phi from the loop latch).
+  // We allow both, but they, obviously, have different values.
+  DenseMap<Value *, Value *> MissingVals;
+  BasicBlock *OrigLoopLatch = OrigLoop->getLoopLatch();
+  BasicBlock *EarlyExitingBB = Legal->getUncountableEarlyExitingBlock();
+  Value *PostInc = OrigPhi->getIncomingValueForBlock(OrigLoopLatch);
+
+  // Obtain the canonical IV, since we have to use the most recent value
+  // before exiting the loop early. This is unlike fixupIVUsers, which has
+  // the luxury of using the end value in the middle block.
+  VPBasicBlock *EntryVPBB = Plan.getVectorLoopRegion()->getEntryBasicBlock();
+  // NOTE: We cannot call Plan.getCanonicalIV() here because the original
+  // recipe created whilst building plans is no longer valid.
+  VPHeaderPHIRecipe *CanonicalIVR =
+      cast<VPHeaderPHIRecipe>(&*EntryVPBB->begin());
+  Value *CanonicalIV = State.get(CanonicalIVR->getVPSingleValue(), true);
+
+  // Search for the mask that drove us to exit early.
+  VPBasicBlock *EarlyExitVPBB = Plan.getVectorLoopRegion()->getEarlyExit();
+  VPBasicBlock *MiddleSplitVPBB =
+      cast<VPBasicBlock>(EarlyExitVPBB->getSinglePredecessor());
+  VPInstruction *BranchOnCond =
+      cast<VPInstruction>(MiddleSplitVPBB->getTerminator());
+  assert(BranchOnCond->getOpcode() == VPInstruction::BranchOnCond &&
+         "Expected middle.split block terminator to be a branch-on-cond");
+  VPInstruction *ScalarEarlyExitCond =
+      cast<VPInstruction>(BranchOnCond->getOperand(0));
+  assert(
+      ScalarEarlyExitCond->getOpcode() == VPInstruction::AnyOf &&
+      "Expected middle.split block terminator branch condition to be any-of");
+  VPValue *VectorEarlyExitCond = ScalarEarlyExitCond->getOperand(0);
+  // Finally get the mask that led us into the early exit block.
+  Value *EarlyExitMask = State.get(VectorEarlyExitCond);
+
+  // Calculate the IV step.
+  VPValue *StepVPV = Plan.getSCEVExpansion(II.getStep());
+  assert(StepVPV && "step must have been expanded during VPlan execution");
+  Value *Step = StepVPV->isLiveIn() ? StepVPV->getLiveInIRValue()
+                                    : State.get(StepVPV, VPLane(0));
+
+  auto FixUpPhi = [&](Instruction *UI, bool PostInc) -> Value * {
+    IRBuilder<> B(VectorEarlyExitBB->getTerminator());
+    assert(isa<PHINode>(UI) && "Expected LCSSA form");
+
+    // Fast-math-flags propagate from the original induction instruction.
+    if (isa_and_nonnull<FPMathOperator>(II.getInductionBinOp()))
+      B.setFastMathFlags(II.getInductionBinOp()->getFastMathFlags());
+
+    Type *CtzType = CanonicalIV->getType();
+    Value *Ctz = B.CreateCountTrailingZeroElems(CtzType, EarlyExitMask);
+    Ctz = B.CreateAdd(Ctz, cast<PHINode>(CanonicalIV));
+    if (PostInc)
+      Ctz = B.CreateAdd(Ctz, ConstantInt::get(CtzType, 1));
+
+    Value *Escape = emitTransformedIndex(B, Ctz, II.getStartValue(), Step,
+                                         II.getKind(), II.getInductionBinOp());
+    Escape->setName("ind.early.escape");
+    return Escape;
+  };
+
+  for (User *U : PostInc->users()) {
+    auto *UI = cast<Instruction>(U);
+    if (!OrigLoop->contains(UI)) {
+      if (isValueIncomingFromBlock(EarlyExitingBB, PostInc, UI))
+        MissingVals[UI] = FixUpPhi(UI, true);
+    }
+  }
+
+  for (User *U : OrigPhi->users()) {
+    auto *UI = cast<Instruction>(U);
+    if (!OrigLoop->contains(UI)) {
+      if (isValueIncomingFromBlock(EarlyExitingBB, OrigPhi, UI))
+        MissingVals[UI] = FixUpPhi(UI, false);
+    }
+  }
+
+  for (auto &I : MissingVals) {
+    PHINode *PHI = cast<PHINode>(I.first);
+    // One corner case we have to handle is two IVs "chasing" each-other,
+    // that is %IV2 = phi [...], [ %IV1, %latch ]
+    // In this case, if IV1 has an external use, we need to avoid adding both
+    // "last value of IV1" and "penultimate value of IV2". So, verify that we
+    // don't already have an incoming value for the middle block.
+    if (PHI->getBasicBlockIndex(VectorEarlyExitBB) == -1)
+      PHI->addIncoming(I.second, VectorEarlyExitBB);
+  }
+}
+
 namespace {
 
 struct CSEDenseMapInfo {
@@ -3062,6 +3171,13 @@ void InnerLoopVectorizer::fixVectorizedLoop(VPTransformState &State) {
       OuterLoop->addBasicBlockToLoop(MiddleSplitBB, *LI);
       PredVPBB = PredVPBB->getSinglePredecessor();
     }
+
+    BasicBlock *OrigEarlyExitBB = Legal->getUncountableEarlyExitBlock();
+    if (Loop *EEL = LI->getLoopFor(OrigEarlyExitBB)) {
+      BasicBlock *VectorEarlyExitBB =
+          State.CFG.VPBB2IRBB[VectorRegion->getEarlyExit()];
+      EEL->addBasicBlockToLoop(VectorEarlyExitBB, *LI);
+    }
   }
 
   // After vectorization, the exit blocks of the original loop will have
@@ -3091,6 +3207,15 @@ void InnerLoopVectorizer::fixVectorizedLoop(VPTransformState &State) {
                    getOrCreateVectorTripCount(nullptr), LoopMiddleBlock, State);
   }
 
+  if (Legal->hasUncountableEarlyExit()) {
+    VPBasicBlock *VectorEarlyExitVPBB =
+        cast<VPBasicBlock>(VectorRegion->getEarlyExit());
+    BasicBlock *VectorEarlyExitBB = State.CFG.VPBB2IRBB[VectorEarlyExitVPBB];
+    for (const auto &Entry : Legal->getInductionVars())
+      fixupEarlyExitIVUsers(Entry.first, Entry.second, VectorEarlyExitBB,
+                            LoopMiddleBlock, Plan, State);
+  }
+
   for (Instruction *PI : PredicatedInstructions)
     sinkScalarOperands(&*PI);
 
@@ -8974,6 +9099,9 @@ static void addScalarResumePhis(VPRecipeBuilder &Builder, VPlan &Plan) {
     auto *VectorPhiR = cast<VPHeaderPHIRecipe>(Builder.getRecipe(ScalarPhiI));
     if (!isa<VPFirstOrderRecurrencePHIRecipe, VPReductionPHIRecipe>(VectorPhiR))
       continue;
+    assert(!Plan.getVectorLoopRegion()->getEarlyExit() &&
+           "Cannot handle "
+           "first-order recurrences with uncountable early exits");
     // The backedge value provides the value to resume coming out of a loop,
     // which for FORs is a vector whose last element needs to be extracted. The
     // start value provides the value if the loop is bypassed.
@@ -9032,8 +9160,7 @@ static SetVector<VPIRInstruction *> collectUsersInExitBlocks(
                auto *P = dyn_cast<PHINode>(U);
                return P && Inductions.contains(P);
              }))) {
-          if (ExitVPBB->getSinglePredecessor() == MiddleVPBB)
-            continue;
+          V = VPValue::getNull();
         }
         ExitUsersToFix.insert(ExitIRI);
         ExitIRI->addOperand(V);
@@ -9061,18 +9188,30 @@ addUsersInExitBlocks(VPlan &Plan,
     for (const auto &[Idx, Op] : enumerate(ExitIRI->operands())) {
       // Pass live-in values used by exit phis directly through to their users
       // in the exit block.
-      if (Op->isLiveIn())
+      if (Op->isLiveIn() || Op->isNull())
         continue;
 
       // Currently only live-ins can be used by exit values from blocks not
       // exiting via the vector latch through to the middle block.
-      if (ExitIRI->getParent()->getSinglePredecessor() != MiddleVPBB)
-        return false;
-
       LLVMContext &Ctx = ExitIRI->getInstruction().getContext();
-      VPValue *Ext = B.createNaryOp(VPInstruction::ExtractFromEnd,
-                                    {Op, Plan.getOrAddLiveIn(ConstantInt::get(
-                                             IntegerType::get(Ctx, 32), 1))});
+      VPValue *Ext;
+      VPBasicBlock *PredVPBB =
+          cast<VPBasicBlock>(ExitIRI->getParent()->getPredecessors()[Idx]);
+      if (PredVPBB != MiddleVPBB) {
+        VPBasicBlock *VectorEarlyExitVPBB =
+            Plan.getVectorLoopRegion()->getEarlyExit();
+        VPBuilder B2(VectorEarlyExitVPBB,
+                     VectorEarlyExitVPBB->getFirstNonPhi());
+        assert(ExitIRI->getParent()->getNumPredecessors() <= 2);
+        VPValue *EarlyExitMask =
+            Plan.getVectorLoopRegion()->getVectorEarlyExitCond();
+        Ext = B2.createNaryOp(VPInstruction::ExtractFirstActive,
+                              {Op, EarlyExitMask});
+      } else {
+        Ext = B.createNaryOp(VPInstruction::ExtractFromEnd,
+                             {Op, Plan.getOrAddLiveIn(ConstantInt::get(
+                                      IntegerType::get(Ctx, 32), 1))});
+      }
       ExitIRI->setOperand(Idx, Ext);
     }
   }
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.cpp b/llvm/lib/Transforms/Vectorize/VPlan.cpp
index 71f43abe534ec0..3578c268b2187f 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlan.cpp
@@ -83,6 +83,9 @@ Value *VPLane::getAsRuntimeExpr(IRBuilderBase &Builder,
   llvm_unreachable("Unknown lane kind");
 }
 
+static VPValue NullValue;
+VPValue *VPValue::Null = &NullValue;
+
 VPValue::VPValue(const unsigned char SC, Value *UV, VPDef *Def)
     : SubclassID(SC), UnderlyingVal(UV), Def(Def) {
   if (Def)
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index 8dd94a292f7075..c6d9364cd1e9b4 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -1231,6 +1231,9 @@ class VPInstruction : public VPRecipeWithIRFlags,
     // Returns a scalar boolean value, which is true if any lane of its single
     // operand is true.
     AnyOf,
+    // Extracts the first active lane of a vector, where the first operand is
+    // the predicate, and the second operand is the vector to extract.
+    ExtractFirstActive,
   };
 
 private:
@@ -3662,6 +3665,13 @@ class VPRegionBlock : public VPBlockBase {
   /// VPRegionBlock.
   VPBlockBase *Exiting;
 
+  /// Hold the Early Exit block of the SEME region, if one exists.
+  VPBasicBlock *EarlyExit;
+
+  /// If one exists, this keeps track of the vector early mask that triggered
+  /// the early exit.
+  VPValue *VectorEarlyExitCond;
+
   /// An indicator whether this region is to generate multiple replicated
   /// instances of output IR corresponding to its VPBlockBases.
   bool IsReplicator;
@@ -3670,6 +3680,7 @@ class VPRegionBlock : public VPBlockBase {
   VPRegionBlock(VPBlockBase *Entry, VPBlockBase *Exiting,
                 const std::string &Name = "", bool IsReplicator = false)
       : VPBlockBase(VPRegionBlockSC, Name), Entry(Entry), Exiting(Exiting),
+        EarlyExit(nullptr), VectorEarlyExitCond(nullptr),
         IsReplicator(IsReplicator) {
     assert(Entry->getPredecessors().empty() && "Entry block has predecessors.");
     assert(Exiting->getSuccessors().empty() && "Exit block has successors.");
@@ -3678,6 +3689,7 @@ class VPRegionBlock : public VPBlockBase {
   }
   VPRegionBlock(const std::string &Name = "", bool IsReplicator = false)
       : VPBlockBase(VPRegionBlockSC, Name), Entry(nullptr), Exiting(nullptr),
+        EarlyExit(nullptr), VectorEarlyExitCond(nullptr),
         IsReplicator(IsReplicator) {}
 
   ~VPRegionBlock() override {
@@ -3717,6 +3729,22 @@ class VPRegionBlock : public VPBlockBase {
     ExitingBlock->setParent(this);
   }
 
+  /// Sets the early exit vector mask.
+  void setVectorEarlyExitCond(VPValue *V) {
+    assert(!VectorEarlyExitCond);
+    VectorEarlyExitCond = V;
+  }
+
+  /// Gets the early exit vector mask
+  VPValue *getVectorEarlyExitCond() const { return VectorEarlyExitCond; }
+
+  /// Set the vector early exit block
+  void setEarlyExit(VPBasicBlock *ExitBlock) { EarlyExit = ExitBlock; }
+
+  /// Get the vector early exit block
+  const VPBasicBlock *getEarlyExit() const { return EarlyExit; }
+  VPBasicBlock *getEarlyExit() { return EarlyExit; }
+
   /// Returns the pre-header VPBasicBlock of the loop region.
   VPBasicBlock *getPreheaderVPBB() {
     assert(!isReplicator() && "should only get pre-header of loop regions");
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index 7f8c560270bc0c..67c2595aabd081 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -57,7 +57,6 @@ bool VPRecipeBase::mayWriteToMemory() const {
     case Instruction::Or:
     case Instruction::ICmp:
     case Instruction::Select:
-    case VPInstruction::AnyOf:
     case VPInstruction::Not:
     case VPInstruction::CalculateTripCountMinusVF:
     case VPInstruction::CanonicalIVIncrementForPart:
@@ -65,6 +64,8 @@ bool VPRecipeBase::mayWriteToMemory() const {
     case VPInstruction::FirstOrderRecurrenceSplice:
     case VPInstruction::LogicalAnd:
     case VPInstruction::PtrAdd:
+    case VPInstruction::AnyOf:
+    case VPInstruction::ExtractFirstActive:
       return false;
     default:
       return true;
@@ -645,7 +646,13 @@ Value *VPInstruction::generate(VPTransformState &State) {
     Value *A = State.get(getOperand(0));
     return Builder.CreateOrReduce(A);
   }
-
+  case VPInstruction::ExtractFirstActive: {
+    Value *Vec = State.get(getOperand(0));
+    Value *Mask = State.get(getOperand(1));
+    Value *Ctz =
+        Builder.CreateCountTrailingZeroElems(Builder.getInt64Ty(), Mask);
+    return Builder.CreateExtractElement(Vec, Ctz);
+  }
   default:
     llvm_unreachable("Unsupported opcode for instruction");
   }
@@ -654,7 +661,8 @@ Value *VPInstruction::generate(VPTransformState &State) {
 bool VPInstruction::isVectorToScalar() const {
   return getOpcode() == VPInstruction::ExtractFromEnd ||
          getOpcode() == VPInstruction::ComputeReductionResult ||
-         getOpcode() == VPInstruction::AnyOf;
+         getOpcode() == VPInstruction::AnyOf ||
+         getOpcode() == VPInstruction::ExtractFirstActive;
 }
 
 bool VPInstruction::isSingleScalar() const {
@@ -816,6 +824,9 @@ void VPInstruction::print(raw_ostream &O, const Twine &Indent,
   case VPInstruction::AnyOf:
     O << "any-of";
     break;
+  case VPInstruction::ExtractFirstActive:
+    O << "extract-first-active";
+    break;
   default:
     O << Instruction::getOpcodeName(getOpcode());
   }
@@ -833,8 +844,9 @@ void VPInstruction::print(raw_ostream &O, const Twine &Indent,
 void VPIRInstruction::execute(VPTransformState &State) {
   assert((isa<PHINode>(&I) || getNumOperands() == 0) &&
          "Only PHINodes can have extra operands");
-  for (const auto &[Idx, Op] : enumerate(operands())) {
-    VPValue *ExitValue = Op;
+  for (const auto &[Idx, ExitValue] : enumerate(operands())) {
+    if (ExitValue->isNull())
+      continue;
     auto Lane = vputils::isUniformAfterVectorization(ExitValue)
                     ? VPLane::getFirstLane()
                     : VPLane::getLastLaneForVF(State.VF);
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
index aacb27f9325d07..9b3b54fde5112e 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
@@ -1905,14 +1905,19 @@ void VPlanTransforms::handleUncountableEarlyExit(
   VPValue *EarlyExitNotTakenCond = RecipeBuilder.getBlockInMask(
       OrigLoop->contains(TrueSucc) ? TrueSucc : FalseSucc);
   auto *EarlyExitTakenCond = Builder.createNot(EarlyExitNotTakenCond);
+  LoopRegion->setVectorEarlyExitCond(EarlyExitTakenCond);
   IsEarlyExitTaken =
       Builder.createNaryOp(VPInstruction::AnyOf, {EarlyExitTakenCond});
 
   VPBasicBlock *NewMiddle = new VPBasicBlock("middle.split");
+  VPBasicBlock *EarlyExitVPBB = new VPBasicBlock("vector.early.exit");
   VPBlockUtils::insertOnEdge(LoopRegion, MiddleVPBB, NewM...
[truncated]

@david-arm
Copy link
Contributor Author

PR #88385 is very out of date now, but it also dealt with loops that require a scalar epilogue. I've not attempted to deal with those in this PR, but I am finishing off a downstream patch to handle these too, which will mean we can start vectorising loops in tests/Transforms/LoopVectorizer/multi_early_exit* as well.

I also intend to put up another patch soon that adds support for versioning early exit loops that contain a single load instruction, requiring the correct alignment to ensure safety.

@david-arm
Copy link
Contributor Author

Just for reference here is a link to a WIP patch that adds support for versioning of early exit loops with potentially faulting loads: #120603

@@ -12,7 +12,7 @@ define i64 @one_uncountable_two_countable_same_exit() {
; CHECK-NEXT: [[P2:%.*]] = alloca [1024 x i8], align 1
; CHECK-NEXT: call void @init_mem(ptr [[P1]], i64 1024)
; CHECK-NEXT: call void @init_mem(ptr [[P2]], i64 1024)
; CHECK-NEXT: br label [[LOOP:%.*]]
; CHECK-NEXT: br label [[LOOP1:%.*]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All changes in file just renames?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, reverted the changes!

@@ -14,18 +14,18 @@ define i64 @one_uncountable_two_countable_same_exit_phi_of_consts() {
; CHECK: loop:
; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ [[INDEX_NEXT:%.*]], [[LOOP_INC:%.*]] ], [ 3, [[ENTRY:%.*]] ]
; CHECK-NEXT: [[CMP1:%.*]] = icmp ne i64 [[INDEX]], 64
; CHECK-NEXT: br i1 [[CMP1]], label [[SEARCH:%.*]], label [[LOOP_END:%.*]]
; CHECK-NEXT: br i1 [[CMP1]], label [[SEARCH:%.*]], label [[LOOP_END1:%.*]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All changes in file just renames?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, reverted the changes!

@@ -141,7 +140,6 @@ define i64 @loop_contains_load_after_early_exit(ptr dereferenceable(1024) align(
; CHECK-LABEL: LV: Checking a loop in 'loop_contains_load_after_early_exit'
; CHECK: LV: Found an early exit loop with symbolic max backedge taken count: 63
; CHECK-NEXT: LV: We can vectorize this loop!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably use CHECK-NOT: LV: Not vectorizing, otherwise the test will also pass without the code changes? Same above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Comment on lines 215 to 229
// Now that we support vectorising loops with uncountable early exits
// we can end up in situations where VPBB does not dominate the exit
// block. Only do the check if the user is not in a VPIRBasicBlock.
if (!isa<VPIRBasicBlock>(UI->getParent()) &&
!VPDT.dominates(VPBB, UI->getParent())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this case? Ideally we shouldn't come up with cases where there are uses before defs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a use before def, it's an invalid assumption that the block containing the use has a single predecessor. Now that we support vectorising loops with early exits that assumption is broken, for example the user could be a PHI node.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see; IIUC whether a block has multiple predecessors or not shouldn't really be the issue. The problem is that for a phi we need to check if the incoming value dominates the incoming block.

We have the same issue for other phi-like recipes, see TODO above (line 210). I think it should be sufficient to skip VPIRInstructions wrapping a phi instruction there as well?

That way we would preserve def-use checks for other recipes in VPIRBasicBlocks.

I also put up #124838 separately to extend def-use verification to phis, which should enable removing the early continue in all cases

Comment on lines 3668 to 3731
/// Hold the Early Exit block of the SEME region, if one exists.
VPBasicBlock *EarlyExit;

/// If one exists, this keeps track of the vector early mask that triggered
/// the early exit.
VPValue *VectorEarlyExitCond;

/// An indicator whether this region is to generate multiple replicated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC those are only needed because currently the final values are computed outside VPlan, right?

Could we instead either force a wide IV and extract the desired lane and add the required recipes directly on VPlan-construction?

Or compute the end value for the user directly in VPlan (see #112145 which computes the end values for the resume phis #112147 which removes fixupIVUsers)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to look into this, although I'm not sure what you mean by forcing a wide IV? I'll take a look at #112145 as well - do you think this is close to being merged?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've rewritten this so that we don't need to stash the EarlyExit and VectorEarlyExitCond variables. Instead, I've added a getEarlyExit that walks through the VPBasicBlocks to find the vector.early.exit block. I also added code in addUsersInExitBlocks to work backwards from the early exit block to find the early exit mask. It's uglier, but it does avoid having to track these variables and maintain them during cloning, etc. Hope that's prefereable for you!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I still need to take a look at the latest update. By forcing a wide IV I meant is ensuring that there always will be a vector induction, from which we should be able to extract the lane for the early exit I think.

One way to ensure that would be to not mark induction updates as uniform if there's an user of the induction outside the loop for early exit loops (something like 5c0d7aa)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just double checking, looking at the latest version it is effectively doing an extract of the last active lane, right?

@@ -141,7 +140,6 @@ define i64 @loop_contains_load_after_early_exit(ptr dereferenceable(1024) align(
; CHECK-LABEL: LV: Checking a loop in 'loop_contains_load_after_early_exit'
; CHECK: LV: Found an early exit loop with symbolic max backedge taken count: 63
; CHECK-NEXT: LV: We can vectorize this loop!
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@@ -14,18 +14,18 @@ define i64 @one_uncountable_two_countable_same_exit_phi_of_consts() {
; CHECK: loop:
; CHECK-NEXT: [[INDEX:%.*]] = phi i64 [ [[INDEX_NEXT:%.*]], [[LOOP_INC:%.*]] ], [ 3, [[ENTRY:%.*]] ]
; CHECK-NEXT: [[CMP1:%.*]] = icmp ne i64 [[INDEX]], 64
; CHECK-NEXT: br i1 [[CMP1]], label [[SEARCH:%.*]], label [[LOOP_END:%.*]]
; CHECK-NEXT: br i1 [[CMP1]], label [[SEARCH:%.*]], label [[LOOP_END1:%.*]]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, reverted the changes!

@@ -12,7 +12,7 @@ define i64 @one_uncountable_two_countable_same_exit() {
; CHECK-NEXT: [[P2:%.*]] = alloca [1024 x i8], align 1
; CHECK-NEXT: call void @init_mem(ptr [[P1]], i64 1024)
; CHECK-NEXT: call void @init_mem(ptr [[P2]], i64 1024)
; CHECK-NEXT: br label [[LOOP:%.*]]
; CHECK-NEXT: br label [[LOOP1:%.*]]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, reverted the changes!

// block predecessor of the exit block. This block needs to have the same
// parent loop as the exit block itself.
VPRegionBlock *VectorRegion = State.Plan->getVectorLoopRegion();
if (BasicBlock *OrigEarlyExitBB = Legal->getUncountableEarlyExitBlock()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to fix a bug in VPBasicBlock::execute that incorrectly adds some VPBasicBlocks to the wrong parent loop.

david-arm referenced this pull request Jan 7, 2025
Legalize extract-from-ends using uniform VPReplicateRecipe of wide
inductions to use regular VPReplicateRecipe, so the correct end value
is available.

Fixes #121745.
@david-arm david-arm force-pushed the ee_live_outs branch 2 times, most recently from 55d6e4f to c6c4e77 Compare January 9, 2025 14:01
@david-arm
Copy link
Contributor Author

Hi @fhahn, I've managed to rewrite the patch so that it doesn't depend upon #112147 although it does leave the IR in an unoptimised state. I think this is fine for now, and strangely enough for one of the std::find loops I care about it doesn't seem to affect performance that much. The recipes can be optimised in a later patch in a similar way to how you plan to deal with normal live-outs in #112147. I've had to change fixupIVUsers so that it doesn't crash when vectorising early exit loops, but this won't last very long since #112147 will be removing this code. I don't think this patch will cause any issues for #112147.

I've also updated the documentation to represent the new VPlan structure.

Comment on lines 408 to 410
(``middle.split`` below) branches to the early exit block via an intermediate
block (``vector.early.exit`` below). Otherwise ``middle.block`` selects between
the exit block from the latch or the scalar remainder loop.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to also explain why the intermediate block is needed, i.e. to compute the exit value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@@ -19,23 +19,27 @@ compound=true
"middle.split"
]
N4 -> N5 [ label=""]
N4 -> N6 [ label=""]
N4 -> N7 [ label=""]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't comment on the generated file, it looks like the size in pixels changed, just checking if that was intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't specifically try to change the size in pixels - I just ran dot -Tpng ../llvm/docs/vplan-early-exit.dot. Is there a recommended command for regenerating this? I couldn't see any documentation about this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that dot -Tpng -Gsize=15,14\! -Gdpi=100 vplan-early-exit.dot -o vplan-ee.png resulted in a larger image, though not the exact size I requested. May require some experimenting to get a good image. You could put the size and dpi directives in the .dot file instead to preserve it the next time someone modifies it.

It seems graphviz isn't terribly cooperative about this, and you need other tools to rescale or pad the edges.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! I've regenerated it with that command and hopefully the image looks bigger now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I didn't realize that there were different defaults on different platforms/version. The larger versions looks good, thanks!

Comment on lines 8971 to 8973
assert(!Plan.getEarlyExit() &&
"Cannot handle reductions or first-order recurrences with "
"uncountable early exits");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be added separately, as NFC?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do although I'll have to check for early exit differently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #122732. Will update this PR soon!

if (PredVPBB != MiddleVPBB) {
assert(ExitIRI->getParent()->getNumPredecessors() <= 2);

// Cache the early exit mask
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Cache the early exit mask
// Lookup and cache the early exit mask.

Caching is done to avoid repeated lookups, i.e. compile time perf opt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's right. If we don't store the mask in the VPRegionBlock then we have to calculate it manually and I was hoping to avoid constant recalculation of the same value. I could also just calculate this prior to the loop, but then the work is unnecessary if there aren't any early exits.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 504 to 512
if (this == State->Plan->getEarlyExit()) {
// If this is the vector early exit block then it has a single successor,
// which is the uncountable early exit block of the original loop. The
// parent loop for the exit block may not be the same as the parent loop
// of the vectorised loop, so we have to treat this differently.
Loop *EEL = State->LI->getLoopFor(State->CFG.UncountableEarlyExitBB);
if (EEL)
EEL->addBasicBlockToLoop(NewBB, *State->LI);
} else if (State->CurrentParentLoop)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be good to frame this more generally if possible.

IIUC the problem here is that now we introduce a VPBB which may be outside the enclosing loop (and in the same loop as the destination VPIRBB.

This is the case when we have a VPBB that only branches to a single exit block and then it must be in the same loop as this VPIRBB. Can we check for that instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do it that way I think, but I guess that requires walking through the list returned by getExitBlocks() every time this function is called. I imagine that's quite a bit of compile time cost, but I suppose I can limit this to when State->CFG.UncountableEarlyExitB is non-null if that's ok?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! I've added a new isExitBlock function that I call on the current block's single successor (if there is one).

david-arm added a commit to david-arm/llvm-project that referenced this pull request Jan 13, 2025
This patch is split off llvm#120567, adding asserts in
addScalarResumePhis and addExitUsersForFirstOrderRecurrences
that the loop does not contain an uncountable early exit,
since the code cannot yet handle them correctly.
hstk30-hw pushed a commit that referenced this pull request Jan 15, 2025
This patch is split off #120567, adding asserts in
addScalarResumePhis and addExitUsersForFirstOrderRecurrences
that the loop does not contain an uncountable early exit,
since the code cannot yet handle them correctly.
@@ -407,6 +407,11 @@ class LoopVectorizationLegality {

/// Returns the destination of an uncountable early exiting block.
BasicBlock *getUncountableEarlyExitBlock() const {
if (!HasUncountableEarlyExit) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to have a separate HasUncountableEarlyExit variable instead of just checking whether the list of uncountable exit(ing) blocks isn't empty? It just seems that having a single source of truth is better than repeatedly asserting that they agree.

It may also be better to store pairs of (Exiting,Exit) blocks instead of two separate vectors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split this off into #123219

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left this code unchanged for now, since it's being dealt with in #123219. If that patch lands first I'll rebase this one.

@david-arm
Copy link
Contributor Author

Rebase

Comment on lines 3668 to 3731
/// Hold the Early Exit block of the SEME region, if one exists.
VPBasicBlock *EarlyExit;

/// If one exists, this keeps track of the vector early mask that triggered
/// the early exit.
VPValue *VectorEarlyExitCond;

/// An indicator whether this region is to generate multiple replicated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just double checking, looking at the latest version it is effectively doing an extract of the last active lane, right?

@david-arm
Copy link
Contributor Author

david-arm commented Jan 21, 2025

Just double checking, looking at the latest version it is effectively doing an extract of the last active lane, right?

Hi @fhahn, no it extracts the first active lane by counting the number of trailing zeroes in the mask that triggered the exit. So it's calculating the element corresponding to the first true value in the mask. If it was extracting the last active lane then it would be VF - number of leading zeroes.

If you compile it to assembly you'll see for SVE it generates

brkb p0, p1, p0
cntp x0, p0

to get the lane index.

@fhahn
Copy link
Contributor

fhahn commented Jan 21, 2025

Just double checking, looking at the latest version it is effectively doing an extract of the last active lane, right?

Hi @fhahn, no it extracts the first active lane by counting the number of trailing zeroes in the mask that triggered the exit. So it's calculating the element corresponding to the first true value in the mask. If it was extracting the last active lane then it would be VF - number of leading zeroes.

Yeah I meant just that it only extracts from last valid lane of the vector IV, not how exactly that is computed (vs computing the end value based on the IV end value)

Yeah that's true. There are two different ways of thinking about this:

  1. We're extracting the lane corresponding to the first true value in the exit mask, or
  2. We're extracting the last valid lane of the loop-defined vector variable, since values in higher lanes correspond to values beyond the loop exit.

I've named the new VPInstruction in terms of the former because I thought that more closely matches the operation. Although I can see it depends upon your perspective!

IntegerType::get(Ctx, 32), 1))});
VPValue *Ext;
VPBasicBlock *PredVPBB =
cast<VPBasicBlock>(ExitIRI->getParent()->getPredecessors()[Idx]);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not keen on having the predecessors being looked up like this -- I know the operands to the exit phi recipe are added by walking the blocks in collectUsersInExitBlocks, but it feels fragile. I won't hold it up for that though, maybe this is something to improve in the future when we add support for additional uncounted exits.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also share your concerns, but at a previous vectoriser call in November @fhahn explained this was quite normal in VPlan that assumptions are made regarding the ordering of successors and predecessors. We already do this when setting up the middle block, etc. It was definitely a problem until I fixed up some code that replaced the VPBasicBlock middle block with a VPIRBasicBlock and maintained the ordering. As we support more exits it may become harder to maintain the ordering during VPlan optimisations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have all the required information available in handleUncountableEarlyExit. Could we simply update the early exit live-outs directly there and avoid the complication here?

Conceptually it should be fine to have collectUsersInExitBlocks only collect exit values with incoming values defined in a loop region; this would allow handleUncountableEarlyExit to create the required extract directly for the early exit I think.

I put up #123819 to also avoid collecting live-ins in collectUsersInExitBlocks and added an assert that other values to fix are defined in the vector loop region. The assert could then be changed to an early continue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to take a look, but as mentioned on #123819 I think I'd like some time to test this out before #123819 lands, in case for some reason it's not possible to do this in handleUncountableEarlyExit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am specifically worried about this code, which I always seem to trip over during a rebase:

void VPIRInstruction::execute(VPTransformState &State) {
  assert((isa<PHINode>(&I) || getNumOperands() == 0) &&
         "Only PHINodes can have extra operands");
  for (const auto &[Idx, Op] : enumerate(operands())) {

which specifically assumes:

  1. The number of operands exactly matches the number of predecessors, and
  2. The operands are added in exactly the same order as the order of predecessors.

Anything in the code that breaks these assumptions also breaks early exit loops. By splitting out the code to deal with live-outs in handleUncountableEarlyExit I am worried that we have break 2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have implemented your suggestion for handling the live-outs in handleUncountableEarlyExit. I did this in such a way that the LLVM test suite continues to pass when forcing early exit vectorisation and ignoring potentially faulting loads, and also made sure all the tests in llvm/test/Transforms/LoopVectorize pass. There are still ordering issues that need handling delicately, which is why I had to collect the normal exit values prior to calling handleUncountableEarlyExit, otherwise when verifying the IR we crash due to incoming PHI values being matched up with the wrong incoming block.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok sounds good, so should #123819 land in preparation?

Copy link
Contributor Author

@david-arm david-arm Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest I didn't base my latest changes off #123819. Ideally I'd prefer to get it all working in this PR so I can be absolutely sure that there are no test failures. At first glance it looks like #123819 will make things more difficult (and code more complicated) for the general case that I care about here. If you take a look at my version in this patch I split out handling of live-outs according to the incoming block which works for all cases (same or different exit blocks), whereas #123819 is trying to handle both the incoming values from the latch and the early exit in handleUncountableEarlyExit only for the special case where exit blocks are the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea behind the changes to handleUncountableEarlyExit in #123819 to have a temporary mechanism to test the special case and defend the code? If so, that would be fine but I think I will need to rewrite that code for the general case that I care about in this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, instead of handling everything in handleUncountableEarlyExit here it requires running the collection early.

The smallish additional complexity in handleUncountableEarlyExit in #123819 for early exits has the advantage of not having this ordering requirement, but I don't have a strong preference.

The main benefit from #123819 IMO would be to separately move where we update the currently supported exit users and test/defend the removal of the code in collectUser../addUsers.. independently before adding support for new exit values.

@@ -501,8 +501,15 @@ void VPBasicBlock::execute(VPTransformState *State) {
UnreachableInst *Terminator = State->Builder.CreateUnreachable();
// Register NewBB in its loop. In innermost loops its the same for all
// BB's.
if (State->CurrentParentLoop)
State->CurrentParentLoop->addBasicBlockToLoop(NewBB, *State->LI);
Loop *ParentLoop = State->CurrentParentLoop;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to improve tracking the CurrentParentLoop instead of working around it here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible, although you'll still end up with similar looking code in VPlan::execute where we execute the blocks:

  for (VPBlockBase *Block : RPOT)
    Block->execute(State);

It's worth bearing in mind that in VPRegionBlock::execute we also set the ParentLoop:

  if (!isReplicator()) {
    // Create and register the new vector loop.
    Loop *PrevLoop = State->CurrentParentLoop;
    State->CurrentParentLoop = State->LI->AllocateLoop();

and VPRegionBlock is included in that list. So depending upon how the blocks are traversed you may need to cache the previous value each time, i.e. something like:

  for (VPBlockBase *Block : RPOT) {
    Loop *OldParentLoop = State->ParentLoop;
    if (Block->getSingleSuccessor() && isExitBlock(Block->getSingleSuccessor()))
      State->ParentLoop = ... parent for exit ...
    Block->execute(State);
    State->ParentLoop = OldParentLoop;
  }

I'm honestly not sure if that's any better? The only way to do this neatly I think is if you deal with the VP blocks with a single successor exit block in a different loop, i.e.

  for (VPBlockBase *Block : RPOT) {
    if (!Block->getSingleSuccessor() || !isExitBlock(Block->getSingleSuccessor()))
      Block->execute(State);
  }
  for (VPBlockBase *Block : RPOT) {
    if (Block->getSingleSuccessor() && isExitBlock(Block->getSingleSuccessor())) {
      State->ParentLoop = ... exit block parent loop ...
      Block->execute(State);
    }
  }

Copy link
Contributor

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just double checking, looking at the latest version it is effectively doing an extract of the last active lane, right?

Hi @fhahn, no it extracts the first active lane by counting the number of trailing zeroes in the mask that triggered the exit. So it's calculating the element corresponding to the first true value in the mask. If it was extracting the last active lane then it would be VF - number of leading zeroes.

Yeah I meant just that it only extracts from last valid lane of the vector IV, not how exactly that is computed (vs computing the end value based on the IV end value)

Yeah that's true. There are two different ways of thinking about this:

  1. We're extracting the lane corresponding to the first true value in the exit mask, or
  2. We're extracting the last valid lane of the loop-defined vector variable, since values in higher lanes correspond to values beyond the loop exit.

I've named the new VPInstruction in terms of the former because I thought that more closely matches the operation. Although I can see it depends upon your perspective!

OK great! (Was originally a bit confused as it looks like you modified my original comment but I assumed you meant to post this as new response :)

@@ -3938,6 +3941,22 @@ class VPlan {
VPRegionBlock *getVectorLoopRegion();
const VPRegionBlock *getVectorLoopRegion() const;

/// Get the vector early exit block
VPBasicBlock *getEarlyExit() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this is isn't needed any longer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, removed thanks!


// The block must be a successor of the region block.
if (llvm::is_contained(
vp_depth_first_shallow(RegionBlock->getSingleSuccessor()), VPBB)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I completely follow the logic here. We return true if VPBB is any block after the middle block, basically relying on the checks above to filter out all non-VPIRBBs or VPIRBBs with successors.

Simpler to return isa<VPIRBasicBlock>(VPBB) && VPBB->getNumSuccessors() == 0;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, that's fair enough. I wasn't sure if those two checks would always be enough, but it makes sense.

@david-arm
Copy link
Contributor Author

Rebase + deal with conflicts. NOTE: due to a recent commit (b7286db) same_exit_block_pre_inc_use1_reverse now starts vectorising too.

@david-arm
Copy link
Contributor Author

Hi @fhahn, the Windows build is complaining about getExitBlocks that is called from the new isExitBlock function. The error is:

C:\BuildTools\VC\Tools\MSVC\14.29.30133\include\xutility(1378): error C2280: 'llvm::filter_iterator_impl<llvm::mapped_iterator<IteratorT,FuncTy,BlockTy *>,llvm::VPlan::getExitBlocks::<lambda_38c1aefa419e66528297efd947a86d73>,llvm::detail::fwd_or_bidi_tag_impl<false>::type> &llvm::filter_iterator_impl<llvm::mapped_iterator<IteratorT,FuncTy,BlockTy *>,llvm::VPlan::getExitBlocks::<lambda_38c1aefa419e66528297efd947a86d73>,llvm::detail::fwd_or_bidi_tag_impl<false>::type>::operator =(const llvm::filter_iterator_impl<llvm::mapped_iterator<IteratorT,FuncTy,BlockTy *>,llvm::VPlan::getExitBlocks::<lambda_38c1aefa419e66528297efd947a86d73>,llvm::detail::fwd_or_bidi_tag_impl<false>::type> &)': attempting to reference a deleted function
        with
...

It's not immediately obvious to me that this is a problem with isExitBlock and possibly a problem with getExitBlocks or some assumption about the function is used. I'll investigate this a bit further, but I may just have to remove the assert for now.

Copy link
Contributor

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates, a few small comments left

Hi @fhahn, the Windows build is complaining about getExitBlocks that is called from the new isExitBlock function. The error is:

C:\BuildTools\VC\Tools\MSVC\14.29.30133\include\xutility(1378): error C2280: 'llvm::filter_iterator_impl<llvm::mapped_iterator<IteratorT,FuncTy,BlockTy *>,llvm::VPlan::getExitBlocks::<lambda_38c1aefa419e66528297efd947a86d73>,llvm::detail::fwd_or_bidi_tag_impl::type> &llvm::filter_iterator_impl<llvm::mapped_iterator<IteratorT,FuncTy,BlockTy *>,llvm::VPlan::getExitBlocks::<lambda_38c1aefa419e66528297efd947a86d73>,llvm::detail::fwd_or_bidi_tag_impl::type>::operator =(const llvm::filter_iterator_impl<llvm::mapped_iterator<IteratorT,FuncTy,BlockTy *>,llvm::VPlan::getExitBlocks::<lambda_38c1aefa419e66528297efd947a86d73>,llvm::detail::fwd_or_bidi_tag_impl::type> &)': attempting to reference a deleted function
with
...
It's not immediately obvious to me that this is a problem with isExitBlock and possibly a problem with getExitBlocks or some assumption about the function is used. I'll investigate this a bit further, but I may just have to remove the assert for now.

Sounds good, it looks like it may not be playing nicely with is_contained for some reason. I also don't have a windows machine to check this.

Comment on lines 703 to 705
Value *Ctz =
Builder.CreateCountTrailingZeroElems(Builder.getInt64Ty(), Mask);
return Builder.CreateExtractElement(Vec, Ctz);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: might be helpful for IR readability to use IR value names here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, good suggestion!

Comment on lines 215 to 229
// Now that we support vectorising loops with uncountable early exits
// we can end up in situations where VPBB does not dominate the exit
// block. Only do the check if the user is not in a VPIRBasicBlock.
if (!isa<VPIRBasicBlock>(UI->getParent()) &&
!VPDT.dominates(VPBB, UI->getParent())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see; IIUC whether a block has multiple predecessors or not shouldn't really be the issue. The problem is that for a phi we need to check if the incoming value dominates the incoming block.

We have the same issue for other phi-like recipes, see TODO above (line 210). I think it should be sufficient to skip VPIRInstructions wrapping a phi instruction there as well?

That way we would preserve def-use checks for other recipes in VPIRBasicBlocks.

I also put up #124838 separately to extend def-use verification to phis, which should enable removing the early continue in all cases

This work feeds part of PR llvm#88385, and adds support for vectorising
loops with uncountable early exits and outside users of loop-defined
variables. When calculating the final value from an uncountable early
exit we need to calculate the vector lane that triggered the exit,
and hence determine the value at the point we exited.

All code for calculating the last value when exiting the loop early
now lives in a new vector.early.exit block, which sits between the
middle.split block and the original exit block. Doing this required
the following fix:

* The vplan verifier incorrectly assumed that the block containing
a definition always dominates the block of the user. That's not true
if you can arrive at the use block from multiple incoming blocks.
This is possible for early exit loops where both the early exit and
the latch jump to the same block.

I've added a new ExtractFirstActive VPInstruction that extracts the
first active lane of a vector, i.e. the lane of the vector predicate
that triggered the exit.

NOTE: The IR generated for dealing with live-outs from early exit
loops is unoptimised, as opposed to normal loops. This inevitably
leads to poor quality code, but this can be fixed up later.
@david-arm
Copy link
Contributor Author

We have the same issue for other phi-like recipes, see TODO above (line 210). I think it should be sufficient to skip VPIRInstructions wrapping a phi instruction there as well?

That way we would preserve def-use checks for other recipes in VPIRBasicBlocks.

Yep, I've tried to do that in the latest patch. That will then be superseded by #124838 (or whichever comes first).

@fhahn fhahn requested a review from ayalz January 29, 2025 20:09
Copy link
Contributor

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ,thanks

@david-arm david-arm merged commit 3bc2dad into llvm:main Jan 30, 2025
9 checks passed
@llvm-ci
Copy link
Collaborator

llvm-ci commented Jan 30, 2025

LLVM Buildbot has detected a new failure on builder openmp-offload-libc-amdgpu-runtime running on omp-vega20-1 while building llvm at step 7 "Add check check-offload".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/73/builds/12723

Here is the relevant piece of the build log for the reference
Step 7 (Add check check-offload) failure: test (failure)
******************** TEST 'libomptarget :: amdgcn-amd-amdhsa :: offloading/host_as_target.c' FAILED ********************
Exit Code: 2

Command Output (stdout):
--
# RUN: at line 8
/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./bin/clang -fopenmp    -I /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test -I /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -L /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -L /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./lib -L /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src  -nogpulib -Wl,-rpath,/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -Wl,-rpath,/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -Wl,-rpath,/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./lib  -fopenmp-targets=amdgcn-amd-amdhsa /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test/offloading/host_as_target.c -o /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/offloading/Output/host_as_target.c.tmp -Xoffload-linker -lc -Xoffload-linker -lm /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./lib/libomptarget.devicertl.a && /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/offloading/Output/host_as_target.c.tmp | /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./bin/FileCheck /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test/offloading/host_as_target.c
# executed command: /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./bin/clang -fopenmp -I /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test -I /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -L /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -L /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./lib -L /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -nogpulib -Wl,-rpath,/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -Wl,-rpath,/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -Wl,-rpath,/home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./lib -fopenmp-targets=amdgcn-amd-amdhsa /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test/offloading/host_as_target.c -o /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/offloading/Output/host_as_target.c.tmp -Xoffload-linker -lc -Xoffload-linker -lm /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./lib/libomptarget.devicertl.a
# executed command: /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/offloading/Output/host_as_target.c.tmp
# note: command had no output on stdout or stderr
# error: command failed with exit status: -11
# executed command: /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./bin/FileCheck /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test/offloading/host_as_target.c
# .---command stderr------------
# | FileCheck error: '<stdin>' is empty.
# | FileCheck command line:  /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.build/./bin/FileCheck /home/ompworker/bbot/openmp-offload-libc-amdgpu-runtime/llvm.src/offload/test/offloading/host_as_target.c
# `-----------------------------
# error: command failed with exit status: 2

--

********************


@david-arm david-arm deleted the ee_live_outs branch April 7, 2025 16:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants