[VPlan] Implement VPlan-based cost model for VPReduction, VPExtendedReduction and VPMulAccumulateReduction. #113903

ElvisWang123 · 2024-10-28T12:56:41Z

This patch implement the VPlan-based cost model for VPReduction, VPExtendedReduction and VPMulAccumulateReduction.

With this patch, we can calculate the reduction cost by the VPlan-based cost model so remove the reduction costs in precomputeCost().

Ref: Original instruction based implementation: https://reviews.llvm.org/D93476

llvmbot · 2024-10-28T12:57:17Z

@llvm/pr-subscribers-vectorizers

@llvm/pr-subscribers-llvm-transforms

Author: Elvis Wang (ElvisWang123)

Changes

This patch implement the VPlan-based pattern match for extendedReduction and MulAccReduction. In above reduction patterns, extened instructions and mul instruction can fold into reduction instruction and the cost is free.

We add FoldedRecipes in the VPCostContext to put recipes that can be folded into other recipes.

ExtendedReductionPatterns:
reduce(ext(...))
MulAccReductionPatterns:
reduce.add(mul(...))
reduce.add(mul(ext(...), ext(...)))
reduce.add(ext(mul(...)))
reduce.add(ext(mul(ext(...), ext(...))))

Ref: Original instruction based implementation:
https://reviews.llvm.org/D93476

This patch is based on #113902 .

Patch is 21.33 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/113903.diff

5 Files Affected:

(modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (-45)
(modified) llvm/lib/Transforms/Vectorize/VPlan.cpp (+1-1)
(modified) llvm/lib/Transforms/Vectorize/VPlan.h (+2)
(modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+127-12)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/interleaved-cost.ll (+36-36)

diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 60a94ca1f86e42..483e039fe133d6 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -7303,51 +7303,6 @@ LoopVectorizationPlanner::precomputeCosts(VPlan &Plan, ElementCount VF,
       Cost += ReductionCost;
       continue;
     }
-
-    const auto &ChainOps = RdxDesc.getReductionOpChain(RedPhi, OrigLoop);
-    SetVector<Instruction *> ChainOpsAndOperands(ChainOps.begin(),
-                                                 ChainOps.end());
-    auto IsZExtOrSExt = [](const unsigned Opcode) -> bool {
-      return Opcode == Instruction::ZExt || Opcode == Instruction::SExt;
-    };
-    // Also include the operands of instructions in the chain, as the cost-model
-    // may mark extends as free.
-    //
-    // For ARM, some of the instruction can folded into the reducion
-    // instruction. So we need to mark all folded instructions free.
-    // For example: We can fold reduce(mul(ext(A), ext(B))) into one
-    // instruction.
-    for (auto *ChainOp : ChainOps) {
-      for (Value *Op : ChainOp->operands()) {
-        if (auto *I = dyn_cast<Instruction>(Op)) {
-          ChainOpsAndOperands.insert(I);
-          if (I->getOpcode() == Instruction::Mul) {
-            auto *Ext0 = dyn_cast<Instruction>(I->getOperand(0));
-            auto *Ext1 = dyn_cast<Instruction>(I->getOperand(1));
-            if (Ext0 && IsZExtOrSExt(Ext0->getOpcode()) && Ext1 &&
-                Ext0->getOpcode() == Ext1->getOpcode()) {
-              ChainOpsAndOperands.insert(Ext0);
-              ChainOpsAndOperands.insert(Ext1);
-            }
-          }
-        }
-      }
-    }
-
-    // Pre-compute the cost for I, if it has a reduction pattern cost.
-    for (Instruction *I : ChainOpsAndOperands) {
-      auto ReductionCost = CM.getReductionPatternCost(
-          I, VF, ToVectorTy(I->getType(), VF), TTI::TCK_RecipThroughput);
-      if (!ReductionCost)
-        continue;
-
-      assert(!CostCtx.SkipCostComputation.contains(I) &&
-             "reduction op visited multiple times");
-      CostCtx.SkipCostComputation.insert(I);
-      LLVM_DEBUG(dbgs() << "Cost of " << ReductionCost << " for VF " << VF
-                        << ":\n in-loop reduction " << *I << "\n");
-      Cost += *ReductionCost;
-    }
   }
 
   // Pre-compute the costs for branches except for the backedge, as the number
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.cpp b/llvm/lib/Transforms/Vectorize/VPlan.cpp
index 6ab8fb45c351b4..49e93e1e7b5501 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlan.cpp
@@ -785,7 +785,7 @@ void VPRegionBlock::execute(VPTransformState *State) {
 
 InstructionCost VPBasicBlock::cost(ElementCount VF, VPCostContext &Ctx) {
   InstructionCost Cost = 0;
-  for (VPRecipeBase &R : Recipes)
+  for (VPRecipeBase &R : reverse(Recipes))
     Cost += R.cost(VF, Ctx);
   return Cost;
 }
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index 6a192bdf01c4ff..b26fd460a278f5 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -725,6 +725,8 @@ struct VPCostContext {
   LLVMContext &LLVMCtx;
   LoopVectorizationCostModel &CM;
   SmallPtrSet<Instruction *, 8> SkipCostComputation;
+  /// Contains recipes that are folded into other recipes.
+  SmallDenseMap<ElementCount, SmallPtrSet<VPRecipeBase *, 4>, 4> FoldedRecipes;
 
   VPCostContext(const TargetTransformInfo &TTI, const TargetLibraryInfo &TLI,
                 Type *CanIVTy, LoopVectorizationCostModel &CM)
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index 0eb4f7c7c88cee..5f59a1e96df9f8 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -299,7 +299,9 @@ InstructionCost VPRecipeBase::cost(ElementCount VF, VPCostContext &Ctx) {
     UI = &WidenMem->getIngredient();
 
   InstructionCost RecipeCost;
-  if (UI && Ctx.skipCostComputation(UI, VF.isVector())) {
+  if ((UI && Ctx.skipCostComputation(UI, VF.isVector())) ||
+      (Ctx.FoldedRecipes.contains(VF) &&
+       Ctx.FoldedRecipes.at(VF).contains(this))) {
     RecipeCost = 0;
   } else {
     RecipeCost = computeCost(VF, Ctx);
@@ -2188,30 +2190,143 @@ InstructionCost VPReductionRecipe::computeCost(ElementCount VF,
   TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput;
   unsigned Opcode = RdxDesc.getOpcode();
 
-  // TODO: Support any-of and in-loop reductions.
+  // TODO: Support any-of reductions.
   assert(
       (!RecurrenceDescriptor::isAnyOfRecurrenceKind(RdxKind) ||
        ForceTargetInstructionCost.getNumOccurrences() > 0) &&
       "Any-of reduction not implemented in VPlan-based cost model currently.");
-  assert(
-      (!cast<VPReductionPHIRecipe>(getOperand(0))->isInLoop() ||
-       ForceTargetInstructionCost.getNumOccurrences() > 0) &&
-      "In-loop reduction not implemented in VPlan-based cost model currently.");
 
   assert(ElementTy->getTypeID() == RdxDesc.getRecurrenceType()->getTypeID() &&
          "Inferred type and recurrence type mismatch.");
 
-  // Cost = Reduction cost + BinOp cost
-  InstructionCost Cost =
+  // BaseCost = Reduction cost + BinOp cost
+  InstructionCost BaseCost =
       Ctx.TTI.getArithmeticInstrCost(Opcode, ElementTy, CostKind);
   if (RecurrenceDescriptor::isMinMaxRecurrenceKind(RdxKind)) {
     Intrinsic::ID Id = getMinMaxReductionIntrinsicOp(RdxKind);
-    return Cost + Ctx.TTI.getMinMaxReductionCost(
-                      Id, VectorTy, RdxDesc.getFastMathFlags(), CostKind);
+    BaseCost += Ctx.TTI.getMinMaxReductionCost(
+        Id, VectorTy, RdxDesc.getFastMathFlags(), CostKind);
+  } else {
+    BaseCost += Ctx.TTI.getArithmeticReductionCost(
+        Opcode, VectorTy, RdxDesc.getFastMathFlags(), CostKind);
   }
 
-  return Cost + Ctx.TTI.getArithmeticReductionCost(
-                    Opcode, VectorTy, RdxDesc.getFastMathFlags(), CostKind);
+  using namespace llvm::VPlanPatternMatch;
+  auto GetMulAccReductionCost =
+      [&](const VPReductionRecipe *Red) -> InstructionCost {
+    VPValue *A, *B;
+    InstructionCost InnerExt0Cost = 0;
+    InstructionCost InnerExt1Cost = 0;
+    InstructionCost ExtCost = 0;
+    InstructionCost MulCost = 0;
+
+    VectorType *SrcVecTy = VectorTy;
+    Type *InnerExt0Ty;
+    Type *InnerExt1Ty;
+    Type *MaxInnerExtTy;
+    bool IsUnsigned = true;
+    bool HasOuterExt = false;
+
+    auto *Ext = dyn_cast_if_present<VPWidenCastRecipe>(
+        Red->getVecOp()->getDefiningRecipe());
+    VPRecipeBase *Mul;
+    // Try to match outer extend reduce.add(ext(...))
+    if (Ext && match(Ext, m_ZExtOrSExt(m_VPValue())) &&
+        cast<VPWidenCastRecipe>(Ext)->getNumUsers() == 1) {
+      IsUnsigned =
+          Ext->getOpcode() == Instruction::CastOps::ZExt ? true : false;
+      ExtCost = Ext->computeCost(VF, Ctx);
+      Mul = Ext->getOperand(0)->getDefiningRecipe();
+      HasOuterExt = true;
+    } else {
+      Mul = Red->getVecOp()->getDefiningRecipe();
+    }
+
+    // Match reduce.add(mul())
+    if (Mul && match(Mul, m_Mul(m_VPValue(A), m_VPValue(B))) &&
+        cast<VPWidenRecipe>(Mul)->getNumUsers() == 1) {
+      MulCost = cast<VPWidenRecipe>(Mul)->computeCost(VF, Ctx);
+      auto *InnerExt0 =
+          dyn_cast_if_present<VPWidenCastRecipe>(A->getDefiningRecipe());
+      auto *InnerExt1 =
+          dyn_cast_if_present<VPWidenCastRecipe>(B->getDefiningRecipe());
+      bool HasInnerExt = false;
+      // Try to match inner extends.
+      if (InnerExt0 && InnerExt1 &&
+          match(InnerExt0, m_ZExtOrSExt(m_VPValue())) &&
+          match(InnerExt1, m_ZExtOrSExt(m_VPValue())) &&
+          InnerExt0->getOpcode() == InnerExt1->getOpcode() &&
+          (InnerExt0->getNumUsers() > 0 &&
+           !InnerExt0->hasMoreThanOneUniqueUser()) &&
+          (InnerExt1->getNumUsers() > 0 &&
+           !InnerExt1->hasMoreThanOneUniqueUser())) {
+        InnerExt0Cost = InnerExt0->computeCost(VF, Ctx);
+        InnerExt1Cost = InnerExt1->computeCost(VF, Ctx);
+        Type *InnerExt0Ty = Ctx.Types.inferScalarType(InnerExt0->getOperand(0));
+        Type *InnerExt1Ty = Ctx.Types.inferScalarType(InnerExt1->getOperand(0));
+        Type *MaxInnerExtTy = InnerExt0Ty->getIntegerBitWidth() >
+                                      InnerExt1Ty->getIntegerBitWidth()
+                                  ? InnerExt0Ty
+                                  : InnerExt1Ty;
+        SrcVecTy = cast<VectorType>(ToVectorTy(MaxInnerExtTy, VF));
+        IsUnsigned = true;
+        HasInnerExt = true;
+      }
+      InstructionCost MulAccRedCost = Ctx.TTI.getMulAccReductionCost(
+          IsUnsigned, ElementTy, SrcVecTy, CostKind);
+      // Check if folding ext/mul into MulAccReduction is profitable.
+      if (MulAccRedCost.isValid() &&
+          MulAccRedCost <
+              ExtCost + MulCost + InnerExt0Cost + InnerExt1Cost + BaseCost) {
+        if (HasInnerExt) {
+          Ctx.FoldedRecipes[VF].insert(InnerExt0);
+          Ctx.FoldedRecipes[VF].insert(InnerExt1);
+        }
+        Ctx.FoldedRecipes[VF].insert(Mul);
+        if (HasOuterExt)
+          Ctx.FoldedRecipes[VF].insert(Ext);
+        return MulAccRedCost;
+      }
+    }
+    return InstructionCost::getInvalid();
+  };
+
+  // Match reduce(ext(...))
+  auto GetExtendedReductionCost =
+      [&](const VPReductionRecipe *Red) -> InstructionCost {
+    VPValue *VecOp = Red->getVecOp();
+    VPValue *A;
+    if (match(VecOp, m_ZExtOrSExt(m_VPValue(A))) && VecOp->getNumUsers() == 1) {
+      VPWidenCastRecipe *Ext =
+          cast<VPWidenCastRecipe>(VecOp->getDefiningRecipe());
+      bool IsUnsigned = Ext->getOpcode() == Instruction::CastOps::ZExt;
+      InstructionCost ExtCost = Ext->computeCost(VF, Ctx);
+      auto *ExtVecTy =
+          cast<VectorType>(ToVectorTy(Ctx.Types.inferScalarType(A), VF));
+      InstructionCost ExtendedRedCost = Ctx.TTI.getExtendedReductionCost(
+          Opcode, IsUnsigned, ElementTy, ExtVecTy, RdxDesc.getFastMathFlags(),
+          CostKind);
+      // Check if folding ext into ExtendedReduction is profitable.
+      if (ExtendedRedCost.isValid() && ExtendedRedCost < ExtCost + BaseCost) {
+        Ctx.FoldedRecipes[VF].insert(Ext);
+        return ExtendedRedCost;
+      }
+    }
+    return InstructionCost::getInvalid();
+  };
+
+  // Match MulAccReduction patterns.
+  InstructionCost MulAccCost = GetMulAccReductionCost(this);
+  if (MulAccCost.isValid())
+    return MulAccCost;
+
+  // Match ExtendedReduction patterns.
+  InstructionCost ExtendedCost = GetExtendedReductionCost(this);
+  if (ExtendedCost.isValid())
+    return ExtendedCost;
+
+  // Default cost.
+  return BaseCost;
 }
 
 #if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/interleaved-cost.ll b/llvm/test/Transforms/LoopVectorize/RISCV/interleaved-cost.ll
index fa346b4eac02d4..f2e36399c85f5d 100644
--- a/llvm/test/Transforms/LoopVectorize/RISCV/interleaved-cost.ll
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/interleaved-cost.ll
@@ -6,26 +6,26 @@ define void @i8_factor_2(ptr %data, i64 %n) {
 entry:
   br label %for.body
 ; CHECK-LABEL: Checking a loop in 'i8_factor_2'
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
 ; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
-; CHECK: Cost of 2 for VF 4: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
 ; CHECK: Cost of 2 for VF 4: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
-; CHECK: Cost of 2 for VF 8: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF 4: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
 ; CHECK: Cost of 2 for VF 8: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
-; CHECK: Cost of 3 for VF 16: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF 8: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
 ; CHECK: Cost of 3 for VF 16: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
-; CHECK: Cost of 5 for VF 32: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
+; CHECK: Cost of 3 for VF 16: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
 ; CHECK: Cost of 5 for VF 32: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
-; CHECK: Cost of 2 for VF vscale x 1: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
+; CHECK: Cost of 5 for VF 32: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
 ; CHECK: Cost of 2 for VF vscale x 1: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
-; CHECK: Cost of 2 for VF vscale x 2: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF vscale x 1: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
 ; CHECK: Cost of 2 for VF vscale x 2: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
-; CHECK: Cost of 2 for VF vscale x 4: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF vscale x 2: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
 ; CHECK: Cost of 2 for VF vscale x 4: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
-; CHECK: Cost of 3 for VF vscale x 8: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF vscale x 4: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
 ; CHECK: Cost of 3 for VF vscale x 8: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
-; CHECK: Cost of 5 for VF vscale x 16: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
+; CHECK: Cost of 3 for VF vscale x 8: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
 ; CHECK: Cost of 5 for VF vscale x 16: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
+; CHECK: Cost of 5 for VF vscale x 16: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
 for.body:
   %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
   %p0 = getelementptr inbounds %i8.2, ptr %data, i64 %i, i32 0
@@ -49,16 +49,16 @@ define void @i8_factor_3(ptr %data, i64 %n) {
 entry:
   br label %for.body
 ; CHECK-LABEL: Checking a loop in 'i8_factor_3'
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
 ; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 3 at <badref>, ir<%p0>
-; CHECK: Cost of 2 for VF 4: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
 ; CHECK: Cost of 2 for VF 4: INTERLEAVE-GROUP with factor 3 at <badref>, ir<%p0>
-; CHECK: Cost of 3 for VF 8: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF 4: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
 ; CHECK: Cost of 3 for VF 8: INTERLEAVE-GROUP with factor 3 at <badref>, ir<%p0>
-; CHECK: Cost of 5 for VF 16: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
+; CHECK: Cost of 3 for VF 8: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
 ; CHECK: Cost of 5 for VF 16: INTERLEAVE-GROUP with factor 3 at <badref>, ir<%p0>
-; CHECK: Cost of 9 for VF 32: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
+; CHECK: Cost of 5 for VF 16: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
 ; CHECK: Cost of 9 for VF 32: INTERLEAVE-GROUP with factor 3 at <badref>, ir<%p0>
+; CHECK: Cost of 9 for VF 32: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
 for.body:
   %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
   %p0 = getelementptr inbounds %i8.3, ptr %data, i64 %i, i32 0
@@ -86,16 +86,16 @@ define void @i8_factor_4(ptr %data, i64 %n) {
 entry:
   br label %for.body
 ; CHECK-LABEL: Checking a loop in 'i8_factor_4'
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
 ; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
-; CHECK: Cost of 2 for VF 4: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
 ; CHECK: Cost of 2 for VF 4: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
-; CHECK: Cost of 3 for VF 8: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF 4: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
 ; CHECK: Cost of 3 for VF 8: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
-; CHECK: Cost of 5 for VF 16: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
+; CHECK: Cost of 3 for VF 8: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
 ; CHECK: Cost of 5 for VF 16: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
-; CHECK: Cost of 9 for VF 32: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
+; CHECK: Cost of 5 for VF 16: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
 ; CHECK: Cost of 9 for VF 32: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
+; CHECK: Cost of 9 for VF 32: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
 for.body:
   %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
   %p0 = getelementptr inbounds %i8.4, ptr %data, i64 %i, i32 0
@@ -127,14 +127,14 @@ define void @i8_factor_5(ptr %data, i64 %n) {
 entry:
   br label %for.body
 ; CHECK-LABEL: Checking a loop in 'i8_factor_5'
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
 ; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 5 at <badref>, ir<%p0>
-; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
 ; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 5 at <badref>, ir<%p0>
-; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
+; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
 ; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 5 at <badref>, ir<%p0>
-; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
+; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
 ; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 5 at <badref>, ir<%p0>
+; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
 for.body:
   %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
   %p0 = getelementptr inbounds %i8.5, ptr %data, i64 %i, i32 0
@@ -170,14 +170,14 @@ define void @i8_factor_6(ptr %data, i64 %n) {
 entry:
   br label %for.body
 ; CHECK-LABEL: Checking a loop in 'i8_factor_6'
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
 ; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 6 at <badref>, ir<%p0>
-; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
 ; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 6 at <badref>, ir<%p0>
-; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
+; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
 ; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 6 at <badref>, ir<%p0>
-; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
+; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
 ; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 6 at <badref>, ir<%p0>
+; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
 for.body:
   %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
   %p0 = getelementptr inbounds %i8.6, ptr %data, i64 %i, i32 0
@@ -217,14 +217,14 @@ define void @i8_factor_7(ptr %data, i64 %n) {
 entry:
   br label %for.body
 ; CHECK-LABEL: Checking a loop in 'i8_factor_7'
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
 ; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 7 at <badref>, ir<%p0>
-; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
 ; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 7 at <badref>, ir<%p0>
-; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
+; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
 ; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 7 at <badref>, ir<%p0>
-; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
+; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
 ; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 7 at <badref>, ir<%p0>
+; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 7 at %...
[truncated]

davemgreen · 2024-10-28T13:02:55Z

The way I think I had imagined this working in a vplan-based cost model would be that the vplan nodes more closely mirrored what the back-end would produce. So there would be a vplan recipe for the extended-reduction, which would be created if it was profitable but would otherwise be relatively easy to cost-model.

I'm not sure if that is still the current plan or not.

fhahn · 2024-10-28T20:42:03Z

The way I think I had imagined this working in a vplan-based cost model would be that the vplan nodes more closely mirrored what the back-end would produce. So there would be a vplan recipe for the extended-reduction, which would be created if it was profitable but would otherwise be relatively easy to cost-model.

Yes I think ideally we would model the add-extend/mul-extend operations explicitly in VPlan, especially if matching it in VPlan would require changes to the order in which the cost is computed. Would add-extend/mul-extend be sufficient or would other recipes be needed as well?

ElvisWang123 · 2024-10-28T23:31:24Z

I think currently we need mul-extend-reduction and extend-reduction recipe to model these reduction pattern in the VPlan.

Yes, generating new recipe is good but using new recipe to model these patterns would duplicate lots of codes in the execute() function since we lack of middle-end IRs for these patterns. The new recipes would still need to generate all vector instructions for recipes that has been folding into it.

fhahn · 2024-10-30T21:09:33Z

I think currently we need mul-extend-reduction and extend-reduction recipe to model these reduction pattern in the VPlan.

@davemgreen do you think those would be enough?

Yes, generating new recipe is good but using new recipe to model these patterns would duplicate lots of codes in the execute() function since we lack of middle-end IRs for these patterns. The new recipes would still need to generate all vector instructions for recipes that has been folding into it.

This might be a case where gradual lowering would help. We could have a more abstract recipe early on which combines mul-extend in a single recipe, facilitating simple cost-computation. Before code-gen, we can replace the recipe with wide recipes for the adds and extends, so there is no need to duplicate codegen for those, similar to how things are sketched for scalar phis in #114305

davemgreen · 2024-11-01T08:56:31Z

Hello. I believe the basic patterns are the ones listed in the summary, which are an extended-add-reduction or an extended mla reduction:

reduce(ext(...))
reduce.add(mul(...))
reduce.add(mul(ext(...), ext(...)))

In the case of MVE, both the ext's will be the same. The add can be done by setting one of the operands to 1. #92418 is similar, but produce a vector instead of a single output value (a dot-product). Dot product has udot, sdot and usdot that do partialreduce.add(mul(sext, zext)), but for MVE the extends need to be the same.

There are also other patterns that come up too. The first I believe should be equivalent to vecreduce(mul(ext, ext)), providing the ext nodes are the correct types. I don't remember about the second.

reduce.add(ext(mul(ext(...), ext(...))))
reduce.add(ext(mul(...)))

AArch64 has a stand-alone umull instruction (both for scalar and for vector, although the type sizes differ), that performs a mul(ext, ext). Sometimes it might be better to fold towards ext(load) though, depending on the types.

davemgreen · 2024-11-01T09:17:05Z

BTW, I believe this patch is currently changing the scores computed for reduce.add(ext(mul(ext(...), ext(...)))), in a way that makes it pick a different VF. I wasn't sure if that was because this was better or not, but I think it is picking different type sizes. https://godbolt.org/z/7TYb8heEM should hopefully show the difference? Let me know if it doesn't.

ElvisWang123 · 2024-11-01T17:16:30Z

This might be a case where gradual lowering would help. We could have a more abstract recipe early on which combines mul-extend in a single recipe, facilitating simple cost-computation. Before code-gen, we can replace the recipe with wide recipes for the adds and extends, so there is no need to duplicate codegen for those, similar to how things are sketched for scalar phis in #114305

Thanks for your advice, I'm working on this direction.

Hello. I believe the basic patterns are the ones listed in the summary, which are an extended-add-reduction or an extended mla reduction:
reduce(ext(...))
reduce.add(mul(...))
reduce.add(mul(ext(...), ext(...)))

Thanks, I will only model these three patterns for reduction.

There are also other patterns that come up too. The first I believe should be equivalent to vecreduce(mul(ext, ext)), providing the ext nodes are the correct types. I don't remember about the second.
reduce.add(ext(mul(ext(...), ext(...))))
reduce.add(ext(mul(...)))

Thanks for caching that. I misunderstood that how many patterns can be folded into mve reduction-like instructions in the original patch.

AArch64 has a stand-alone umull instruction (both for scalar and for vector, although the type sizes differ), that performs a mul(ext, ext). Sometimes it might be better to fold towards ext(load) though, depending on the types.

I think we already model the instruction cost for ext(load) in VPWidenCastRecipe::computeCost(). We compute the CastContextHint which depends on the load/store for the ext instructions. But I am not quite sure will ARMTTI handle this pattern correctly or not.

In summary, I think we only need two new recipes for reduction - reduce(ext) and reduce.add(mul(<optional>(ext), <optional>(ext))) ?

If there is any question, please let me know.

davemgreen · 2024-11-04T09:39:55Z

Thanks, I will only model these three patterns for reduction.

I believe that reduce.add(ext(mul(ext(...), ext(...)))) is mathematically equivalent to reduce.add(mul(ext(...), ext(...))), so that sounds good so long as we can match both. I don't remember about the reduce.add(ext(mul(..., ...))) form, it doesn't sound like it would be equivalent. If it turns out we do need it, it should hopefully be fine to look into adding it later.

I think we already model the instruction cost for ext(load) in VPWidenCastRecipe::computeCost(). We compute the CastContextHint which depends on the load/store for the ext instructions. But I am not quite sure will ARMTTI handle this pattern correctly or not.

In summary, I think we only need two new recipes for reduction - reduce(ext) and reduce.add(mul((ext), (ext))) ?

IIRC CCH was added for getting the extension costs (more) correct under Arm/MVE, so it hopefully does OK in the current scheme. We might need to be a little careful about which one of red-ext + load vs red + ext-load we prefer for each vplan.

ElvisWang123 · 2024-11-07T05:32:05Z

Update to recipe based implementation.

Note that the EVL version is not implement yet, so the test case in the RISCV changed.

davemgreen · 2024-11-08T09:31:00Z

Hello. Let me upload some tests of examples that produce different results with this now. I think it might be that add(sext(mul(sext, sext)) can be optimized to add(zext(mul(sext, sext)) nowadays, or that there are multiple uses.

davemgreen · 2024-11-08T09:31:26Z

This might show it picking a different VF now: https://godbolt.org/z/zda3dvcrx

ElvisWang123 · 2024-11-08T10:16:38Z

Hello. Let me upload some tests of examples that produce different results with this now. I think it might be that add(sext(mul(sext, sext)) can be optimized to add(zext(mul(sext, sext)) nowadays, or that there are multiple uses.

Thanks for your information. I will fix the MulAcc pattern match for this pattern.

davemgreen · 2024-11-08T10:38:24Z

Thanks - There might be some more and that might have just been one of the issues, I will keep trying to test the others and see if anything else comes up.

davemgreen · 2024-11-08T11:46:27Z

This is another one that is behaving differently, I think due to subtleties about when one-use checks are beneficial. https://godbolt.org/z/fYW9Y5TqG. There is a third more awkward case with interleave groups I have not looked into much yet.

I will try and make those into test cases to ensure we have the test coverage for them. I will have to check again later if anything else remains that is behaving differently and hitting the assert.

davemgreen · 2024-11-08T14:32:57Z

Some tests in ab9178e.

fhahn

Thanks for the update, some initial suggestions inline.

Would be good if we could avoid any references to IR operands, and possibly unify both recipes (and avoid duplication with the existing reduction recipe)

llvm/lib/Transforms/Vectorize/VPlan.h

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

ElvisWang123 · 2024-11-11T02:09:47Z

Some tests in ab9178e.

Thanks for adding new tests.

Update the condition of creating MulAccRecipe for reduce.add(zext(mul(sext(A), sext(B)))) when A == B. The new patch can pass all assertion checks.

To support mla_and_add_together_16_64 which contains two reduction patterns in same loop, reduce.add(zext(mul(sext(A), sext(A)))) and reduce(sext(A)). I removed the checks of if the extend recipes has only one user to align the behavior of the legacy model. This change might lead to miscalculating cost if the extend recipes will be used by other recipes not in the reduction pattern, which we still need to calculate the costs of extend recipes. But I think we would have same issue in the legacy cost model that we can fix it in future.
Also this may generate duplicate extend instructions but I think it will be removed after other optimizations. 234e81e

fhahn · 2025-04-29T14:46:13Z

Split new recipes implementations to #137745. Split transformations (convertTo{Abstract|Concrete}Recipes) to #137746.

Would be good to combine #137745 & #137746, so the recipes are used already, sorry if that wasn't clearer from the previous comment

… to abstract recipe. This patch introduce two new recipes. * VPExtendedReductionRecipe - cast + reduction. * VPMulAccumulateReductionRecipe - (cast) + mul + reduction. This patch also implements the transformation that match following patterns via vplan and converts to abstract recipes for better cost estimation. * VPExtendedReduction - reduce(cast(...)) * VPMulAccumulateReductionRecipe - reduce.add(mul(...)) - reduce.add(mul(ext(...), ext(...)) - reduce.add(ext(mul(ext(...), ext(...)))) The conveted abstract recipes will be lower to the concrete recipes (widen-cast + widen-mul + reduction) just before recipe execution. Split from llvm#113903.

ElvisWang123 · 2025-04-29T23:50:24Z

Would be good to combine #137745 & #137746, so the recipes are used already, sorry if that wasn't clearer from the previous comment

Ah it is fine, I misunderstood the previous comment 😃. Combined all the changes into #137746.

…xit`. (llvm#135294) This patch check if the plan contains scalar VF by VFRange instead of Plan. This patch also clamp the range to contains either only scalar or only vector VFs to prevent mis-compile. Split from llvm#113903.

This patch add the test for the fmuladd reduction to show the test change/fail for the cost model change. Note that without the fp128 load and trunc, there is no failure. Pre-commit test for llvm#113903.

… to abstract recipe. This patch introduce two new recipes. * VPExtendedReductionRecipe - cast + reduction. * VPMulAccumulateReductionRecipe - (cast) + mul + reduction. This patch also implements the transformation that match following patterns via vplan and converts to abstract recipes for better cost estimation. * VPExtendedReduction - reduce(cast(...)) * VPMulAccumulateReductionRecipe - reduce.add(mul(...)) - reduce.add(mul(ext(...), ext(...)) - reduce.add(ext(mul(ext(...), ext(...)))) The conveted abstract recipes will be lower to the concrete recipes (widen-cast + widen-mul + reduction) just before recipe execution. Split from llvm#113903.

… and corresponding vplan transformations. (#137746) This patch introduce two new recipes. * VPExtendedReductionRecipe - cast + reduction. * VPMulAccumulateReductionRecipe - (cast) + mul + reduction. This patch also implements the transformation that match following patterns via vplan and converts to abstract recipes for better cost estimation. * VPExtendedReduction - reduce(cast(...)) * VPMulAccumulateReductionRecipe - reduce.add(mul(...)) - reduce.add(mul(ext(...), ext(...)) - reduce.add(ext(mul(ext(...), ext(...)))) The converted abstract recipes will be lower to the concrete recipes (widen-cast + widen-mul + reduction) just before recipe execution. Note that this patch still relies on legacy cost model the calculate the cost for these patters. Will enable vplan-based cost decision in #113903. Split from #113903.

ElvisWang123

Update after #137746.

ElvisWang123 · 2025-05-19T00:20:10Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

-    return Cost +
-           Ctx.TTI.getMinMaxReductionCost(Id, VectorTy, FMFs, Ctx.CostKind);


Remove the BinOp cost here to match the legacy cost model. link
And TTI already calculate the cost of BinOp (at least in RISCV). link

sdesmalen-arm

Thanks for rebasing @ElvisWang123. Given that this PR is now mostly about flipping the switch to the new cost model, I see no issues with landing it.

sdesmalen-arm · 2025-05-27T09:20:03Z

I see the PR has "NFC" in the title, which is not correct because the behaviour may be subtly different with these code-changes. Can you please remove it?

fhahn

LGTM, thanks

fhahn · 2025-05-27T20:03:02Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

 InstructionCost VPInstruction::computeCost(ElementCount VF,
                                           VPCostContext &Ctx) const {
  if (Instruction::isBinaryOp(getOpcode())) {
+


Suggested change

fhahn · 2025-05-27T20:23:56Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

+  // Note that TTI should model the cost of moving result to the scalar register
+  // and the BinOp cost in the getReductionCost().


Could you repharse this a bit to make it clearer what this note is about? Below we only call getMinMaxReductionIntrinsicOp / getMinMaxReductionCost. Should getReductionCost -> getMinMaxReductiton...?

…eduction and VPMulAccumulateReduction. (#113903) This patch implement the VPlan-based cost model for VPReduction, VPExtendedReduction and VPMulAccumulateReduction. With this patch, we can calculate the reduction cost by the VPlan-based cost model so remove the reduction costs in `precomputeCost()`. Ref: Original instruction based implementation: https://reviews.llvm.org/D93476

ElvisWang123 requested review from fhahn, arcbbb, davemgreen and ayalz October 28, 2024 12:56

llvmbot added vectorizers llvm:transforms labels Oct 28, 2024

davemgreen requested a review from david-arm October 28, 2024 12:57

fhahn mentioned this pull request Oct 30, 2024

[LoopVectorize][NFC] Rewrite tests to check output of vplan cost model #113697

Merged

ElvisWang123 force-pushed the vp-arm-mve-transform branch from 442d1dd to bcccb13 Compare November 7, 2024 05:29

fhahn reviewed Nov 8, 2024

View reviewed changes

ElvisWang123 force-pushed the vp-arm-mve-transform branch from bcccb13 to 967b370 Compare November 11, 2024 02:07

ElvisWang123 force-pushed the vp-arm-mve-transform branch from 967b370 to 234e81e Compare November 11, 2024 02:11

davemgreen mentioned this pull request Nov 12, 2024

[LoopVectorizer][ARM] Detect reduce(ext(mul(ext, ext))) patterns more reliably #115847

Open

huntergr-arm mentioned this pull request May 13, 2025

[LV] Enable considering higher VFs when data extend ops are present i… #137593

Open

ElvisWang123 added 2 commits May 16, 2025 02:13

Merge branch 'main' into vp-arm-mve-transform

f4afc2c

Fixup after merge.

7b25767

ElvisWang123 commented May 19, 2025

View reviewed changes

ElvisWang123 changed the title ~~[VPlan] Impl VPlan-based pattern match for ExtendedRed and MulAccRed~~ [VPlan] Implement VPlan-based cost model for VPReduction, VPExtendedReduction and VPMulAccumulateReduction. (NFC) May 19, 2025

sdesmalen-arm approved these changes May 27, 2025

View reviewed changes

fhahn approved these changes May 27, 2025

View reviewed changes

ElvisWang123 added 3 commits May 27, 2025 22:59

Merge branch 'main' into vp-arm-mve-transform

685f217

Address comments.

710df44

Merge branch 'main' into vp-arm-mve-transform

829cb2a

ElvisWang123 changed the title ~~[VPlan] Implement VPlan-based cost model for VPReduction, VPExtendedReduction and VPMulAccumulateReduction. (NFC)~~ [VPlan] Implement VPlan-based cost model for VPReduction, VPExtendedReduction and VPMulAccumulateReduction. May 28, 2025

ElvisWang123 merged commit 332fe08 into llvm:main May 29, 2025
11 checks passed

ElvisWang123 deleted the vp-arm-mve-transform branch May 29, 2025 03:15

		return Cost +
		Ctx.TTI.getMinMaxReductionCost(Id, VectorTy, FMFs, Ctx.CostKind);

		// Note that TTI should model the cost of moving result to the scalar register
		// and the BinOp cost in the getReductionCost().

[VPlan] Implement VPlan-based cost model for VPReduction, VPExtendedReduction and VPMulAccumulateReduction. #113903

[VPlan] Implement VPlan-based cost model for VPReduction, VPExtendedReduction and VPMulAccumulateReduction. #113903

Conversation

ElvisWang123 commented Oct 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Oct 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davemgreen commented Oct 28, 2024

Uh oh!

fhahn commented Oct 28, 2024

Uh oh!

ElvisWang123 commented Oct 28, 2024

Uh oh!

fhahn commented Oct 30, 2024

Uh oh!

davemgreen commented Nov 1, 2024

Uh oh!

davemgreen commented Nov 1, 2024

Uh oh!

ElvisWang123 commented Nov 1, 2024

Uh oh!

davemgreen commented Nov 4, 2024

Uh oh!

ElvisWang123 commented Nov 7, 2024

Uh oh!

davemgreen commented Nov 8, 2024

Uh oh!

davemgreen commented Nov 8, 2024

Uh oh!

ElvisWang123 commented Nov 8, 2024

Uh oh!

davemgreen commented Nov 8, 2024

Uh oh!

davemgreen commented Nov 8, 2024

Uh oh!

davemgreen commented Nov 8, 2024

Uh oh!

fhahn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ElvisWang123 commented Nov 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fhahn commented Apr 29, 2025

Uh oh!

ElvisWang123 commented Apr 29, 2025

Uh oh!

ElvisWang123 left a comment

Choose a reason for hiding this comment

Uh oh!

ElvisWang123 May 19, 2025

Choose a reason for hiding this comment

Uh oh!

sdesmalen-arm left a comment

Choose a reason for hiding this comment

Uh oh!

sdesmalen-arm commented May 27, 2025

Uh oh!

fhahn left a comment

Choose a reason for hiding this comment

Uh oh!

fhahn May 27, 2025

Choose a reason for hiding this comment

Uh oh!

fhahn May 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ElvisWang123 commented Oct 28, 2024 •

edited

Loading

llvmbot commented Oct 28, 2024 •

edited

Loading

ElvisWang123 commented Nov 11, 2024 •

edited

Loading