-
Notifications
You must be signed in to change notification settings - Fork 13.5k
[VPlan] Implement VPlan-based cost model for VPReduction, VPExtendedReduction and VPMulAccumulateReduction. (NFC) #113903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@llvm/pr-subscribers-vectorizers @llvm/pr-subscribers-llvm-transforms Author: Elvis Wang (ElvisWang123) ChangesThis patch implement the VPlan-based pattern match for extendedReduction and MulAccReduction. In above reduction patterns, extened instructions and mul instruction can fold into reduction instruction and the cost is free. We add ExtendedReductionPatterns: Ref: Original instruction based implementation: This patch is based on #113902 . Patch is 21.33 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/113903.diff 5 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 60a94ca1f86e42..483e039fe133d6 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -7303,51 +7303,6 @@ LoopVectorizationPlanner::precomputeCosts(VPlan &Plan, ElementCount VF,
Cost += ReductionCost;
continue;
}
-
- const auto &ChainOps = RdxDesc.getReductionOpChain(RedPhi, OrigLoop);
- SetVector<Instruction *> ChainOpsAndOperands(ChainOps.begin(),
- ChainOps.end());
- auto IsZExtOrSExt = [](const unsigned Opcode) -> bool {
- return Opcode == Instruction::ZExt || Opcode == Instruction::SExt;
- };
- // Also include the operands of instructions in the chain, as the cost-model
- // may mark extends as free.
- //
- // For ARM, some of the instruction can folded into the reducion
- // instruction. So we need to mark all folded instructions free.
- // For example: We can fold reduce(mul(ext(A), ext(B))) into one
- // instruction.
- for (auto *ChainOp : ChainOps) {
- for (Value *Op : ChainOp->operands()) {
- if (auto *I = dyn_cast<Instruction>(Op)) {
- ChainOpsAndOperands.insert(I);
- if (I->getOpcode() == Instruction::Mul) {
- auto *Ext0 = dyn_cast<Instruction>(I->getOperand(0));
- auto *Ext1 = dyn_cast<Instruction>(I->getOperand(1));
- if (Ext0 && IsZExtOrSExt(Ext0->getOpcode()) && Ext1 &&
- Ext0->getOpcode() == Ext1->getOpcode()) {
- ChainOpsAndOperands.insert(Ext0);
- ChainOpsAndOperands.insert(Ext1);
- }
- }
- }
- }
- }
-
- // Pre-compute the cost for I, if it has a reduction pattern cost.
- for (Instruction *I : ChainOpsAndOperands) {
- auto ReductionCost = CM.getReductionPatternCost(
- I, VF, ToVectorTy(I->getType(), VF), TTI::TCK_RecipThroughput);
- if (!ReductionCost)
- continue;
-
- assert(!CostCtx.SkipCostComputation.contains(I) &&
- "reduction op visited multiple times");
- CostCtx.SkipCostComputation.insert(I);
- LLVM_DEBUG(dbgs() << "Cost of " << ReductionCost << " for VF " << VF
- << ":\n in-loop reduction " << *I << "\n");
- Cost += *ReductionCost;
- }
}
// Pre-compute the costs for branches except for the backedge, as the number
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.cpp b/llvm/lib/Transforms/Vectorize/VPlan.cpp
index 6ab8fb45c351b4..49e93e1e7b5501 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlan.cpp
@@ -785,7 +785,7 @@ void VPRegionBlock::execute(VPTransformState *State) {
InstructionCost VPBasicBlock::cost(ElementCount VF, VPCostContext &Ctx) {
InstructionCost Cost = 0;
- for (VPRecipeBase &R : Recipes)
+ for (VPRecipeBase &R : reverse(Recipes))
Cost += R.cost(VF, Ctx);
return Cost;
}
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index 6a192bdf01c4ff..b26fd460a278f5 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -725,6 +725,8 @@ struct VPCostContext {
LLVMContext &LLVMCtx;
LoopVectorizationCostModel &CM;
SmallPtrSet<Instruction *, 8> SkipCostComputation;
+ /// Contains recipes that are folded into other recipes.
+ SmallDenseMap<ElementCount, SmallPtrSet<VPRecipeBase *, 4>, 4> FoldedRecipes;
VPCostContext(const TargetTransformInfo &TTI, const TargetLibraryInfo &TLI,
Type *CanIVTy, LoopVectorizationCostModel &CM)
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index 0eb4f7c7c88cee..5f59a1e96df9f8 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -299,7 +299,9 @@ InstructionCost VPRecipeBase::cost(ElementCount VF, VPCostContext &Ctx) {
UI = &WidenMem->getIngredient();
InstructionCost RecipeCost;
- if (UI && Ctx.skipCostComputation(UI, VF.isVector())) {
+ if ((UI && Ctx.skipCostComputation(UI, VF.isVector())) ||
+ (Ctx.FoldedRecipes.contains(VF) &&
+ Ctx.FoldedRecipes.at(VF).contains(this))) {
RecipeCost = 0;
} else {
RecipeCost = computeCost(VF, Ctx);
@@ -2188,30 +2190,143 @@ InstructionCost VPReductionRecipe::computeCost(ElementCount VF,
TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput;
unsigned Opcode = RdxDesc.getOpcode();
- // TODO: Support any-of and in-loop reductions.
+ // TODO: Support any-of reductions.
assert(
(!RecurrenceDescriptor::isAnyOfRecurrenceKind(RdxKind) ||
ForceTargetInstructionCost.getNumOccurrences() > 0) &&
"Any-of reduction not implemented in VPlan-based cost model currently.");
- assert(
- (!cast<VPReductionPHIRecipe>(getOperand(0))->isInLoop() ||
- ForceTargetInstructionCost.getNumOccurrences() > 0) &&
- "In-loop reduction not implemented in VPlan-based cost model currently.");
assert(ElementTy->getTypeID() == RdxDesc.getRecurrenceType()->getTypeID() &&
"Inferred type and recurrence type mismatch.");
- // Cost = Reduction cost + BinOp cost
- InstructionCost Cost =
+ // BaseCost = Reduction cost + BinOp cost
+ InstructionCost BaseCost =
Ctx.TTI.getArithmeticInstrCost(Opcode, ElementTy, CostKind);
if (RecurrenceDescriptor::isMinMaxRecurrenceKind(RdxKind)) {
Intrinsic::ID Id = getMinMaxReductionIntrinsicOp(RdxKind);
- return Cost + Ctx.TTI.getMinMaxReductionCost(
- Id, VectorTy, RdxDesc.getFastMathFlags(), CostKind);
+ BaseCost += Ctx.TTI.getMinMaxReductionCost(
+ Id, VectorTy, RdxDesc.getFastMathFlags(), CostKind);
+ } else {
+ BaseCost += Ctx.TTI.getArithmeticReductionCost(
+ Opcode, VectorTy, RdxDesc.getFastMathFlags(), CostKind);
}
- return Cost + Ctx.TTI.getArithmeticReductionCost(
- Opcode, VectorTy, RdxDesc.getFastMathFlags(), CostKind);
+ using namespace llvm::VPlanPatternMatch;
+ auto GetMulAccReductionCost =
+ [&](const VPReductionRecipe *Red) -> InstructionCost {
+ VPValue *A, *B;
+ InstructionCost InnerExt0Cost = 0;
+ InstructionCost InnerExt1Cost = 0;
+ InstructionCost ExtCost = 0;
+ InstructionCost MulCost = 0;
+
+ VectorType *SrcVecTy = VectorTy;
+ Type *InnerExt0Ty;
+ Type *InnerExt1Ty;
+ Type *MaxInnerExtTy;
+ bool IsUnsigned = true;
+ bool HasOuterExt = false;
+
+ auto *Ext = dyn_cast_if_present<VPWidenCastRecipe>(
+ Red->getVecOp()->getDefiningRecipe());
+ VPRecipeBase *Mul;
+ // Try to match outer extend reduce.add(ext(...))
+ if (Ext && match(Ext, m_ZExtOrSExt(m_VPValue())) &&
+ cast<VPWidenCastRecipe>(Ext)->getNumUsers() == 1) {
+ IsUnsigned =
+ Ext->getOpcode() == Instruction::CastOps::ZExt ? true : false;
+ ExtCost = Ext->computeCost(VF, Ctx);
+ Mul = Ext->getOperand(0)->getDefiningRecipe();
+ HasOuterExt = true;
+ } else {
+ Mul = Red->getVecOp()->getDefiningRecipe();
+ }
+
+ // Match reduce.add(mul())
+ if (Mul && match(Mul, m_Mul(m_VPValue(A), m_VPValue(B))) &&
+ cast<VPWidenRecipe>(Mul)->getNumUsers() == 1) {
+ MulCost = cast<VPWidenRecipe>(Mul)->computeCost(VF, Ctx);
+ auto *InnerExt0 =
+ dyn_cast_if_present<VPWidenCastRecipe>(A->getDefiningRecipe());
+ auto *InnerExt1 =
+ dyn_cast_if_present<VPWidenCastRecipe>(B->getDefiningRecipe());
+ bool HasInnerExt = false;
+ // Try to match inner extends.
+ if (InnerExt0 && InnerExt1 &&
+ match(InnerExt0, m_ZExtOrSExt(m_VPValue())) &&
+ match(InnerExt1, m_ZExtOrSExt(m_VPValue())) &&
+ InnerExt0->getOpcode() == InnerExt1->getOpcode() &&
+ (InnerExt0->getNumUsers() > 0 &&
+ !InnerExt0->hasMoreThanOneUniqueUser()) &&
+ (InnerExt1->getNumUsers() > 0 &&
+ !InnerExt1->hasMoreThanOneUniqueUser())) {
+ InnerExt0Cost = InnerExt0->computeCost(VF, Ctx);
+ InnerExt1Cost = InnerExt1->computeCost(VF, Ctx);
+ Type *InnerExt0Ty = Ctx.Types.inferScalarType(InnerExt0->getOperand(0));
+ Type *InnerExt1Ty = Ctx.Types.inferScalarType(InnerExt1->getOperand(0));
+ Type *MaxInnerExtTy = InnerExt0Ty->getIntegerBitWidth() >
+ InnerExt1Ty->getIntegerBitWidth()
+ ? InnerExt0Ty
+ : InnerExt1Ty;
+ SrcVecTy = cast<VectorType>(ToVectorTy(MaxInnerExtTy, VF));
+ IsUnsigned = true;
+ HasInnerExt = true;
+ }
+ InstructionCost MulAccRedCost = Ctx.TTI.getMulAccReductionCost(
+ IsUnsigned, ElementTy, SrcVecTy, CostKind);
+ // Check if folding ext/mul into MulAccReduction is profitable.
+ if (MulAccRedCost.isValid() &&
+ MulAccRedCost <
+ ExtCost + MulCost + InnerExt0Cost + InnerExt1Cost + BaseCost) {
+ if (HasInnerExt) {
+ Ctx.FoldedRecipes[VF].insert(InnerExt0);
+ Ctx.FoldedRecipes[VF].insert(InnerExt1);
+ }
+ Ctx.FoldedRecipes[VF].insert(Mul);
+ if (HasOuterExt)
+ Ctx.FoldedRecipes[VF].insert(Ext);
+ return MulAccRedCost;
+ }
+ }
+ return InstructionCost::getInvalid();
+ };
+
+ // Match reduce(ext(...))
+ auto GetExtendedReductionCost =
+ [&](const VPReductionRecipe *Red) -> InstructionCost {
+ VPValue *VecOp = Red->getVecOp();
+ VPValue *A;
+ if (match(VecOp, m_ZExtOrSExt(m_VPValue(A))) && VecOp->getNumUsers() == 1) {
+ VPWidenCastRecipe *Ext =
+ cast<VPWidenCastRecipe>(VecOp->getDefiningRecipe());
+ bool IsUnsigned = Ext->getOpcode() == Instruction::CastOps::ZExt;
+ InstructionCost ExtCost = Ext->computeCost(VF, Ctx);
+ auto *ExtVecTy =
+ cast<VectorType>(ToVectorTy(Ctx.Types.inferScalarType(A), VF));
+ InstructionCost ExtendedRedCost = Ctx.TTI.getExtendedReductionCost(
+ Opcode, IsUnsigned, ElementTy, ExtVecTy, RdxDesc.getFastMathFlags(),
+ CostKind);
+ // Check if folding ext into ExtendedReduction is profitable.
+ if (ExtendedRedCost.isValid() && ExtendedRedCost < ExtCost + BaseCost) {
+ Ctx.FoldedRecipes[VF].insert(Ext);
+ return ExtendedRedCost;
+ }
+ }
+ return InstructionCost::getInvalid();
+ };
+
+ // Match MulAccReduction patterns.
+ InstructionCost MulAccCost = GetMulAccReductionCost(this);
+ if (MulAccCost.isValid())
+ return MulAccCost;
+
+ // Match ExtendedReduction patterns.
+ InstructionCost ExtendedCost = GetExtendedReductionCost(this);
+ if (ExtendedCost.isValid())
+ return ExtendedCost;
+
+ // Default cost.
+ return BaseCost;
}
#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/interleaved-cost.ll b/llvm/test/Transforms/LoopVectorize/RISCV/interleaved-cost.ll
index fa346b4eac02d4..f2e36399c85f5d 100644
--- a/llvm/test/Transforms/LoopVectorize/RISCV/interleaved-cost.ll
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/interleaved-cost.ll
@@ -6,26 +6,26 @@ define void @i8_factor_2(ptr %data, i64 %n) {
entry:
br label %for.body
; CHECK-LABEL: Checking a loop in 'i8_factor_2'
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
-; CHECK: Cost of 2 for VF 4: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
; CHECK: Cost of 2 for VF 4: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
-; CHECK: Cost of 2 for VF 8: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF 4: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
; CHECK: Cost of 2 for VF 8: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
-; CHECK: Cost of 3 for VF 16: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF 8: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
; CHECK: Cost of 3 for VF 16: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
-; CHECK: Cost of 5 for VF 32: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
+; CHECK: Cost of 3 for VF 16: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
; CHECK: Cost of 5 for VF 32: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
-; CHECK: Cost of 2 for VF vscale x 1: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
+; CHECK: Cost of 5 for VF 32: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
; CHECK: Cost of 2 for VF vscale x 1: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
-; CHECK: Cost of 2 for VF vscale x 2: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF vscale x 1: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
; CHECK: Cost of 2 for VF vscale x 2: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
-; CHECK: Cost of 2 for VF vscale x 4: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF vscale x 2: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
; CHECK: Cost of 2 for VF vscale x 4: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
-; CHECK: Cost of 3 for VF vscale x 8: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF vscale x 4: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
; CHECK: Cost of 3 for VF vscale x 8: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
-; CHECK: Cost of 5 for VF vscale x 16: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
+; CHECK: Cost of 3 for VF vscale x 8: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
; CHECK: Cost of 5 for VF vscale x 16: INTERLEAVE-GROUP with factor 2 at <badref>, ir<%p0>
+; CHECK: Cost of 5 for VF vscale x 16: INTERLEAVE-GROUP with factor 2 at %l0, ir<%p0>
for.body:
%i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
%p0 = getelementptr inbounds %i8.2, ptr %data, i64 %i, i32 0
@@ -49,16 +49,16 @@ define void @i8_factor_3(ptr %data, i64 %n) {
entry:
br label %for.body
; CHECK-LABEL: Checking a loop in 'i8_factor_3'
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 3 at <badref>, ir<%p0>
-; CHECK: Cost of 2 for VF 4: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
; CHECK: Cost of 2 for VF 4: INTERLEAVE-GROUP with factor 3 at <badref>, ir<%p0>
-; CHECK: Cost of 3 for VF 8: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF 4: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
; CHECK: Cost of 3 for VF 8: INTERLEAVE-GROUP with factor 3 at <badref>, ir<%p0>
-; CHECK: Cost of 5 for VF 16: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
+; CHECK: Cost of 3 for VF 8: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
; CHECK: Cost of 5 for VF 16: INTERLEAVE-GROUP with factor 3 at <badref>, ir<%p0>
-; CHECK: Cost of 9 for VF 32: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
+; CHECK: Cost of 5 for VF 16: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
; CHECK: Cost of 9 for VF 32: INTERLEAVE-GROUP with factor 3 at <badref>, ir<%p0>
+; CHECK: Cost of 9 for VF 32: INTERLEAVE-GROUP with factor 3 at %l0, ir<%p0>
for.body:
%i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
%p0 = getelementptr inbounds %i8.3, ptr %data, i64 %i, i32 0
@@ -86,16 +86,16 @@ define void @i8_factor_4(ptr %data, i64 %n) {
entry:
br label %for.body
; CHECK-LABEL: Checking a loop in 'i8_factor_4'
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
-; CHECK: Cost of 2 for VF 4: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
; CHECK: Cost of 2 for VF 4: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
-; CHECK: Cost of 3 for VF 8: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF 4: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
; CHECK: Cost of 3 for VF 8: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
-; CHECK: Cost of 5 for VF 16: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
+; CHECK: Cost of 3 for VF 8: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
; CHECK: Cost of 5 for VF 16: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
-; CHECK: Cost of 9 for VF 32: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
+; CHECK: Cost of 5 for VF 16: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
; CHECK: Cost of 9 for VF 32: INTERLEAVE-GROUP with factor 4 at <badref>, ir<%p0>
+; CHECK: Cost of 9 for VF 32: INTERLEAVE-GROUP with factor 4 at %l0, ir<%p0>
for.body:
%i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
%p0 = getelementptr inbounds %i8.4, ptr %data, i64 %i, i32 0
@@ -127,14 +127,14 @@ define void @i8_factor_5(ptr %data, i64 %n) {
entry:
br label %for.body
; CHECK-LABEL: Checking a loop in 'i8_factor_5'
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 5 at <badref>, ir<%p0>
-; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 5 at <badref>, ir<%p0>
-; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
+; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 5 at <badref>, ir<%p0>
-; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
+; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 5 at <badref>, ir<%p0>
+; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 5 at %l0, ir<%p0>
for.body:
%i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
%p0 = getelementptr inbounds %i8.5, ptr %data, i64 %i, i32 0
@@ -170,14 +170,14 @@ define void @i8_factor_6(ptr %data, i64 %n) {
entry:
br label %for.body
; CHECK-LABEL: Checking a loop in 'i8_factor_6'
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 6 at <badref>, ir<%p0>
-; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 6 at <badref>, ir<%p0>
-; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
+; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 6 at <badref>, ir<%p0>
-; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
+; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 6 at <badref>, ir<%p0>
+; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 6 at %l0, ir<%p0>
for.body:
%i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
%p0 = getelementptr inbounds %i8.6, ptr %data, i64 %i, i32 0
@@ -217,14 +217,14 @@ define void @i8_factor_7(ptr %data, i64 %n) {
entry:
br label %for.body
; CHECK-LABEL: Checking a loop in 'i8_factor_7'
-; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 7 at <badref>, ir<%p0>
-; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
+; CHECK: Cost of 2 for VF 2: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 7 at <badref>, ir<%p0>
-; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
+; CHECK: Cost of 3 for VF 4: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 7 at <badref>, ir<%p0>
-; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
+; CHECK: Cost of 5 for VF 8: INTERLEAVE-GROUP with factor 7 at %l0, ir<%p0>
; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 7 at <badref>, ir<%p0>
+; CHECK: Cost of 9 for VF 16: INTERLEAVE-GROUP with factor 7 at %...
[truncated]
|
The way I think I had imagined this working in a vplan-based cost model would be that the vplan nodes more closely mirrored what the back-end would produce. So there would be a vplan recipe for the extended-reduction, which would be created if it was profitable but would otherwise be relatively easy to cost-model. I'm not sure if that is still the current plan or not. |
Yes I think ideally we would model the add-extend/mul-extend operations explicitly in VPlan, especially if matching it in VPlan would require changes to the order in which the cost is computed. Would add-extend/mul-extend be sufficient or would other recipes be needed as well? |
I think currently we need mul-extend-reduction and extend-reduction recipe to model these reduction pattern in the VPlan. Yes, generating new recipe is good but using new recipe to model these patterns would duplicate lots of codes in the |
@davemgreen do you think those would be enough?
This might be a case where gradual lowering would help. We could have a more abstract recipe early on which combines mul-extend in a single recipe, facilitating simple cost-computation. Before code-gen, we can replace the recipe with wide recipes for the adds and extends, so there is no need to duplicate codegen for those, similar to how things are sketched for scalar phis in #114305 |
Hello. I believe the basic patterns are the ones listed in the summary, which are an extended-add-reduction or an extended mla reduction:
In the case of MVE, both the ext's will be the same. The add can be done by setting one of the operands to 1. #92418 is similar, but produce a vector instead of a single output value (a dot-product). Dot product has udot, sdot and usdot that do There are also other patterns that come up too. The first I believe should be equivalent to
AArch64 has a stand-alone umull instruction (both for scalar and for vector, although the type sizes differ), that performs a |
BTW, I believe this patch is currently changing the scores computed for |
Thanks for your advice, I'm working on this direction.
Thanks, I will only model these three patterns for reduction.
Thanks for caching that. I misunderstood that how many patterns can be folded into mve reduction-like instructions in the original patch.
I think we already model the instruction cost for In summary, I think we only need two new recipes for reduction - If there is any question, please let me know. |
I believe that
IIRC CCH was added for getting the extension costs (more) correct under Arm/MVE, so it hopefully does OK in the current scheme. We might need to be a little careful about which one of |
442d1dd
to
bcccb13
Compare
Update to recipe based implementation. Note that the EVL version is not implement yet, so the test case in the RISCV changed. |
Hello. Let me upload some tests of examples that produce different results with this now. I think it might be that |
This might show it picking a different VF now: https://godbolt.org/z/zda3dvcrx |
Thanks for your information. I will fix the MulAcc pattern match for this pattern. |
Thanks - There might be some more and that might have just been one of the issues, I will keep trying to test the others and see if anything else comes up. |
This is another one that is behaving differently, I think due to subtleties about when one-use checks are beneficial. https://godbolt.org/z/fYW9Y5TqG. There is a third more awkward case with interleave groups I have not looked into much yet. I will try and make those into test cases to ensure we have the test coverage for them. I will have to check again later if anything else remains that is behaving differently and hitting the assert. |
Some tests in ab9178e. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update, some initial suggestions inline.
Would be good if we could avoid any references to IR operands, and possibly unify both recipes (and avoid duplication with the existing reduction recipe)
bcccb13
to
967b370
Compare
Thanks for adding new tests. Update the condition of creating MulAccRecipe for To support |
967b370
to
234e81e
Compare
@@ -155,6 +155,8 @@ bool VPRecipeBase::mayHaveSideEffects() const { | |||
case VPBlendSC: | |||
case VPReductionEVLSC: | |||
case VPReductionSC: | |||
case VPExtendedReductionSC: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should also be added to mayReadFromMemory
and mayWriteToMemory
for completeness?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the updates. Basically looks good, but I think it would be good to split up the patch into adding the new recipes + transform + lowering but without the cost changes. And then land the cost changes separately, if possible?
That way there will less churn in case we need to revert due to cost failures |
This patch add the test for the fmuladd reduction to show the test change/fail for the cost model change. Note that without the fp128 load and trunc, there is no failure. Pre-commit test for #113903.
…to abstract recipe. This patch implements the transformation that match the following patterns in the vplan and converts to abstract recipes for better cost estimation. * VPExtendedReductionRecipe - cast + reduction. * VPMulAccumulateReductionRecipe - (cast) + mul + reduction. The conveted abstract recipes will be lower to the concrete recipes (widen-cast + widen-mul + reduction) just before vector codegen. This should be a cost-model based decision which will be implemented in the following patch. In current status, still rely on the legacy cost model to calaulate the right cost. Split from llvm#113903.
Split new recipes implementations to #137745. Will update this patch after above patches landed. |
This patch add the test for the fmuladd reduction to show the test change/fail for the cost model change. Note that without the fp128 load and trunc, there is no failure. Pre-commit test for llvm#113903.
… to abstract recipe. This patch introduce two new recipes. * VPExtendedReductionRecipe - cast + reduction. * VPMulAccumulateReductionRecipe - (cast) + mul + reduction. This patch also implements the transformation that match following patterns via vplan and converts to abstract recipes for better cost estimation. * VPExtendedReduction - reduce(cast(...)) * VPMulAccumulateReductionRecipe - reduce.add(mul(...)) - reduce.add(mul(ext(...), ext(...)) - reduce.add(ext(mul(ext(...), ext(...)))) The conveted abstract recipes will be lower to the concrete recipes (widen-cast + widen-mul + reduction) just before recipe execution. Split from llvm#113903.
…xit`. (llvm#135294) This patch check if the plan contains scalar VF by VFRange instead of Plan. This patch also clamp the range to contains either only scalar or only vector VFs to prevent mis-compile. Split from llvm#113903.
This patch add the test for the fmuladd reduction to show the test change/fail for the cost model change. Note that without the fp128 load and trunc, there is no failure. Pre-commit test for llvm#113903.
This patch add the test for the fmuladd reduction to show the test change/fail for the cost model change. Note that without the fp128 load and trunc, there is no failure. Pre-commit test for llvm#113903.
…xit`. (llvm#135294) This patch check if the plan contains scalar VF by VFRange instead of Plan. This patch also clamp the range to contains either only scalar or only vector VFs to prevent mis-compile. Split from llvm#113903.
This patch add the test for the fmuladd reduction to show the test change/fail for the cost model change. Note that without the fp128 load and trunc, there is no failure. Pre-commit test for llvm#113903.
This patch add the test for the fmuladd reduction to show the test change/fail for the cost model change. Note that without the fp128 load and trunc, there is no failure. Pre-commit test for llvm#113903.
… to abstract recipe. This patch introduce two new recipes. * VPExtendedReductionRecipe - cast + reduction. * VPMulAccumulateReductionRecipe - (cast) + mul + reduction. This patch also implements the transformation that match following patterns via vplan and converts to abstract recipes for better cost estimation. * VPExtendedReduction - reduce(cast(...)) * VPMulAccumulateReductionRecipe - reduce.add(mul(...)) - reduce.add(mul(ext(...), ext(...)) - reduce.add(ext(mul(ext(...), ext(...)))) The conveted abstract recipes will be lower to the concrete recipes (widen-cast + widen-mul + reduction) just before recipe execution. Split from llvm#113903.
This patch add the test for the fmuladd reduction to show the test change/fail for the cost model change. Note that without the fp128 load and trunc, there is no failure. Pre-commit test for llvm#113903.
… to abstract recipe. This patch introduce two new recipes. * VPExtendedReductionRecipe - cast + reduction. * VPMulAccumulateReductionRecipe - (cast) + mul + reduction. This patch also implements the transformation that match following patterns via vplan and converts to abstract recipes for better cost estimation. * VPExtendedReduction - reduce(cast(...)) * VPMulAccumulateReductionRecipe - reduce.add(mul(...)) - reduce.add(mul(ext(...), ext(...)) - reduce.add(ext(mul(ext(...), ext(...)))) The conveted abstract recipes will be lower to the concrete recipes (widen-cast + widen-mul + reduction) just before recipe execution. Split from llvm#113903.
… and corresponding vplan transformations. (#137746) This patch introduce two new recipes. * VPExtendedReductionRecipe - cast + reduction. * VPMulAccumulateReductionRecipe - (cast) + mul + reduction. This patch also implements the transformation that match following patterns via vplan and converts to abstract recipes for better cost estimation. * VPExtendedReduction - reduce(cast(...)) * VPMulAccumulateReductionRecipe - reduce.add(mul(...)) - reduce.add(mul(ext(...), ext(...)) - reduce.add(ext(mul(ext(...), ext(...)))) The converted abstract recipes will be lower to the concrete recipes (widen-cast + widen-mul + reduction) just before recipe execution. Note that this patch still relies on legacy cost model the calculate the cost for these patters. Will enable vplan-based cost decision in #113903. Split from #113903.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update after #137746.
return Cost + | ||
Ctx.TTI.getMinMaxReductionCost(Id, VectorTy, FMFs, Ctx.CostKind); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This patch implement the VPlan-based cost model for VPReduction, VPExtendedReduction and VPMulAccumulateReduction.
With this patch, we can calculate the reduction cost by the VPlan-based cost model so remove the reduction costs in
precomputeCost()
.Ref: Original instruction based implementation: https://reviews.llvm.org/D93476