-
Notifications
You must be signed in to change notification settings - Fork 13.5k
[VPlan] Add transformation to narrow interleave groups. #106441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First set of minor comments.
if (VPlanTransforms::narrowInterleaveGroups(BestVPlan, BestVF)) { | ||
LLVM_DEBUG(dbgs() << "Narrowed interleave\n"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (VPlanTransforms::narrowInterleaveGroups(BestVPlan, BestVF)) { | |
LLVM_DEBUG(dbgs() << "Narrowed interleave\n"); | |
} | |
if (VPlanTransforms::narrowInterleaveGroups(BestVPlan, BestVF)) | |
LLVM_DEBUG(dbgs() << "Narrowed interleave\n"); |
better if the transform itself dumps what it does, possibly along with each interleave narrowed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved, thanks!
@@ -7487,6 +7487,9 @@ LoopVectorizationPlanner::executePlan( | |||
|
|||
VPlanTransforms::optimizeForVFAndUF(BestVPlan, BestVF, BestUF, PSE); | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should optimizeForVFAndUF()
above be renamed eliminateRedundantBranchOnCount()
?
Otherwise narrowInterleaveGroups()
below would consistently be optimizeForVF()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! There are no other simplifications there so far
VPlanTransforms::createInterleaveGroups(InterleaveGroups, RecipeBuilder, | ||
CM.isScalarEpilogueAllowed()); | ||
VPlanTransforms::createInterleaveGroups( | ||
*Plan, InterleaveGroups, RecipeBuilder, CM.isScalarEpilogueAllowed()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: RecipeBuilder already holds Plan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is part of #10643
@@ -1330,6 +1330,13 @@ class VPInstruction : public VPRecipeWithIRFlags { | |||
assert(Opcode == Instruction::Or && "only OR opcodes can be disjoint"); | |||
} | |||
|
|||
VPInstruction(VPValue *Ptr, VPValue *Offset, bool InBounds, DebugLoc DL = {}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems a bit odd for a constructor of VPInstruction(VPValue*, VPValue*, bool) to create a PtrAdd. Perhaps having GEPFlagsTy as 3rd parameter instead of bool would be clearer. Should be documented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is part of #10643
return Builder.CreatePtrAdd(Ptr, Addend, Name, | ||
isInBounds() ? GEPNoWrapFlags::inBounds() | ||
: GEPNoWrapFlags::none()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Independent of this patch?
if ((R.isPhi() && !isa<VPCanonicalIVPHIRecipe>(&R)) || | ||
(R.mayWriteToMemory() && !isa<VPInterleaveRecipe>(&R))) | ||
return false; | ||
|
||
auto *IR = dyn_cast<VPInterleaveRecipe>(&R); | ||
if (!IR) | ||
continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if ((R.isPhi() && !isa<VPCanonicalIVPHIRecipe>(&R)) || | |
(R.mayWriteToMemory() && !isa<VPInterleaveRecipe>(&R))) | |
return false; | |
auto *IR = dyn_cast<VPInterleaveRecipe>(&R); | |
if (!IR) | |
continue; | |
if (R.isPhi()) | |
return false; | |
auto *IR = dyn_cast<VPInterleaveRecipe>(&R); | |
if (R.mayWriteToMemory() && !IR) | |
return false; | |
if (!IR) | |
continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simplified, thanks!
|
||
bool Changed = false; | ||
SmallVector<VPInterleaveRecipe *> StoreGroups; | ||
for (auto &R : make_early_inc_range( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why make_early_inc_range if StoreGroups is filled to be operated on later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a left-over from earlier versions, removed, thanks!
return false; | ||
} | ||
|
||
/// Returns true of \p IR is a consecutive interleave group with \p VF members. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Returns true of \p IR is a consecutive interleave group with \p VF members. | |
/// Returns true of \p IR is a full interleave group with factor and number of members both equal to \p VF. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated thanks!
return false; | ||
auto IG = IR->getInterleaveGroup(); | ||
return IG->getFactor() == IG->getNumMembers() && | ||
IG->getNumMembers() == VF.getKnownMinValue(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IG->getNumMembers() == VF.getKnownMinValue(); | |
IG->getNumMembers() == VF.getFixedValue(); |
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adjusted, thanks!
StoreGroup->getDebugLoc()); | ||
S->insertBefore(StoreGroup); | ||
StoreGroup->eraseFromParent(); | ||
Changed = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Early exit above if StoredGroups.empty() returning false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks!
This patch adds a new narrowInterleaveGroups transfrom, which tries convert a plan with interleave groups with VF elements to a plan that instead replaces the interleave groups with wide loads and stores processing VF elements. This effectively is a very simple form of loop-aware SLP, where we use interleave groups to identify candidates. This initial version is quite restricted and hopefully serves as a starting point for how to best model those kinds of transforms. Depends on llvm#106431. Fixes llvm#82936
6a2778c
to
61279d1
Compare
@@ -673,6 +672,7 @@ static void recursivelyDeleteDeadRecipes(VPValue *V) { | |||
void VPlanTransforms::optimizeForVFAndUF(VPlan &Plan, ElementCount BestVF, | |||
unsigned BestUF, | |||
PredicatedScalarEvolution &PSE) { | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed, thanks!
return Builder.CreatePtrAdd(Ptr, Addend, Name, | ||
isInBounds() ? GEPNoWrapFlags::inBounds() | ||
: GEPNoWrapFlags::none()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the PR currently contains #106431 which is where this comes from
Instruction *IRInsertPos = IG->getInsertPos(); | ||
auto *InsertPos = | ||
cast<VPWidenMemoryRecipe>(RecipeBuilder.getRecipe(IRInsertPos)); | ||
VPRecipeBase *IP = InsertPos; | ||
|
||
// Get or create the start address for the interleave group. | ||
auto *Start = | ||
cast<VPWidenMemoryRecipe>(RecipeBuilder.getRecipe(IG->getMember(0))); | ||
VPValue *Addr = Start->getAddr(); | ||
if (!VPDT.properlyDominates(Addr->getDefiningRecipe(), InsertPos)) { | ||
bool InBounds = false; | ||
if (auto *gep = dyn_cast<GetElementPtrInst>( | ||
getLoadStorePointerOperand(IRInsertPos)->stripPointerCasts())) | ||
InBounds = gep->isInBounds(); | ||
|
||
// We cannot re-use the address of the first member because it does not | ||
// dominate the insert position. Use the address of the insert position | ||
// and create a PtrAdd to adjust the index to start at the first member. | ||
APInt Offset(32, | ||
getLoadStoreType(IRInsertPos)->getScalarSizeInBits() / 8 * | ||
IG->getIndex(IRInsertPos), | ||
/*IsSigned=*/true); | ||
VPValue *OffsetVPV = Plan.getOrAddLiveIn( | ||
ConstantInt::get(IRInsertPos->getParent()->getContext(), -Offset)); | ||
Addr = new VPInstruction(InsertPos->getAddr(), OffsetVPV, InBounds); | ||
Addr->getDefiningRecipe()->insertAfter(InsertPos); | ||
IP = Addr->getDefiningRecipe(); | ||
} | ||
auto *VPIG = new VPInterleaveRecipe(IG, Addr, StoredValues, | ||
InsertPos->getMask(), NeedsMaskForGaps); | ||
VPIG->insertAfter(IP); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also part of #106431
return false; | ||
} | ||
|
||
/// Returns true of \p IR is a consecutive interleave group with \p VF members. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated thanks!
return false; | ||
auto IG = IR->getInterleaveGroup(); | ||
return IG->getFactor() == IG->getNumMembers() && | ||
IG->getNumMembers() == VF.getKnownMinValue(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adjusted, thanks!
StoreGroup->getDebugLoc()); | ||
S->insertBefore(StoreGroup); | ||
StoreGroup->eraseFromParent(); | ||
Changed = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks!
@@ -1330,6 +1330,13 @@ class VPInstruction : public VPRecipeWithIRFlags { | |||
assert(Opcode == Instruction::Or && "only OR opcodes can be disjoint"); | |||
} | |||
|
|||
VPInstruction(VPValue *Ptr, VPValue *Offset, bool InBounds, DebugLoc DL = {}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is part of #10643
VPlanTransforms::createInterleaveGroups(InterleaveGroups, RecipeBuilder, | ||
CM.isScalarEpilogueAllowed()); | ||
VPlanTransforms::createInterleaveGroups( | ||
*Plan, InterleaveGroups, RecipeBuilder, CM.isScalarEpilogueAllowed()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is part of #10643
if (VPlanTransforms::narrowInterleaveGroups(BestVPlan, BestVF)) { | ||
LLVM_DEBUG(dbgs() << "Narrowed interleave\n"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved, thanks!
@@ -7487,6 +7487,9 @@ LoopVectorizationPlanner::executePlan( | |||
|
|||
VPlanTransforms::optimizeForVFAndUF(BestVPlan, BestVF, BestUF, PSE); | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! There are no other simplifications there so far
@llvm/pr-subscribers-llvm-transforms Author: Florian Hahn (fhahn) ChangesThis patch adds a new narrowInterleaveGroups transfrom, which tries This effectively is a very simple form of loop-aware SLP, where we This initial version is quite restricted and hopefully serves as a Depends on #106431. Fixes #82936 Patch is 249.37 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/106441.diff 20 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h b/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
index 00eec0a6f7b14e..5951873a960af2 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
@@ -220,9 +220,15 @@ class VPBuilder {
new VPInstruction(Instruction::ICmp, Pred, A, B, DL, Name));
}
- VPInstruction *createPtrAdd(VPValue *Ptr, VPValue *Offset, DebugLoc DL,
+ VPInstruction *createPtrAdd(VPValue *Ptr, VPValue *Offset, DebugLoc DL = {},
const Twine &Name = "") {
- return createInstruction(VPInstruction::PtrAdd, {Ptr, Offset}, DL, Name);
+ return tryInsertInstruction(new VPInstruction(
+ Ptr, Offset, VPRecipeWithIRFlags::GEPFlagsTy(false), DL, Name));
+ }
+ VPValue *createInBoundsPtrAdd(VPValue *Ptr, VPValue *Offset, DebugLoc DL = {},
+ const Twine &Name = "") {
+ return tryInsertInstruction(new VPInstruction(
+ Ptr, Offset, VPRecipeWithIRFlags::GEPFlagsTy(true), DL, Name));
}
VPDerivedIVRecipe *createDerivedIV(InductionDescriptor::InductionKind Kind,
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index db4631e19c11d3..3eb830bcc02098 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -7506,6 +7506,7 @@ LoopVectorizationPlanner::executePlan(
VPlanTransforms::unrollByUF(BestVPlan, BestUF,
OrigLoop->getHeader()->getModule()->getContext());
VPlanTransforms::optimizeForVFAndUF(BestVPlan, BestVF, BestUF, PSE);
+ VPlanTransforms::narrowInterleaveGroups(BestVPlan, BestVF);
LLVM_DEBUG(dbgs() << "Executing best plan with VF=" << BestVF
<< ", UF=" << BestUF << '\n');
@@ -9005,8 +9006,8 @@ LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes(VFRange &Range) {
// Interleave memory: for each Interleave Group we marked earlier as relevant
// for this VPlan, replace the Recipes widening its memory instructions with a
// single VPInterleaveRecipe at its insertion point.
- VPlanTransforms::createInterleaveGroups(InterleaveGroups, RecipeBuilder,
- CM.isScalarEpilogueAllowed());
+ VPlanTransforms::createInterleaveGroups(
+ *Plan, InterleaveGroups, RecipeBuilder, CM.isScalarEpilogueAllowed());
for (ElementCount VF : Range)
Plan->addVF(VF);
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.cpp b/llvm/lib/Transforms/Vectorize/VPlan.cpp
index 6ddbfcf0ecfe58..a18ab844c5ebec 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlan.cpp
@@ -954,7 +954,6 @@ void VPlan::prepareToExecute(Value *TripCountV, Value *VectorTripCountV,
IRBuilder<> Builder(State.CFG.PrevBB->getTerminator());
// FIXME: Model VF * UF computation completely in VPlan.
- assert(VFxUF.getNumUsers() && "VFxUF expected to always have users");
unsigned UF = getUF();
if (VF.getNumUsers()) {
Value *RuntimeVF = getRuntimeVF(Builder, TCTy, State.VF);
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index bbcfaf9e19cd0c..f841355d320930 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -956,7 +956,6 @@ class VPRecipeWithIRFlags : public VPSingleDefRecipe {
DisjointFlagsTy(bool IsDisjoint) : IsDisjoint(IsDisjoint) {}
};
-protected:
struct GEPFlagsTy {
char IsInBounds : 1;
GEPFlagsTy(bool IsInBounds) : IsInBounds(IsInBounds) {}
@@ -1307,6 +1306,13 @@ class VPInstruction : public VPRecipeWithIRFlags,
assert(Opcode == Instruction::Or && "only OR opcodes can be disjoint");
}
+ VPInstruction(VPValue *Ptr, VPValue *Offset, GEPFlagsTy Flags = {false},
+ DebugLoc DL = {}, const Twine &Name = "")
+ : VPRecipeWithIRFlags(VPDef::VPInstructionSC,
+ ArrayRef<VPValue *>({Ptr, Offset}),
+ GEPFlagsTy(Flags), DL),
+ Opcode(VPInstruction::PtrAdd), Name(Name.str()) {}
+
VPInstruction(unsigned Opcode, std::initializer_list<VPValue *> Operands,
FastMathFlags FMFs, DebugLoc DL = {}, const Twine &Name = "");
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index dacba152611c19..96b16164738181 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -48,6 +48,9 @@ extern cl::opt<unsigned> ForceTargetInstructionCost;
bool VPRecipeBase::mayWriteToMemory() const {
switch (getVPDefID()) {
+ case VPInstructionSC: {
+ return !Instruction::isBinaryOp(cast<VPInstruction>(this)->getOpcode());
+ }
case VPInterleaveSC:
return cast<VPInterleaveRecipe>(this)->getNumStoreOperands() > 0;
case VPWidenStoreEVLSC:
@@ -63,6 +66,7 @@ bool VPRecipeBase::mayWriteToMemory() const {
case VPBranchOnMaskSC:
case VPScalarIVStepsSC:
case VPPredInstPHISC:
+ case VPVectorPointerSC:
return false;
case VPBlendSC:
case VPReductionEVLSC:
@@ -644,7 +648,8 @@ Value *VPInstruction::generate(VPTransformState &State) {
"can only generate first lane for PtrAdd");
Value *Ptr = State.get(getOperand(0), /* IsScalar */ true);
Value *Addend = State.get(getOperand(1), /* IsScalar */ true);
- return Builder.CreatePtrAdd(Ptr, Addend, Name);
+ return isInBounds() ? Builder.CreateInBoundsPtrAdd(Ptr, Addend, Name)
+ : Builder.CreatePtrAdd(Ptr, Addend, Name);
}
case VPInstruction::ResumePhi: {
Value *IncomingFromVPlanPred =
@@ -2470,15 +2475,12 @@ void VPInterleaveRecipe::execute(VPTransformState &State) {
unsigned InterleaveFactor = Group->getFactor();
auto *VecTy = VectorType::get(ScalarTy, State.VF * InterleaveFactor);
- // Prepare for the new pointers.
- unsigned Index = Group->getIndex(Instr);
-
// TODO: extend the masked interleaved-group support to reversed access.
VPValue *BlockInMask = getMask();
assert((!BlockInMask || !Group->isReverse()) &&
"Reversed masked interleave-group not supported.");
- Value *Idx;
+ Value *Index;
// If the group is reverse, adjust the index to refer to the last vector lane
// instead of the first. We adjust the index from the first vector lane,
// rather than directly getting the pointer for lane VF - 1, because the
@@ -2486,35 +2488,24 @@ void VPInterleaveRecipe::execute(VPTransformState &State) {
if (Group->isReverse()) {
Value *RuntimeVF =
getRuntimeVF(State.Builder, State.Builder.getInt32Ty(), State.VF);
- Idx = State.Builder.CreateSub(RuntimeVF, State.Builder.getInt32(1));
- Idx = State.Builder.CreateMul(Idx,
- State.Builder.getInt32(Group->getFactor()));
- Idx = State.Builder.CreateAdd(Idx, State.Builder.getInt32(Index));
- Idx = State.Builder.CreateNeg(Idx);
- } else
- Idx = State.Builder.getInt32(-Index);
+ Index = State.Builder.CreateSub(RuntimeVF, State.Builder.getInt32(1));
+ Index = State.Builder.CreateMul(Index,
+ State.Builder.getInt32(Group->getFactor()));
+ Index = State.Builder.CreateNeg(Index);
+ } else {
+ // TODO: Drop redundant 0-index GEP as follow-up.
+ Index = State.Builder.getInt32(0);
+ }
VPValue *Addr = getAddr();
Value *ResAddr = State.get(Addr, VPLane(0));
if (auto *I = dyn_cast<Instruction>(ResAddr))
State.setDebugLocFrom(I->getDebugLoc());
- // Notice current instruction could be any index. Need to adjust the address
- // to the member of index 0.
- //
- // E.g. a = A[i+1]; // Member of index 1 (Current instruction)
- // b = A[i]; // Member of index 0
- // Current pointer is pointed to A[i+1], adjust it to A[i].
- //
- // E.g. A[i+1] = a; // Member of index 1
- // A[i] = b; // Member of index 0
- // A[i+2] = c; // Member of index 2 (Current instruction)
- // Current pointer is pointed to A[i+2], adjust it to A[i].
-
bool InBounds = false;
if (auto *gep = dyn_cast<GetElementPtrInst>(ResAddr->stripPointerCasts()))
InBounds = gep->isInBounds();
- ResAddr = State.Builder.CreateGEP(ScalarTy, ResAddr, Idx, "", InBounds);
+ ResAddr = State.Builder.CreateGEP(ScalarTy, ResAddr, Index, "", InBounds);
State.setDebugLocFrom(Instr->getDebugLoc());
Value *PoisonVec = PoisonValue::get(VecTy);
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
index a878613c4ba483..58188379c7fabb 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
@@ -29,6 +29,9 @@
#include "llvm/IR/Intrinsics.h"
#include "llvm/IR/PatternMatch.h"
+#define LV_NAME "loop-vectorize"
+#define DEBUG_TYPE LV_NAME
+
using namespace llvm;
void VPlanTransforms::VPInstructionsToVPRecipes(
@@ -710,6 +713,7 @@ void VPlanTransforms::optimizeForVFAndUF(VPlan &Plan, ElementCount BestVF,
// TODO: Further simplifications are possible
// 1. Replace inductions with constants.
// 2. Replace vector loop region with VPBasicBlock.
+ //
}
/// Sink users of \p FOR after the recipe defining the previous value \p
@@ -1590,14 +1594,19 @@ void VPlanTransforms::dropPoisonGeneratingRecipes(
}
void VPlanTransforms::createInterleaveGroups(
- const SmallPtrSetImpl<const InterleaveGroup<Instruction> *> &InterleaveGroups,
+ VPlan &Plan,
+ const SmallPtrSetImpl<const InterleaveGroup<Instruction> *>
+ &InterleaveGroups,
VPRecipeBuilder &RecipeBuilder, bool ScalarEpilogueAllowed) {
+ if (InterleaveGroups.empty())
+ return;
+
// Interleave memory: for each Interleave Group we marked earlier as relevant
// for this VPlan, replace the Recipes widening its memory instructions with a
// single VPInterleaveRecipe at its insertion point.
+ VPDominatorTree VPDT;
+ VPDT.recalculate(Plan);
for (const auto *IG : InterleaveGroups) {
- auto *Recipe =
- cast<VPWidenMemoryRecipe>(RecipeBuilder.getRecipe(IG->getInsertPos()));
SmallVector<VPValue *, 4> StoredValues;
for (unsigned i = 0; i < IG->getFactor(); ++i)
if (auto *SI = dyn_cast_or_null<StoreInst>(IG->getMember(i))) {
@@ -1607,9 +1616,38 @@ void VPlanTransforms::createInterleaveGroups(
bool NeedsMaskForGaps =
IG->requiresScalarEpilogue() && !ScalarEpilogueAllowed;
- auto *VPIG = new VPInterleaveRecipe(IG, Recipe->getAddr(), StoredValues,
- Recipe->getMask(), NeedsMaskForGaps);
- VPIG->insertBefore(Recipe);
+
+ Instruction *IRInsertPos = IG->getInsertPos();
+ auto *InsertPos =
+ cast<VPWidenMemoryRecipe>(RecipeBuilder.getRecipe(IRInsertPos));
+
+ // Get or create the start address for the interleave group.
+ auto *Start =
+ cast<VPWidenMemoryRecipe>(RecipeBuilder.getRecipe(IG->getMember(0)));
+ VPValue *Addr = Start->getAddr();
+ if (!VPDT.properlyDominates(Addr->getDefiningRecipe(), InsertPos)) {
+ bool InBounds = false;
+ if (auto *Gep = dyn_cast<GetElementPtrInst>(
+ getLoadStorePointerOperand(IRInsertPos)->stripPointerCasts()))
+ InBounds = Gep->isInBounds();
+
+ // We cannot re-use the address of the first member because it does not
+ // dominate the insert position. Use the address of the insert position
+ // and create a PtrAdd to adjust the index to start at the first member.
+ APInt Offset(32,
+ getLoadStoreType(IRInsertPos)->getScalarSizeInBits() / 8 *
+ IG->getIndex(IRInsertPos),
+ /*IsSigned=*/true);
+ VPValue *OffsetVPV = Plan.getOrAddLiveIn(
+ ConstantInt::get(IRInsertPos->getParent()->getContext(), -Offset));
+ VPBuilder B(InsertPos);
+ Addr = InBounds ? B.createInBoundsPtrAdd(InsertPos->getAddr(), OffsetVPV)
+ : B.createPtrAdd(InsertPos->getAddr(), OffsetVPV);
+ }
+ auto *VPIG = new VPInterleaveRecipe(IG, Addr, StoredValues,
+ InsertPos->getMask(), NeedsMaskForGaps);
+ VPIG->insertBefore(InsertPos);
+
unsigned J = 0;
for (unsigned i = 0; i < IG->getFactor(); ++i)
if (Instruction *Member = IG->getMember(i)) {
@@ -1623,3 +1661,129 @@ void VPlanTransforms::createInterleaveGroups(
}
}
}
+
+static bool supportedLoad(VPWidenRecipe *R0, VPValue *V, unsigned Idx) {
+ if (auto *W = dyn_cast_or_null<VPWidenLoadRecipe>(V->getDefiningRecipe())) {
+ if (W->getMask())
+ return false;
+ return !W->getMask() && (R0->getOperand(0) == V || R0->getOperand(1) == V);
+ }
+
+ if (auto *IR = dyn_cast_or_null<VPInterleaveRecipe>(V->getDefiningRecipe())) {
+ return IR->getInterleaveGroup()->getFactor() ==
+ IR->getInterleaveGroup()->getNumMembers() &&
+ IR->getVPValue(Idx) == V;
+ }
+ return false;
+}
+
+/// Returns true if \p IR is a full interleave group with factor and number of
+/// members both equal to \p VF.
+static bool isConsecutiveInterleaveGroup(VPInterleaveRecipe *IR,
+ ElementCount VF) {
+ if (!IR)
+ return false;
+ auto IG = IR->getInterleaveGroup();
+ return IG->getFactor() == IG->getNumMembers() &&
+ IG->getNumMembers() == VF.getFixedValue();
+}
+
+void VPlanTransforms::narrowInterleaveGroups(VPlan &Plan, ElementCount VF) {
+ using namespace llvm::VPlanPatternMatch;
+ if (VF.isScalable())
+ return;
+
+ SmallVector<VPInterleaveRecipe *> StoreGroups;
+ for (auto &R : *Plan.getVectorLoopRegion()->getEntryBasicBlock()) {
+ if (match(&R, m_BranchOnCount(m_VPValue(), m_VPValue())) ||
+ isa<VPCanonicalIVPHIRecipe>(&R))
+ continue;
+
+ // Bail out on recipes not supported at the moment:
+ // * phi recipes other than the canonical induction
+ // * recipes writing to memory except interleave groups
+ // Only support plans with a canonical induction phi.
+ if (R.isPhi())
+ return;
+
+ auto *IR = dyn_cast<VPInterleaveRecipe>(&R);
+ if (R.mayWriteToMemory() && !IR)
+ return;
+
+ if (!IR)
+ continue;
+
+ if (!isConsecutiveInterleaveGroup(IR, VF))
+ return;
+ if (IR->getStoredValues().empty())
+ continue;
+
+ auto *Lane0 = dyn_cast_or_null<VPWidenRecipe>(
+ IR->getStoredValues()[0]->getDefiningRecipe());
+ if (!Lane0)
+ return;
+ for (const auto &[I, V] : enumerate(IR->getStoredValues())) {
+ auto *R = dyn_cast<VPWidenRecipe>(V->getDefiningRecipe());
+ if (!R || R->getOpcode() != Lane0->getOpcode())
+ return;
+ // Work around captured structured bindings being a C++20 extension.
+ auto Idx = I;
+ if (any_of(R->operands(), [Lane0, Idx](VPValue *V) {
+ return !supportedLoad(Lane0, V, Idx);
+ }))
+ return;
+ }
+
+ StoreGroups.push_back(IR);
+ }
+
+ if (StoreGroups.empty())
+ return;
+
+ // Narrow operation tree rooted at store groups.
+ for (auto *StoreGroup : StoreGroups) {
+ auto *Lane0 = cast<VPWidenRecipe>(
+ StoreGroup->getStoredValues()[0]->getDefiningRecipe());
+
+ unsigned LoadGroupIdx =
+ isa<VPInterleaveRecipe>(Lane0->getOperand(1)->getDefiningRecipe()) ? 1
+ : 0;
+ unsigned WideLoadIdx = 1 - LoadGroupIdx;
+ auto *LoadGroup = cast<VPInterleaveRecipe>(
+ Lane0->getOperand(LoadGroupIdx)->getDefiningRecipe());
+
+ auto *WideLoad = cast<VPWidenLoadRecipe>(
+ Lane0->getOperand(WideLoadIdx)->getDefiningRecipe());
+
+ // Narrow wide load to uniform scalar load, as transformed VPlan will only
+ // process one original iteration.
+ auto *N = new VPReplicateRecipe(&WideLoad->getIngredient(),
+ WideLoad->operands(), true);
+ // Narrow interleave group to wide load, as transformed VPlan will only
+ // process one original iteration.
+ auto *L = new VPWidenLoadRecipe(
+ *cast<LoadInst>(LoadGroup->getInterleaveGroup()->getInsertPos()),
+ LoadGroup->getAddr(), LoadGroup->getMask(), true, false,
+ LoadGroup->getDebugLoc());
+ L->insertBefore(LoadGroup);
+ N->insertBefore(LoadGroup);
+ Lane0->setOperand(LoadGroupIdx, L);
+ Lane0->setOperand(WideLoadIdx, N);
+
+ auto *S = new VPWidenStoreRecipe(
+ *cast<StoreInst>(StoreGroup->getInterleaveGroup()->getInsertPos()),
+ StoreGroup->getAddr(), Lane0, nullptr, true, false,
+ StoreGroup->getDebugLoc());
+ S->insertBefore(StoreGroup);
+ StoreGroup->eraseFromParent();
+ }
+
+ // Adjust induction to reflect that the transformed plan only processes one
+ // original iteration.
+ auto *CanIV = Plan.getCanonicalIV();
+ VPInstruction *Inc = cast<VPInstruction>(CanIV->getBackedgeValue());
+ Inc->setOperand(
+ 1, Plan.getOrAddLiveIn(ConstantInt::get(CanIV->getScalarType(), 1)));
+ removeDeadRecipes(Plan);
+ LLVM_DEBUG(dbgs() << "Narrowed interleave\n");
+}
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.h b/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
index f4a17aec42b244..efe4bde8998fd4 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
@@ -114,11 +114,15 @@ struct VPlanTransforms {
// widening its memory instructions with a single VPInterleaveRecipe at its
// insertion point.
static void createInterleaveGroups(
- const SmallPtrSetImpl<const InterleaveGroup<Instruction> *> &InterleaveGroups,
+ VPlan &Plan,
+ const SmallPtrSetImpl<const InterleaveGroup<Instruction> *>
+ &InterleaveGroups,
VPRecipeBuilder &RecipeBuilder, bool ScalarEpilogueAllowed);
/// Remove dead recipes from \p Plan.
static void removeDeadRecipes(VPlan &Plan);
+
+ static void narrowInterleaveGroups(VPlan &Plan, ElementCount VF);
};
} // namespace llvm
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/interleaved-store-of-first-order-recurrence.ll b/llvm/test/Transforms/LoopVectorize/AArch64/interleaved-store-of-first-order-recurrence.ll
index 87674f611251cc..997ef7466d5cfc 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/interleaved-store-of-first-order-recurrence.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/interleaved-store-of-first-order-recurrence.ll
@@ -14,8 +14,7 @@ define void @interleaved_store_first_order_recurrence(ptr noalias %src, ptr %dst
; CHECK-NEXT: [[TMP2:%.*]] = shufflevector <4 x i32> [[VECTOR_RECUR]], <4 x i32> [[BROADCAST_SPLAT]], <4 x i32> <i32 3, i32 4, i32 5, i32 6>
; CHECK-NEXT: [[TMP3:%.*]] = mul nuw nsw i64 [[TMP0]], 3
; CHECK-NEXT: [[TMP4:%.*]] = getelementptr inbounds i32, ptr [[DST:%.*]], i64 [[TMP3]]
-; CHECK-NEXT: [[TMP6:%.*]] = getelementptr inbounds i32, ptr [[TMP4]], i64 2
-; CHECK-NEXT: [[TMP7:%.*]] = getelementptr inbounds i32, ptr [[TMP6]], i32 -2
+; CHECK-NEXT: [[TMP7:%.*]] = getelementptr inbounds i32, ptr [[TMP4]], i32 0
; CHECK-NEXT: [[TMP9:%.*]] = shufflevector <4 x i32> zeroinitializer, <4 x i32> [[TMP2]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
; CHECK-NEXT: [[TMP10:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLAT]], <4 x i32> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 poison, i32 poison, i32 poison, i32 poison>
; CHECK-NEXT: [[TMP11:%.*]] = shufflevector <8 x i32> [[TMP9]], <8 x i32> [[TMP10]], <12 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11>
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/sve-interleaved-accesses.ll b/llvm/test/Transforms/LoopVectorize/AArch64/sve-interleaved-accesses.ll
index 0959d913fd0cd5..ba4145217c3ba5 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/sve-interleaved-accesses.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/sve-interleaved-accesses.ll
@@ -41,13 +41,11 @@ define void @test_array_load2_store2(i32 %C, i32 %D) #1 {
; CHECK-NEXT: [[STRIDED_VEC:%.*]] = call { <vscale x 4 x i32>, <vscale x 4 x i32> } @llvm.vector.deinterleave2.nxv8i32(<vscale x 8 x i32> [[WIDE_VEC]])
; CHECK-NEXT: [[TMP3:%.*]] = extractvalue { <vscale x 4 x i32>, <vscale x 4 x i32> } [[STRIDED_VEC]], 0
; CHECK-NEXT: ...
[truncated]
|
@@ -954,7 +954,6 @@ void VPlan::prepareToExecute(Value *TripCountV, Value *VectorTripCountV, | |||
|
|||
IRBuilder<> Builder(State.CFG.PrevBB->getTerminator()); | |||
// FIXME: Model VF * UF computation completely in VPlan. | |||
assert(VFxUF.getNumUsers() && "VFxUF expected to always have users"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why dropping this check? If this check is dropped, shall it early exit if VFxUF.getNumUsers() == 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The narrowing changes the loop from stepping by VFxUF to stepping by 1, removing the user of VFxUF. Could only conditionally create it, but I don't think it would cause any test changes, as there only would be a difference for scalable vectors, which isn't supported for narrowing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall it early exit then if VFxUF.getNumUsers() == 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks!
case VPInstructionSC: { | ||
return !Instruction::isBinaryOp(cast<VPInstruction>(this)->getOpcode()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about cast or select or other instructions here? Is this still correct for them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, unified the logic separately (68ed172) and remove the change here
#define LV_NAME "loop-vectorize" | ||
#define DEBUG_TYPE LV_NAME | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needed for LLVM_DEBUG
below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed in latest version, thanks
if (W->getMask()) | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (W->getMask()) | |
return false; | |
return !W->getMask(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dropped the check and return as it is subsumed by the check below, thanks!
auto Idx = I; | ||
if (any_of(R->operands(), [Lane0, Idx](VPValue *V) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto Idx = I; | |
if (any_of(R->operands(), [Lane0, Idx](VPValue *V) { | |
if (any_of(R->operands(), [Lane0, Idx=I](VPValue *V) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, I think originally Clang complained about this being an extension. Looks fine now, thanks!
Unify logic for mayWriteToMemory and mayHaveSideEffects for VPInstruction, with the later relying on the former. Also extend to handle binary operators. Split off from #106441
Conflicts: llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
✅ With the latest revision this PR passed the C/C++ code formatter. |
Unify logic for mayWriteToMemory and mayHaveSideEffects for VPInstruction, with the later relying on the former. Also extend to handle binary operators. Split off from llvm#106441
Unify logic for mayWriteToMemory and mayHaveSideEffects for VPInstruction, with the later relying on the former. Also extend to handle binary operators. Split off from llvm/llvm-project#106441
Unify logic for mayWriteToMemory and mayHaveSideEffects for VPInstruction, with the later relying on the former. Also extend to handle binary operators. Split off from llvm/llvm-project#106441
This patch adds a new narrowInterleaveGroups transfrom, which tries convert a plan with interleave groups with VF elements to a plan that instead replaces the interleave groups with wide loads and stores processing VF elements. This effectively is a very simple form of loop-aware SLP, where we use interleave groups to identify candidates. This initial version is quite restricted and hopefully serves as a starting point for how to best model those kinds of transforms. For now it only transforms load interleave groups feeding store groups. Depends on #106431. This lands the main parts of the approved #106441 as suggested to break things up a bit more.
This patch adds a new narrowInterleaveGroups transfrom, which tries convert a plan with interleave groups with VF elements to a plan that instead replaces the interleave groups with wide loads and stores processing VF elements. This effectively is a very simple form of loop-aware SLP, where we use interleave groups to identify candidates. This initial version is quite restricted and hopefully serves as a starting point for how to best model those kinds of transforms. For now it only transforms load interleave groups feeding store groups. Depends on #106431. This lands the main parts of the approved llvm/llvm-project#106441 as suggested to break things up a bit more.
Further increase test coverage for #106441 Also regenerate checks with -filter-out-after.
Further increase test coverage for llvm/llvm-project#106441 Also regenerate checks with -filter-out-after.
d4ba75e
to
0226cb0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I landed the load->store group only transform separately as c73ad7b as suggested.
Will land the changes in the PR soon as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks!
…106441) This patch adds a new narrowInterleaveGroups transfrom, which tries convert a plan with interleave groups with VF elements to a plan that instead replaces the interleave groups with wide loads and stores processing VF elements. This effectively is a very simple form of loop-aware SLP, where we use interleave groups to identify candidates. This initial version is quite restricted and hopefully serves as a starting point for how to best model those kinds of transforms. Depends on llvm/llvm-project#106431. Fixes llvm/llvm-project#82936. PR: llvm/llvm-project#106441
This change caused a miscompile in ffmpeg, on aarch64 (on both Linux and Windows), causing ffmpeg to crash. The issue is reproducible both with ffmpeg compiled via llvm-test-suite, and compiled standalone (clone https://github.com/ffmpeg/ffmpeg, To reproduce the crash, build ffmpeg in either way and run:
(This uses an input file hosted by me, for repeated use it may be easier to download it and reference it with a local file name.) The miscompiled source file is |
…oups. (#106441)" This reverts commit dfa665f. This commit caused miscompilations in ffmpeg, see llvm/llvm-project#106441 for details.
I pushed a revert for this change now. |
…6441)" This reverts commit ff3e2ba. The recommmitted version limits to transform to cases where no interleaving is taking place, to avoid a mis-compile when interleaving. Original commit message: This patch adds a new narrowInterleaveGroups transfrom, which tries convert a plan with interleave groups with VF elements to a plan that instead replaces the interleave groups with wide loads and stores processing VF elements. This effectively is a very simple form of loop-aware SLP, where we use interleave groups to identify candidates. This initial version is quite restricted and hopefully serves as a starting point for how to best model those kinds of transforms. Depends on #106431. Fixes #82936. PR: #106441
@mstorsjo thanks, I recommitted with a fix, limiting this to loops w/o interleaving for now: 577631f I ran the ffmpeg tests from the test-suite, but it sounds like the test-suite coverage could be improved by running various encoders with different inputs and checking the output? Would the input files you are using for your testing be suitable for adding to the test-suite and define tests using them? |
…roups. (#106441)" This reverts commit ff3e2ba. The recommmitted version limits to transform to cases where no interleaving is taking place, to avoid a mis-compile when interleaving. Original commit message: This patch adds a new narrowInterleaveGroups transfrom, which tries convert a plan with interleave groups with VF elements to a plan that instead replaces the interleave groups with wide loads and stores processing VF elements. This effectively is a very simple form of loop-aware SLP, where we use interleave groups to identify candidates. This initial version is quite restricted and hopefully serves as a starting point for how to best model those kinds of transforms. Depends on llvm/llvm-project#106431. Fixes llvm/llvm-project#82936. PR: llvm/llvm-project#106441
Great, thanks! I built the latest clang and reran the full ffmpeg testsuite now, and it seems to pass flawlessly now.
That would be great, but unfortunately it'd be a fairly massive effort... Upstream ffmpeg has got like over 5000 tests, defined in a combination of shell script and makefiles. Some tests decode test samples from a separately synced repo of test inputs (currently a 1.5 GB bunch of inputs) - some of them check the hash of the decoded output, expecting it to be bitexact. Some decode and check that the output is close enough to a reference output (e.g. for decoders that use floats internally). Some tests generate a synthetic test input, then try encode it into various formats, and decode it back again, and do some comparison on it, etc. (This particular test that I isolated was this kind, I think.) So while it's probably not hard to extract a handful of testcases and set them up in llvm-test-suite, it probably doesn't give very much meaningful test coverage if it's only a few. (Chances that you'd break the exact same case again is pretty small.) ;-) And doing all of them would both be a quite massive effort, and it would also probably be a huge pain to maintain in sync (if we'd want to update targeting a newer release). This, compared with the cmake wrapping which is fairly low effort to update to a newer release if we want to do that. If you do want to try to run the full upstream ffmpeg testsuite, it's not very hard:
|
…aving. Remove the UF = 1 restriction introduced by 577631f building on top of 783a846, which allows updating all relevant users of the VF, VPScalarIVSteps in particular. This restores the full functionality of llvm/llvm-project#106441.
This patch adds a new narrowInterleaveGroups transfrom, which tries
convert a plan with interleave groups with VF elements to a plan that
instead replaces the interleave groups with wide loads and stores
processing VF elements.
This effectively is a very simple form of loop-aware SLP, where we
use interleave groups to identify candidates.
This initial version is quite restricted and hopefully serves as a
starting point for how to best model those kinds of transforms.
Depends on #106431.
Fixes #82936