-
Notifications
You must be signed in to change notification settings - Fork 13.5k
[VPlan] Use pointer to member 0 as VPInterleaveRecipe's pointer arg. #106431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Update VPInterleaveRecipe to always use the pointer to member 0 as pointer argument. This in many cases helps to remove unneeded index adjustments and simplifies VPInterleaveRecipe::execute. In some rare cases, the address of member 0 does not dominate the insert position of the interleave group. In those cases a PtrAdd VPInstruction is emitted to compute the address of member 0 based on the address of the insert position. Alternatively we could hoist the recipe computing the address of member 0.
@llvm/pr-subscribers-llvm-transforms Author: Florian Hahn (fhahn) ChangesUpdate VPInterleaveRecipe to always use the pointer to member 0 as pointer argument. This in many cases helps to remove unneeded index adjustments and simplifies VPInterleaveRecipe::execute. In some rare cases, the address of member 0 does not dominate the insert position of the interleave group. In those cases a PtrAdd VPInstruction is emitted to compute the address of member 0 based on the address of the insert position. Alternatively we could hoist the recipe computing the address of member 0. Patch is 32.76 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/106431.diff 10 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index cb104c4ed2d03d..1e92bda5494248 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -8993,8 +8993,8 @@ LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes(VFRange &Range) {
// Interleave memory: for each Interleave Group we marked earlier as relevant
// for this VPlan, replace the Recipes widening its memory instructions with a
// single VPInterleaveRecipe at its insertion point.
- VPlanTransforms::createInterleaveGroups(InterleaveGroups, RecipeBuilder,
- CM.isScalarEpilogueAllowed());
+ VPlanTransforms::createInterleaveGroups(
+ *Plan, InterleaveGroups, RecipeBuilder, CM.isScalarEpilogueAllowed());
for (ElementCount VF : Range)
Plan->addVF(VF);
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index e7ea5cb23b90d3..b08cba798a9a9b 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -1330,6 +1330,13 @@ class VPInstruction : public VPRecipeWithIRFlags {
assert(Opcode == Instruction::Or && "only OR opcodes can be disjoint");
}
+ VPInstruction(VPValue *Ptr, VPValue *Offset, bool InBounds, DebugLoc DL = {},
+ const Twine &Name = "")
+ : VPRecipeWithIRFlags(VPDef::VPInstructionSC,
+ ArrayRef<VPValue *>({Ptr, Offset}),
+ GEPFlagsTy(InBounds), DL),
+ Opcode(VPInstruction::PtrAdd), Name(Name.str()) {}
+
VPInstruction(unsigned Opcode, std::initializer_list<VPValue *> Operands,
FastMathFlags FMFs, DebugLoc DL = {}, const Twine &Name = "");
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index f84317ba51257a..df4e3a54812626 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -646,7 +646,9 @@ Value *VPInstruction::generatePerPart(VPTransformState &State, unsigned Part) {
"can only generate first lane for PtrAdd");
Value *Ptr = State.get(getOperand(0), Part, /* IsScalar */ true);
Value *Addend = State.get(getOperand(1), Part, /* IsScalar */ true);
- return Builder.CreatePtrAdd(Ptr, Addend, Name);
+ return Builder.CreatePtrAdd(Ptr, Addend, Name,
+ isInBounds() ? GEPNoWrapFlags::inBounds()
+ : GEPNoWrapFlags::none());
}
case VPInstruction::ResumePhi: {
if (Part != 0)
@@ -2393,7 +2395,6 @@ void VPInterleaveRecipe::execute(VPTransformState &State) {
// Prepare for the new pointers.
SmallVector<Value *, 2> AddrParts;
- unsigned Index = Group->getIndex(Instr);
// TODO: extend the masked interleaved-group support to reversed access.
VPValue *BlockInMask = getMask();
@@ -2413,10 +2414,11 @@ void VPInterleaveRecipe::execute(VPTransformState &State) {
Idx = State.Builder.CreateSub(RuntimeVF, State.Builder.getInt32(1));
Idx = State.Builder.CreateMul(Idx,
State.Builder.getInt32(Group->getFactor()));
- Idx = State.Builder.CreateAdd(Idx, State.Builder.getInt32(Index));
Idx = State.Builder.CreateNeg(Idx);
- } else
- Idx = State.Builder.getInt32(-Index);
+ } else {
+ // TODO: Drop redundant 0-index GEP as follow-up.
+ Idx = State.Builder.getInt32(0);
+ }
VPValue *Addr = getAddr();
for (unsigned Part = 0; Part < State.UF; Part++) {
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
index ee7c7cea0b7670..4dae6a54b017f6 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
@@ -596,8 +596,7 @@ static void legalizeAndOptimizeInductions(VPlan &Plan, ScalarEvolution &SE) {
Plan, InductionDescriptor::IK_IntInduction, Instruction::Add, nullptr,
SE, nullptr, StartV, StepV, InsertPt);
- auto *Recipe = new VPInstruction(VPInstruction::PtrAdd,
- {PtrIV->getStartValue(), Steps},
+ auto *Recipe = new VPInstruction(PtrIV->getStartValue(), Steps, false,
PtrIV->getDebugLoc(), "next.gep");
Recipe->insertAfter(Steps);
@@ -1522,14 +1521,19 @@ void VPlanTransforms::dropPoisonGeneratingRecipes(
}
void VPlanTransforms::createInterleaveGroups(
- const SmallPtrSetImpl<const InterleaveGroup<Instruction> *> &InterleaveGroups,
+ VPlan &Plan,
+ const SmallPtrSetImpl<const InterleaveGroup<Instruction> *>
+ &InterleaveGroups,
VPRecipeBuilder &RecipeBuilder, bool ScalarEpilogueAllowed) {
+ if (InterleaveGroups.empty())
+ return;
+
// Interleave memory: for each Interleave Group we marked earlier as relevant
// for this VPlan, replace the Recipes widening its memory instructions with a
// single VPInterleaveRecipe at its insertion point.
+ VPDominatorTree VPDT;
+ VPDT.recalculate(Plan);
for (const auto *IG : InterleaveGroups) {
- auto *Recipe =
- cast<VPWidenMemoryRecipe>(RecipeBuilder.getRecipe(IG->getInsertPos()));
SmallVector<VPValue *, 4> StoredValues;
for (unsigned i = 0; i < IG->getFactor(); ++i)
if (auto *SI = dyn_cast_or_null<StoreInst>(IG->getMember(i))) {
@@ -1539,9 +1543,39 @@ void VPlanTransforms::createInterleaveGroups(
bool NeedsMaskForGaps =
IG->requiresScalarEpilogue() && !ScalarEpilogueAllowed;
- auto *VPIG = new VPInterleaveRecipe(IG, Recipe->getAddr(), StoredValues,
- Recipe->getMask(), NeedsMaskForGaps);
- VPIG->insertBefore(Recipe);
+
+ Instruction *IRInsertPos = IG->getInsertPos();
+ auto *InsertPos =
+ cast<VPWidenMemoryRecipe>(RecipeBuilder.getRecipe(IRInsertPos));
+ VPRecipeBase *IP = InsertPos;
+
+ // Get or create the start address for the interleave group.
+ auto *Start =
+ cast<VPWidenMemoryRecipe>(RecipeBuilder.getRecipe(IG->getMember(0)));
+ VPValue *Addr = Start->getAddr();
+ if (!VPDT.properlyDominates(Addr->getDefiningRecipe(), InsertPos)) {
+ bool InBounds = false;
+ if (auto *gep = dyn_cast<GetElementPtrInst>(
+ getLoadStorePointerOperand(IRInsertPos)->stripPointerCasts()))
+ InBounds = gep->isInBounds();
+
+ // We cannot re-use the address of the first member because it does not
+ // dominate the insert position. Use the address of the insert position
+ // and create a PtrAdd to adjust the index to start at the first member.
+ APInt Offset(32,
+ getLoadStoreType(IRInsertPos)->getScalarSizeInBits() / 8 *
+ IG->getIndex(IRInsertPos),
+ /*IsSigned=*/true);
+ VPValue *OffsetVPV = Plan.getOrAddLiveIn(
+ ConstantInt::get(IRInsertPos->getParent()->getContext(), -Offset));
+ Addr = new VPInstruction(InsertPos->getAddr(), OffsetVPV, InBounds);
+ Addr->getDefiningRecipe()->insertAfter(InsertPos);
+ IP = Addr->getDefiningRecipe();
+ }
+ auto *VPIG = new VPInterleaveRecipe(IG, Addr, StoredValues,
+ InsertPos->getMask(), NeedsMaskForGaps);
+ VPIG->insertAfter(IP);
+
unsigned J = 0;
for (unsigned i = 0; i < IG->getFactor(); ++i)
if (Instruction *Member = IG->getMember(i)) {
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.h b/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
index e714f69eeff1ab..434545c5ea67e1 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
@@ -112,7 +112,9 @@ struct VPlanTransforms {
// widening its memory instructions with a single VPInterleaveRecipe at its
// insertion point.
static void createInterleaveGroups(
- const SmallPtrSetImpl<const InterleaveGroup<Instruction> *> &InterleaveGroups,
+ VPlan &Plan,
+ const SmallPtrSetImpl<const InterleaveGroup<Instruction> *>
+ &InterleaveGroups,
VPRecipeBuilder &RecipeBuilder, bool ScalarEpilogueAllowed);
};
diff --git a/llvm/test/Transforms/LoopVectorize/X86/interleave-cost.ll b/llvm/test/Transforms/LoopVectorize/X86/interleave-cost.ll
index cc1d11754b27ed..9383799b181c82 100644
--- a/llvm/test/Transforms/LoopVectorize/X86/interleave-cost.ll
+++ b/llvm/test/Transforms/LoopVectorize/X86/interleave-cost.ll
@@ -89,12 +89,11 @@ define void @test_free_instructions_feeding_geps_for_interleave_groups(ptr noali
; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <2 x float> poison, float [[TMP40]], i64 0
; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <2 x float> [[BROADCAST_SPLATINSERT]], <2 x float> poison, <2 x i32> zeroinitializer
; CHECK-NEXT: [[TMP41:%.*]] = shl i64 [[TMP39]], 2
+; CHECK-NEXT: [[TMP44:%.*]] = getelementptr float, ptr [[DST_1]], i64 [[TMP41]]
; CHECK-NEXT: [[TMP42:%.*]] = load float, ptr [[P_INVAR]], align 4
; CHECK-NEXT: [[BROADCAST_SPLATINSERT27:%.*]] = insertelement <2 x float> poison, float [[TMP42]], i64 0
; CHECK-NEXT: [[BROADCAST_SPLAT28:%.*]] = shufflevector <2 x float> [[BROADCAST_SPLATINSERT27]], <2 x float> poison, <2 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP43:%.*]] = or disjoint i64 [[TMP41]], 3
-; CHECK-NEXT: [[TMP44:%.*]] = getelementptr float, ptr [[DST_1]], i64 [[TMP43]]
-; CHECK-NEXT: [[TMP45:%.*]] = getelementptr float, ptr [[TMP44]], i32 -3
+; CHECK-NEXT: [[TMP45:%.*]] = getelementptr float, ptr [[TMP44]], i32 0
; CHECK-NEXT: [[TMP46:%.*]] = shufflevector <2 x float> [[BROADCAST_SPLAT]], <2 x float> [[BROADCAST_SPLAT28]], <4 x i32> <i32 0, i32 1, i32 2, i32 3>
; CHECK-NEXT: [[TMP47:%.*]] = shufflevector <4 x float> [[TMP46]], <4 x float> zeroinitializer, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
; CHECK-NEXT: [[INTERLEAVED_VEC:%.*]] = shufflevector <8 x float> [[TMP47]], <8 x float> poison, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 1, i32 3, i32 5, i32 7>
@@ -102,8 +101,8 @@ define void @test_free_instructions_feeding_geps_for_interleave_groups(ptr noali
; CHECK-NEXT: [[TMP48:%.*]] = load float, ptr [[P_INVAR]], align 4
; CHECK-NEXT: [[BROADCAST_SPLATINSERT29:%.*]] = insertelement <2 x float> poison, float [[TMP48]], i64 0
; CHECK-NEXT: [[BROADCAST_SPLAT30:%.*]] = shufflevector <2 x float> [[BROADCAST_SPLATINSERT29]], <2 x float> poison, <2 x i32> zeroinitializer
-; CHECK-NEXT: [[TMP49:%.*]] = getelementptr float, ptr [[DST_2]], i64 [[TMP43]]
-; CHECK-NEXT: [[TMP50:%.*]] = getelementptr float, ptr [[TMP49]], i32 -3
+; CHECK-NEXT: [[TMP49:%.*]] = getelementptr float, ptr [[DST_2]], i64 [[TMP41]]
+; CHECK-NEXT: [[TMP50:%.*]] = getelementptr float, ptr [[TMP49]], i32 0
; CHECK-NEXT: [[BROADCAST_SPLAT36:%.*]] = shufflevector <2 x float> [[BROADCAST_SPLAT30]], <2 x float> zeroinitializer, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
; CHECK-NEXT: [[TMP51:%.*]] = shufflevector <4 x float> [[BROADCAST_SPLAT36]], <4 x float> zeroinitializer, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
; CHECK-NEXT: [[INTERLEAVED_VEC31:%.*]] = shufflevector <8 x float> [[TMP51]], <8 x float> poison, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 1, i32 3, i32 5, i32 7>
@@ -266,8 +265,7 @@ define void @geps_feeding_interleave_groups_with_reuse(ptr %arg, i64 %arg1, ptr
; CHECK-NEXT: [[TMP35:%.*]] = fmul <2 x float> [[TMP34]], zeroinitializer
; CHECK-NEXT: [[TMP36:%.*]] = fadd <2 x float> [[STRIDED_VEC16]], [[STRIDED_VEC20]]
; CHECK-NEXT: [[TMP37:%.*]] = fmul <2 x float> [[TMP36]], zeroinitializer
-; CHECK-NEXT: [[TMP38:%.*]] = getelementptr i8, ptr [[TMP28]], i64 12
-; CHECK-NEXT: [[TMP39:%.*]] = getelementptr float, ptr [[TMP38]], i32 -3
+; CHECK-NEXT: [[TMP39:%.*]] = getelementptr float, ptr [[TMP28]], i32 0
; CHECK-NEXT: [[TMP40:%.*]] = shufflevector <2 x float> [[TMP31]], <2 x float> [[TMP33]], <4 x i32> <i32 0, i32 1, i32 2, i32 3>
; CHECK-NEXT: [[TMP41:%.*]] = shufflevector <2 x float> [[TMP35]], <2 x float> [[TMP37]], <4 x i32> <i32 0, i32 1, i32 2, i32 3>
; CHECK-NEXT: [[TMP42:%.*]] = shufflevector <4 x float> [[TMP40]], <4 x float> [[TMP41]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
@@ -490,11 +488,10 @@ define void @geps_feeding_interleave_groups_with_reuse2(ptr %A, ptr %B, i64 %N)
; CHECK-NEXT: [[WIDE_VEC:%.*]] = load <16 x i32>, ptr [[TMP53]], align 4
; CHECK-NEXT: [[STRIDED_VEC:%.*]] = shufflevector <16 x i32> [[WIDE_VEC]], <16 x i32> poison, <4 x i32> <i32 0, i32 4, i32 8, i32 12>
; CHECK-NEXT: [[STRIDED_VEC34:%.*]] = shufflevector <16 x i32> [[WIDE_VEC]], <16 x i32> poison, <4 x i32> <i32 1, i32 5, i32 9, i32 13>
+; CHECK-NEXT: [[TMP56:%.*]] = getelementptr i32, ptr [[A]], i64 [[TMP50]]
; CHECK-NEXT: [[TMP54:%.*]] = getelementptr i32, ptr [[B]], <4 x i64> [[VEC_IND]]
; CHECK-NEXT: [[WIDE_MASKED_GATHER:%.*]] = call <4 x i32> @llvm.masked.gather.v4i32.v4p0(<4 x ptr> [[TMP54]], i32 4, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x i32> poison), !alias.scope [[META6:![0-9]+]]
-; CHECK-NEXT: [[TMP55:%.*]] = or disjoint i64 [[TMP50]], 7
-; CHECK-NEXT: [[TMP56:%.*]] = getelementptr i32, ptr [[A]], i64 [[TMP55]]
-; CHECK-NEXT: [[TMP57:%.*]] = getelementptr i32, ptr [[TMP56]], i32 -7
+; CHECK-NEXT: [[TMP57:%.*]] = getelementptr i32, ptr [[TMP56]], i32 0
; CHECK-NEXT: [[TMP58:%.*]] = shufflevector <4 x i32> [[STRIDED_VEC]], <4 x i32> zeroinitializer, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
; CHECK-NEXT: [[TMP59:%.*]] = shufflevector <4 x i32> [[STRIDED_VEC34]], <4 x i32> zeroinitializer, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
; CHECK-NEXT: [[TMP60:%.*]] = shufflevector <4 x i32> [[WIDE_MASKED_GATHER]], <4 x i32> zeroinitializer, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
diff --git a/llvm/test/Transforms/LoopVectorize/X86/x86-interleaved-accesses-masked-group.ll b/llvm/test/Transforms/LoopVectorize/X86/x86-interleaved-accesses-masked-group.ll
index 6b52023cfbcaec..968058134690bc 100644
--- a/llvm/test/Transforms/LoopVectorize/X86/x86-interleaved-accesses-masked-group.ll
+++ b/llvm/test/Transforms/LoopVectorize/X86/x86-interleaved-accesses-masked-group.ll
@@ -1419,13 +1419,11 @@ define dso_local void @masked_strided2(ptr noalias nocapture readonly %p, ptr no
; ENABLED_MASKED_STRIDED-NEXT: [[WIDE_MASKED_VEC:%.*]] = call <16 x i8> @llvm.masked.load.v16i8.p0(ptr [[TMP2]], i32 1, <16 x i1> [[INTERLEAVED_MASK]], <16 x i8> poison)
; ENABLED_MASKED_STRIDED-NEXT: [[STRIDED_VEC:%.*]] = shufflevector <16 x i8> [[WIDE_MASKED_VEC]], <16 x i8> poison, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
; ENABLED_MASKED_STRIDED-NEXT: [[STRIDED_VEC1:%.*]] = shufflevector <16 x i8> [[WIDE_MASKED_VEC]], <16 x i8> poison, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
-; ENABLED_MASKED_STRIDED-NEXT: [[TMP3:%.*]] = or disjoint i32 [[TMP1]], 1
; ENABLED_MASKED_STRIDED-NEXT: [[TMP4:%.*]] = call <8 x i8> @llvm.smax.v8i8(<8 x i8> [[STRIDED_VEC]], <8 x i8> [[STRIDED_VEC1]])
+; ENABLED_MASKED_STRIDED-NEXT: [[TMP6:%.*]] = getelementptr i8, ptr [[Q:%.*]], i32 [[TMP1]]
; ENABLED_MASKED_STRIDED-NEXT: [[TMP5:%.*]] = sub <8 x i8> zeroinitializer, [[TMP4]]
-; ENABLED_MASKED_STRIDED-NEXT: [[TMP6:%.*]] = getelementptr i8, ptr [[Q:%.*]], i32 [[TMP3]]
-; ENABLED_MASKED_STRIDED-NEXT: [[TMP7:%.*]] = getelementptr i8, ptr [[TMP6]], i32 -1
; ENABLED_MASKED_STRIDED-NEXT: [[INTERLEAVED_VEC:%.*]] = shufflevector <8 x i8> [[TMP4]], <8 x i8> [[TMP5]], <16 x i32> <i32 0, i32 8, i32 1, i32 9, i32 2, i32 10, i32 3, i32 11, i32 4, i32 12, i32 5, i32 13, i32 6, i32 14, i32 7, i32 15>
-; ENABLED_MASKED_STRIDED-NEXT: call void @llvm.masked.store.v16i8.p0(<16 x i8> [[INTERLEAVED_VEC]], ptr [[TMP7]], i32 1, <16 x i1> [[INTERLEAVED_MASK]])
+; ENABLED_MASKED_STRIDED-NEXT: call void @llvm.masked.store.v16i8.p0(<16 x i8> [[INTERLEAVED_VEC]], ptr [[TMP6]], i32 1, <16 x i1> [[INTERLEAVED_MASK]])
; ENABLED_MASKED_STRIDED-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 8
; ENABLED_MASKED_STRIDED-NEXT: [[VEC_IND_NEXT]] = add <8 x i32> [[VEC_IND]], <i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8>
; ENABLED_MASKED_STRIDED-NEXT: [[TMP8:%.*]] = icmp eq i32 [[INDEX_NEXT]], 1024
@@ -2555,13 +2553,11 @@ define dso_local void @masked_strided2_unknown_tc(ptr noalias nocapture readonly
; ENABLED_MASKED_STRIDED-NEXT: [[WIDE_MASKED_VEC:%.*]] = call <16 x i8> @llvm.masked.load.v16i8.p0(ptr [[TMP3]], i32 1, <16 x i1> [[INTERLEAVED_MASK]], <16 x i8> poison)
; ENABLED_MASKED_STRIDED-NEXT: [[STRIDED_VEC:%.*]] = shufflevector <16 x i8> [[WIDE_MASKED_VEC]], <16 x i8> poison, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
; ENABLED_MASKED_STRIDED-NEXT: [[STRIDED_VEC3:%.*]] = shufflevector <16 x i8> [[WIDE_MASKED_VEC]], <16 x i8> poison, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
-; ENABLED_MASKED_STRIDED-NEXT: [[TMP5:%.*]] = or disjoint i32 [[TMP2]], 1
; ENABLED_MASKED_STRIDED-NEXT: [[TMP6:%.*]] = call <8 x i8> @llvm.smax.v8i8(<8 x i8> [[STRIDED_VEC]], <8 x i8> [[STRIDED_VEC3]])
+; ENABLED_MASKED_STRIDED-NEXT: [[TMP8:%.*]] = getelementptr i8, ptr [[Q:%.*]], i32 [[TMP2]]
; ENABLED_MASKED_STRIDED-NEXT: [[TMP7:%.*]] = sub <8 x i8> zeroinitializer, [[TMP6]]
-; ENABLED_MASKED_STRIDED-NEXT: [[TMP8:%.*]] = getelementptr i8, ptr [[Q:%.*]], i32 [[TMP5]]
-; ENABLED_MASKED_STRIDED-NEXT: [[TMP9:%.*]] = getelementptr i8, ptr [[TMP8]], i32 -1
; ENABLED_MASKED_STRIDED-NEXT: [[INTERLEAVED_VEC:%.*]] = shufflevector <8 x i8> [[TMP6]], <8 x i8> [[TMP7]], <16 x i32> <i32 0, i32 8, i32 1, i32 9, i32 2, i32 10, i32 3, i32 11, i32 4, i32 12, i32 5, i32 13, i32 6, i32 14, i32 7, i32 15>
-; ENABLED_MASKED_STRIDED-NEXT: call void @llvm.masked.store.v16i8.p0(<16 x i8> [[INTERLEAVED_VEC]], ptr [[TMP9]], i32 1, <16 x i1> [[INTERLEAVED_MASK]])
+; ENABLED_MASKED_STRIDED-NEXT: call void @llvm.masked.store.v16i8.p0(<16 x i8> [[INTERLEAVED_VEC]], ptr [[TMP8]], i32 1, <16 x i1> [[INTERLEAVED_MASK]])
; ENABLED_MASKED_STRIDED-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 8
; ENABLED_MASKED_STRIDED-NEXT: [[VEC_IND_NEXT]] = add <8 x i32> [[VEC_IND]], <i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8, i32 8>
; ENABLED_MASKED_STRIDED-NEXT: [[TMP10:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
@@ -2989,13 +2985,11 @@ define dso_local void @unconditional_masked_strided2_unknown_tc(ptr noalias noca
; ENABLED_MASKED_STRIDED-NEXT: [[WIDE_MASKED_VEC:%.*]] = call <16 x i8> @llvm.masked.load.v16i8.p0(ptr [[TMP2]], i32 1, <16 x i1> [[INTERLEAVED_MASK]], <16 x i8> poison)
; ENABLED_MASKED_STRIDED-NEXT: [[STRIDED_VEC:%.*]] = shufflevector <16 x i8> [[WIDE_MASKED_VEC]], <16 x i8> poison, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
; ENABLED_MASKED_STRIDED-NEXT: [[STRIDED_VEC3:%.*]] = shufflevector <16 x i8> [[WIDE_MASKED_VEC]], <16 x i8> poison, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
-; ENABLED_MASKED_STRIDED-NEXT: [[TMP3:%.*]] = or disjoint i32 [[TMP1]], 1
; ENABLED_MASKED_STRIDED-NEXT: [[TMP4:%.*]] = call <8 x i8> @llvm.smax.v8i8(<8 x i8> [[STRIDED_VEC]], <8 x i8> [[STRIDED_VEC3]])
+; ENABLED_MASKED_STRIDED-NEXT: [[TMP6:%.*]] = getelementptr inbounds i8, ptr [[Q:%.*]], i32 [[TMP1]]
; ENABLED_MASKED_STRIDED-NEXT: [[TMP5:%.*]] = sub <8 x i8> zeroinitializer, [[TMP4]]
-; ENABLED_MASKED_STRIDED-NEXT: [[TMP6:%.*]] = getelementptr inbounds i8, ptr [[Q:%.*]], i32 [[TMP3]]
-; ENABLED_MASKED_STRIDED-NEXT: [[TMP7:%.*]] = getelementptr inbounds i8, ptr [[TMP6]], i32 -1
; ENABLED_MASKED_STRIDED-NEXT: [[INTERLEAVED_VEC:%.*]] = shufflevector <8 x i8> [[TMP4]], <8 x i8> [[TMP5]], <16 x i32> <i32 0, i32 8, i32 1, i32 9, i32 2, i32 10, i32 3, i32 11, i32 4, i32 12, i32 5, i32 13, i32 6, i32 14, i32 7, i32 15>
-; ENABLED_MASKED_STRIDED-NEXT: call void @llvm.masked.store.v16i8.p0(<16 x i8> [[INTERLEAVED_VEC]], ptr [[TMP7]], i32 1, <16 x i1> [[INTERLEAVED_MASK]])
+; ENABLED_MASKED_STRIDED-NEXT: call void @llvm.masked.store.v16i8.p0(<16 x i8> [[INTERLEAVED_VEC]], ptr [[TMP6]], i32 1, <16 x i1> [[...
[truncated]
|
This patch adds a new narrowInterleaveGroups transfrom, which tries convert a plan with interleave groups with VF elements to a plan that instead replaces the interleave groups with wide loads and stores processing VF elements. This effectively is a very simple form of loop-aware SLP, where we use interleave groups to identify candidates. This initial version is quite restricted and hopefully serves as a starting point for how to best model those kinds of transforms. Depends on llvm#106431. Fixes llvm#82936
ping :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change makes sense to me, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice simplification! Can provide a step towards facilitating SLP.
@@ -1330,6 +1330,13 @@ class VPInstruction : public VPRecipeWithIRFlags { | |||
assert(Opcode == Instruction::Or && "only OR opcodes can be disjoint"); | |||
} | |||
|
|||
VPInstruction(VPValue *Ptr, VPValue *Offset, bool InBounds, DebugLoc DL = {}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: a bit odd to construct PtrAdd given two VPValues and a bool. Perhaps pass a GEPFlagsTy as third parameter instead of bool, analogous to passing FMFs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks!
isInBounds() ? GEPNoWrapFlags::inBounds() | ||
: GEPNoWrapFlags::none()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Independent of this patch, but exercised only with it?
return Builder.CreatePtrAdd(Ptr, Addend, Name, | ||
isInBounds() ? GEPNoWrapFlags::inBounds() | ||
: GEPNoWrapFlags::none()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return Builder.CreatePtrAdd(Ptr, Addend, Name, | |
isInBounds() ? GEPNoWrapFlags::inBounds() | |
: GEPNoWrapFlags::none()); | |
auto Flags = isInBounds() ? GEPNoWrapFlags::inBounds() | |
: GEPNoWrapFlags::none()); | |
return Builder.CreatePtrAdd(Ptr, Addend, Name, Flags); |
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to use CreateInBoundsPtrAdd/CreatePtrAdd.
@@ -2413,10 +2414,11 @@ void VPInterleaveRecipe::execute(VPTransformState &State) { | |||
Idx = State.Builder.CreateSub(RuntimeVF, State.Builder.getInt32(1)); | |||
Idx = State.Builder.CreateMul(Idx, | |||
State.Builder.getInt32(Group->getFactor())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the above comment of adjusting the index from the first vector lane (lane zero?) rather than using the pointer of the last lane, still hold? If so, this adjusting is invariant and better be placed in the preheader, possibly later by licm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think so, we just don't have to compute the index for the first lane (used to be done via the removed add)
} else { | ||
// TODO: Drop redundant 0-index GEP as follow-up. | ||
Idx = State.Builder.getInt32(0); | ||
} | ||
|
||
VPValue *Addr = getAddr(); | ||
for (unsigned Part = 0; Part < State.UF; Part++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the "Notice current instruction ..." explanation below be updated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be dropped now I think, done, thanks!
auto *Start = | ||
cast<VPWidenMemoryRecipe>(RecipeBuilder.getRecipe(IG->getMember(0))); | ||
VPValue *Addr = Start->getAddr(); | ||
if (!VPDT.properlyDominates(Addr->getDefiningRecipe(), InsertPos)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then hoist Addr's defining recipe to insert pos, possibly hoisting additional defining recipes as needed? And/or sink loads above member zero to join it.
In general, all members of an interleave group conceptually move to its insert pos, so may as well perform actual movement. This should facilitate subsequent SLP'ing.
Interleave stores are inserted at the last member, so the address of any member should be available there. Interleaved loads are inserted at the first member, so only its address is guaranteed to be available there, although others may as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, should be done as potential follow-up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, worth leaving behind a TODO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added, thanks!
; CHECK-NEXT: [[TMP5:%.*]] = getelementptr i8, ptr [[TMP4]], i32 -4 | ||
; CHECK-NEXT: [[TMP51:%.*]] = getelementptr i16, ptr [[TMP5]], i32 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: here a redundant gep with zero is added, replacing TMP4[-2:i16] with TMP4[-4:i8 + 0]
; CHECK-NEXT: [[INTERLEAVED_VEC:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLAT]], <4 x i32> [[BROADCAST_SPLAT2]], <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7> | ||
; CHECK-NEXT: store <8 x i32> [[INTERLEAVED_VEC]], ptr [[TMP12]], align 4 | ||
; CHECK-NEXT: store <8 x i32> [[INTERLEAVED_VEC]], ptr [[TMP7]], align 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: this stores INTERLEAVED_VEC at A[OFFSET_IDX], rather than to A[OFFSET_IDX+(1*4)-4]
Stores of X should be dce'd.
; ENABLED_MASKED_STRIDED-NEXT: [[INTERLEAVED_VEC:%.*]] = shufflevector <8 x i8> [[TMP4]], <8 x i8> [[TMP5]], <16 x i32> <i32 0, i32 8, i32 1, i32 9, i32 2, i32 10, i32 3, i32 11, i32 4, i32 12, i32 5, i32 13, i32 6, i32 14, i32 7, i32 15> | ||
; ENABLED_MASKED_STRIDED-NEXT: call void @llvm.masked.store.v16i8.p0(<16 x i8> [[INTERLEAVED_VEC]], ptr [[TMP7]], i32 1, <16 x i1> [[INTERLEAVED_MASK]]) | ||
; ENABLED_MASKED_STRIDED-NEXT: call void @llvm.masked.store.v16i8.p0(<16 x i8> [[INTERLEAVED_VEC]], ptr [[TMP6]], i32 1, <16 x i1> [[INTERLEAVED_MASK]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: Q[(TMP1+1)-1] --> Q[TMP1]
This patch adds a new narrowInterleaveGroups transfrom, which tries convert a plan with interleave groups with VF elements to a plan that instead replaces the interleave groups with wide loads and stores processing VF elements. This effectively is a very simple form of loop-aware SLP, where we use interleave groups to identify candidates. This initial version is quite restricted and hopefully serves as a starting point for how to best model those kinds of transforms. Depends on llvm#106431. Fixes llvm#82936
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, adding a few minor comments.
@@ -1323,6 +1322,13 @@ class VPInstruction : public VPRecipeWithIRFlags, | |||
assert(Opcode == Instruction::Or && "only OR opcodes can be disjoint"); | |||
} | |||
|
|||
VPInstruction(VPValue *Ptr, VPValue *Offset, GEPFlagsTy Flags = {false}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VPInstruction(VPValue *Ptr, VPValue *Offset, GEPFlagsTy Flags = {false}, | |
VPInstruction(VPValue *Ptr, VPValue *Offset, GEPFlagsTy Flags, |
default unneeded and may lead to potential ambiguation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks!
DebugLoc DL = {}, const Twine &Name = "") | ||
: VPRecipeWithIRFlags(VPDef::VPInstructionSC, | ||
ArrayRef<VPValue *>({Ptr, Offset}), | ||
GEPFlagsTy(Flags), DL), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GEPFlagsTy(Flags), DL), | |
Flags, DL), |
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks!
} else | ||
Idx = State.Builder.getInt32(-Index); | ||
} else { | ||
// TODO: Drop redundant 0-index GEP as follow-up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your call; generating gep with zero also causes some test discrepancies.
auto *Start = | ||
cast<VPWidenMemoryRecipe>(RecipeBuilder.getRecipe(IG->getMember(0))); | ||
VPValue *Addr = Start->getAddr(); | ||
if (!VPDT.properlyDominates(Addr->getDefiningRecipe(), InsertPos)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also handle(s) a live-in Addr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done by checking if there's a defining recipe, thanks!
// and create a PtrAdd to adjust the index to start at the first member. | ||
APInt Offset(32, | ||
getLoadStoreType(IRInsertPos)->getScalarSizeInBits() / 8 * | ||
IG->getIndex(IRInsertPos), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth asserting Offset or index of IRInsertPos is non zero?
I.e., if insert pos is the first member, its address operand must dominate it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added assert, thanks!
getLoadStorePointerOperand(IRInsertPos)->stripPointerCasts())) | ||
InBounds = Gep->isInBounds(); | ||
|
||
// We cannot re-use the address of the first member because it does not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// We cannot re-use the address of the first member because it does not | |
// We cannot re-use the address of member zero because it does not |
(may be confusing: insertion point appears "first" (for loads, last for stores); here we mean "first" in terms of memory addresses.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks!
InBounds = Gep->isInBounds(); | ||
|
||
// We cannot re-use the address of the first member because it does not | ||
// dominate the insert position. Use the address of the insert position |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// dominate the insert position. Use the address of the insert position | |
// dominate the insert position. Instead, use the address of the insert position |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated thanks.
|
||
// We cannot re-use the address of the first member because it does not | ||
// dominate the insert position. Use the address of the insert position | ||
// and create a PtrAdd to adjust the index to start at the first member. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// and create a PtrAdd to adjust the index to start at the first member. | |
// and create a PtrAdd adjusting it to the address of member zero. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks!
auto *Start = | ||
cast<VPWidenMemoryRecipe>(RecipeBuilder.getRecipe(IG->getMember(0))); | ||
VPValue *Addr = Start->getAddr(); | ||
if (!VPDT.properlyDominates(Addr->getDefiningRecipe(), InsertPos)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, worth leaving behind a TODO.
* commit 'FETCH_HEAD': [X86] getIntImmCostInst - pull out repeated Imm.getBitWidth() calls. NFC. [X86] Add test coverage for llvm#111323 [Driver] Use empty multilib file in another test (llvm#111352) [clang][OpenMP][test] Use x86_64-linux-gnu triple for test referencing avx512f feature (llvm#111337) [doc] Fix Kaleidoscope tutorial chapter 3 code snippet and full listing discrepancies (llvm#111289) [Flang][OpenMP] Improve entry block argument creation and binding (llvm#110267) [x86] combineMul - handle 0/-1 KnownBits cases before MUL_IMM logic (REAPPLIED) [llvm-dis] Fix non-deterministic disassembly across multiple inputs (llvm#110988) [lldb][test] TestDataFormatterLibcxxOptionalSimulator.py: change order of ifdefs [lldb][test] Add libcxx-simulators test for std::optional (llvm#111133) [x86] combineMul - use computeKnownBits directly to find MUL_IMM constant splat. (REAPPLIED) Reland "[lldb][test] TestDataFormatterLibcxxStringSimulator.py: add new padding layout" (llvm#111123) Revert "[x86] combineMul - use computeKnownBits directly to find MUL_IMM constant splat." update_test_checks: fix a simple regression (llvm#111347) [LegalizeVectorTypes] Always widen fabs (llvm#111298) [lsan] Make ReportUnsuspendedThreads return bool also for Fuchsia [mlir][vector] Add more tests for ConvertVectorToLLVM (6/n) (llvm#111121) [bazel] port 9144fed [SystemZ] Remove inlining threshold multiplier. (llvm#106058) [LegalizeVectorTypes] When widening don't check for libcalls if promoted (llvm#111297) [clang][Driver] Improve multilib custom error reporting (llvm#110804) [clang][Driver] Rename "FatalError" key to "Error" in multilib.yaml (llvm#110804) [LLVM][Maintainers] Update release managers (llvm#111164) [Clang][Driver] Add option to provide path for multilib's YAML config file (llvm#109640) [LoopVectorize] Remove redundant code in emitSCEVChecks (llvm#111132) [AMDGPU] Only emit SCOPE_SYS global_wb (llvm#110636) [ELF] Change Ctx::target to unique_ptr (llvm#111260) [ELF] Pass Ctx & to some free functions [RISCV] Only disassemble fcvtmod.w.d if the rounding mode is rtz. (llvm#111308) [Clang] Remove the special-casing for RequiresExprBodyDecl in BuildResolvedCallExpr() after fd87d76 (llvm#111277) [ELF] Pass Ctx & to InputFile [clang-format] Add AlignFunctionDeclarations to AlignConsecutiveDeclarations (llvm#108241) [AMDGPU] Support preloading hidden kernel arguments (llvm#98861) [ELF] Move static nextGroupId isInGroup to LinkerDriver [clangd] Add ArgumentLists config option under Completion (llvm#111322) [ELF] Pass Ctx & to SyntheticSections [ELF] Pass Ctx & to Symbols [ELF] Pass Ctx & to Symbols [ELF] getRelocTargetVA: pass Ctx and Relocation. NFC [clang-tidy] Avoid capturing a local variable in a static lambda in UseRangesCheck (llvm#111282) [VPlan] Use pointer to member 0 as VPInterleaveRecipe's pointer arg. (llvm#106431) [clangd] Simplify ternary expressions with std::optional::value_or (NFC) (llvm#111309) [libc++][format][2/3] Optimizes c-string arguments. (llvm#101805) [RISCV] Combine RVBUnary and RVKUnary into classes that are more similar to ALU(W)_r(r/i). NFC (llvm#111279) [ELF] Pass Ctx & to InputFiles [libc] GPU RPC interface: add return value to `rpc_host_call` (llvm#111288) Signed-off-by: kyvangka1610 <[email protected]>
Hi @fhahn, a heads up: some of our tests started failing on aarch64 after this commit. Bisecting with |
I looked closer at the original C++ code and it actually looks like a miscompile. The diff in the generated IR before and after this commit (and the follow-up):
I've reduced this to
Hopefully, no important details were lost during reduction. |
Use getAllocTypeSize to get compute the offset to the start of interleave groups instead getScalarSizeInBits, which may return 0 for pointers. This is in line with the analysis building the interleave groups and fixes a mis-compile reported for #106431.
Thanks! bbff5b8 resolved the issues we found so far. |
This patch adds a new narrowInterleaveGroups transfrom, which tries convert a plan with interleave groups with VF elements to a plan that instead replaces the interleave groups with wide loads and stores processing VF elements. This effectively is a very simple form of loop-aware SLP, where we use interleave groups to identify candidates. This initial version is quite restricted and hopefully serves as a starting point for how to best model those kinds of transforms. For now it only transforms load interleave groups feeding store groups. Depends on #106431. This lands the main parts of the approved #106441 as suggested to break things up a bit more.
@fhahn ,
would you take care of it? |
Should already be fixed |
This patch adds a new narrowInterleaveGroups transfrom, which tries convert a plan with interleave groups with VF elements to a plan that instead replaces the interleave groups with wide loads and stores processing VF elements. This effectively is a very simple form of loop-aware SLP, where we use interleave groups to identify candidates. This initial version is quite restricted and hopefully serves as a starting point for how to best model those kinds of transforms. Depends on #106431. Fixes #82936. PR: #106441
…106441) This patch adds a new narrowInterleaveGroups transfrom, which tries convert a plan with interleave groups with VF elements to a plan that instead replaces the interleave groups with wide loads and stores processing VF elements. This effectively is a very simple form of loop-aware SLP, where we use interleave groups to identify candidates. This initial version is quite restricted and hopefully serves as a starting point for how to best model those kinds of transforms. Depends on llvm/llvm-project#106431. Fixes llvm/llvm-project#82936. PR: llvm/llvm-project#106441
…6441)" This reverts commit ff3e2ba. The recommmitted version limits to transform to cases where no interleaving is taking place, to avoid a mis-compile when interleaving. Original commit message: This patch adds a new narrowInterleaveGroups transfrom, which tries convert a plan with interleave groups with VF elements to a plan that instead replaces the interleave groups with wide loads and stores processing VF elements. This effectively is a very simple form of loop-aware SLP, where we use interleave groups to identify candidates. This initial version is quite restricted and hopefully serves as a starting point for how to best model those kinds of transforms. Depends on #106431. Fixes #82936. PR: #106441
…roups. (#106441)" This reverts commit ff3e2ba. The recommmitted version limits to transform to cases where no interleaving is taking place, to avoid a mis-compile when interleaving. Original commit message: This patch adds a new narrowInterleaveGroups transfrom, which tries convert a plan with interleave groups with VF elements to a plan that instead replaces the interleave groups with wide loads and stores processing VF elements. This effectively is a very simple form of loop-aware SLP, where we use interleave groups to identify candidates. This initial version is quite restricted and hopefully serves as a starting point for how to best model those kinds of transforms. Depends on llvm/llvm-project#106431. Fixes llvm/llvm-project#82936. PR: llvm/llvm-project#106441
Update VPInterleaveRecipe to always use the pointer to member 0 as pointer argument. This in many cases helps to remove unneeded index adjustments and simplifies VPInterleaveRecipe::execute.
In some rare cases, the address of member 0 does not dominate the insert position of the interleave group. In those cases a PtrAdd VPInstruction is emitted to compute the address of member 0 based on the address of the insert position. Alternatively we could hoist the recipe computing the address of member 0.