Skip to content

[VPlan] Add transformation to narrow interleave groups. #106441

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 36 commits into from
Mar 22, 2025

Conversation

fhahn
Copy link
Contributor

@fhahn fhahn commented Aug 28, 2024

This patch adds a new narrowInterleaveGroups transfrom, which tries
convert a plan with interleave groups with VF elements to a plan that
instead replaces the interleave groups with wide loads and stores
processing VF elements.

This effectively is a very simple form of loop-aware SLP, where we
use interleave groups to identify candidates.

This initial version is quite restricted and hopefully serves as a
starting point for how to best model those kinds of transforms.

Depends on #106431.

Fixes #82936

@fhahn fhahn changed the title Vplan narrow interleave [VPlan] Add transformation to narrow interleave groups. Aug 28, 2024
Copy link
Collaborator

@ayalz ayalz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First set of minor comments.

Comment on lines 7490 to 7492
if (VPlanTransforms::narrowInterleaveGroups(BestVPlan, BestVF)) {
LLVM_DEBUG(dbgs() << "Narrowed interleave\n");
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (VPlanTransforms::narrowInterleaveGroups(BestVPlan, BestVF)) {
LLVM_DEBUG(dbgs() << "Narrowed interleave\n");
}
if (VPlanTransforms::narrowInterleaveGroups(BestVPlan, BestVF))
LLVM_DEBUG(dbgs() << "Narrowed interleave\n");

better if the transform itself dumps what it does, possibly along with each interleave narrowed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved, thanks!

@@ -7487,6 +7487,9 @@ LoopVectorizationPlanner::executePlan(

VPlanTransforms::optimizeForVFAndUF(BestVPlan, BestVF, BestUF, PSE);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should optimizeForVFAndUF() above be renamed eliminateRedundantBranchOnCount()?
Otherwise narrowInterleaveGroups() below would consistently be optimizeForVF().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! There are no other simplifications there so far

VPlanTransforms::createInterleaveGroups(InterleaveGroups, RecipeBuilder,
CM.isScalarEpilogueAllowed());
VPlanTransforms::createInterleaveGroups(
*Plan, InterleaveGroups, RecipeBuilder, CM.isScalarEpilogueAllowed());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: RecipeBuilder already holds Plan.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is part of #10643

@@ -1330,6 +1330,13 @@ class VPInstruction : public VPRecipeWithIRFlags {
assert(Opcode == Instruction::Or && "only OR opcodes can be disjoint");
}

VPInstruction(VPValue *Ptr, VPValue *Offset, bool InBounds, DebugLoc DL = {},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems a bit odd for a constructor of VPInstruction(VPValue*, VPValue*, bool) to create a PtrAdd. Perhaps having GEPFlagsTy as 3rd parameter instead of bool would be clearer. Should be documented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is part of #10643

Comment on lines 653 to 655
return Builder.CreatePtrAdd(Ptr, Addend, Name,
isInBounds() ? GEPNoWrapFlags::inBounds()
: GEPNoWrapFlags::none());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Independent of this patch?

Comment on lines 1637 to 1714
if ((R.isPhi() && !isa<VPCanonicalIVPHIRecipe>(&R)) ||
(R.mayWriteToMemory() && !isa<VPInterleaveRecipe>(&R)))
return false;

auto *IR = dyn_cast<VPInterleaveRecipe>(&R);
if (!IR)
continue;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if ((R.isPhi() && !isa<VPCanonicalIVPHIRecipe>(&R)) ||
(R.mayWriteToMemory() && !isa<VPInterleaveRecipe>(&R)))
return false;
auto *IR = dyn_cast<VPInterleaveRecipe>(&R);
if (!IR)
continue;
if (R.isPhi())
return false;
auto *IR = dyn_cast<VPInterleaveRecipe>(&R);
if (R.mayWriteToMemory() && !IR)
return false;
if (!IR)
continue;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplified, thanks!


bool Changed = false;
SmallVector<VPInterleaveRecipe *> StoreGroups;
for (auto &R : make_early_inc_range(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why make_early_inc_range if StoreGroups is filled to be operated on later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a left-over from earlier versions, removed, thanks!

return false;
}

/// Returns true of \p IR is a consecutive interleave group with \p VF members.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Returns true of \p IR is a consecutive interleave group with \p VF members.
/// Returns true of \p IR is a full interleave group with factor and number of members both equal to \p VF.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated thanks!

return false;
auto IG = IR->getInterleaveGroup();
return IG->getFactor() == IG->getNumMembers() &&
IG->getNumMembers() == VF.getKnownMinValue();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
IG->getNumMembers() == VF.getKnownMinValue();
IG->getNumMembers() == VF.getFixedValue();

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjusted, thanks!

StoreGroup->getDebugLoc());
S->insertBefore(StoreGroup);
StoreGroup->eraseFromParent();
Changed = true;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Early exit above if StoredGroups.empty() returning false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks!

This patch adds a new narrowInterleaveGroups transfrom, which tries
convert a plan with interleave groups with VF elements to a plan that
instead replaces the interleave groups with wide loads and stores
processing VF elements.

This effectively is a very simple form of loop-aware SLP, where we
use interleave groups to identify candidates.

This initial version is quite restricted and hopefully serves as a
starting point for how to best model those kinds of transforms.

Depends on llvm#106431.

Fixes llvm#82936
@fhahn fhahn force-pushed the vplan-narrow-interleave branch from 6a2778c to 61279d1 Compare September 25, 2024 20:32
@fhahn fhahn marked this pull request as ready for review September 25, 2024 20:33
@fhahn fhahn requested a review from alexey-bataev September 25, 2024 20:34
@@ -673,6 +672,7 @@ static void recursivelyDeleteDeadRecipes(VPValue *V) {
void VPlanTransforms::optimizeForVFAndUF(VPlan &Plan, ElementCount BestVF,
unsigned BestUF,
PredicatedScalarEvolution &PSE) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed, thanks!

Comment on lines 653 to 655
return Builder.CreatePtrAdd(Ptr, Addend, Name,
isInBounds() ? GEPNoWrapFlags::inBounds()
: GEPNoWrapFlags::none());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the PR currently contains #106431 which is where this comes from

Comment on lines 1549 to 1650
Instruction *IRInsertPos = IG->getInsertPos();
auto *InsertPos =
cast<VPWidenMemoryRecipe>(RecipeBuilder.getRecipe(IRInsertPos));
VPRecipeBase *IP = InsertPos;

// Get or create the start address for the interleave group.
auto *Start =
cast<VPWidenMemoryRecipe>(RecipeBuilder.getRecipe(IG->getMember(0)));
VPValue *Addr = Start->getAddr();
if (!VPDT.properlyDominates(Addr->getDefiningRecipe(), InsertPos)) {
bool InBounds = false;
if (auto *gep = dyn_cast<GetElementPtrInst>(
getLoadStorePointerOperand(IRInsertPos)->stripPointerCasts()))
InBounds = gep->isInBounds();

// We cannot re-use the address of the first member because it does not
// dominate the insert position. Use the address of the insert position
// and create a PtrAdd to adjust the index to start at the first member.
APInt Offset(32,
getLoadStoreType(IRInsertPos)->getScalarSizeInBits() / 8 *
IG->getIndex(IRInsertPos),
/*IsSigned=*/true);
VPValue *OffsetVPV = Plan.getOrAddLiveIn(
ConstantInt::get(IRInsertPos->getParent()->getContext(), -Offset));
Addr = new VPInstruction(InsertPos->getAddr(), OffsetVPV, InBounds);
Addr->getDefiningRecipe()->insertAfter(InsertPos);
IP = Addr->getDefiningRecipe();
}
auto *VPIG = new VPInterleaveRecipe(IG, Addr, StoredValues,
InsertPos->getMask(), NeedsMaskForGaps);
VPIG->insertAfter(IP);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also part of #106431

return false;
}

/// Returns true of \p IR is a consecutive interleave group with \p VF members.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated thanks!

return false;
auto IG = IR->getInterleaveGroup();
return IG->getFactor() == IG->getNumMembers() &&
IG->getNumMembers() == VF.getKnownMinValue();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjusted, thanks!

StoreGroup->getDebugLoc());
S->insertBefore(StoreGroup);
StoreGroup->eraseFromParent();
Changed = true;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks!

@@ -1330,6 +1330,13 @@ class VPInstruction : public VPRecipeWithIRFlags {
assert(Opcode == Instruction::Or && "only OR opcodes can be disjoint");
}

VPInstruction(VPValue *Ptr, VPValue *Offset, bool InBounds, DebugLoc DL = {},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is part of #10643

VPlanTransforms::createInterleaveGroups(InterleaveGroups, RecipeBuilder,
CM.isScalarEpilogueAllowed());
VPlanTransforms::createInterleaveGroups(
*Plan, InterleaveGroups, RecipeBuilder, CM.isScalarEpilogueAllowed());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is part of #10643

Comment on lines 7490 to 7492
if (VPlanTransforms::narrowInterleaveGroups(BestVPlan, BestVF)) {
LLVM_DEBUG(dbgs() << "Narrowed interleave\n");
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved, thanks!

@@ -7487,6 +7487,9 @@ LoopVectorizationPlanner::executePlan(

VPlanTransforms::optimizeForVFAndUF(BestVPlan, BestVF, BestUF, PSE);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! There are no other simplifications there so far

@llvmbot
Copy link
Member

llvmbot commented Sep 25, 2024

@llvm/pr-subscribers-llvm-transforms

Author: Florian Hahn (fhahn)

Changes

This patch adds a new narrowInterleaveGroups transfrom, which tries
convert a plan with interleave groups with VF elements to a plan that
instead replaces the interleave groups with wide loads and stores
processing VF elements.

This effectively is a very simple form of loop-aware SLP, where we
use interleave groups to identify candidates.

This initial version is quite restricted and hopefully serves as a
starting point for how to best model those kinds of transforms.

Depends on #106431.

Fixes #82936


Patch is 249.37 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/106441.diff

20 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h (+8-2)
  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+3-2)
  • (modified) llvm/lib/Transforms/Vectorize/VPlan.cpp (-1)
  • (modified) llvm/lib/Transforms/Vectorize/VPlan.h (+7-1)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+16-25)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp (+170-6)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.h (+5-1)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/interleaved-store-of-first-order-recurrence.ll (+1-2)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/sve-interleaved-accesses.ll (+11-19)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/sve-interleaved-masked-accesses.ll (+6-10)
  • (modified) llvm/test/Transforms/LoopVectorize/RISCV/interleaved-accesses.ll (+15-84)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/interleave-cost.ll (+7-10)
  • (added) llvm/test/Transforms/LoopVectorize/X86/transform-narrow-interleave-to-widen-memory.ll (+755)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/vectorize-interleaved-accesses-gap.ll (+2-3)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/x86-interleaved-accesses-masked-group.ll (+6-12)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/x86-interleaved-store-accesses-with-gaps.ll (+4-8)
  • (modified) llvm/test/Transforms/LoopVectorize/interleaved-accesses-different-insert-position.ll (+5-3)
  • (modified) llvm/test/Transforms/LoopVectorize/interleaved-accesses.ll (+9-16)
  • (added) llvm/test/Transforms/LoopVectorize/transform-narrow-interleave-to-widen-memory.ll (+1677)
  • (modified) llvm/test/Transforms/LoopVectorize/vplan-printing.ll (+2-3)
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h b/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
index 00eec0a6f7b14e..5951873a960af2 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
@@ -220,9 +220,15 @@ class VPBuilder {
         new VPInstruction(Instruction::ICmp, Pred, A, B, DL, Name));
   }
 
-  VPInstruction *createPtrAdd(VPValue *Ptr, VPValue *Offset, DebugLoc DL,
+  VPInstruction *createPtrAdd(VPValue *Ptr, VPValue *Offset, DebugLoc DL = {},
                               const Twine &Name = "") {
-    return createInstruction(VPInstruction::PtrAdd, {Ptr, Offset}, DL, Name);
+    return tryInsertInstruction(new VPInstruction(
+        Ptr, Offset, VPRecipeWithIRFlags::GEPFlagsTy(false), DL, Name));
+  }
+  VPValue *createInBoundsPtrAdd(VPValue *Ptr, VPValue *Offset, DebugLoc DL = {},
+                                const Twine &Name = "") {
+    return tryInsertInstruction(new VPInstruction(
+        Ptr, Offset, VPRecipeWithIRFlags::GEPFlagsTy(true), DL, Name));
   }
 
   VPDerivedIVRecipe *createDerivedIV(InductionDescriptor::InductionKind Kind,
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index db4631e19c11d3..3eb830bcc02098 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -7506,6 +7506,7 @@ LoopVectorizationPlanner::executePlan(
   VPlanTransforms::unrollByUF(BestVPlan, BestUF,
                               OrigLoop->getHeader()->getModule()->getContext());
   VPlanTransforms::optimizeForVFAndUF(BestVPlan, BestVF, BestUF, PSE);
+  VPlanTransforms::narrowInterleaveGroups(BestVPlan, BestVF);
 
   LLVM_DEBUG(dbgs() << "Executing best plan with VF=" << BestVF
                     << ", UF=" << BestUF << '\n');
@@ -9005,8 +9006,8 @@ LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes(VFRange &Range) {
   // Interleave memory: for each Interleave Group we marked earlier as relevant
   // for this VPlan, replace the Recipes widening its memory instructions with a
   // single VPInterleaveRecipe at its insertion point.
-  VPlanTransforms::createInterleaveGroups(InterleaveGroups, RecipeBuilder,
-                                          CM.isScalarEpilogueAllowed());
+  VPlanTransforms::createInterleaveGroups(
+      *Plan, InterleaveGroups, RecipeBuilder, CM.isScalarEpilogueAllowed());
 
   for (ElementCount VF : Range)
     Plan->addVF(VF);
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.cpp b/llvm/lib/Transforms/Vectorize/VPlan.cpp
index 6ddbfcf0ecfe58..a18ab844c5ebec 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlan.cpp
@@ -954,7 +954,6 @@ void VPlan::prepareToExecute(Value *TripCountV, Value *VectorTripCountV,
 
   IRBuilder<> Builder(State.CFG.PrevBB->getTerminator());
   // FIXME: Model VF * UF computation completely in VPlan.
-  assert(VFxUF.getNumUsers() && "VFxUF expected to always have users");
   unsigned UF = getUF();
   if (VF.getNumUsers()) {
     Value *RuntimeVF = getRuntimeVF(Builder, TCTy, State.VF);
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index bbcfaf9e19cd0c..f841355d320930 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -956,7 +956,6 @@ class VPRecipeWithIRFlags : public VPSingleDefRecipe {
     DisjointFlagsTy(bool IsDisjoint) : IsDisjoint(IsDisjoint) {}
   };
 
-protected:
   struct GEPFlagsTy {
     char IsInBounds : 1;
     GEPFlagsTy(bool IsInBounds) : IsInBounds(IsInBounds) {}
@@ -1307,6 +1306,13 @@ class VPInstruction : public VPRecipeWithIRFlags,
     assert(Opcode == Instruction::Or && "only OR opcodes can be disjoint");
   }
 
+  VPInstruction(VPValue *Ptr, VPValue *Offset, GEPFlagsTy Flags = {false},
+                DebugLoc DL = {}, const Twine &Name = "")
+      : VPRecipeWithIRFlags(VPDef::VPInstructionSC,
+                            ArrayRef<VPValue *>({Ptr, Offset}),
+                            GEPFlagsTy(Flags), DL),
+        Opcode(VPInstruction::PtrAdd), Name(Name.str()) {}
+
   VPInstruction(unsigned Opcode, std::initializer_list<VPValue *> Operands,
                 FastMathFlags FMFs, DebugLoc DL = {}, const Twine &Name = "");
 
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index dacba152611c19..96b16164738181 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -48,6 +48,9 @@ extern cl::opt<unsigned> ForceTargetInstructionCost;
 
 bool VPRecipeBase::mayWriteToMemory() const {
   switch (getVPDefID()) {
+  case VPInstructionSC: {
+    return !Instruction::isBinaryOp(cast<VPInstruction>(this)->getOpcode());
+  }
   case VPInterleaveSC:
     return cast<VPInterleaveRecipe>(this)->getNumStoreOperands() > 0;
   case VPWidenStoreEVLSC:
@@ -63,6 +66,7 @@ bool VPRecipeBase::mayWriteToMemory() const {
   case VPBranchOnMaskSC:
   case VPScalarIVStepsSC:
   case VPPredInstPHISC:
+  case VPVectorPointerSC:
     return false;
   case VPBlendSC:
   case VPReductionEVLSC:
@@ -644,7 +648,8 @@ Value *VPInstruction::generate(VPTransformState &State) {
            "can only generate first lane for PtrAdd");
     Value *Ptr = State.get(getOperand(0), /* IsScalar */ true);
     Value *Addend = State.get(getOperand(1), /* IsScalar */ true);
-    return Builder.CreatePtrAdd(Ptr, Addend, Name);
+    return isInBounds() ? Builder.CreateInBoundsPtrAdd(Ptr, Addend, Name)
+                        : Builder.CreatePtrAdd(Ptr, Addend, Name);
   }
   case VPInstruction::ResumePhi: {
     Value *IncomingFromVPlanPred =
@@ -2470,15 +2475,12 @@ void VPInterleaveRecipe::execute(VPTransformState &State) {
   unsigned InterleaveFactor = Group->getFactor();
   auto *VecTy = VectorType::get(ScalarTy, State.VF * InterleaveFactor);
 
-  // Prepare for the new pointers.
-  unsigned Index = Group->getIndex(Instr);
-
   // TODO: extend the masked interleaved-group support to reversed access.
   VPValue *BlockInMask = getMask();
   assert((!BlockInMask || !Group->isReverse()) &&
          "Reversed masked interleave-group not supported.");
 
-  Value *Idx;
+  Value *Index;
   // If the group is reverse, adjust the index to refer to the last vector lane
   // instead of the first. We adjust the index from the first vector lane,
   // rather than directly getting the pointer for lane VF - 1, because the
@@ -2486,35 +2488,24 @@ void VPInterleaveRecipe::execute(VPTransformState &State) {
   if (Group->isReverse()) {
     Value *RuntimeVF =
         getRuntimeVF(State.Builder, State.Builder.getInt32Ty(), State.VF);
-    Idx = State.Builder.CreateSub(RuntimeVF, State.Builder.getInt32(1));
-    Idx = State.Builder.CreateMul(Idx,
-                                  State.Builder.getInt32(Group->getFactor()));
-    Idx = State.Builder.CreateAdd(Idx, State.Builder.getInt32(Index));
-    Idx = State.Builder.CreateNeg(Idx);
-  } else
-    Idx = State.Builder.getInt32(-Index);
+    Index = State.Builder.CreateSub(RuntimeVF, State.Builder.getInt32(1));
+    Index = State.Builder.CreateMul(Index,
+                                    State.Builder.getInt32(Group->getFactor()));
+    Index = State.Builder.CreateNeg(Index);
+  } else {
+    // TODO: Drop redundant 0-index GEP as follow-up.
+    Index = State.Builder.getInt32(0);
+  }
 
   VPValue *Addr = getAddr();
   Value *ResAddr = State.get(Addr, VPLane(0));
   if (auto *I = dyn_cast<Instruction>(ResAddr))
     State.setDebugLocFrom(I->getDebugLoc());
 
-  // Notice current instruction could be any index. Need to adjust the address
-  // to the member of index 0.
-  //
-  // E.g.  a = A[i+1];     // Member of index 1 (Current instruction)
-  //       b = A[i];       // Member of index 0
-  // Current pointer is pointed to A[i+1], adjust it to A[i].
-  //
-  // E.g.  A[i+1] = a;     // Member of index 1
-  //       A[i]   = b;     // Member of index 0
-  //       A[i+2] = c;     // Member of index 2 (Current instruction)
-  // Current pointer is pointed to A[i+2], adjust it to A[i].
-
   bool InBounds = false;
   if (auto *gep = dyn_cast<GetElementPtrInst>(ResAddr->stripPointerCasts()))
     InBounds = gep->isInBounds();
-  ResAddr = State.Builder.CreateGEP(ScalarTy, ResAddr, Idx, "", InBounds);
+  ResAddr = State.Builder.CreateGEP(ScalarTy, ResAddr, Index, "", InBounds);
 
   State.setDebugLocFrom(Instr->getDebugLoc());
   Value *PoisonVec = PoisonValue::get(VecTy);
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
index a878613c4ba483..58188379c7fabb 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
@@ -29,6 +29,9 @@
 #include "llvm/IR/Intrinsics.h"
 #include "llvm/IR/PatternMatch.h"
 
+#define LV_NAME "loop-vectorize"
+#define DEBUG_TYPE LV_NAME
+
 using namespace llvm;
 
 void VPlanTransforms::VPInstructionsToVPRecipes(
@@ -710,6 +713,7 @@ void VPlanTransforms::optimizeForVFAndUF(VPlan &Plan, ElementCount BestVF,
   // TODO: Further simplifications are possible
   //      1. Replace inductions with constants.
   //      2. Replace vector loop region with VPBasicBlock.
+  //
 }
 
 /// Sink users of \p FOR after the recipe defining the previous value \p
@@ -1590,14 +1594,19 @@ void VPlanTransforms::dropPoisonGeneratingRecipes(
 }
 
 void VPlanTransforms::createInterleaveGroups(
-    const SmallPtrSetImpl<const InterleaveGroup<Instruction> *> &InterleaveGroups,
+    VPlan &Plan,
+    const SmallPtrSetImpl<const InterleaveGroup<Instruction> *>
+        &InterleaveGroups,
     VPRecipeBuilder &RecipeBuilder, bool ScalarEpilogueAllowed) {
+  if (InterleaveGroups.empty())
+    return;
+
   // Interleave memory: for each Interleave Group we marked earlier as relevant
   // for this VPlan, replace the Recipes widening its memory instructions with a
   // single VPInterleaveRecipe at its insertion point.
+  VPDominatorTree VPDT;
+  VPDT.recalculate(Plan);
   for (const auto *IG : InterleaveGroups) {
-    auto *Recipe =
-        cast<VPWidenMemoryRecipe>(RecipeBuilder.getRecipe(IG->getInsertPos()));
     SmallVector<VPValue *, 4> StoredValues;
     for (unsigned i = 0; i < IG->getFactor(); ++i)
       if (auto *SI = dyn_cast_or_null<StoreInst>(IG->getMember(i))) {
@@ -1607,9 +1616,38 @@ void VPlanTransforms::createInterleaveGroups(
 
     bool NeedsMaskForGaps =
         IG->requiresScalarEpilogue() && !ScalarEpilogueAllowed;
-    auto *VPIG = new VPInterleaveRecipe(IG, Recipe->getAddr(), StoredValues,
-                                        Recipe->getMask(), NeedsMaskForGaps);
-    VPIG->insertBefore(Recipe);
+
+    Instruction *IRInsertPos = IG->getInsertPos();
+    auto *InsertPos =
+        cast<VPWidenMemoryRecipe>(RecipeBuilder.getRecipe(IRInsertPos));
+
+    // Get or create the start address for the interleave group.
+    auto *Start =
+        cast<VPWidenMemoryRecipe>(RecipeBuilder.getRecipe(IG->getMember(0)));
+    VPValue *Addr = Start->getAddr();
+    if (!VPDT.properlyDominates(Addr->getDefiningRecipe(), InsertPos)) {
+      bool InBounds = false;
+      if (auto *Gep = dyn_cast<GetElementPtrInst>(
+              getLoadStorePointerOperand(IRInsertPos)->stripPointerCasts()))
+        InBounds = Gep->isInBounds();
+
+      // We cannot re-use the address of the first member because it does not
+      // dominate the insert position. Use the address of the insert position
+      // and create a PtrAdd to adjust the index to start at the first member.
+      APInt Offset(32,
+                   getLoadStoreType(IRInsertPos)->getScalarSizeInBits() / 8 *
+                       IG->getIndex(IRInsertPos),
+                   /*IsSigned=*/true);
+      VPValue *OffsetVPV = Plan.getOrAddLiveIn(
+          ConstantInt::get(IRInsertPos->getParent()->getContext(), -Offset));
+      VPBuilder B(InsertPos);
+      Addr = InBounds ? B.createInBoundsPtrAdd(InsertPos->getAddr(), OffsetVPV)
+                      : B.createPtrAdd(InsertPos->getAddr(), OffsetVPV);
+    }
+    auto *VPIG = new VPInterleaveRecipe(IG, Addr, StoredValues,
+                                        InsertPos->getMask(), NeedsMaskForGaps);
+    VPIG->insertBefore(InsertPos);
+
     unsigned J = 0;
     for (unsigned i = 0; i < IG->getFactor(); ++i)
       if (Instruction *Member = IG->getMember(i)) {
@@ -1623,3 +1661,129 @@ void VPlanTransforms::createInterleaveGroups(
       }
   }
 }
+
+static bool supportedLoad(VPWidenRecipe *R0, VPValue *V, unsigned Idx) {
+  if (auto *W = dyn_cast_or_null<VPWidenLoadRecipe>(V->getDefiningRecipe())) {
+    if (W->getMask())
+      return false;
+    return !W->getMask() && (R0->getOperand(0) == V || R0->getOperand(1) == V);
+  }
+
+  if (auto *IR = dyn_cast_or_null<VPInterleaveRecipe>(V->getDefiningRecipe())) {
+    return IR->getInterleaveGroup()->getFactor() ==
+               IR->getInterleaveGroup()->getNumMembers() &&
+           IR->getVPValue(Idx) == V;
+  }
+  return false;
+}
+
+/// Returns true if \p IR is a full interleave group with factor and number of
+/// members both equal to \p VF.
+static bool isConsecutiveInterleaveGroup(VPInterleaveRecipe *IR,
+                                         ElementCount VF) {
+  if (!IR)
+    return false;
+  auto IG = IR->getInterleaveGroup();
+  return IG->getFactor() == IG->getNumMembers() &&
+         IG->getNumMembers() == VF.getFixedValue();
+}
+
+void VPlanTransforms::narrowInterleaveGroups(VPlan &Plan, ElementCount VF) {
+  using namespace llvm::VPlanPatternMatch;
+  if (VF.isScalable())
+    return;
+
+  SmallVector<VPInterleaveRecipe *> StoreGroups;
+  for (auto &R : *Plan.getVectorLoopRegion()->getEntryBasicBlock()) {
+    if (match(&R, m_BranchOnCount(m_VPValue(), m_VPValue())) ||
+        isa<VPCanonicalIVPHIRecipe>(&R))
+      continue;
+
+    // Bail out on recipes not supported at the moment:
+    //  * phi recipes other than the canonical induction
+    //  * recipes writing to memory except interleave groups
+    // Only support plans with a canonical induction phi.
+    if (R.isPhi())
+      return;
+
+    auto *IR = dyn_cast<VPInterleaveRecipe>(&R);
+    if (R.mayWriteToMemory() && !IR)
+      return;
+
+    if (!IR)
+      continue;
+
+    if (!isConsecutiveInterleaveGroup(IR, VF))
+      return;
+    if (IR->getStoredValues().empty())
+      continue;
+
+    auto *Lane0 = dyn_cast_or_null<VPWidenRecipe>(
+        IR->getStoredValues()[0]->getDefiningRecipe());
+    if (!Lane0)
+      return;
+    for (const auto &[I, V] : enumerate(IR->getStoredValues())) {
+      auto *R = dyn_cast<VPWidenRecipe>(V->getDefiningRecipe());
+      if (!R || R->getOpcode() != Lane0->getOpcode())
+        return;
+      // Work around captured structured bindings being a C++20 extension.
+      auto Idx = I;
+      if (any_of(R->operands(), [Lane0, Idx](VPValue *V) {
+            return !supportedLoad(Lane0, V, Idx);
+          }))
+        return;
+    }
+
+    StoreGroups.push_back(IR);
+  }
+
+  if (StoreGroups.empty())
+    return;
+
+  // Narrow operation tree rooted at store groups.
+  for (auto *StoreGroup : StoreGroups) {
+    auto *Lane0 = cast<VPWidenRecipe>(
+        StoreGroup->getStoredValues()[0]->getDefiningRecipe());
+
+    unsigned LoadGroupIdx =
+        isa<VPInterleaveRecipe>(Lane0->getOperand(1)->getDefiningRecipe()) ? 1
+                                                                           : 0;
+    unsigned WideLoadIdx = 1 - LoadGroupIdx;
+    auto *LoadGroup = cast<VPInterleaveRecipe>(
+        Lane0->getOperand(LoadGroupIdx)->getDefiningRecipe());
+
+    auto *WideLoad = cast<VPWidenLoadRecipe>(
+        Lane0->getOperand(WideLoadIdx)->getDefiningRecipe());
+
+    // Narrow wide load to uniform scalar load, as transformed VPlan will only
+    // process one original iteration.
+    auto *N = new VPReplicateRecipe(&WideLoad->getIngredient(),
+                                    WideLoad->operands(), true);
+    // Narrow interleave group to wide load, as transformed VPlan will only
+    // process one original iteration.
+    auto *L = new VPWidenLoadRecipe(
+        *cast<LoadInst>(LoadGroup->getInterleaveGroup()->getInsertPos()),
+        LoadGroup->getAddr(), LoadGroup->getMask(), true, false,
+        LoadGroup->getDebugLoc());
+    L->insertBefore(LoadGroup);
+    N->insertBefore(LoadGroup);
+    Lane0->setOperand(LoadGroupIdx, L);
+    Lane0->setOperand(WideLoadIdx, N);
+
+    auto *S = new VPWidenStoreRecipe(
+        *cast<StoreInst>(StoreGroup->getInterleaveGroup()->getInsertPos()),
+        StoreGroup->getAddr(), Lane0, nullptr, true, false,
+        StoreGroup->getDebugLoc());
+    S->insertBefore(StoreGroup);
+    StoreGroup->eraseFromParent();
+  }
+
+  // Adjust induction to reflect that the transformed plan only processes one
+  // original iteration.
+  auto *CanIV = Plan.getCanonicalIV();
+  VPInstruction *Inc = cast<VPInstruction>(CanIV->getBackedgeValue());
+  Inc->setOperand(
+      1, Plan.getOrAddLiveIn(ConstantInt::get(CanIV->getScalarType(), 1)));
+  removeDeadRecipes(Plan);
+  LLVM_DEBUG(dbgs() << "Narrowed interleave\n");
+}
diff --git a/llvm/lib/Transforms/Vectorize/VPlanTransforms.h b/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
index f4a17aec42b244..efe4bde8998fd4 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
+++ b/llvm/lib/Transforms/Vectorize/VPlanTransforms.h
@@ -114,11 +114,15 @@ struct VPlanTransforms {
   // widening its memory instructions with a single VPInterleaveRecipe at its
   // insertion point.
   static void createInterleaveGroups(
-      const SmallPtrSetImpl<const InterleaveGroup<Instruction> *> &InterleaveGroups,
+      VPlan &Plan,
+      const SmallPtrSetImpl<const InterleaveGroup<Instruction> *>
+          &InterleaveGroups,
       VPRecipeBuilder &RecipeBuilder, bool ScalarEpilogueAllowed);
 
   /// Remove dead recipes from \p Plan.
   static void removeDeadRecipes(VPlan &Plan);
+
+  static void narrowInterleaveGroups(VPlan &Plan, ElementCount VF);
 };
 
 } // namespace llvm
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/interleaved-store-of-first-order-recurrence.ll b/llvm/test/Transforms/LoopVectorize/AArch64/interleaved-store-of-first-order-recurrence.ll
index 87674f611251cc..997ef7466d5cfc 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/interleaved-store-of-first-order-recurrence.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/interleaved-store-of-first-order-recurrence.ll
@@ -14,8 +14,7 @@ define void @interleaved_store_first_order_recurrence(ptr noalias %src, ptr %dst
 ; CHECK-NEXT:    [[TMP2:%.*]] = shufflevector <4 x i32> [[VECTOR_RECUR]], <4 x i32> [[BROADCAST_SPLAT]], <4 x i32> <i32 3, i32 4, i32 5, i32 6>
 ; CHECK-NEXT:    [[TMP3:%.*]] = mul nuw nsw i64 [[TMP0]], 3
 ; CHECK-NEXT:    [[TMP4:%.*]] = getelementptr inbounds i32, ptr [[DST:%.*]], i64 [[TMP3]]
-; CHECK-NEXT:    [[TMP6:%.*]] = getelementptr inbounds i32, ptr [[TMP4]], i64 2
-; CHECK-NEXT:    [[TMP7:%.*]] = getelementptr inbounds i32, ptr [[TMP6]], i32 -2
+; CHECK-NEXT:    [[TMP7:%.*]] = getelementptr inbounds i32, ptr [[TMP4]], i32 0
 ; CHECK-NEXT:    [[TMP9:%.*]] = shufflevector <4 x i32> zeroinitializer, <4 x i32> [[TMP2]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
 ; CHECK-NEXT:    [[TMP10:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLAT]], <4 x i32> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 poison, i32 poison, i32 poison, i32 poison>
 ; CHECK-NEXT:    [[TMP11:%.*]] = shufflevector <8 x i32> [[TMP9]], <8 x i32> [[TMP10]], <12 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11>
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/sve-interleaved-accesses.ll b/llvm/test/Transforms/LoopVectorize/AArch64/sve-interleaved-accesses.ll
index 0959d913fd0cd5..ba4145217c3ba5 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/sve-interleaved-accesses.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/sve-interleaved-accesses.ll
@@ -41,13 +41,11 @@ define void @test_array_load2_store2(i32 %C, i32 %D) #1 {
 ; CHECK-NEXT:    [[STRIDED_VEC:%.*]] = call { <vscale x 4 x i32>, <vscale x 4 x i32> } @llvm.vector.deinterleave2.nxv8i32(<vscale x 8 x i32> [[WIDE_VEC]])
 ; CHECK-NEXT:    [[TMP3:%.*]] = extractvalue { <vscale x 4 x i32>, <vscale x 4 x i32> } [[STRIDED_VEC]], 0
 ; CHECK-NEXT: ...
[truncated]

@@ -954,7 +954,6 @@ void VPlan::prepareToExecute(Value *TripCountV, Value *VectorTripCountV,

IRBuilder<> Builder(State.CFG.PrevBB->getTerminator());
// FIXME: Model VF * UF computation completely in VPlan.
assert(VFxUF.getNumUsers() && "VFxUF expected to always have users");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why dropping this check? If this check is dropped, shall it early exit if VFxUF.getNumUsers() == 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The narrowing changes the loop from stepping by VFxUF to stepping by 1, removing the user of VFxUF. Could only conditionally create it, but I don't think it would cause any test changes, as there only would be a difference for scalable vectors, which isn't supported for narrowing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall it early exit then if VFxUF.getNumUsers() == 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

Comment on lines 51 to 53
case VPInstructionSC: {
return !Instruction::isBinaryOp(cast<VPInstruction>(this)->getOpcode());
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about cast or select or other instructions here? Is this still correct for them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, unified the logic separately (68ed172) and remove the change here

Comment on lines 32 to 34
#define LV_NAME "loop-vectorize"
#define DEBUG_TYPE LV_NAME

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed for LLVM_DEBUG below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in latest version, thanks

Comment on lines 1667 to 1668
if (W->getMask())
return false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (W->getMask())
return false;
return !W->getMask();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped the check and return as it is subsumed by the check below, thanks!

Comment on lines 1730 to 1731
auto Idx = I;
if (any_of(R->operands(), [Lane0, Idx](VPValue *V) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto Idx = I;
if (any_of(R->operands(), [Lane0, Idx](VPValue *V) {
if (any_of(R->operands(), [Lane0, Idx=I](VPValue *V) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, I think originally Clang complained about this being an extension. Looks fine now, thanks!

fhahn added a commit that referenced this pull request Sep 26, 2024
Unify logic for mayWriteToMemory and mayHaveSideEffects for
VPInstruction, with the later relying on the former. Also extend to
handle binary operators.

Split off from #106441
Copy link

github-actions bot commented Sep 26, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

Sterling-Augustine pushed a commit to Sterling-Augustine/llvm-project that referenced this pull request Sep 27, 2024
Unify logic for mayWriteToMemory and mayHaveSideEffects for
VPInstruction, with the later relying on the former. Also extend to
handle binary operators.

Split off from llvm#106441
puja2196 pushed a commit to puja2196/LLVM-tutorial that referenced this pull request Sep 30, 2024
Unify logic for mayWriteToMemory and mayHaveSideEffects for
VPInstruction, with the later relying on the former. Also extend to
handle binary operators.

Split off from llvm/llvm-project#106441
puja2196 pushed a commit to puja2196/LLVM-tutorial that referenced this pull request Oct 2, 2024
Unify logic for mayWriteToMemory and mayHaveSideEffects for
VPInstruction, with the later relying on the former. Also extend to
handle binary operators.

Split off from llvm/llvm-project#106441
fhahn added a commit that referenced this pull request Mar 20, 2025
This patch adds a new narrowInterleaveGroups transfrom, which tries
convert a plan with interleave groups with VF elements to a plan that
instead replaces the interleave groups with wide loads and stores
processing VF elements.

This effectively is a very simple form of loop-aware SLP, where we
use interleave groups to identify candidates.

This initial version is quite restricted and hopefully serves as a
starting point for how to best model those kinds of transforms. For now
it only transforms load interleave groups feeding store groups.

Depends on #106431.

This lands the main parts of the approved
#106441 as suggested to break
things up a bit more.
llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this pull request Mar 20, 2025
This patch adds a new narrowInterleaveGroups transfrom, which tries
convert a plan with interleave groups with VF elements to a plan that
instead replaces the interleave groups with wide loads and stores
processing VF elements.

This effectively is a very simple form of loop-aware SLP, where we
use interleave groups to identify candidates.

This initial version is quite restricted and hopefully serves as a
starting point for how to best model those kinds of transforms. For now
it only transforms load interleave groups feeding store groups.

Depends on #106431.

This lands the main parts of the approved
llvm/llvm-project#106441 as suggested to break
things up a bit more.
fhahn added a commit that referenced this pull request Mar 22, 2025
Further increase test coverage for
#106441

Also regenerate checks with -filter-out-after.
llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this pull request Mar 22, 2025
Further increase test coverage for
llvm/llvm-project#106441

Also regenerate checks with -filter-out-after.
@fhahn fhahn force-pushed the vplan-narrow-interleave branch from d4ba75e to 0226cb0 Compare March 22, 2025 21:21
Copy link
Contributor Author

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I landed the load->store group only transform separately as c73ad7b as suggested.

Will land the changes in the PR soon as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

@fhahn fhahn merged commit dfa665f into llvm:main Mar 22, 2025
6 of 11 checks passed
@fhahn fhahn deleted the vplan-narrow-interleave branch March 22, 2025 21:40
llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this pull request Mar 22, 2025
…106441)

This patch adds a new narrowInterleaveGroups transfrom, which tries
convert a plan with interleave groups with VF elements to a plan that
instead replaces the interleave groups with wide loads and stores
processing VF elements.

This effectively is a very simple form of loop-aware SLP, where we
use interleave groups to identify candidates.

This initial version is quite restricted and hopefully serves as a
starting point for how to best model those kinds of transforms.

Depends on llvm/llvm-project#106431.

Fixes llvm/llvm-project#82936.

PR: llvm/llvm-project#106441
@mstorsjo
Copy link
Member

This change caused a miscompile in ffmpeg, on aarch64 (on both Linux and Windows), causing ffmpeg to crash. The issue is reproducible both with ffmpeg compiled via llvm-test-suite, and compiled standalone (clone https://github.com/ffmpeg/ffmpeg, ./configure --cc=clang && make -j$(nproc)).

To reproduce the crash, build ffmpeg in either way and run:

$ ./External/ffmpeg/ffmpeg -s 352x288 -i http://martin.st/temp/vsynth1.yuv -c ffv1 -coder range_tab -pass 1 -y vsynth1-ffv1-2pass.avi

(This uses an input file hosted by me, for repeated use it may be easier to download it and reference it with a local file name.)

The miscompiled source file is libavcodec/ffv1enc.c.

mstorsjo added a commit that referenced this pull request Mar 23, 2025
)"

This reverts commit dfa665f.

This commit caused miscompilations in ffmpeg, see
#106441 for details.
llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this pull request Mar 23, 2025
…oups. (#106441)"

This reverts commit dfa665f.

This commit caused miscompilations in ffmpeg, see
llvm/llvm-project#106441 for details.
@mstorsjo
Copy link
Member

This change caused a miscompile in ffmpeg, on aarch64 (on both Linux and Windows), causing ffmpeg to crash. The issue is reproducible both with ffmpeg compiled via llvm-test-suite, and compiled standalone (clone https://github.com/ffmpeg/ffmpeg, ./configure --cc=clang && make -j$(nproc)).

To reproduce the crash, build ffmpeg in either way and run:

$ ./External/ffmpeg/ffmpeg -s 352x288 -i http://martin.st/temp/vsynth1.yuv -c ffv1 -coder range_tab -pass 1 -y vsynth1-ffv1-2pass.avi

(This uses an input file hosted by me, for repeated use it may be easier to download it and reference it with a local file name.)

The miscompiled source file is libavcodec/ffv1enc.c.

I pushed a revert for this change now.

fhahn added a commit that referenced this pull request Mar 25, 2025
…6441)"

This reverts commit ff3e2ba.

The recommmitted version limits to transform to cases where no
interleaving is taking place, to avoid a mis-compile when interleaving.

Original commit message:

This patch adds a new narrowInterleaveGroups transfrom, which tries
convert a plan with interleave groups with VF elements to a plan that
instead replaces the interleave groups with wide loads and stores
processing VF elements.

This effectively is a very simple form of loop-aware SLP, where we
use interleave groups to identify candidates.

This initial version is quite restricted and hopefully serves as a
starting point for how to best model those kinds of transforms.

Depends on #106431.

Fixes #82936.

PR: #106441
@fhahn
Copy link
Contributor Author

fhahn commented Mar 25, 2025

@mstorsjo thanks, I recommitted with a fix, limiting this to loops w/o interleaving for now: 577631f

I ran the ffmpeg tests from the test-suite, but it sounds like the test-suite coverage could be improved by running various encoders with different inputs and checking the output? Would the input files you are using for your testing be suitable for adding to the test-suite and define tests using them?

llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this pull request Mar 25, 2025
…roups. (#106441)"

This reverts commit ff3e2ba.

The recommmitted version limits to transform to cases where no
interleaving is taking place, to avoid a mis-compile when interleaving.

Original commit message:

This patch adds a new narrowInterleaveGroups transfrom, which tries
convert a plan with interleave groups with VF elements to a plan that
instead replaces the interleave groups with wide loads and stores
processing VF elements.

This effectively is a very simple form of loop-aware SLP, where we
use interleave groups to identify candidates.

This initial version is quite restricted and hopefully serves as a
starting point for how to best model those kinds of transforms.

Depends on llvm/llvm-project#106431.

Fixes llvm/llvm-project#82936.

PR: llvm/llvm-project#106441
@mstorsjo
Copy link
Member

@mstorsjo thanks, I recommitted with a fix, limiting this to loops w/o interleaving for now: 577631f

Great, thanks! I built the latest clang and reran the full ffmpeg testsuite now, and it seems to pass flawlessly now.

I ran the ffmpeg tests from the test-suite, but it sounds like the test-suite coverage could be improved by running various encoders with different inputs and checking the output? Would the input files you are using for your testing be suitable for adding to the test-suite and define tests using them?

That would be great, but unfortunately it'd be a fairly massive effort... Upstream ffmpeg has got like over 5000 tests, defined in a combination of shell script and makefiles. Some tests decode test samples from a separately synced repo of test inputs (currently a 1.5 GB bunch of inputs) - some of them check the hash of the decoded output, expecting it to be bitexact. Some decode and check that the output is close enough to a reference output (e.g. for decoders that use floats internally). Some tests generate a synthetic test input, then try encode it into various formats, and decode it back again, and do some comparison on it, etc. (This particular test that I isolated was this kind, I think.)

So while it's probably not hard to extract a handful of testcases and set them up in llvm-test-suite, it probably doesn't give very much meaningful test coverage if it's only a few. (Chances that you'd break the exact same case again is pretty small.) ;-) And doing all of them would both be a quite massive effort, and it would also probably be a huge pain to maintain in sync (if we'd want to update targeting a newer release). This, compared with the cmake wrapping which is fairly low effort to update to a newer release if we want to do that.

If you do want to try to run the full upstream ffmpeg testsuite, it's not very hard:

$ mkdir ffmpeg-build
$ cd ffmpeg-build
$ ../path/to/ffmpeg/configure --cc=clang --samples=$(pwd)/../samples
$ make fate-rsync # download test samples into the path specified by --samples
$ make -j$(nproc)
$ make -j$(nproc) fate # run all tests

fhahn added a commit that referenced this pull request Mar 29, 2025
Remove the UF = 1 restriction introduced by 577631f building on top
of 783a846, which allows updating all relevant users of the VF,
VPScalarIVSteps in particular.

This restores the full functionality of
#106441.
llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this pull request Mar 29, 2025
…aving.

Remove the UF = 1 restriction introduced by 577631f building on top
of 783a846, which allows updating all relevant users of the VF,
VPScalarIVSteps in particular.

This restores the full functionality of
llvm/llvm-project#106441.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

simd vectorization regression starting with clang 14
5 participants