Skip to content

[VPlan] Introduce explicit ExtractFromEnd recipes for live-outs. #100658

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Aug 21, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
205 changes: 174 additions & 31 deletions llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -8527,9 +8527,11 @@ static void addCanonicalIVRecipes(VPlan &Plan, Type *IdxTy, bool HasNUW,
{CanonicalIVIncrement, &Plan.getVectorTripCount()}, DL);
}

// Add exit values to \p Plan. VPLiveOuts are added for each LCSSA phi in the
// original exit block.
static void addUsersInExitBlock(
// Collect (ExitPhi, ExitingValue) pairs phis in the original exit block that
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Collect (ExitPhi, ExitingValue) pairs phis in the original exit block that
// Collect (ExitPhi, ExitingValue) pairs for phis in the original exit block that

// are modeled in VPlan. Some exiting values are not modeled explicitly yet and
// won't be included. Those are un-truncated VPWidenIntOrFpInductionRecipe,
// VPWidenPointerInductionRecipe and induction increments.
static MapVector<PHINode *, VPValue *> collectUsersInExitBlock(
Loop *OrigLoop, VPRecipeBuilder &Builder, VPlan &Plan,
const MapVector<PHINode *, InductionDescriptor> &Inductions) {
auto MiddleVPBB =
Expand All @@ -8538,9 +8540,8 @@ static void addUsersInExitBlock(
// and there is nothing to fix from vector loop; phis should have incoming
// from scalar loop only.
if (MiddleVPBB->getNumSuccessors() != 2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So non-empty exitingValuesToFix implies two successors; having exit block as the only successor is not considered.

Looking forward following our roadmap, initial VPlan should probably always start with both successors, and optimized later to have a single successor if scalar epilog is required or redundant.

Does empty exitingValuesToFix imply no edge from middle block to exit block, i.e., can such an edge exist w/o any lcssa phi's to fix?

return;

// Introduce VPUsers modeling the exit values.
return {};
MapVector<PHINode *, VPValue *> ExitingValuesToFix;
BasicBlock *ExitBB =
cast<VPIRBasicBlock>(MiddleVPBB->getSuccessors()[0])->getIRBasicBlock();
BasicBlock *ExitingBB = OrigLoop->getExitingBlock();
Expand All @@ -8561,15 +8562,52 @@ static void addUsersInExitBlock(
return P && Inductions.contains(P);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also bail out for FOR's, to be handled explicitly elsewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FOR should now be handled below

})))
continue;
Plan.addLiveOut(&ExitPhi, V);
ExitingValuesToFix.insert({&ExitPhi, V});
}
return ExitingValuesToFix;
}

/// Feed a resume value for every FOR from the vector loop to the scalar loop,
/// if middle block branches to scalar preheader, by introducing ExtractFromEnd
/// and ResumePhi recipes in each, respectively, and a VPLiveOut which uses the
/// latter and corresponds to the scalar header.
static void addLiveOutsForFirstOrderRecurrences(VPlan &Plan) {
// Add exit values to \p Plan. Extracts and VPLiveOuts are added for each entry
// in \p ExitingValuesToFix.
static void
addUsersInExitBlock(VPlan &Plan,
MapVector<PHINode *, VPValue *> &ExitingValuesToFix) {
if (ExitingValuesToFix.empty())
return;

auto MiddleVPBB =
cast<VPBasicBlock>(Plan.getVectorLoopRegion()->getSingleSuccessor());
BasicBlock *ExitBB =
cast<VPIRBasicBlock>(MiddleVPBB->getSuccessors()[0])->getIRBasicBlock();
// TODO: set B to MiddleVPBB->getFirstNonPhi(), taking care of affected tests.
VPBuilder B(MiddleVPBB);
if (auto *Terminator = MiddleVPBB->getTerminator()) {
auto *Condition = dyn_cast<VPInstruction>(Terminator->getOperand(0));
assert((!Condition || Condition->getParent() == MiddleVPBB) &&
"Condition expected in MiddleVPBB");
B.setInsertPoint(Condition ? Condition : Terminator);
}

// Introduce VPUsers modeling the exit values.
for (const auto &[ExitPhi, V] : ExitingValuesToFix) {
VPValue *Ext = B.createNaryOp(
VPInstruction::ExtractFromEnd,
{V, Plan.getOrAddLiveIn(ConstantInt::get(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Independently: may be good to provide VPlan::getOrAddConstant(), with type derived somehows.

IntegerType::get(ExitBB->getContext(), 32), 1))});
Plan.addLiveOut(ExitPhi, Ext);
}
}

/// Handle live-outs for first order reductions, both in the scalar preheader
/// and the original exit block:
/// 1. Feed a resume value for every FOR from the vector loop to the scalar
/// loop, if middle block branches to scalar preheader, by introducing
/// ExtractFromEnd and ResumePhi recipes in each, respectively, and a
/// VPLiveOut which uses the latter and corresponds to the scalar header.
/// 2. Feed the penultimate value of recurrences to their LCSSA phi users in
/// the original exit block using a VPLiveOut.
static void addLiveOutsForFirstOrderRecurrences(
VPlan &Plan, MapVector<PHINode *, VPValue *> &ExitingValuesToFix) {
VPRegionBlock *VectorRegion = Plan.getVectorLoopRegion();

// Start by finding out if middle block branches to scalar preheader, which is
Expand All @@ -8578,21 +8616,31 @@ static void addLiveOutsForFirstOrderRecurrences(VPlan &Plan) {
// TODO: Should be replaced by
// Plan->getScalarLoopRegion()->getSinglePredecessor() in the future once the
// scalar region is modeled as well.
VPBasicBlock *ScalarPHVPBB = nullptr;
auto *MiddleVPBB = cast<VPBasicBlock>(VectorRegion->getSingleSuccessor());
for (VPBlockBase *Succ : MiddleVPBB->getSuccessors()) {
if (isa<VPIRBasicBlock>(Succ))
continue;
assert(!ScalarPHVPBB && "Two candidates for ScalarPHVPBB?");
ScalarPHVPBB = cast<VPBasicBlock>(Succ);
BasicBlock *ExitBB = nullptr;
VPBasicBlock *ScalarPHVPBB = nullptr;
if (MiddleVPBB->getNumSuccessors() == 2) {
// Order is strict: first is the exit block, second is the scalar preheader.
ExitBB =
cast<VPIRBasicBlock>(MiddleVPBB->getSuccessors()[0])->getIRBasicBlock();
ScalarPHVPBB = cast<VPBasicBlock>(MiddleVPBB->getSuccessors()[1]);
} else if (ExitingValuesToFix.empty()) {
ScalarPHVPBB = cast<VPBasicBlock>(MiddleVPBB->getSingleSuccessor());
} else {
ExitBB = cast<VPIRBasicBlock>(MiddleVPBB->getSingleSuccessor())
->getIRBasicBlock();
Comment on lines +8627 to +8631
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
} else if (ExitingValuesToFix.empty()) {
ScalarPHVPBB = cast<VPBasicBlock>(MiddleVPBB->getSingleSuccessor());
} else {
ExitBB = cast<VPIRBasicBlock>(MiddleVPBB->getSingleSuccessor())
->getIRBasicBlock();
} else {
// Single successor - only scalar preheader case supported.
assert(exitingValuesToFix.empty() && "Exiting values to fix w/o edge to exit block");
ScalarPHVPBB = cast<VPBasicBlock>(MiddleVPBB->getSingleSuccessor());
}

}
if (!ScalarPHVPBB)
if (!ScalarPHVPBB) {
assert(ExitingValuesToFix.empty() &&
"missed inserting extracts for exiting values");
return;
}
Comment on lines +8633 to +8637
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can ScalarPHVPBB be null?


VPBuilder ScalarPHBuilder(ScalarPHVPBB);
VPBuilder MiddleBuilder(MiddleVPBB);
// Reset insert point so new recipes are inserted before terminator and
// condition, if there is either the former or both.
// TODO: set MiddleBuilder to MiddleVPBB->getFirstNonPhi().
if (auto *Terminator = MiddleVPBB->getTerminator()) {
auto *Condition = dyn_cast<VPInstruction>(Terminator->getOperand(0));
assert((!Condition || Condition->getParent() == MiddleVPBB) &&
Expand All @@ -8601,20 +8649,110 @@ static void addLiveOutsForFirstOrderRecurrences(VPlan &Plan) {
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
}
// TODO: set MiddleBuilder to MiddleVPBB->getFirstNonPhi().
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, thanks!

VPValue *OneVPV = Plan.getOrAddLiveIn(
ConstantInt::get(Plan.getCanonicalIV()->getScalarType(), 1));
VPValue *TwoVPV = Plan.getOrAddLiveIn(
ConstantInt::get(Plan.getCanonicalIV()->getScalarType(), 2));

for (auto &HeaderPhi : VectorRegion->getEntryBasicBlock()->phis()) {
auto *FOR = dyn_cast<VPFirstOrderRecurrencePHIRecipe>(&HeaderPhi);
if (!FOR)
continue;

// This is the second phase of vectorizing first-order recurrences, creating
// extract for users outside the loop. An overview of the transformation is
// described below. Suppose we have the following loop with some use after
// the loop of the last a[i-1],
//
// for (int i = 0; i < n; ++i) {
// t = a[i - 1];
// b[i] = a[i] - t;
// }
// use t;
//
// There is a first-order recurrence on "a". For this loop, the shorthand
// scalar IR looks like:
//
// scalar.ph:
// s.init = a[-1]
// br scalar.body
//
// scalar.body:
// i = phi [0, scalar.ph], [i+1, scalar.body]
// s1 = phi [s.init, scalar.ph], [s2, scalar.body]
// s2 = a[i]
// b[i] = s2 - s1
// br cond, scalar.body, exit.block
//
// exit.block:
// use = lcssa.phi [s1, scalar.body]
//
// In this example, s1 is a recurrence because it's value depends on the
// previous iteration. In the first phase of vectorization, we created a
// VPFirstOrderRecurrencePHIRecipe v1 for s1. Now we create the extracts
// for users in the scalar preheader and exit block.
//
// vector.ph:
// v_init = vector(..., ..., ..., a[-1])
// br vector.body
//
// vector.body
// i = phi [0, vector.ph], [i+4, vector.body]
// v1 = phi [v_init, vector.ph], [v2, vector.body]
// v2 = a[i, i+1, i+2, i+3]
// b[i] = v2 - v1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// b[i] = v2 - v1

this line still needs to drop.

// // Next, third phase will introduce v1' = splice(v1(3), v2(0, 1, 2))
// b[i, i+1, i+2, i+3] = v2 - v1
// br cond, vector.body, middle.block
//
// middle.block:
// vector.recur.extract.for.phi = v2(2)
// vector.recur.extract = v2(3)
// br cond, scalar.ph, exit.block
//
// scalar.ph:
// scalar.recur.init = phi [vector.recur.extract, middle.block],
// [s.init, otherwise]
// br scalar.body
//
// scalar.body:
// i = phi [0, scalar.ph], [i+1, scalar.body]
// s1 = phi [scalar.recur.init, scalar.ph], [s2, scalar.body]
// s2 = a[i]
// b[i] = s2 - s1
// br cond, scalar.body, exit.block
//
// exit.block:
// lo = lcssa.phi [s1, scalar.body],
// [vector.recur.extract.for.phi, middle.block]
//
// Extract the resume value and create a new VPLiveOut for it.
auto *Resume = MiddleBuilder.createNaryOp(VPInstruction::ExtractFromEnd,
{FOR->getBackedgeValue(), OneVPV},
{}, "vector.recur.extract");
auto *ResumePhiRecipe = ScalarPHBuilder.createNaryOp(
VPInstruction::ResumePhi, {Resume, FOR->getStartValue()}, {},
"scalar.recur.init");
Plan.addLiveOut(cast<PHINode>(FOR->getUnderlyingInstr()), ResumePhiRecipe);
auto *FORPhi = cast<PHINode>(FOR->getUnderlyingInstr());
Plan.addLiveOut(FORPhi, ResumePhiRecipe);

// Now create VPLiveOuts for users in the exit block.
// Extract the penultimate value of the recurrence and add VPLiveOut
// users of the recurrence splice.

// No edge from the middle block to the unique exit block has been inserted
// and there is nothing to fix from vector loop; phis should have incoming
// from scalar loop only.
if (ExitingValuesToFix.empty())
continue;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit odd that the existence of an edge from middle block to exit block can be so easily detected, and exitBB set, compared to an edge from middle block to scalar preheader done above: "// Start by finding out if middle block branches to scalar preheader ....".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be better now, thanks!

Comment on lines +8744 to +8745
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can instead check

    if (!ExitBB)
      continue;

invariantly, but prefer to possibly bail out earlier - as soon as all phi's have been fixed?

Anyhow, the comment talks about no edge to exit block, i.e., !ExitBB, rather than exhausting ExitingsValuesToFix.

for (User *U : FORPhi->users()) {
auto *UI = cast<Instruction>(U);
if (UI->getParent() != ExitBB)
continue;
VPValue *Ext = MiddleBuilder.createNaryOp(
VPInstruction::ExtractFromEnd, {FOR->getBackedgeValue(), TwoVPV}, {},
"vector.recur.extract.for.phi");
Plan.addLiveOut(cast<PHINode>(UI), Ext);
ExitingValuesToFix.erase(cast<PHINode>(UI));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that it's fine for a FOR to have no users in ExitBB. Can it have duplicates/more than one? They can reuse the same extract (and lcssa phi).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There can be multiple duplicated exit phis, but those should be handled correctly (and cleaned up earlier). Potentially can also handle this de-duplication here as follow up

}
}
}

Expand Down Expand Up @@ -8769,16 +8907,17 @@ LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes(VFRange &Range) {
// After here, VPBB should not be used.
VPBB = nullptr;

addUsersInExitBlock(OrigLoop, RecipeBuilder, *Plan,
Legal->getInductionVars());

assert(isa<VPRegionBlock>(Plan->getVectorLoopRegion()) &&
!Plan->getVectorLoopRegion()->getEntryBasicBlock()->empty() &&
"entry block must be set to a VPRegionBlock having a non-empty entry "
"VPBasicBlock");
RecipeBuilder.fixHeaderPhis();

addLiveOutsForFirstOrderRecurrences(*Plan);
MapVector<PHINode *, VPValue *> ExitingValuesToFix = collectUsersInExitBlock(
OrigLoop, RecipeBuilder, *Plan, Legal->getInductionVars());

addLiveOutsForFirstOrderRecurrences(*Plan, ExitingValuesToFix);
addUsersInExitBlock(*Plan, ExitingValuesToFix);

// ---------------------------------------------------------------------------
// Transform initial VPlan: Apply previously taken decisions, in order, to
Expand Down Expand Up @@ -8931,6 +9070,7 @@ VPlanPtr LoopVectorizationPlanner::buildVPlan(VFRange &Range) {
// iteration. The final value is selected by the final ComputeReductionResult.
void LoopVectorizationPlanner::adjustRecipesForReductions(
VPlanPtr &Plan, VPRecipeBuilder &RecipeBuilder, ElementCount MinVF) {
using namespace VPlanPatternMatch;
VPRegionBlock *VectorLoopRegion = Plan->getVectorLoopRegion();
VPBasicBlock *Header = VectorLoopRegion->getEntryBasicBlock();
// Gather all VPReductionPHIRecipe and sort them so that Intermediate stores
Expand Down Expand Up @@ -8988,10 +9128,11 @@ void LoopVectorizationPlanner::adjustRecipesForReductions(
for (unsigned I = 0; I != Worklist.size(); ++I) {
VPSingleDefRecipe *Cur = Worklist[I];
for (VPUser *U : Cur->users()) {
auto *UserRecipe = dyn_cast<VPSingleDefRecipe>(U);
if (!UserRecipe) {
assert(isa<VPLiveOut>(U) &&
"U must either be a VPSingleDef or VPLiveOut");
auto *UserRecipe = cast<VPSingleDefRecipe>(U);
if (!UserRecipe->getParent()->getEnclosingLoopRegion()) {
assert(match(U, m_Binary<VPInstruction::ExtractFromEnd>(
m_VPValue(), m_VPValue())) &&
"U must be an ExtractFromEnd VPInstruction");
continue;
}
Worklist.insert(UserRecipe);
Expand Down Expand Up @@ -9208,9 +9349,11 @@ void LoopVectorizationPlanner::adjustRecipesForReductions(
auto *FinalReductionResult = new VPInstruction(
VPInstruction::ComputeReductionResult, {PhiR, NewExitingVPV}, ExitDL);
FinalReductionResult->insertBefore(*MiddleVPBB, IP);
OrigExitingVPV->replaceUsesWithIf(
FinalReductionResult,
[](VPUser &User, unsigned) { return isa<VPLiveOut>(&User); });
OrigExitingVPV->replaceUsesWithIf(FinalReductionResult, [](VPUser &User,
unsigned) {
return match(&User, m_Binary<VPInstruction::ExtractFromEnd>(m_VPValue(),
m_VPValue()));
});
}

VPlanTransforms::clearReductionWrapFlags(*Plan);
Expand Down
5 changes: 5 additions & 0 deletions llvm/lib/Transforms/Vectorize/VPlanPatternMatch.h
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,11 @@ template <typename Val, typename Pattern> bool match(Val *V, const Pattern &P) {
return const_cast<Pattern &>(P).match(V);
}

template <typename Pattern> bool match(VPUser *U, const Pattern &P) {
auto *R = dyn_cast<VPRecipeBase>(U);
return R && match(R, P);
}

template <typename Class> struct class_match {
template <typename ITy> bool match(ITy *V) { return isa<Class>(V); }
};
Expand Down
8 changes: 1 addition & 7 deletions llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -194,9 +194,6 @@ bool VPRecipeBase::mayHaveSideEffects() const {

void VPLiveOut::fixPhi(VPlan &Plan, VPTransformState &State) {
VPValue *ExitValue = getOperand(0);
auto Lane = vputils::isUniformAfterVectorization(ExitValue)
? VPLane::getFirstLane()
: VPLane::getLastLaneForVF(State.VF);
VPBasicBlock *MiddleVPBB =
cast<VPBasicBlock>(Plan.getVectorLoopRegion()->getSingleSuccessor());
VPRecipeBase *ExitingRecipe = ExitValue->getDefiningRecipe();
Expand All @@ -207,10 +204,7 @@ void VPLiveOut::fixPhi(VPlan &Plan, VPTransformState &State) {
? MiddleVPBB
: ExitingVPBB;
BasicBlock *PredBB = State.CFG.VPBB2IRBB[PredVPBB];
// Set insertion point in PredBB in case an extract needs to be generated.
// TODO: Model extracts explicitly.
State.Builder.SetInsertPoint(PredBB, PredBB->getFirstNonPHIIt());
Value *V = State.get(ExitValue, VPIteration(State.UF - 1, Lane));
Value *V = State.get(ExitValue, VPIteration(0, 0));
if (Phi->getBasicBlockIndex(PredBB) != -1)
Phi->setIncomingValueForBlock(PredBB, V);
else
Expand Down
Loading