Skip to content

Commit a6c1689

Browse files
committed
[LV, VP]VP intrinsics support for the Loop Vectorizer
This patch introduces generating VP intrinsics in the Loop Vectorizer. Currently the Loop Vectorizer supports vector predication in a very limited capacity via tail-folding and masked load/store/gather/scatter intrinsics. However, this does not let architectures with active vector length predication support take advantage of their capabilities. Architectures with general masked predication support also can only take advantage of predication on memory operations. By having a way for the Loop Vectorizer to generate Vector Predication intrinsics, which (will) provide a target-independent way to model predicated vector instructions, These architectures can make better use of their predication capabilities. Our first approach (implemented in this patch) builds on top of the existing tail-folding mechanism in the LV, but instead of generating masked intrinsics for memory operations it generates VP intrinsics for loads/stores instructions. Other important part of this approach is how the Explicit Vector Length is computed. (We use active vector length and explicit vector length interchangeably; VP intrinsics define this vector length parameter as Explicit Vector Length (EVL)). We consider the following three ways to compute the EVL parameter for the VP Intrinsics. - The simplest way is to use the VF as EVL and rely solely on the mask parameter to control predication. The mask parameter is the same as computed for current tail-folding implementation. - The second way is to insert instructions to compute `min(VF, trip_count - index)` for each vector iteration. - For architectures like RISC-V, which have special instruction to compute/set an explicit vector length, we also introduce an experimental intrinsic `get_vector_length`, that can be lowered to architecture specific instruction(s) to compute EVL. Also, added a new recipe to emit instructions for computing EVL. Using VPlan in this way will eventually help build and compare VPlans corresponding to different strategies and alternatives. ===Tentative Development Roadmap=== * Use vp-intrinsics for all possible vector operations. That work has 2 possible implementations: 1. Introduce a new pass which transforms emitted vector instructions to vp intrinsics if the the loop was transformed to use predication for loads/stores. The advantage of this approach is that it does not require many changes in the loop vectorizer itself. The disadvantage is that it may require to copy some existing functionality from the loop vectorizer in a separate patch, have similar code in the different passes and perform the same analysis 2 times, at least. 2. Extend Loop Vectorizer using VectorBuildor and make it emit vp intrinsics automatically in presence of EVL value. The advantage is that it does not require a separate pass, thus it may reduce compile time. Plus, we can avoid code duplication. It requires some extra work in the LoopVectorizer to add VectorBuilder support and smart vector instructions/vp intrinsics emission. Also, to fully support Loop Vectorizer it will require adding a new PHI recipe to handle EVL on the previous iteration + extending several existing recipes with the new operands (depends on the design). * Switch to vp-intrinsics for memory operations for VLS and VLA vectorizations. Differential Revision: https://reviews.llvm.org/D99750
1 parent 3de5d8e commit a6c1689

24 files changed

+1581
-32
lines changed

llvm/include/llvm/Analysis/TargetTransformInfo.h

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -190,7 +190,10 @@ enum class TailFoldingStyle {
190190
/// Use predicate to control both data and control flow, but modify
191191
/// the trip count so that a runtime overflow check can be avoided
192192
/// and such that the scalar epilogue loop can always be removed.
193-
DataAndControlFlowWithoutRuntimeCheck
193+
DataAndControlFlowWithoutRuntimeCheck,
194+
/// Use predicated EVL instructions for tail-folding.
195+
/// Indicates that VP intrinsics should be used if tail-folding is enabled.
196+
DataWithEVL,
194197
};
195198

196199
struct TailFoldingInfo {

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -226,6 +226,10 @@ RISCVTTIImpl::getIntImmCostIntrin(Intrinsic::ID IID, unsigned Idx,
226226
return TTI::TCC_Free;
227227
}
228228

229+
bool RISCVTTIImpl::hasActiveVectorLength(unsigned, Type *DataTy, Align) const {
230+
return ST->hasVInstructions();
231+
}
232+
229233
TargetTransformInfo::PopcntSupportKind
230234
RISCVTTIImpl::getPopcntSupport(unsigned TyWidth) {
231235
assert(isPowerOf2_32(TyWidth) && "Ty width must be power of 2");

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,22 @@ class RISCVTTIImpl : public BasicTTIImplBase<RISCVTTIImpl> {
7575
const APInt &Imm, Type *Ty,
7676
TTI::TargetCostKind CostKind);
7777

78+
/// \name Vector Predication Information
79+
/// Whether the target supports the %evl parameter of VP intrinsic efficiently
80+
/// in hardware, for the given opcode and type/alignment. (see LLVM Language
81+
/// Reference - "Vector Predication Intrinsics",
82+
/// https://llvm.org/docs/LangRef.html#vector-predication-intrinsics and
83+
/// "IR-level VP intrinsics",
84+
/// https://llvm.org/docs/Proposals/VectorPredication.html#ir-level-vp-intrinsics).
85+
/// \param Opcode the opcode of the instruction checked for predicated version
86+
/// support.
87+
/// \param DataType the type of the instruction with the \p Opcode checked for
88+
/// prediction support.
89+
/// \param Alignment the alignment for memory access operation checked for
90+
/// predicated version support.
91+
bool hasActiveVectorLength(unsigned Opcode, Type *DataType,
92+
Align Alignment) const;
93+
7894
TargetTransformInfo::PopcntSupportKind getPopcntSupport(unsigned TyWidth);
7995

8096
bool shouldExpandReduction(const IntrinsicInst *II) const;

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

Lines changed: 144 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -123,6 +123,7 @@
123123
#include "llvm/IR/User.h"
124124
#include "llvm/IR/Value.h"
125125
#include "llvm/IR/ValueHandle.h"
126+
#include "llvm/IR/VectorBuilder.h"
126127
#include "llvm/IR/Verifier.h"
127128
#include "llvm/Support/Casting.h"
128129
#include "llvm/Support/CommandLine.h"
@@ -247,10 +248,12 @@ static cl::opt<TailFoldingStyle> ForceTailFoldingStyle(
247248
clEnumValN(TailFoldingStyle::DataAndControlFlow, "data-and-control",
248249
"Create lane mask using active.lane.mask intrinsic, and use "
249250
"it for both data and control flow"),
250-
clEnumValN(
251-
TailFoldingStyle::DataAndControlFlowWithoutRuntimeCheck,
252-
"data-and-control-without-rt-check",
253-
"Similar to data-and-control, but remove the runtime check")));
251+
clEnumValN(TailFoldingStyle::DataAndControlFlowWithoutRuntimeCheck,
252+
"data-and-control-without-rt-check",
253+
"Similar to data-and-control, but remove the runtime check"),
254+
clEnumValN(TailFoldingStyle::DataWithEVL, "data-with-evl",
255+
"Use predicated EVL instructions for tail folding if the "
256+
"target supports vector length predication")));
254257

255258
static cl::opt<bool> MaximizeBandwidth(
256259
"vectorizer-maximize-bandwidth", cl::init(false), cl::Hidden,
@@ -1098,9 +1101,7 @@ void InnerLoopVectorizer::collectPoisonGeneratingRecipes(
10981101
// handled.
10991102
if (isa<VPWidenMemoryInstructionRecipe>(CurRec) ||
11001103
isa<VPInterleaveRecipe>(CurRec) ||
1101-
isa<VPScalarIVStepsRecipe>(CurRec) ||
1102-
isa<VPCanonicalIVPHIRecipe>(CurRec) ||
1103-
isa<VPActiveLaneMaskPHIRecipe>(CurRec))
1104+
isa<VPScalarIVStepsRecipe>(CurRec) || isa<VPHeaderPHIRecipe>(CurRec))
11041105
continue;
11051106

11061107
// This recipe contributes to the address computation of a widen
@@ -1640,6 +1641,23 @@ class LoopVectorizationCostModel {
16401641
return foldTailByMasking() || Legal->blockNeedsPredication(BB);
16411642
}
16421643

1644+
/// Returns true if VP intrinsics with explicit vector length support should
1645+
/// be generated in the tail folded loop.
1646+
bool useVPIWithVPEVLVectorization() const {
1647+
return PreferEVL && !EnableVPlanNativePath &&
1648+
getTailFoldingStyle() == TailFoldingStyle::DataWithEVL &&
1649+
// FIXME: implement support for max safe dependency distance.
1650+
Legal->isSafeForAnyVectorWidth() &&
1651+
// FIXME: remove this once reductions are supported.
1652+
Legal->getReductionVars().empty() &&
1653+
// FIXME: remove this once vp_reverse is supported.
1654+
none_of(
1655+
WideningDecisions,
1656+
[](const std::pair<std::pair<Instruction *, ElementCount>,
1657+
std::pair<InstWidening, InstructionCost>>
1658+
&Data) { return Data.second.first == CM_Widen_Reverse; });
1659+
}
1660+
16431661
/// Returns true if the Phi is part of an inloop reduction.
16441662
bool isInLoopReduction(PHINode *Phi) const {
16451663
return InLoopReductions.contains(Phi);
@@ -1785,6 +1803,10 @@ class LoopVectorizationCostModel {
17851803
/// All blocks of loop are to be masked to fold tail of scalar iterations.
17861804
bool CanFoldTailByMasking = false;
17871805

1806+
/// Control whether to generate VP intrinsics with explicit-vector-length
1807+
/// support in vectorized code.
1808+
bool PreferEVL = false;
1809+
17881810
/// A map holding scalar costs for different vectorization factors. The
17891811
/// presence of a cost for an instruction in the mapping indicates that the
17901812
/// instruction will be scalarized when vectorizing with the associated
@@ -4690,6 +4712,39 @@ LoopVectorizationCostModel::computeMaxVF(ElementCount UserVF, unsigned UserIC) {
46904712
// FIXME: look for a smaller MaxVF that does divide TC rather than masking.
46914713
if (Legal->prepareToFoldTailByMasking()) {
46924714
CanFoldTailByMasking = true;
4715+
if (getTailFoldingStyle() == TailFoldingStyle::None)
4716+
return MaxFactors;
4717+
4718+
if (UserIC > 1) {
4719+
LLVM_DEBUG(dbgs() << "LV: Preference for VP intrinsics indicated. Will "
4720+
"not generate VP intrinsics since interleave count "
4721+
"specified is greater than 1.\n");
4722+
return MaxFactors;
4723+
}
4724+
4725+
if (MaxFactors.ScalableVF.isVector()) {
4726+
assert(MaxFactors.ScalableVF.isScalable() &&
4727+
"Expected scalable vector factor.");
4728+
// FIXME: use actual opcode/data type for analysis here.
4729+
PreferEVL = getTailFoldingStyle() == TailFoldingStyle::DataWithEVL &&
4730+
TTI.hasActiveVectorLength(0, nullptr, Align());
4731+
#if !NDEBUG
4732+
if (getTailFoldingStyle() == TailFoldingStyle::DataWithEVL) {
4733+
if (PreferEVL)
4734+
dbgs() << "LV: Preference for VP intrinsics indicated. Will "
4735+
"try to generate VP Intrinsics.\n";
4736+
else
4737+
dbgs() << "LV: Preference for VP intrinsics indicated. Will "
4738+
"not try to generate VP Intrinsics since the target "
4739+
"does not support vector length predication.\n";
4740+
}
4741+
#endif // !NDEBUG
4742+
4743+
// Tail folded loop using VP intrinsics restricts the VF to be scalable.
4744+
if (PreferEVL)
4745+
MaxFactors.FixedVF = ElementCount::getFixed(1);
4746+
}
4747+
46934748
return MaxFactors;
46944749
}
46954750

@@ -5299,6 +5354,10 @@ LoopVectorizationCostModel::selectInterleaveCount(ElementCount VF,
52995354
if (!isScalarEpilogueAllowed())
53005355
return 1;
53015356

5357+
// Do not interleave if EVL is preferred and no User IC is specified.
5358+
if (useVPIWithVPEVLVectorization())
5359+
return 1;
5360+
53025361
// We used the distance for the interleave count.
53035362
if (!Legal->isSafeForAnyVectorWidth())
53045363
return 1;
@@ -8553,6 +8612,8 @@ void LoopVectorizationPlanner::buildVPlansWithVPRecipes(ElementCount MinVF,
85538612
VPlanTransforms::truncateToMinimalBitwidths(
85548613
*Plan, CM.getMinimalBitwidths(), PSE.getSE()->getContext());
85558614
VPlanTransforms::optimize(*Plan, *PSE.getSE());
8615+
if (CM.useVPIWithVPEVLVectorization())
8616+
VPlanTransforms::addExplicitVectorLength(*Plan);
85568617
assert(VPlanVerifier::verifyPlanIsValid(*Plan) && "VPlan is invalid");
85578618
VPlans.push_back(std::move(Plan));
85588619
}
@@ -9414,6 +9475,52 @@ void VPReplicateRecipe::execute(VPTransformState &State) {
94149475
State.ILV->scalarizeInstruction(UI, this, VPIteration(Part, Lane), State);
94159476
}
94169477

9478+
/// Creates either vp_store or vp_scatter intrinsics calls to represent
9479+
/// predicated store/scatter.
9480+
static Instruction *
9481+
lowerStoreUsingVectorIntrinsics(IRBuilderBase &Builder, Value *Addr,
9482+
Value *StoredVal, bool IsScatter, Value *Mask,
9483+
Value *EVLPart, const Align &Alignment) {
9484+
CallInst *Call;
9485+
if (IsScatter) {
9486+
Call = Builder.CreateIntrinsic(Type::getVoidTy(EVLPart->getContext()),
9487+
Intrinsic::vp_scatter,
9488+
{StoredVal, Addr, Mask, EVLPart});
9489+
} else {
9490+
VectorBuilder VBuilder(Builder);
9491+
VBuilder.setEVL(EVLPart).setMask(Mask);
9492+
Call = cast<CallInst>(VBuilder.createVectorInstruction(
9493+
Instruction::Store, Type::getVoidTy(EVLPart->getContext()),
9494+
{StoredVal, Addr}));
9495+
}
9496+
Call->addParamAttr(
9497+
1, Attribute::getWithAlignment(Call->getContext(), Alignment));
9498+
return Call;
9499+
}
9500+
9501+
/// Creates either vp_load or vp_gather intrinsics calls to represent
9502+
/// predicated load/gather.
9503+
static Instruction *lowerLoadUsingVectorIntrinsics(IRBuilderBase &Builder,
9504+
VectorType *DataTy,
9505+
Value *Addr, bool IsGather,
9506+
Value *Mask, Value *EVLPart,
9507+
const Align &Alignment) {
9508+
CallInst *Call;
9509+
if (IsGather) {
9510+
Call = Builder.CreateIntrinsic(DataTy, Intrinsic::vp_gather,
9511+
{Addr, Mask, EVLPart}, nullptr,
9512+
"wide.masked.gather");
9513+
} else {
9514+
VectorBuilder VBuilder(Builder);
9515+
VBuilder.setEVL(EVLPart).setMask(Mask);
9516+
Call = cast<CallInst>(VBuilder.createVectorInstruction(
9517+
Instruction::Load, DataTy, Addr, "vp.op.load"));
9518+
}
9519+
Call->addParamAttr(
9520+
0, Attribute::getWithAlignment(Call->getContext(), Alignment));
9521+
return Call;
9522+
}
9523+
94179524
void VPWidenMemoryInstructionRecipe::execute(VPTransformState &State) {
94189525
VPValue *StoredValue = isStore() ? getStoredValue() : nullptr;
94199526

@@ -9445,14 +9552,31 @@ void VPWidenMemoryInstructionRecipe::execute(VPTransformState &State) {
94459552
}
94469553
}
94479554

9555+
auto MaskValue = [&](unsigned Part) -> Value * {
9556+
if (isMaskRequired)
9557+
return BlockInMaskParts[Part];
9558+
return nullptr;
9559+
};
9560+
94489561
// Handle Stores:
94499562
if (SI) {
94509563
State.setDebugLocFrom(SI->getDebugLoc());
94519564

94529565
for (unsigned Part = 0; Part < State.UF; ++Part) {
94539566
Instruction *NewSI = nullptr;
94549567
Value *StoredVal = State.get(StoredValue, Part);
9455-
if (CreateGatherScatter) {
9568+
if (State.EVL) {
9569+
Value *EVLPart = State.get(State.EVL, Part);
9570+
// If EVL is not nullptr, then EVL must be a valid value set during plan
9571+
// creation, possibly default value = whole vector register length. EVL
9572+
// is created only if TTI prefers predicated vectorization, thus if EVL
9573+
// is not nullptr it also implies preference for predicated
9574+
// vectorization.
9575+
// FIXME: Support reverse store after vp_reverse is added.
9576+
NewSI = lowerStoreUsingVectorIntrinsics(
9577+
Builder, State.get(getAddr(), Part), StoredVal, CreateGatherScatter,
9578+
MaskValue(Part), EVLPart, Alignment);
9579+
} else if (CreateGatherScatter) {
94569580
Value *MaskPart = isMaskRequired ? BlockInMaskParts[Part] : nullptr;
94579581
Value *VectorGep = State.get(getAddr(), Part);
94589582
NewSI = Builder.CreateMaskedScatter(StoredVal, VectorGep, Alignment,
@@ -9482,7 +9606,18 @@ void VPWidenMemoryInstructionRecipe::execute(VPTransformState &State) {
94829606
State.setDebugLocFrom(LI->getDebugLoc());
94839607
for (unsigned Part = 0; Part < State.UF; ++Part) {
94849608
Value *NewLI;
9485-
if (CreateGatherScatter) {
9609+
if (State.EVL) {
9610+
Value *EVLPart = State.get(State.EVL, Part);
9611+
// If EVL is not nullptr, then EVL must be a valid value set during plan
9612+
// creation, possibly default value = whole vector register length. EVL
9613+
// is created only if TTI prefers predicated vectorization, thus if EVL
9614+
// is not nullptr it also implies preference for predicated
9615+
// vectorization.
9616+
// FIXME: Support reverse loading after vp_reverse is added.
9617+
NewLI = lowerLoadUsingVectorIntrinsics(
9618+
Builder, DataTy, State.get(getAddr(), Part), CreateGatherScatter,
9619+
MaskValue(Part), EVLPart, Alignment);
9620+
} else if (CreateGatherScatter) {
94869621
Value *MaskPart = isMaskRequired ? BlockInMaskParts[Part] : nullptr;
94879622
Value *VectorGep = State.get(getAddr(), Part);
94889623
NewLI = Builder.CreateMaskedGather(DataTy, VectorGep, Alignment, MaskPart,

llvm/lib/Transforms/Vectorize/VPlan.h

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -244,6 +244,12 @@ struct VPTransformState {
244244
ElementCount VF;
245245
unsigned UF;
246246

247+
/// If EVL is not nullptr, then EVL must be a valid value set during plan
248+
/// creation, possibly a default value = whole vector register length. EVL is
249+
/// created only if TTI prefers predicated vectorization, thus if EVL is
250+
/// not nullptr it also implies preference for predicated vectorization.
251+
VPValue *EVL = nullptr;
252+
247253
/// Hold the indices to generate specific scalar instructions. Null indicates
248254
/// that all instances are to be generated, using either scalar or vector
249255
/// instructions.
@@ -1135,6 +1141,8 @@ class VPInstruction : public VPRecipeWithIRFlags {
11351141
SLPLoad,
11361142
SLPStore,
11371143
ActiveLaneMask,
1144+
ExplicitVectorLength,
1145+
ExplicitVectorLengthIVIncrement,
11381146
CalculateTripCountMinusVF,
11391147
// Increment the canonical IV separately for each unrolled part.
11401148
CanonicalIVIncrementForPart,
@@ -1244,6 +1252,8 @@ class VPInstruction : public VPRecipeWithIRFlags {
12441252
default:
12451253
return false;
12461254
case VPInstruction::ActiveLaneMask:
1255+
case VPInstruction::ExplicitVectorLength:
1256+
case VPInstruction::ExplicitVectorLengthIVIncrement:
12471257
case VPInstruction::CalculateTripCountMinusVF:
12481258
case VPInstruction::CanonicalIVIncrementForPart:
12491259
case VPInstruction::BranchOnCount:
@@ -2288,6 +2298,39 @@ class VPActiveLaneMaskPHIRecipe : public VPHeaderPHIRecipe {
22882298
#endif
22892299
};
22902300

2301+
/// A recipe for generating the phi node for the current index of elements,
2302+
/// adjusted in accordance with EVL value. It starts at StartIV value and gets
2303+
/// incremented by EVL in each iteration of the vector loop.
2304+
class VPEVLBasedIVPHIRecipe : public VPHeaderPHIRecipe {
2305+
public:
2306+
VPEVLBasedIVPHIRecipe(VPValue *StartMask, DebugLoc DL)
2307+
: VPHeaderPHIRecipe(VPDef::VPEVLBasedIVPHISC, nullptr, StartMask, DL) {}
2308+
2309+
~VPEVLBasedIVPHIRecipe() override = default;
2310+
2311+
VP_CLASSOF_IMPL(VPDef::VPEVLBasedIVPHISC)
2312+
2313+
static inline bool classof(const VPHeaderPHIRecipe *D) {
2314+
return D->getVPDefID() == VPDef::VPEVLBasedIVPHISC;
2315+
}
2316+
2317+
/// Generate phi for handling IV based on EVL over iterations correctly.
2318+
void execute(VPTransformState &State) override;
2319+
2320+
/// Returns true if the recipe only uses the first lane of operand \p Op.
2321+
bool onlyFirstLaneUsed(const VPValue *Op) const override {
2322+
assert(is_contained(operands(), Op) &&
2323+
"Op must be an operand of the recipe");
2324+
return true;
2325+
}
2326+
2327+
#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
2328+
/// Print the recipe.
2329+
void print(raw_ostream &O, const Twine &Indent,
2330+
VPSlotTracker &SlotTracker) const override;
2331+
#endif
2332+
};
2333+
22912334
/// A Recipe for widening the canonical induction variable of the vector loop.
22922335
class VPWidenCanonicalIVRecipe : public VPSingleDefRecipe {
22932336
public:

llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -207,14 +207,14 @@ Type *VPTypeAnalysis::inferScalarType(const VPValue *V) {
207207
Type *ResultTy =
208208
TypeSwitch<const VPRecipeBase *, Type *>(V->getDefiningRecipe())
209209
.Case<VPCanonicalIVPHIRecipe, VPFirstOrderRecurrencePHIRecipe,
210-
VPReductionPHIRecipe, VPWidenPointerInductionRecipe>(
211-
[this](const auto *R) {
212-
// Handle header phi recipes, except VPWienIntOrFpInduction
213-
// which needs special handling due it being possibly truncated.
214-
// TODO: consider inferring/caching type of siblings, e.g.,
215-
// backedge value, here and in cases below.
216-
return inferScalarType(R->getStartValue());
217-
})
210+
VPReductionPHIRecipe, VPWidenPointerInductionRecipe,
211+
VPEVLBasedIVPHIRecipe>([this](const auto *R) {
212+
// Handle header phi recipes, except VPWienIntOrFpInduction
213+
// which needs special handling due it being possibly truncated.
214+
// TODO: consider inferring/caching type of siblings, e.g.,
215+
// backedge value, here and in cases below.
216+
return inferScalarType(R->getStartValue());
217+
})
218218
.Case<VPWidenIntOrFpInductionRecipe, VPDerivedIVRecipe>(
219219
[](const auto *R) { return R->getScalarType(); })
220220
.Case<VPPredInstPHIRecipe, VPWidenPHIRecipe, VPScalarIVStepsRecipe,

0 commit comments

Comments
 (0)