Skip to content

[VPlan] Implement VPlan-based cost model for VPReduction, VPExtendedReduction and VPMulAccumulateReduction. (NFC) #113903

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 58 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
58 commits
Select commit Hold shift + click to select a range
33b1f60
[VPlan] Impl VPlan-based pattern match for ExtendedRed and MulAccRed.…
ElvisWang123 Oct 28, 2024
68fbd70
Partially support Extended-reduction.
ElvisWang123 Nov 4, 2024
c8c9d56
Support MulAccRecipe
ElvisWang123 Nov 5, 2024
d29a118
Fix servel errors and update tests.
ElvisWang123 Nov 6, 2024
e5b50f7
Refactors
ElvisWang123 Nov 6, 2024
cc004ff
Fix typos and update printing test
ElvisWang123 Nov 7, 2024
b5445ca
Fold reduce.add(zext(mul(sext(A), sext(B)))) into MulAccRecipe when A…
ElvisWang123 Nov 11, 2024
1df91d4
Refactor! Reuse functions from VPReductionRecipe.
ElvisWang123 Nov 11, 2024
a0b2f30
Refactor! Add comments and refine new recipes.
ElvisWang123 Nov 12, 2024
46928bd
Remove underying instruction dependency.
ElvisWang123 Nov 14, 2024
35abf19
Revert "Remove underying instruction dependency."
ElvisWang123 Nov 14, 2024
453997e
Remove extended instruction after mul in MulAccRecipe.
ElvisWang123 Nov 15, 2024
fa4f476
Refactor.
ElvisWang123 Nov 15, 2024
86ad2d8
Clamp the range when the ExtendedReduction or MulAcc cost is invalid.
ElvisWang123 Nov 15, 2024
594f9e4
Try to not depend on underlying ext/mul instructions and preserve fla…
ElvisWang123 Nov 18, 2024
52369d0
Update testcase and fix reduction cost.
ElvisWang123 Nov 25, 2024
abc08f3
!fixup. Rebase to upstream `prepareToExecute()` implementation.
ElvisWang123 Dec 5, 2024
729a70e
Move VPReductionRecipe inherite from VPRecipeWithIRFlags.
ElvisWang123 Dec 11, 2024
ea58282
Only create VPMulAcc/VPExtendedReduction recipe when beneficial. NFC
ElvisWang123 Dec 11, 2024
1c22ce2
Merge branch 'main' into vp-arm-mve-transform
ElvisWang123 Dec 11, 2024
a987456
!fixup use `auto`
ElvisWang123 Dec 12, 2024
6c434c7
!fixup VPReductionRecipe unit tests.
ElvisWang123 Dec 12, 2024
f4b1b78
!fixup migrate tryTo* to VPlanTransforms
ElvisWang123 Dec 23, 2024
bffcac5
Implement clone() and add some docs.
ElvisWang123 Dec 23, 2024
da705f1
Update comments.
ElvisWang123 Dec 23, 2024
1dc279e
fix-ReductionEVLRecipe query underlyingInstr().
ElvisWang123 Dec 23, 2024
20ea82e
Merge branch 'main' into vp-arm-mve-transform
ElvisWang123 Dec 23, 2024
90f9ffa
Update after merge.
ElvisWang123 Dec 23, 2024
99512fe
Address comments and split off abstract recipes creation from adjustR…
ElvisWang123 Dec 26, 2024
2e4014a
!fixup using foldTailWithEVL.
ElvisWang123 Dec 27, 2024
38dd924
!fixup, remove extra debugLoc and move check of EVL out of transforms.
ElvisWang123 Jan 21, 2025
602a5e4
Merge branch 'main' into vp-arm-mve-transform
ElvisWang123 Jan 21, 2025
1939d44
Update after merge main.
ElvisWang123 Jan 22, 2025
2ee6e76
Merge branch 'main' into vp-arm-mve-transform
ElvisWang123 Feb 17, 2025
d584fc1
Update after merge. Using runPass::().
ElvisWang123 Feb 18, 2025
21b33e6
!fixup, Remove unused check and functions.
ElvisWang123 Feb 26, 2025
ae371e5
Merge branch 'main' into vp-arm-mve-transform
ElvisWang123 Mar 4, 2025
0d7b7f3
!fixup; Address comments.
ElvisWang123 Mar 4, 2025
e12bd04
!fixup, Add Mul cost to prevent FMuladd Reduction cost misaligned.
ElvisWang123 Mar 7, 2025
4906637
!Fixup, typo.
ElvisWang123 Mar 10, 2025
ca5db10
Merge branch 'main' into vp-arm-mve-transform
ElvisWang123 Mar 18, 2025
2fbdc7c
!fixup, Address comments and fix VPReductionRecipe::computeCost
ElvisWang123 Mar 19, 2025
38d83bf
Merge branch 'main' into vp-arm-mve-transform
ElvisWang123 Mar 19, 2025
3e2acad
Merge branch 'main' into vp-arm-mve-transform
ElvisWang123 Mar 19, 2025
d2a5a43
!fixup, Update after merge, using std::array.
ElvisWang123 Mar 19, 2025
484f9cc
fixup, formatting.
ElvisWang123 Mar 19, 2025
cd86af4
!fixup, address comments.
ElvisWang123 Mar 20, 2025
84f8a46
Merge branch 'main' into vp-arm-mve-transform
ElvisWang123 Mar 25, 2025
36e1032
!fixup, formatting and address comments.
ElvisWang123 Mar 25, 2025
2483a29
!fixup, Update inferScalarType and not clear the VF of plan.
ElvisWang123 Apr 7, 2025
56dcd90
!fixup, address comments.
ElvisWang123 Apr 10, 2025
26d938a
Merge branch 'main' into vp-arm-mve-transform
ElvisWang123 Apr 18, 2025
b32538f
!fixup, address comments.
ElvisWang123 Apr 18, 2025
fd539f8
!fixup, address comments and using `transferFlags()` to copy nneg.
ElvisWang123 Apr 23, 2025
71c7401
!fixup, address comments.
ElvisWang123 Apr 23, 2025
7da7983
!fixup, Add new recipes to mayReadWriteMemory.
ElvisWang123 Apr 24, 2025
f4afc2c
Merge branch 'main' into vp-arm-mve-transform
ElvisWang123 May 16, 2025
7b25767
Fixup after merge.
ElvisWang123 May 16, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
56 changes: 0 additions & 56 deletions llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -7391,62 +7391,6 @@ LoopVectorizationPlanner::precomputeCosts(VPlan &Plan, ElementCount VF,
}
}

// The legacy cost model has special logic to compute the cost of in-loop
// reductions, which may be smaller than the sum of all instructions involved
// in the reduction.
// TODO: Switch to costing based on VPlan once the logic has been ported.
for (const auto &[RedPhi, RdxDesc] : Legal->getReductionVars()) {
if (ForceTargetInstructionCost.getNumOccurrences())
continue;

if (!CM.isInLoopReduction(RedPhi))
continue;

const auto &ChainOps = RdxDesc.getReductionOpChain(RedPhi, OrigLoop);
SetVector<Instruction *> ChainOpsAndOperands(llvm::from_range, ChainOps);
auto IsZExtOrSExt = [](const unsigned Opcode) -> bool {
return Opcode == Instruction::ZExt || Opcode == Instruction::SExt;
};
// Also include the operands of instructions in the chain, as the cost-model
// may mark extends as free.
//
// For ARM, some of the instruction can folded into the reducion
// instruction. So we need to mark all folded instructions free.
// For example: We can fold reduce(mul(ext(A), ext(B))) into one
// instruction.
for (auto *ChainOp : ChainOps) {
for (Value *Op : ChainOp->operands()) {
if (auto *I = dyn_cast<Instruction>(Op)) {
ChainOpsAndOperands.insert(I);
if (I->getOpcode() == Instruction::Mul) {
auto *Ext0 = dyn_cast<Instruction>(I->getOperand(0));
auto *Ext1 = dyn_cast<Instruction>(I->getOperand(1));
if (Ext0 && IsZExtOrSExt(Ext0->getOpcode()) && Ext1 &&
Ext0->getOpcode() == Ext1->getOpcode()) {
ChainOpsAndOperands.insert(Ext0);
ChainOpsAndOperands.insert(Ext1);
}
}
}
}
}

// Pre-compute the cost for I, if it has a reduction pattern cost.
for (Instruction *I : ChainOpsAndOperands) {
auto ReductionCost =
CM.getReductionPatternCost(I, VF, toVectorTy(I->getType(), VF));
if (!ReductionCost)
continue;

assert(!CostCtx.SkipCostComputation.contains(I) &&
"reduction op visited multiple times");
CostCtx.SkipCostComputation.insert(I);
LLVM_DEBUG(dbgs() << "Cost of " << ReductionCost << " for VF " << VF
<< ":\n in-loop reduction " << *I << "\n");
Cost += *ReductionCost;
}
}

// Pre-compute the costs for branches except for the backedge, as the number
// of replicate regions in a VPlan may not directly match the number of
// branches, which would lead to different decisions.
Expand Down
8 changes: 8 additions & 0 deletions llvm/lib/Transforms/Vectorize/VPlan.h
Original file line number Diff line number Diff line change
Expand Up @@ -2669,6 +2669,10 @@ class VPExtendedReductionRecipe : public VPReductionRecipe {
"VPExtendedRecipe + VPReductionRecipe before execution.");
};

/// Return the cost of VPExtendedReductionRecipe.
InstructionCost computeCost(ElementCount VF,
VPCostContext &Ctx) const override;

#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
/// Print the recipe.
void print(raw_ostream &O, const Twine &Indent,
Expand Down Expand Up @@ -2768,6 +2772,10 @@ class VPMulAccumulateReductionRecipe : public VPReductionRecipe {
"VPWidenRecipe + VPReductionRecipe before execution");
}

/// Return the cost of VPMulAccumulateReductionRecipe.
InstructionCost computeCost(ElementCount VF,
VPCostContext &Ctx) const override;

#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
/// Print the recipe.
void print(raw_ostream &O, const Twine &Indent,
Expand Down
57 changes: 43 additions & 14 deletions llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -782,19 +782,25 @@ Value *VPInstruction::generate(VPTransformState &State) {
InstructionCost VPInstruction::computeCost(ElementCount VF,
VPCostContext &Ctx) const {
if (Instruction::isBinaryOp(getOpcode())) {

Type *ResTy = Ctx.Types.inferScalarType(this);
if (!vputils::onlyFirstLaneUsed(this))
ResTy = toVectorTy(ResTy, VF);

if (!getUnderlyingValue()) {
// TODO: Compute cost for VPInstructions without underlying values once
// the legacy cost model has been retired.
return 0;
switch (getOpcode()) {
case Instruction::FMul:
return Ctx.TTI.getArithmeticInstrCost(getOpcode(), ResTy, Ctx.CostKind);
default:
// TODO: Compute cost for VPInstructions without underlying values once
// the legacy cost model has been retired.
return 0;
}
}

assert(!doesGeneratePerAllLanes() &&
"Should only generate a vector value or single scalar, not scalars "
"for all lanes.");
Type *ResTy = Ctx.Types.inferScalarType(this);
if (!vputils::onlyFirstLaneUsed(this))
ResTy = toVectorTy(ResTy, VF);

return Ctx.TTI.getArithmeticInstrCost(getOpcode(), ResTy, Ctx.CostKind);
}

Expand Down Expand Up @@ -2527,24 +2533,47 @@ InstructionCost VPReductionRecipe::computeCost(ElementCount VF,
auto *VectorTy = cast<VectorType>(toVectorTy(ElementTy, VF));
unsigned Opcode = RecurrenceDescriptor::getOpcode(RdxKind);
FastMathFlags FMFs = getFastMathFlags();
std::optional<FastMathFlags> OptionalFMF =
ElementTy->isFloatingPointTy() ? std::make_optional(FMFs) : std::nullopt;

// TODO: Support any-of reductions.
assert(
(!RecurrenceDescriptor::isAnyOfRecurrenceKind(RdxKind) ||
ForceTargetInstructionCost.getNumOccurrences() > 0) &&
"Any-of reduction not implemented in VPlan-based cost model currently.");

// Cost = Reduction cost + BinOp cost
InstructionCost Cost =
Ctx.TTI.getArithmeticInstrCost(Opcode, ElementTy, Ctx.CostKind);
// Note that TTI should model the cost of moving result to the scalar register
// and the BinOp cost in the getReductionCost().
if (RecurrenceDescriptor::isMinMaxRecurrenceKind(RdxKind)) {
Intrinsic::ID Id = getMinMaxReductionIntrinsicOp(RdxKind);
return Cost +
Ctx.TTI.getMinMaxReductionCost(Id, VectorTy, FMFs, Ctx.CostKind);
Comment on lines -2542 to -2543
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the BinOp cost here to match the legacy cost model. link
And TTI already calculate the cost of BinOp (at least in RISCV). link

return Ctx.TTI.getMinMaxReductionCost(Id, VectorTy, FMFs, Ctx.CostKind);
}

return Cost + Ctx.TTI.getArithmeticReductionCost(Opcode, VectorTy, FMFs,
Ctx.CostKind);
return Ctx.TTI.getArithmeticReductionCost(Opcode, VectorTy, OptionalFMF,
Ctx.CostKind);
}

InstructionCost
VPExtendedReductionRecipe::computeCost(ElementCount VF,
VPCostContext &Ctx) const {
unsigned Opcode = RecurrenceDescriptor::getOpcode(getRecurrenceKind());
Type *RedTy = Ctx.Types.inferScalarType(this);
auto *SrcVecTy =
cast<VectorType>(toVectorTy(Ctx.Types.inferScalarType(getVecOp()), VF));
assert(RedTy->isIntegerTy() &&
"ExtendedReduction only support integer type currently.");
return Ctx.TTI.getExtendedReductionCost(Opcode, isZExt(), RedTy, SrcVecTy,
std::nullopt, Ctx.CostKind);
}

InstructionCost
VPMulAccumulateReductionRecipe::computeCost(ElementCount VF,
VPCostContext &Ctx) const {
Type *RedTy = Ctx.Types.inferScalarType(this);
auto *SrcVecTy =
cast<VectorType>(toVectorTy(Ctx.Types.inferScalarType(getVecOp0()), VF));
return Ctx.TTI.getMulAccReductionCost(isZExt(), RedTy, SrcVecTy,
Ctx.CostKind);
}

#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,14 +12,14 @@ target triple="aarch64-unknown-linux-gnu"

; CHECK-VSCALE2-LABEL: LV: Checking a loop in 'fadd_strict32'
; CHECK-VSCALE2: Cost of 4 for VF vscale x 2:
; CHECK-VSCALE2: in-loop reduction %add = fadd float %0, %sum.07
; CHECK-VSCALE2: REDUCE ir<%add> = ir<%sum.07> + reduce.fadd (ir<%0>)
; CHECK-VSCALE2: Cost of 8 for VF vscale x 4:
; CHECK-VSCALE2: in-loop reduction %add = fadd float %0, %sum.07
; CHECK-VSCALE2: REDUCE ir<%add> = ir<%sum.07> + reduce.fadd (ir<%0>)
; CHECK-VSCALE1-LABEL: LV: Checking a loop in 'fadd_strict32'
; CHECK-VSCALE1: Cost of 2 for VF vscale x 2:
; CHECK-VSCALE1: in-loop reduction %add = fadd float %0, %sum.07
; CHECK-VSCALE1: REDUCE ir<%add> = ir<%sum.07> + reduce.fadd (ir<%0>)
; CHECK-VSCALE1: Cost of 4 for VF vscale x 4:
; CHECK-VSCALE1: in-loop reduction %add = fadd float %0, %sum.07
; CHECK-VSCALE1: REDUCE ir<%add> = ir<%sum.07> + reduce.fadd (ir<%0>)

define float @fadd_strict32(ptr noalias nocapture readonly %a, i64 %n) #0 {
entry:
Expand All @@ -42,10 +42,10 @@ for.end:

; CHECK-VSCALE2-LABEL: LV: Checking a loop in 'fadd_strict64'
; CHECK-VSCALE2: Cost of 4 for VF vscale x 2:
; CHECK-VSCALE2: in-loop reduction %add = fadd double %0, %sum.07
; CHECK-VSCALE2: REDUCE ir<%add> = ir<%sum.07> + reduce.fadd (ir<%0>)
; CHECK-VSCALE1-LABEL: LV: Checking a loop in 'fadd_strict64'
; CHECK-VSCALE1: Cost of 2 for VF vscale x 2:
; CHECK-VSCALE1: in-loop reduction %add = fadd double %0, %sum.07
; CHECK-VSCALE1: REDUCE ir<%add> = ir<%sum.07> + reduce.fadd (ir<%0>)

define double @fadd_strict64(ptr noalias nocapture readonly %a, i64 %n) #0 {
entry:
Expand Down
42 changes: 21 additions & 21 deletions llvm/test/Transforms/LoopVectorize/ARM/mve-reductions.ll
Original file line number Diff line number Diff line change
Expand Up @@ -800,11 +800,11 @@ define i32 @mla_i32_i32(ptr nocapture readonly %x, ptr nocapture readonly %y, i3
; CHECK-NEXT: [[INDEX:%.*]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[VEC_PHI:%.*]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[TMP5:%.*]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 [[INDEX]], i32 [[N]])
; CHECK-NEXT: [[TMP0:%.*]] = getelementptr inbounds i32, ptr [[X:%.*]], i32 [[INDEX]]
; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.*]] = call <4 x i32> @llvm.masked.load.v4i32.p0(ptr [[TMP0]], i32 4, <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i32> poison)
; CHECK-NEXT: [[TMP1:%.*]] = getelementptr inbounds i32, ptr [[Y:%.*]], i32 [[INDEX]]
; CHECK-NEXT: [[WIDE_MASKED_LOAD1:%.*]] = call <4 x i32> @llvm.masked.load.v4i32.p0(ptr [[TMP1]], i32 4, <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i32> poison)
; CHECK-NEXT: [[TMP2:%.*]] = mul nsw <4 x i32> [[WIDE_MASKED_LOAD1]], [[WIDE_MASKED_LOAD]]
; CHECK-NEXT: [[TMP7:%.*]] = getelementptr inbounds i32, ptr [[Y1:%.*]], i32 [[INDEX]]
; CHECK-NEXT: [[WIDE_MASKED_LOAD2:%.*]] = call <4 x i32> @llvm.masked.load.v4i32.p0(ptr [[TMP7]], i32 4, <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i32> poison)
; CHECK-NEXT: [[TMP2:%.*]] = mul nsw <4 x i32> [[WIDE_MASKED_LOAD2]], [[WIDE_MASKED_LOAD1]]
; CHECK-NEXT: [[TMP3:%.*]] = select <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i32> [[TMP2]], <4 x i32> zeroinitializer
; CHECK-NEXT: [[TMP4:%.*]] = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[TMP3]])
; CHECK-NEXT: [[TMP5]] = add i32 [[TMP4]], [[VEC_PHI]]
Expand Down Expand Up @@ -961,11 +961,11 @@ define signext i16 @mla_i16_i16(ptr nocapture readonly %x, ptr nocapture readonl
; CHECK-NEXT: [[INDEX:%.*]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[VEC_PHI:%.*]] = phi i16 [ 0, [[VECTOR_PH]] ], [ [[TMP5:%.*]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <8 x i1> @llvm.get.active.lane.mask.v8i1.i32(i32 [[INDEX]], i32 [[N]])
; CHECK-NEXT: [[TMP0:%.*]] = getelementptr inbounds i16, ptr [[X:%.*]], i32 [[INDEX]]
; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.*]] = call <8 x i16> @llvm.masked.load.v8i16.p0(ptr [[TMP0]], i32 2, <8 x i1> [[ACTIVE_LANE_MASK]], <8 x i16> poison)
; CHECK-NEXT: [[TMP1:%.*]] = getelementptr inbounds i16, ptr [[Y:%.*]], i32 [[INDEX]]
; CHECK-NEXT: [[WIDE_MASKED_LOAD1:%.*]] = call <8 x i16> @llvm.masked.load.v8i16.p0(ptr [[TMP1]], i32 2, <8 x i1> [[ACTIVE_LANE_MASK]], <8 x i16> poison)
; CHECK-NEXT: [[TMP2:%.*]] = mul <8 x i16> [[WIDE_MASKED_LOAD1]], [[WIDE_MASKED_LOAD]]
; CHECK-NEXT: [[TMP7:%.*]] = getelementptr inbounds i16, ptr [[Y1:%.*]], i32 [[INDEX]]
; CHECK-NEXT: [[WIDE_MASKED_LOAD2:%.*]] = call <8 x i16> @llvm.masked.load.v8i16.p0(ptr [[TMP7]], i32 2, <8 x i1> [[ACTIVE_LANE_MASK]], <8 x i16> poison)
; CHECK-NEXT: [[TMP2:%.*]] = mul <8 x i16> [[WIDE_MASKED_LOAD2]], [[WIDE_MASKED_LOAD1]]
; CHECK-NEXT: [[TMP3:%.*]] = select <8 x i1> [[ACTIVE_LANE_MASK]], <8 x i16> [[TMP2]], <8 x i16> zeroinitializer
; CHECK-NEXT: [[TMP4:%.*]] = call i16 @llvm.vector.reduce.add.v8i16(<8 x i16> [[TMP3]])
; CHECK-NEXT: [[TMP5]] = add i16 [[TMP4]], [[VEC_PHI]]
Expand Down Expand Up @@ -1067,11 +1067,11 @@ define zeroext i8 @mla_i8_i8(ptr nocapture readonly %x, ptr nocapture readonly %
; CHECK-NEXT: [[INDEX:%.*]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[VEC_PHI:%.*]] = phi i8 [ 0, [[VECTOR_PH]] ], [ [[TMP5:%.*]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <16 x i1> @llvm.get.active.lane.mask.v16i1.i32(i32 [[INDEX]], i32 [[N]])
; CHECK-NEXT: [[TMP0:%.*]] = getelementptr inbounds i8, ptr [[X:%.*]], i32 [[INDEX]]
; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.*]] = call <16 x i8> @llvm.masked.load.v16i8.p0(ptr [[TMP0]], i32 1, <16 x i1> [[ACTIVE_LANE_MASK]], <16 x i8> poison)
; CHECK-NEXT: [[TMP1:%.*]] = getelementptr inbounds i8, ptr [[Y:%.*]], i32 [[INDEX]]
; CHECK-NEXT: [[WIDE_MASKED_LOAD1:%.*]] = call <16 x i8> @llvm.masked.load.v16i8.p0(ptr [[TMP1]], i32 1, <16 x i1> [[ACTIVE_LANE_MASK]], <16 x i8> poison)
; CHECK-NEXT: [[TMP2:%.*]] = mul <16 x i8> [[WIDE_MASKED_LOAD1]], [[WIDE_MASKED_LOAD]]
; CHECK-NEXT: [[TMP7:%.*]] = getelementptr inbounds i8, ptr [[Y1:%.*]], i32 [[INDEX]]
; CHECK-NEXT: [[WIDE_MASKED_LOAD2:%.*]] = call <16 x i8> @llvm.masked.load.v16i8.p0(ptr [[TMP7]], i32 1, <16 x i1> [[ACTIVE_LANE_MASK]], <16 x i8> poison)
; CHECK-NEXT: [[TMP2:%.*]] = mul <16 x i8> [[WIDE_MASKED_LOAD2]], [[WIDE_MASKED_LOAD1]]
; CHECK-NEXT: [[TMP3:%.*]] = select <16 x i1> [[ACTIVE_LANE_MASK]], <16 x i8> [[TMP2]], <16 x i8> zeroinitializer
; CHECK-NEXT: [[TMP4:%.*]] = call i8 @llvm.vector.reduce.add.v16i8(<16 x i8> [[TMP3]])
; CHECK-NEXT: [[TMP5]] = add i8 [[TMP4]], [[VEC_PHI]]
Expand Down Expand Up @@ -1181,11 +1181,11 @@ define i64 @red_mla_ext_s16_u16_s64(ptr noalias nocapture readonly %A, ptr noali
; CHECK: vector.body:
; CHECK-NEXT: [[INDEX:%.*]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[VEC_PHI:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[TMP7:%.*]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[TMP0:%.*]] = getelementptr inbounds i16, ptr [[A:%.*]], i32 [[INDEX]]
; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <4 x i16>, ptr [[TMP0]], align 1
; CHECK-NEXT: [[TMP1:%.*]] = sext <4 x i16> [[WIDE_LOAD]] to <4 x i32>
; CHECK-NEXT: [[TMP2:%.*]] = getelementptr inbounds i16, ptr [[B:%.*]], i32 [[INDEX]]
; CHECK-NEXT: [[WIDE_LOAD1:%.*]] = load <4 x i16>, ptr [[TMP2]], align 2
; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <4 x i16>, ptr [[TMP2]], align 1
; CHECK-NEXT: [[TMP1:%.*]] = sext <4 x i16> [[WIDE_LOAD]] to <4 x i32>
; CHECK-NEXT: [[TMP11:%.*]] = getelementptr inbounds i16, ptr [[B1:%.*]], i32 [[INDEX]]
; CHECK-NEXT: [[WIDE_LOAD1:%.*]] = load <4 x i16>, ptr [[TMP11]], align 2
; CHECK-NEXT: [[TMP3:%.*]] = zext <4 x i16> [[WIDE_LOAD1]] to <4 x i32>
; CHECK-NEXT: [[TMP4:%.*]] = mul nsw <4 x i32> [[TMP3]], [[TMP1]]
; CHECK-NEXT: [[TMP5:%.*]] = zext <4 x i32> [[TMP4]] to <4 x i64>
Expand All @@ -1204,10 +1204,10 @@ define i64 @red_mla_ext_s16_u16_s64(ptr noalias nocapture readonly %A, ptr noali
; CHECK: for.body:
; CHECK-NEXT: [[I_011:%.*]] = phi i32 [ [[INC:%.*]], [[FOR_BODY]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]
; CHECK-NEXT: [[S_010:%.*]] = phi i64 [ [[ADD:%.*]], [[FOR_BODY]] ], [ [[BC_MERGE_RDX]], [[SCALAR_PH]] ]
; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i16, ptr [[A]], i32 [[I_011]]
; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i16, ptr [[B]], i32 [[I_011]]
; CHECK-NEXT: [[TMP9:%.*]] = load i16, ptr [[ARRAYIDX]], align 1
; CHECK-NEXT: [[CONV:%.*]] = sext i16 [[TMP9]] to i32
; CHECK-NEXT: [[ARRAYIDX1:%.*]] = getelementptr inbounds i16, ptr [[B]], i32 [[I_011]]
; CHECK-NEXT: [[ARRAYIDX1:%.*]] = getelementptr inbounds i16, ptr [[B1]], i32 [[I_011]]
; CHECK-NEXT: [[TMP10:%.*]] = load i16, ptr [[ARRAYIDX1]], align 2
; CHECK-NEXT: [[CONV2:%.*]] = zext i16 [[TMP10]] to i32
; CHECK-NEXT: [[MUL:%.*]] = mul nsw i32 [[CONV2]], [[CONV]]
Expand Down Expand Up @@ -1266,12 +1266,12 @@ define i32 @red_mla_u8_s8_u32(ptr noalias nocapture readonly %A, ptr noalias noc
; CHECK-NEXT: [[INDEX:%.*]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[VEC_PHI:%.*]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[TMP7:%.*]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 [[INDEX]], i32 [[N]])
; CHECK-NEXT: [[TMP0:%.*]] = getelementptr inbounds i8, ptr [[A:%.*]], i32 [[INDEX]]
; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.*]] = call <4 x i8> @llvm.masked.load.v4i8.p0(ptr [[TMP0]], i32 1, <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i8> poison)
; CHECK-NEXT: [[TMP1:%.*]] = zext <4 x i8> [[WIDE_MASKED_LOAD]] to <4 x i32>
; CHECK-NEXT: [[TMP2:%.*]] = getelementptr inbounds i8, ptr [[B:%.*]], i32 [[INDEX]]
; CHECK-NEXT: [[WIDE_MASKED_LOAD1:%.*]] = call <4 x i8> @llvm.masked.load.v4i8.p0(ptr [[TMP2]], i32 1, <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i8> poison)
; CHECK-NEXT: [[TMP3:%.*]] = sext <4 x i8> [[WIDE_MASKED_LOAD1]] to <4 x i32>
; CHECK-NEXT: [[TMP1:%.*]] = zext <4 x i8> [[WIDE_MASKED_LOAD1]] to <4 x i32>
; CHECK-NEXT: [[TMP9:%.*]] = getelementptr inbounds i8, ptr [[B1:%.*]], i32 [[INDEX]]
; CHECK-NEXT: [[WIDE_MASKED_LOAD2:%.*]] = call <4 x i8> @llvm.masked.load.v4i8.p0(ptr [[TMP9]], i32 1, <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i8> poison)
; CHECK-NEXT: [[TMP3:%.*]] = sext <4 x i8> [[WIDE_MASKED_LOAD2]] to <4 x i32>
; CHECK-NEXT: [[TMP4:%.*]] = mul nsw <4 x i32> [[TMP3]], [[TMP1]]
; CHECK-NEXT: [[TMP5:%.*]] = select <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i32> [[TMP4]], <4 x i32> zeroinitializer
; CHECK-NEXT: [[TMP6:%.*]] = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[TMP5]])
Expand Down Expand Up @@ -1408,8 +1408,8 @@ define i32 @mla_i8_i32_multiuse(ptr nocapture readonly %x, ptr nocapture readonl
; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <16 x i1> @llvm.get.active.lane.mask.v16i1.i32(i32 [[INDEX]], i32 [[N]])
; CHECK-NEXT: [[TMP0:%.*]] = getelementptr inbounds i8, ptr [[X:%.*]], i32 [[INDEX]]
; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.*]] = call <16 x i8> @llvm.masked.load.v16i8.p0(ptr [[TMP0]], i32 1, <16 x i1> [[ACTIVE_LANE_MASK]], <16 x i8> poison)
; CHECK-NEXT: [[TMP1:%.*]] = zext <16 x i8> [[WIDE_MASKED_LOAD]] to <16 x i32>
; CHECK-NEXT: [[TMP2:%.*]] = mul nuw nsw <16 x i32> [[TMP1]], [[TMP1]]
; CHECK-NEXT: [[TMP7:%.*]] = zext <16 x i8> [[WIDE_MASKED_LOAD]] to <16 x i32>
; CHECK-NEXT: [[TMP2:%.*]] = mul nuw nsw <16 x i32> [[TMP7]], [[TMP7]]
; CHECK-NEXT: [[TMP3:%.*]] = select <16 x i1> [[ACTIVE_LANE_MASK]], <16 x i32> [[TMP2]], <16 x i32> zeroinitializer
; CHECK-NEXT: [[TMP4:%.*]] = call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> [[TMP3]])
; CHECK-NEXT: [[TMP5]] = add i32 [[TMP4]], [[VEC_PHI]]
Expand Down
Loading