Skip to content

[LV] Fix MVE regression from #132190 #141736

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jun 16, 2025

Conversation

SamTebbs33
Copy link
Collaborator

Register pressure was only considered if the vector bandwidth was being maximised (chosen either by the target or user options), but #132190 inadvertently caused high pressure VFs to be pruned even when max bandwidth wasn't enabled. This PR returns to the previous behaviour.

@llvmbot
Copy link
Member

llvmbot commented May 28, 2025

@llvm/pr-subscribers-vectorizers

@llvm/pr-subscribers-llvm-transforms

Author: Sam Tebbs (SamTebbs33)

Changes

Register pressure was only considered if the vector bandwidth was being maximised (chosen either by the target or user options), but #132190 inadvertently caused high pressure VFs to be pruned even when max bandwidth wasn't enabled. This PR returns to the previous behaviour.


Full diff: https://github.com/llvm/llvm-project/pull/141736.diff

1 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+21-6)
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 2fe59a464457f..ad3cbc6cd1e42 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -959,6 +959,10 @@ class LoopVectorizationCostModel {
     return expectedCost(UserVF).isValid();
   }
 
+  /// \return True if maximizing vector bandwidth is enabled by the target or
+  /// user options.
+  bool useMaxBandwidth(TargetTransformInfo::RegisterKind RegKind);
+
   /// \return The size (in bits) of the smallest and widest types in the code
   /// that needs to be vectorized. We ignore values that remain scalar such as
   /// 64 bit loop indices.
@@ -3944,6 +3948,14 @@ LoopVectorizationCostModel::computeMaxVF(ElementCount UserVF, unsigned UserIC) {
   return FixedScalableVFPair::getNone();
 }
 
+bool LoopVectorizationCostModel::useMaxBandwidth(
+    TargetTransformInfo::RegisterKind RegKind) {
+  return MaximizeBandwidth || (MaximizeBandwidth.getNumOccurrences() == 0 &&
+                               (TTI.shouldMaximizeVectorBandwidth(RegKind) ||
+                                (UseWiderVFIfCallVariantsPresent &&
+                                 Legal->hasVectorCallVariants())));
+}
+
 ElementCount LoopVectorizationCostModel::getMaximizedVFForTarget(
     unsigned MaxTripCount, unsigned SmallestType, unsigned WidestType,
     ElementCount MaxSafeVF, bool FoldTailByMasking) {
@@ -4009,10 +4021,7 @@ ElementCount LoopVectorizationCostModel::getMaximizedVFForTarget(
       ComputeScalableMaxVF ? TargetTransformInfo::RGK_ScalableVector
                            : TargetTransformInfo::RGK_FixedWidthVector;
   ElementCount MaxVF = MaxVectorElementCount;
-  if (MaximizeBandwidth ||
-      (MaximizeBandwidth.getNumOccurrences() == 0 &&
-       (TTI.shouldMaximizeVectorBandwidth(RegKind) ||
-        (UseWiderVFIfCallVariantsPresent && Legal->hasVectorCallVariants())))) {
+  if (useMaxBandwidth(RegKind)) {
     auto MaxVectorElementCountMaxBW = ElementCount::get(
         llvm::bit_floor(WidestRegister.getKnownMinValue() / SmallestType),
         ComputeScalableMaxVF);
@@ -4384,7 +4393,10 @@ VectorizationFactor LoopVectorizationPlanner::selectVectorizationFactor() {
 
       /// Don't consider the VF if it exceeds the number of registers for the
       /// target.
-      if (RU.exceedsMaxNumRegs(TTI))
+      if (CM.useMaxBandwidth(VF.isScalable()
+                                 ? TargetTransformInfo::RGK_ScalableVector
+                                 : TargetTransformInfo::RGK_FixedWidthVector) &&
+          RU.exceedsMaxNumRegs(TTI))
         continue;
 
       InstructionCost C = CM.expectedCost(VF);
@@ -7458,7 +7470,10 @@ VectorizationFactor LoopVectorizationPlanner::computeBestVF() {
       InstructionCost Cost = cost(*P, VF);
       VectorizationFactor CurrentFactor(VF, Cost, ScalarCost);
 
-      if (RU.exceedsMaxNumRegs(TTI)) {
+      if (CM.useMaxBandwidth(VF.isScalable()
+                                 ? TargetTransformInfo::RGK_ScalableVector
+                                 : TargetTransformInfo::RGK_FixedWidthVector) &&
+          RU.exceedsMaxNumRegs(TTI)) {
         LLVM_DEBUG(dbgs() << "LV(REG): Not considering vector loop of width "
                           << VF << " because it uses too many registers\n");
         continue;

Comment on lines +3954 to +3940
(TTI.shouldMaximizeVectorBandwidth(RegKind) ||
(UseWiderVFIfCallVariantsPresent &&
Legal->hasVectorCallVariants())));
Copy link
Contributor

@lukel97 lukel97 May 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@preames @wangpc-pp @topperc Making a note here that after this patch we'll likely regress cases like this on RVV unless we opt into shouldMaximizeVectorBandwidth https://godbolt.org/z/WcbWGooa4

Comment on lines 7473 to 7476
if (CM.useMaxBandwidth(VF.isScalable()
? TargetTransformInfo::RGK_ScalableVector
: TargetTransformInfo::RGK_FixedWidthVector) &&
RU.exceedsMaxNumRegs(TTI)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain this logic briefly? Maybe I'm misunderstanding the goal here, but if we exceed the number of registers but don't have maxBandwidth enabled, then we'd try to vectorize anyway? Does the call to CM.useMaxBandwidth want to be negated?

In other words; Why are we only considering whether we exceed the number of registers when maxBandwidth is enabled?

Copy link
Collaborator Author

@SamTebbs33 SamTebbs33 May 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That understanding is correct. This was the behaviour before my previous PR accidentally changed it. I'll be looking into another sensible solution for this specific reproducer but this fixes the regression.

@NickGuy-Arm
Copy link
Contributor

Could we get some tests for this case? That this regression wasn't caught indicates that we're missing coverage in this area.

if (CM.useMaxBandwidth(VF.isScalable()
? TargetTransformInfo::RGK_ScalableVector
: TargetTransformInfo::RGK_FixedWidthVector) &&
RU.exceedsMaxNumRegs(TTI))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the only use of RU, would be good to skip the register pressure computation when it's not needed

Same below

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, done! This has also gotten the riscv tests back to what they were before merging #132190.

@SamTebbs33
Copy link
Collaborator Author

Could we get some tests for this case? That this regression wasn't caught indicates that we're missing coverage in this area.

Good idea, done.

Copy link
Collaborator

@sdesmalen-arm sdesmalen-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#132190 changed the behaviour such that VFs are pruned based on reg-usage regardless of whether maximize-vector-bandwidth is enabled. This means it now applies to more cases and exposes pre-existing bugs in the reg-usage computation. While I understand the rationale behind rolling back to previous behaviour, my concern is that this PR hides these bugs again rather than fixing them.

That being said, it could be that the registers pressure estimation is too coarse to be useful for the general case (i.e. non-maximized bandwidth VFs). If that is the consensus, then I think this PR should limit pruning to only those VFs that are added by maximize vector bandwidth.

For the test added in this PR, what I can see is that the reason the regusage is too high is because for a VF of 4 it tries to vectorise the stores (rather than scalarise them as happens for VF=2). This inserts a VPWidenPointerInductionRecipe for the pointer at the start of the block which is then used by the stores at the end of the block, increasing register pressure for vector registers. In practice, the code-generator will try to instantiate the widened (vector) induction pointer closer to its use, rather than at the start of the block. I think we can consider these values as 'scalar' (similar to what we do for e.g. VPReplicateRecipe) because they'll remain scalar until they have to be widened at the point of their use.

@SamTebbs33 and @fhah what do you think?

@SamTebbs33
Copy link
Collaborator Author

Thanks Sander, I've pushed up a commit that tries your suggested VPWidenPointerInductionRecipe approach and it does still vectorise, without the worry of causing RISC-V any regressions @lukel97

@davemgreen
Copy link
Collaborator

I haven't looked deeply into the details but there were a number of pretty large MVE regressions (and some improvements), only one of which was reported on #132190. MVE doesn't have a lot of registers but can deal with spills/reloads quite efficiently much of the time. Ideally it wouldn't be a strict yes/no but the cost of spilling would be included in the estimated cost of the vector factor.

There is another example in https://godbolt.org/z/Koeo86sYf, but it doesn't require any spilling. You can't really calculate register pressure (linearly) without scheduling the instructions in some way.

MVE doesn't enable maximise bandwidth. Would it be better to just have a target hook for whether regpressure should exclude a VF?

Copy link
Contributor

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Sander, I've pushed up a commit that tries your suggested VPWidenPointerInductionRecipe approach and it does still vectorise, without the worry of causing RISC-V any regressions @lukel97

I think unconditionally treating it as scalar will underestimate the vector register usage in some cases, and may introduce another set of regressions. It may be good to just limit this for maximizing vector bandwidth to start with, add test cases (would be great if @davemgreen could share any that are missing) and and fix/re-enable from a stable point. Overestimating the vector usage for VPWidenPointerInduction is problematic for the specific case, but there are other cases where we currently may overestimate (e.g. interleave groups).

@SamTebbs33
Copy link
Collaborator Author

SamTebbs33 commented Jun 4, 2025

I haven't looked deeply into the details but there were a number of pretty large MVE regressions (and some improvements), only one of which was reported on #132190. MVE doesn't have a lot of registers but can deal with spills/reloads quite efficiently much of the time. Ideally it wouldn't be a strict yes/no but the cost of spilling would be included in the estimated cost of the vector factor.

There is another example in https://godbolt.org/z/Koeo86sYf, but it doesn't require any spilling. You can't really calculate register pressure (linearly) without scheduling the instructions in some way.

MVE doesn't enable maximise bandwidth. Would it be better to just have a target hook for whether regpressure should exclude a VF?

Thanks Dave, I've had a look at the compiler explorer example you shared and it looks like there's no difference between current main, the "only check reg usage when max bandwidth is enabled" approach and the "treat VPWidenPointerInduction as a scalar" approach. So at least that one isn't regressing.

I'm in favour of Florian's suggestion to revert back to a stable point where register usage is only checked when maxbw is enabled and improve individual cases gradually, before perhaps enabling register usage checking globally once the regressing cases are sorted out.

@SamTebbs33 SamTebbs33 force-pushed the fix-reg-pressure-regression branch from 760eb57 to 949b6bf Compare June 5, 2025 10:45
Copy link
Collaborator

@davemgreen davemgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in the long run it makes sense to treat this separately to MaximizeVectorBandwidth, perhaps through another target control. And in the longer-run adding them as part of the cost-model. In the mean time this fixes a number of performance regressions reported, some of which were quite large for MVE. I'd recommend getting this in to be back to where we were, performance-wise.

@@ -28,28 +28,24 @@ define void @add(ptr noalias nocapture readonly %src1, ptr noalias nocapture rea
; CHECK-SCALAR-NEXT: LV(REG): RegisterClass: RISCV::FPRRC, 2 registers
; CHECK-SCALAR-NEXT: LV(REG): Found invariant usage: 1 item
; CHECK-SCALAR-NEXT: LV(REG): RegisterClass: RISCV::GPRRC, 1 registers
; CHECK-LMUL1: LV(REG): VF = 2
; CHECK-LMUL1-NEXT: LV(REG): Found max usage: 2 item
; CHECK-LMUL1: LV(REG): Found max usage: 2 item
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you adjust this to be more aligned?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks! Missed the earlier update. Left a few small suggestions to clean up the new test case a bit more.

Comment on lines 84 to 95
entry:
%cmp46.not = icmp eq i32 %n, 0
br i1 %cmp46.not, label %for.cond.cleanup, label %for.body.preheader

for.body.preheader: ; preds = %entry
br label %for.body

for.cond.cleanup.loopexit: ; preds = %for.body
br label %for.cond.cleanup

for.cond.cleanup: ; preds = %for.cond.cleanup.loopexit, %entry
ret void
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to clean this up before landing, the checks and extra blocks shouldn't be needed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the blocks 👍

Comment on lines 98 to 99
%in.addr.049 = phi ptr [ %incdec.ptr2, %for.body ], [ %in, %for.body.preheader ]
%out.addr.048 = phi ptr [ %incdec.ptr34, %for.body ], [ %out, %for.body.preheader ]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
%in.addr.049 = phi ptr [ %incdec.ptr2, %for.body ], [ %in, %for.body.preheader ]
%out.addr.048 = phi ptr [ %incdec.ptr34, %for.body ], [ %out, %for.body.preheader ]
%ptr.iv.1 = phi ptr [ %in, %for.body.preheader ], [ %ptr.iv.1.next, %for.body ]
%ptr.iv.2 = phi ptr [ %out, %for.body.preheader ], [ %ptr.iv.2 , %for.body ]

for consistency with other tests.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --filter-out-after "^scalar.ph:" --version 5
; RUN: opt -passes=loop-vectorize < %s -S -o - | FileCheck %s

source_filename = "<source>"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
source_filename = "<source>"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!


attributes #0 = { nofree norecurse nosync nounwind memory(argmem: readwrite) "target-features"="+mve" }

!7 = !{!"llvm.loop.mustprogress"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed I think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I don't think so, removed.

br i1 %exitcond.not, label %for.cond.cleanup.loopexit, label %for.body, !llvm.loop !7
}

attributes #0 = { nofree norecurse nosync nounwind memory(argmem: readwrite) "target-features"="+mve" }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be more explicit to pass the target fetaures throug hte RUN line.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// The cost for scalar VF=1 is already calculated, so ignore it.
if (VF.isScalar())
continue;

/// Don't consider the VF if it exceeds the number of registers for the
/// target.
if (RU.exceedsMaxNumRegs(TTI))
if (CM.useMaxBandwidth(VF.isScalable()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This is just a suggestion, but you could add an overloaded version of CM.useMaxBandwidth that takes an ElementCount and does something like:

  bool useMaxBandwidth(ElementCount VF) {
    return useMaxBandwidth(VF.isScalable() ? ...);
  }

and use that here and in computeBestVF?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that a lot, done!

Register pressure was only considered if the vector bandwidth was being
maximised (chosen either by the target or user options), but llvm#132190
inadvertently caused high pressure VFs to be pruned even when max
bandwidth wasn't enabled. This PR returns to the previous behaviour.
@SamTebbs33 SamTebbs33 force-pushed the fix-reg-pressure-regression branch from 50d2b87 to ea866e8 Compare June 11, 2025 13:59
Copy link
Contributor

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks

@SamTebbs33 SamTebbs33 merged commit 3dd61c1 into llvm:main Jun 16, 2025
7 checks passed
akuhlens pushed a commit to akuhlens/llvm-project that referenced this pull request Jun 24, 2025
Register pressure was only considered if the vector bandwidth was being
maximised (chosen either by the target or user options), but llvm#132190
inadvertently caused high pressure VFs to be pruned even when max
bandwidth wasn't enabled. This PR returns to the previous behaviour.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants