Skip to content

[RISCV][VLOPT] Don't reduce the VL is the same as CommonVL #123878

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jan 22, 2025

Conversation

michaelmaitland
Copy link
Contributor

This fixes #123862.

@llvmbot
Copy link
Member

llvmbot commented Jan 22, 2025

@llvm/pr-subscribers-backend-risc-v

Author: Michael Maitland (michaelmaitland)

Changes

This fixes #123862.


Full diff: https://github.com/llvm/llvm-project/pull/123878.diff

2 Files Affected:

  • (modified) llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp (+7)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vl-opt.ll (+11)
diff --git a/llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp b/llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp
index 54ca8ccd8d9e90..9182e1f751933c 100644
--- a/llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp
+++ b/llvm/lib/Target/RISCV/RISCVVLOptimizer.cpp
@@ -1320,6 +1320,13 @@ bool RISCVVLOptimizer::tryReduceVL(MachineInstr &OrigMI) {
     }
 
     if (CommonVL->isImm()) {
+      if (CommonVL->isImm() && VLOp.isImm() &&
+          VLOp.getImm() == CommonVL->getImm()) {
+        LLVM_DEBUG(dbgs() << "  VL is already reduced to" << VLOp << " for "
+                          << MI << "\n");
+        continue;
+      }
+
       LLVM_DEBUG(dbgs() << "  Reduce VL from " << VLOp << " to "
                         << CommonVL->getImm() << " for " << MI << "\n");
       VLOp.ChangeToImmediate(CommonVL->getImm());
diff --git a/llvm/test/CodeGen/RISCV/rvv/vl-opt.ll b/llvm/test/CodeGen/RISCV/rvv/vl-opt.ll
index 1cc30f077feb4a..d6143f69288e66 100644
--- a/llvm/test/CodeGen/RISCV/rvv/vl-opt.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/vl-opt.ll
@@ -150,3 +150,14 @@ define <vscale x 4 x i32> @dont_optimize_tied_def(<vscale x 4 x i32> %a, <vscale
   ret <vscale x 4 x i32> %2
 }
 
+define <vscale x 4 x i32> @same_vl_imm(<vscale x 4 x i32> %passthru, <vscale x 4 x i32> %a, <vscale x 4 x i32> %b) {
+; CHECK-LABEL: same_vl_imm:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vsetivli zero, 4, e32, m2, ta, ma
+; CHECK-NEXT:    vadd.vv v8, v10, v12
+; CHECK-NEXT:    vadd.vv v8, v8, v10
+; CHECK-NEXT:    ret
+  %v = call <vscale x 4 x i32> @llvm.riscv.vadd.nxv4i32.nxv4i32(<vscale x 4 x i32> poison, <vscale x 4 x i32> %a, <vscale x 4 x i32> %b, iXLen 4)
+  %w = call <vscale x 4 x i32> @llvm.riscv.vadd.nxv4i32.nxv4i32(<vscale x 4 x i32> poison, <vscale x 4 x i32> %v, <vscale x 4 x i32> %a, iXLen 4)
+  ret <vscale x 4 x i32> %w
+}

@MaheshRavishankar
Copy link
Contributor

Thanks! That seems to have worked.

Copy link
Contributor

@lukel97 lukel97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is an infinite loop, but a pathologically slow case where the chain of instructions is really long. But pruning the worklist when the CommonVL is the same seems like a sensible way to manage it.

@wangpc-pp
Copy link
Contributor

wangpc-pp commented Jan 22, 2025

I don't think this is an infinite loop, but a pathologically slow case where the chain of instructions is really long. But pruning the worklist when the CommonVL is the same seems like a sensible way to manage it.

If so, I think we should add a list of handled instructions to remove duplicated instructions in:

// Now add all inputs to this instruction to the worklist.
for (auto &Op : MI.operands()) {
if (!Op.isReg() || !Op.isUse() || !Op.getReg().isVirtual())
continue;
if (!isVectorRegClass(Op.getReg(), MRI))
continue;
MachineInstr *DefMI = MRI->getVRegDef(Op.getReg());
if (!isCandidate(*DefMI))
continue;
Worklist.insert(DefMI);
}

Just like:
SmallVector<SDNode *> Worklist;
SmallSet<SDNode *, 8> Inserted;
Worklist.push_back(N);
Inserted.insert(N);
SmallVector<CombineResult> CombinesToApply;
while (!Worklist.empty()) {
SDNode *Root = Worklist.pop_back_val();
NodeExtensionHelper LHS(Root, 0, DAG, Subtarget);
NodeExtensionHelper RHS(Root, 1, DAG, Subtarget);
auto AppendUsersIfNeeded = [&Worklist, &Subtarget,
&Inserted](const NodeExtensionHelper &Op) {
if (Op.needToPromoteOtherUsers()) {
for (SDUse &Use : Op.OrigOperand->uses()) {
SDNode *TheUser = Use.getUser();
if (!NodeExtensionHelper::isSupportedRoot(TheUser, Subtarget))
return false;
// We only support the first 2 operands of FMA.
if (Use.getOperandNo() >= 2)
return false;
if (Inserted.insert(TheUser).second)
Worklist.push_back(TheUser);


Never mind, I didn't notice we are using SetVector here. :-)

@topperc
Copy link
Collaborator

topperc commented Jan 22, 2025

I don't think this is an infinite loop, but a pathologically slow case where the chain of instructions is really long. But pruning the worklist when the CommonVL is the same seems like a sensible way to manage it.

If so, I think we should add a list of handled instructions to remove duplicated instructions in:

// Now add all inputs to this instruction to the worklist.
for (auto &Op : MI.operands()) {
if (!Op.isReg() || !Op.isUse() || !Op.getReg().isVirtual())
continue;
if (!isVectorRegClass(Op.getReg(), MRI))
continue;
MachineInstr *DefMI = MRI->getVRegDef(Op.getReg());
if (!isCandidate(*DefMI))
continue;
Worklist.insert(DefMI);
}

Just like:

SmallVector<SDNode *> Worklist;
SmallSet<SDNode *, 8> Inserted;
Worklist.push_back(N);
Inserted.insert(N);
SmallVector<CombineResult> CombinesToApply;
while (!Worklist.empty()) {
SDNode *Root = Worklist.pop_back_val();
NodeExtensionHelper LHS(Root, 0, DAG, Subtarget);
NodeExtensionHelper RHS(Root, 1, DAG, Subtarget);
auto AppendUsersIfNeeded = [&Worklist, &Subtarget,
&Inserted](const NodeExtensionHelper &Op) {
if (Op.needToPromoteOtherUsers()) {
for (SDUse &Use : Op.OrigOperand->uses()) {
SDNode *TheUser = Use.getUser();
if (!NodeExtensionHelper::isSupportedRoot(TheUser, Subtarget))
return false;
// We only support the first 2 operands of FMA.
if (Use.getOperandNo() >= 2)
return false;
if (Inserted.insert(TheUser).second)
Worklist.push_back(TheUser);

The problem is that we visit every instruction in the basic block in the outer loop. And we visit every earlier instruction anytime we make a change using a worklist. So the problem is that we process most of the graph using the worklist, but then the outer loop still causes us to revisit everything again. Each time we run through the worklist all over again.

What we should probably do is pre-load the worklist with the entire basic block instead of using the outer loop. The worklist is a SetVector so won't add anything more than once.

@lukel97
Copy link
Contributor

lukel97 commented Jan 22, 2025

The problem is that we visit every instruction in the basic block in the outer loop. And we visit every earlier instruction anytime we make a change using a worklist. So the problem is that we process most of the graph using the worklist, but then the outer loop still causes us to revisit everything again. Each time we run through the worklist all over again.

What we should probably do is pre-load the worklist with the entire basic block instead of using the outer loop. The worklist is a SetVector so won't add anything more than once.

The really long hang happens during the first call to tryReduce. But adding everything to the worklist up front would still probably help.

Would making the worklist a queue instead of a stack also help?

@@ -4244,6 +4244,21 @@ bool RISCV::isVLKnownLE(const MachineOperand &LHS, const MachineOperand &RHS) {
return LHS.getImm() <= RHS.getImm();
}

/// Given two VL operands, do we know that LHS < RHS?
bool RISCV::isVLKnownLT(const MachineOperand &LHS, const MachineOperand &RHS) {
if (LHS.isReg() && RHS.isReg() && LHS.getReg().isVirtual() &&
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put an early-exit here which checks for LHS and RHS both being immediates and this simplifies greatly.

preames added a commit to preames/llvm-project that referenced this pull request Jan 22, 2025
This implements a suggestion by Craig in PR llvm#123878.  We can move the
worklist management out of the per-instruction work and do it once
at the end of scanning all the instructions.  This should reduce
repeat visitation of the same instruction when no changes can be made.

Note that this does not remove the inherent O(N^2) in the algorithm.
We're still potentially visiiting every user of every def.

I also included a guard for unreachable blocks since that had been
mentioned as a possible cause.  It seems we've rulled that out, but
guarding for this case is still a good idea.
@preames
Copy link
Collaborator

preames commented Jan 22, 2025

The problem is that we visit every instruction in the basic block in the outer loop. And we visit every earlier instruction anytime we make a change using a worklist. So the problem is that we process most of the graph using the worklist, but then the outer loop still causes us to revisit everything again. Each time we run through the worklist all over again.

What we should probably do is pre-load the worklist with the entire basic block instead of using the outer loop. The worklist is a SetVector so won't add anything more than once.

I implemented this suggestion in #123973. I want to be careful to say I don't see this as an alternative, but as a possible additional fix or defense in depth. I do want to see this change move forward as it will reduce the depth of the recursive (worklist) walk when we're not achieving anything useful.

@michaelmaitland
Copy link
Contributor Author

I appreciate all of the comments here. I agree that we should improve how we manage our worklist. This was something that @preames pointed out as an area for improvement in the original addition of this patch. I'd like to land this patch independently of any improvements to the management of the worklist. I think that this change and a worklist change are beneficial on their own.

Thank you @lukel97 for pointing out that this is not a case of infinite loop, but a case that is pathologically slow. I've added the pathologically slow test case to this PR, since this PR fixes the slowness problem.

# RUN: llc %s -o - -mtriple=riscv64 -mattr=+v -run-pass=riscv-vl-optimizer -verify-machineinstrs | FileCheck %s

---
name: test
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know that adding this is a good idea. This won't show up as a test failure if this ever fails, it will only show up as a long running test. I'm not sure that's likely to get noticed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have addressed this concern.


; REQUIRES: asserts

define <vscale x 4 x i32> @same_vl_imm(<vscale x 4 x i32> %passthru, <vscale x 4 x i32> %a, <vscale x 4 x i32> %b) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment which explains why you're checking for the particular debug output.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though, honestly, this seems like borderline bad practice. I would be fine with this landing without a dedicated test since it is NFC and "only" improves compile time.

Copy link
Contributor Author

@michaelmaitland michaelmaitland Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please let me know what you prefer. This test case is testing observable change.

Copy link
Collaborator

@preames preames left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

preames added a commit that referenced this pull request Jan 22, 2025
This implements a suggestion by Craig in PR #123878. We can move the
worklist management out of the per-instruction work and do it once at
the end of scanning all the instructions. This should reduce repeat
visitation of the same instruction when no changes can be made.

Note that this does not remove the inherent O(N^2) in the algorithm.
We're still potentially visiiting every user of every def.

I also included a guard for unreachable blocks since that had been
mentioned as a possible cause. It seems we've rulled that out, but
guarding for this case is still a good idea.
@michaelmaitland michaelmaitland merged commit 1687aa2 into llvm:main Jan 22, 2025
5 of 6 checks passed
@michaelmaitland michaelmaitland deleted the vlopt-same-vl branch January 22, 2025 18:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Generation of Object file gets stuck in infinite loop after enabling RISCVVLOptimizer
7 participants