Skip to content

[LoopInterchange] Constrain LI within supported loop nest depth #118656

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jan 23, 2025

Conversation

madhur13490
Copy link
Contributor

This patch is an extension to #115128.

After profiling LLVM test-suite, I see a lot of loop nest of depth more than MaxLoopNestDepth which is 10. Early exit for them would save compile-time as it would avoid computing DependenceInfo and CacheCost.

@llvmbot
Copy link
Member

llvmbot commented Dec 4, 2024

@llvm/pr-subscribers-llvm-transforms

Author: Madhur Amilkanthwar (madhur13490)

Changes

This patch is an extension to #115128.

After profiling LLVM test-suite, I see a lot of loop nest of depth more than MaxLoopNestDepth which is 10. Early exit for them would save compile-time as it would avoid computing DependenceInfo and CacheCost.


Full diff: https://github.com/llvm/llvm-project/pull/118656.diff

3 Files Affected:

  • (modified) llvm/lib/Transforms/Scalar/LoopInterchange.cpp (+12-12)
  • (modified) llvm/test/Transforms/LoopInterchange/bail-out-one-loop.ll (+1-1)
  • (added) llvm/test/Transforms/LoopInterchange/deep-loop-nest.ll (+95)
diff --git a/llvm/lib/Transforms/Scalar/LoopInterchange.cpp b/llvm/lib/Transforms/Scalar/LoopInterchange.cpp
index a0c0080c0bda1c..b5ebde300ebe6f 100644
--- a/llvm/lib/Transforms/Scalar/LoopInterchange.cpp
+++ b/llvm/lib/Transforms/Scalar/LoopInterchange.cpp
@@ -69,6 +69,9 @@ using CharMatrix = std::vector<std::vector<char>>;
 // Maximum number of dependencies that can be handled in the dependency matrix.
 static const unsigned MaxMemInstrCount = 100;
 
+// Minimum loop depth supported.
+static const unsigned MinLoopNestDepth = 2;
+
 // Maximum loop depth supported.
 static const unsigned MaxLoopNestDepth = 10;
 
@@ -239,10 +242,12 @@ static void populateWorklist(Loop &L, LoopVector &LoopList) {
   LoopList.push_back(CurrentLoop);
 }
 
-static bool hasMinimumLoopDepth(SmallVectorImpl<Loop *> &LoopList) {
+static bool hasSupportedLoopDepth(SmallVectorImpl<Loop *> &LoopList) {
   unsigned LoopNestDepth = LoopList.size();
-  if (LoopNestDepth < 2) {
-    LLVM_DEBUG(dbgs() << "Loop doesn't contain minimum nesting level.\n");
+  if (LoopNestDepth < MinLoopNestDepth || LoopNestDepth > MaxLoopNestDepth) {
+    LLVM_DEBUG(dbgs() << "Unsupported depth of loop nest " << LoopNestDepth
+                      << " should be [" << MinLoopNestDepth << ", "
+                      << MaxLoopNestDepth << "]\n");
     return false;
   }
   return true;
@@ -430,15 +435,10 @@ struct LoopInterchange {
   bool processLoopList(SmallVectorImpl<Loop *> &LoopList) {
     bool Changed = false;
 
-    // Ensure minimum loop nest depth.
-    assert(hasMinimumLoopDepth(LoopList) && "Loop nest does not meet minimum depth.");
+    // Ensure proper loop nest depth.
+    assert(hasSupportedLoopDepth(LoopList) && "Unsupported depth of loop nest.");
 
     unsigned LoopNestDepth = LoopList.size();
-    if (LoopNestDepth > MaxLoopNestDepth) {
-      LLVM_DEBUG(dbgs() << "Cannot handle loops of depth greater than "
-                        << MaxLoopNestDepth << "\n");
-      return false;
-    }
     if (!isComputableLoopNest(LoopList)) {
       LLVM_DEBUG(dbgs() << "Not valid loop candidate for interchange\n");
       return false;
@@ -1725,8 +1725,8 @@ PreservedAnalyses LoopInterchangePass::run(LoopNest &LN,
                                            LPMUpdater &U) {
   Function &F = *LN.getParent();
   SmallVector<Loop *, 8> LoopList(LN.getLoops());
-  // Ensure minimum depth of the loop nest to do the interchange.
-  if (!hasMinimumLoopDepth(LoopList))
+  // Ensure proper depth of the loop nest to do the interchange.
+  if (!hasSupportedLoopDepth(LoopList))
     return PreservedAnalyses::all();
 
   DependenceInfo DI(&F, &AR.AA, &AR.SE, &AR.LI);
diff --git a/llvm/test/Transforms/LoopInterchange/bail-out-one-loop.ll b/llvm/test/Transforms/LoopInterchange/bail-out-one-loop.ll
index 788e1b0157d80f..477b37937747fc 100644
--- a/llvm/test/Transforms/LoopInterchange/bail-out-one-loop.ll
+++ b/llvm/test/Transforms/LoopInterchange/bail-out-one-loop.ll
@@ -15,7 +15,7 @@ target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i8:8:32-i16:16:32-i6
 ; CHECK-NOT: Delinearizing
 ; CHECK-NOT: Strides:
 ; CHECK-NOT: Terms:
-; CHECK: Loop doesn't contain minimum nesting level.
+; CHECK: Unsupported depth of loop nest 1 should be [2, 10]
 
 define void @foo() {
 entry:
diff --git a/llvm/test/Transforms/LoopInterchange/deep-loop-nest.ll b/llvm/test/Transforms/LoopInterchange/deep-loop-nest.ll
new file mode 100644
index 00000000000000..348c1ab74b7588
--- /dev/null
+++ b/llvm/test/Transforms/LoopInterchange/deep-loop-nest.ll
@@ -0,0 +1,95 @@
+; REQUIRES: asserts
+; RUN: opt < %s -passes=loop-interchange -debug -disable-output 2>&1| FileCheck %s
+
+target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
+
+; For deep loop nest, delinearization should not be run.
+
+; CHECK-NOT: Delinearizing
+; CHECK-NOT: Strides:
+; CHECK-NOT: Terms:
+; CHECK: Unsupported depth of loop nest 11 should be [2, 10]
+define void @big_loop_nest() {
+entry:
+  br label %for1.header
+
+for1.header:
+  %j = phi i64 [ 0, %entry ], [ %j.next, %for1.inc ]
+  br label %for2.header
+for2.header:
+  %k = phi i64 [ 0, %for1.header ], [ %k.next, %for2.inc ]
+  br label %for3.header
+for3.header:
+  %l = phi i64 [ 0, %for2.header ], [ %l.next, %for3.inc ]
+  br label %for4.header
+for4.header:
+  %m = phi i64 [ 0, %for3.header ], [ %m.next, %for4.inc ]
+  br label %for5.header
+for5.header:
+  %n = phi i64 [ 0, %for4.header ], [ %n.next, %for5.inc ]
+  br label %for6.header
+for6.header:
+  %o = phi i64 [ 0, %for5.header ], [ %o.next, %for6.inc ]
+  br label %for7.header
+for7.header:
+  %p = phi i64 [ 0, %for6.header ], [ %p.next, %for7.inc ]
+  br label %for8.header
+for8.header:
+  %q = phi i64 [ 0, %for7.header ], [ %q.next, %for8.inc ]
+  br label %for9.header
+for9.header:
+  %r = phi i64 [ 0, %for8.header ], [ %r.next, %for9.inc ]
+  br label %for10.header
+for10.header:
+  %s = phi i64 [ 0, %for9.header ], [ %s.next, %for10.inc ]
+  br label %for11
+for11:
+  %t = phi i64 [ %t.next, %for11 ], [ 0, %for10.header ]
+  %t.next = add nuw nsw i64 %t, 1
+  %exitcond = icmp eq i64 %t.next, 99
+  br i1 %exitcond, label %for1.inc, label %for11
+
+for1.inc:
+  %j.next = add nuw nsw i64 %j, 1
+  %exitcond26 = icmp eq i64 %j.next, 99
+  br i1 %exitcond26, label %for2.inc, label %for1.header
+for2.inc:
+  %k.next = add nuw nsw i64 %k, 1
+  %exitcond27 = icmp eq i64 %j.next, 99
+  br i1 %exitcond27, label %for3.inc, label %for2.header
+for3.inc:
+  %l.next = add nuw nsw i64 %l, 1
+  %exitcond28 = icmp eq i64 %l.next, 99
+  br i1 %exitcond28, label %for4.inc, label %for3.header
+for4.inc:
+  %m.next = add nuw nsw i64 %m, 1
+  %exitcond29 = icmp eq i64 %m.next, 99
+  br i1 %exitcond29, label %for5.inc, label %for4.header
+for5.inc:
+  %n.next = add nuw nsw i64 %n, 1
+  %exitcond30 = icmp eq i64 %n.next, 99
+  br i1 %exitcond30, label %for6.inc, label %for5.header
+for6.inc:
+  %o.next = add nuw nsw i64 %o, 1
+  %exitcond31 = icmp eq i64 %o.next, 99
+  br i1 %exitcond31, label %for7.inc, label %for6.header
+for7.inc:
+  %p.next = add nuw nsw i64 %p, 1
+  %exitcond32 = icmp eq i64 %p.next, 99
+  br i1 %exitcond32, label %for8.inc, label %for7.header
+for8.inc:
+  %q.next = add nuw nsw i64 %q, 1
+  %exitcond33 = icmp eq i64 %q.next, 99
+  br i1 %exitcond33, label %for9.inc, label %for8.header
+for9.inc:
+  %r.next = add nuw nsw i64 %r, 1
+  %exitcond34 = icmp eq i64 %q.next, 99
+  br i1 %exitcond34, label %for10.inc, label %for9.header
+for10.inc:
+  %s.next = add nuw nsw i64 %s, 1
+  %exitcond35 = icmp eq i64 %s.next, 99
+  br i1 %exitcond35, label %for.end, label %for10.header
+
+for.end:
+  ret void
+}

Copy link

github-actions bot commented Dec 4, 2024

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:
git-clang-format --diff e1c1e74a6fd71dd889155100d4c0f5e3284f7a22 b6d834518823fdfc6cf0bee464db63c09cc91568 --extensions cpp -- llvm/lib/Transforms/Scalar/LoopInterchange.cpp
View the diff from clang-format here.
diff --git a/llvm/lib/Transforms/Scalar/LoopInterchange.cpp b/llvm/lib/Transforms/Scalar/LoopInterchange.cpp
index d366e749c7..0f21fd2c57 100644
--- a/llvm/lib/Transforms/Scalar/LoopInterchange.cpp
+++ b/llvm/lib/Transforms/Scalar/LoopInterchange.cpp
@@ -1758,7 +1758,7 @@ PreservedAnalyses LoopInterchangePass::run(LoopNest &LN,
   DependenceInfo DI(&F, &AR.AA, &AR.SE, &AR.LI);
   std::unique_ptr<CacheCost> CC =
       CacheCost::getCacheCost(LN.getOutermostLoop(), AR, DI);
-  
+
   if (!LoopInterchange(&AR.SE, &AR.LI, &DI, &AR.DT, CC, &ORE).run(LN))
     return PreservedAnalyses::all();
   U.markLoopNestChanged(true);

@@ -0,0 +1,95 @@
; REQUIRES: asserts
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer not to rely on asserts builds, that would increase our test coverage.

Can you instead rely on e.g. the optimisation remarks for which we don't need asserts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure but in this case, I will have to duplicate the file for without asserts builds. I don't know if we can conditionally execute RUN line based on build type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why you need to duplicate the file? I think you only need to check optimization remarks instead of debug output as what you are doing now, e.g., llvm/test/Transforms/LoopInterchange/loop-interchange-optimization-remarks.ll.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I want to ensure DependencyAnalysis is not run and I can do that only when -debug is provided. Specifically, the below lines are of my interest:

; CHECK-NOT: Delinearizing
; CHECK-NOT: Strides:
; CHECK-NOT: Terms:
; CHECK: Unsupported depth of loop nest 11, the supported range is [2, 10].

if (LoopNestDepth < 2) {
LLVM_DEBUG(dbgs() << "Loop doesn't contain minimum nesting level.\n");
if (LoopNestDepth < MinLoopNestDepth || LoopNestDepth > MaxLoopNestDepth) {
LLVM_DEBUG(dbgs() << "Unsupported depth of loop nest " << LoopNestDepth
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, maybe small clarification of the message:

"Unsupported loop nest depth of " << LoopNestDepth << ", the supported range is ..."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@CongzheUalberta
Copy link
Contributor

CongzheUalberta commented Dec 12, 2024

This patch is an extension to #115128.

After profiling LLVM test-suite, I see a lot of loop nest of depth more than MaxLoopNestDepth which is 10. Early exit for them would save compile-time as it would avoid computing DependenceInfo and CacheCost.

Would be good to know how much compile time reduction for LLVM test-suite you could achieve with this patch?

unsigned LoopNestDepth = LoopList.size();
if (LoopNestDepth < 2) {
LLVM_DEBUG(dbgs() << "Loop doesn't contain minimum nesting level.\n");
if (LoopNestDepth < MinLoopNestDepth || LoopNestDepth > MaxLoopNestDepth) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to make MaxLoopNestDepth and MinLoopNestDepth to be tunable parameters rather than hardcoded numbers, so we could control with better flexibility.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -0,0 +1,95 @@
; REQUIRES: asserts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why you need to duplicate the file? I think you only need to check optimization remarks instead of debug output as what you are doing now, e.g., llvm/test/Transforms/LoopInterchange/loop-interchange-optimization-remarks.ll.

@madhur13490 madhur13490 force-pushed the perf/bound-max-depth branch 2 times, most recently from bff108c to 30807f0 Compare January 21, 2025 08:11
@@ -201,7 +201,7 @@ static cl::opt<bool> RunNewGVN("enable-newgvn", cl::init(false), cl::Hidden,
cl::desc("Run the NewGVN pass"));

static cl::opt<bool> EnableLoopInterchange(
"enable-loopinterchange", cl::init(false), cl::Hidden,
"enable-loopinterchange", cl::init(true), cl::Hidden,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for experimentation purposes only.

@madhur13490
Copy link
Contributor Author

This patch is an extension to #115128.
After profiling LLVM test-suite, I see a lot of loop nest of depth more than MaxLoopNestDepth which is 10. Early exit for them would save compile-time as it would avoid computing DependenceInfo and CacheCost.

Would be good to know how much compile time reduction for LLVM test-suite you could achieve with this patch?

Sure, let me rerun again with compile-time-tracker.

Copy link
Member

@Meinersbur Meinersbur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let me rerun again with compile-time-tracker.

If you want to check the influence of the max-depth bound with enabled loop-interchange, you would need it based on the commit that has if enabled.

I don't think any of the sources from test-suite used by by the compile-time-tracker has loops that deep. I don't think it matters anyway, being based on bubble-sort already is a convinving argument that an upper bound it needed.

Not relying on -debug would be nice, but still common practice that I don't want to suddenly enforce.

LGTM. Be sure to remove cl::init(true) before pushing1.

Footnotes

  1. Could have used a different branch for experimentation, bound-max-depth-2 is identical, enable-li-2 is based on a different commit and doesn't show up as direct comparison in llvm-compile-tracker.

Comment on lines 8 to 10
; CHECK-NOT: Delinearizing
; CHECK-NOT: Strides:
; CHECK-NOT: Terms:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-NOT checks are fragile. For instance, it will not match any of those after the "Unsupported depth of ..." line. If there is not such line, then this test will already fail because of that.

Consider adding additional dbg() << ... output somewhere with the reason that it has been skipped.

@sjoerdmeijer
Copy link
Collaborator

Looks good to me too, but I would really prefer not to introduce tests that rely on Debug when not needed.

The way I look at: with deep-loop-nest.ll we test the loop depth. With the default value of 10, it shouldn't trigger, but we may want another RUN line with -loop-interchange-max-loop-nest-depth=11 where it does trigger, so we have a negative and positive test, and we can fully rely on optimisation remarks for that, no need for Debug.

Testing when DependencyAnalysis is or isn't running, is slightly orthogonal? Can this be tested separated with a different test that then depends on Debug?

But anyway, this is my preference, am also not blocking this.

This patch is an extension to llvm#115128.

After profiling LLVM test-suite, I see a lot of loop nest
of depth more than `MaxLoopNestDepth` which is 10.
Early exit for them would save compile-time as it would avoid
computing DependenceInfo and CacheCost.
@madhur13490
Copy link
Contributor Author

Thanks @Meinersbur and @sjoerdmeijer for the comments.
I have removed the dependence analysis check and thus -debug requirement.

As @Meinersbur suggested, I reran on compile-time-tracker and PFA screenshot - the benefits are clearly visible for the branch perf/max-depth-exp. The first commit on this branch enables the loop interchange and the second commit applied this patch.

Screenshot 2025-01-21 at 10 27 28 PM

I don't think any of the sources from test-suite used by by the compile-time-tracker has loops that deep. I don't think it matters anyway, based on bubble-sort already is a convinving argument that an upper bound it needed.

I don't know if compile-time-tracker covers everything that is present in LLVM Test-suite but when I ran locally here is a quick stat.
I dumped depth of loop nest (using LoopList.size()) for each file in the LLVM test suite at the beginning of run(). When the transformation is enabled, run() fucntion is invoked 139323 times on the entire LVM test suite and there are 238 loops whose depth is more than 10. 238/139323 seems much less, but there are indeed loops with huge depth in the test suite. Max Loop nest depth is 122!

Thanks for the approval. I will land the patch soon giving some more time in case anyone else wants to comment.

@Meinersbur
Copy link
Member

I don't know if compile-time-tracker covers everything that is present in LLVM Test-suite

It does not, it only runs kimwitu++, sqlite3, consumer-typeset, Bullet, tramp3d-v4, mafft, ClamAV, lencod, SPASS, 7zip, geomean. You can see the detailed results here: http://llvm-compile-time-tracker.com/compare.php?from=a591801b74bc59e815f9d8c35af8c8ee2582dce3&to=e96b7568c7f5bb3b5a1af0d5c13c56dee745874b&stat=instructions%3Au&details=on

It seems indeed there is a measurable difference for some translation units.

@madhur13490 madhur13490 merged commit d15f3e8 into llvm:main Jan 23, 2025
7 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants