[SystemZ] Remove getInliningThresholdMultiplier() override #94612

nikic · 2024-06-06T13:07:32Z

Using this TTI hook is only appropriate for GPU targets that need to maximally flatten all functions.

The inlining threshold multiplier should not be increased for ordinary targets -- while this may increase benchmark scores, it can also result in large code size increases and runaway inlining, resulting in build OOM/timeouts.

llvmbot · 2024-06-06T13:08:03Z

@llvm/pr-subscribers-backend-systemz

Author: Nikita Popov (nikic)

Changes

Using this TTI hook is only appropriate for GPU targets that need to maximally flatten all functions.

The inlining threshold multiplier should not be increased for ordinary targets -- while this may increase benchmark scores, it can also result in large code size increases and runaway inlining, resulting in build OOM/timeouts.

Full diff: https://github.com/llvm/llvm-project/pull/94612.diff

1 Files Affected:

(modified) llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.h (-1)

diff --git a/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.h b/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.h
index 3cf4a69ac2818..18effe6950e6c 100644
--- a/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.h
+++ b/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.h
@@ -38,7 +38,6 @@ class SystemZTTIImpl : public BasicTTIImplBase<SystemZTTIImpl> {
   /// \name Scalar TTI Implementations
   /// @{
 
-  unsigned getInliningThresholdMultiplier() const { return 3; }
   unsigned adjustInliningThreshold(const CallBase *CB) const;
 
   InstructionCost getIntImmCost(const APInt &Imm, Type *Ty,

JonPsson1 · 2024-06-06T13:49:47Z

What are the alternatives to using this?

I will check how this would impact benchmarks.

JonPsson1 · 2024-06-11T09:08:15Z

I don't think we can just disable this without a replacement: I see regressions on imagick (9%) and perl (5%) and probably others as well...

nikic · 2024-06-11T09:47:37Z

The default inlining thresholds make a tradeoff between performance, code size and compile-time. Yes, you can increase performance numbers by increasing inlining thresholds, but you lose other things along the way.

For example libLLVM.so on x86 is 122MB large. Other architectures (like aarch64 and ppc64le) are within 5% of that. s390x instead clocks in at 185MB, that is 50% larger. It is expected that -O2 favors performance over code size, but that does not mean that it can be ignored completely.

The reason why I ended up submitting this now is that I hit another s390x-only runaway inlining issue (where the build either OOMs or takes hours). It's certainly not the first time I ran into this issue.

I don't think that it's okay for the s390x target, and only the s390x target, to change otherwise project-global inlining tradeoffs in such a significant way. Basically, I think you need to eat the performance regressions and return to the project-global baseline.

Doing something like this would be appropriate if there is something special about the s390x architecture that justified it having a much, much higher inlining threshold than other targets. This is the case for GPU architectures, for which this hook exists. I could reasonably see thresholds being slightly different on s390x, e.g. because it has a higher function call overhead or something like that, but I find it hard to imagine what would justify a 3x threshold increase. (In reality it's actually even worse than 3x, because adjustInliningThreshold hands out additional big inlining bonuses).

JonPsson1 · 2024-06-11T12:46:26Z

Ok, we may have to reconsider this. Let's wait and see what @uweigand says (currently on vacation).

cuviper · 2024-06-11T17:07:26Z

For a concrete recent example, see https://bugzilla.redhat.com/show_bug.cgi?id=2283132

uweigand · 2024-07-02T12:52:00Z

This is interesting. To answer the question if there's anything fundamental about s390x that would require higher inlining thresholds - not really. However, our ABI historically has had a somewhat higher overhead than some others, which made inlining in some benchmarks more helpful than elsewhere. Also, in our typical usage, executable size tends to be less of a concern than on some other platforms. As a result, we've always thought somewhat more aggressive inlining might be a better tradeoff for us.

That said, I agree we certainly do not want inlining thresholds so high that they cause extreme increases in compile time and/or executable size - in particular, if that introduces notable differences between s390x and other server architectures. If that is indeed the case, we likely have to revisit this choice.

Given that, one remaining question would be whether instead of simply generally increasing the threshold, we can find a way to make inlining decisions more intelligent so that we can get back those benchmark regressions. @JonPsson1 that might be something worth spending some time investigating.

nikic · 2024-07-16T09:56:08Z

Is there something we can reasonably do here in time for the LLVM 19 release? For example, would reducing the multiplier to 2 rather than 1 reduce the benchmark impact? It's not the ideal end-state, but maybe a reasonable intermediate step...

JonPsson1 · 2024-07-23T10:46:40Z

I have now looked into this, building SPEC with both x3 (main), x2 and x1 (no) multipliers. The code size changes are listed below, relative to main. It seems that changing to x2 is indeed somewhat helpful. Looking at the benchmarks though (below), it is clear that this still looses performance in some cases.

Code size
Multiplier vs x3 (main)    x2        x1

i500.perlbench_r        -7.4%    -14.6%
i505.mcf_r              +0.0%     +0.0%
i523.xalancbmk_r        -6.6%    -11.0%
i531.deepsjeng_r        +0.1%     -2.8%
i502.gcc_r              -5.4%    -13.6%
i520.omnetpp_r          -4.0%     -9.6%
i525.x264_r             -3.8%    -13.3%
i541.leela_r            -1.0%     -9.1%
i557.xz_r               -3.1%     -6.2%
f507.cactuBSSN_r        -1.4%     -3.3%
f508.namd_r            -21.6%    -25.0%
f510.parest_r           -3.4%    -10.3%
f511.povray_r          -12.3%    -16.0%
f519.lbm_r              +0.0%     +0.0%
f521.wrf_r              -0.1%     -0.2%
f526.blender_r          -2.5%     -6.5%
f527.cam4_r             -0.0%     -0.1%
f538.imagick_r          -1.8%     -5.0%
f544.nab_r              -6.1%    -13.4%


Runtime vs main            x2        x1

f538.imagick_r           +13%      +15%
i500.perlbench_r          +2%       +6%
f511.povray_r             -1%       +3%
...

Trying the reproducer :

/usr/bin/time -v ./bin/opt -O3 -mtriple=s390x-linux-gnu -mcpu=z16 ./bad-42752cbe095.bc -o /dev/null

x1
User time (seconds): 4.81
Maximum resident set size (kbytes): 132960
Average resident set size (kbytes): 0

x2
User time (seconds): 9.44
Maximum resident set size (kbytes): 175012
Average resident set size (kbytes): 0

x3
Bad: Keeps on going after 45 minutes, having slowly increased memory usage up to 30G and still going.

So I can confirm getting the same results as reported - x2 would remedy this case.

Looking at imagick, it is clear that in some cases increased inlining can have very good performance improvement while not affecting code size as much. And e.g. namd shows that inlining can increase code size significantly without giving much of an improvement...

nikic · 2024-07-23T12:25:16Z

Do I remember correctly that the change in 27e8c50 was also targeted specifically at increasing the imagick SPEC score?

JonPsson1 · 2024-07-23T13:12:42Z

Do I remember correctly that the change in 27e8c50 was also targeted specifically at increasing the imagick SPEC score?

Unfortunately, it seems the old discussion on Phabricator is lost so I am not sure which benchmark(s?) that relates to. This is a case where we actually do know for a specific type of function that inlining is highly beneficial, so that is certainly good to keep. How are you thinking that this would relate to the general multiplier? I could do some more benchmarking without this also if you think it would be relevant.

nikic · 2024-07-23T13:37:53Z

Do I remember correctly that the change in 27e8c50 was also targeted specifically at increasing the imagick SPEC score?

Unfortunately, it seems the old discussion on Phabricator is lost so I am not sure which benchmark(s?) that relates to. This is a case where we actually do know for a specific type of function that inlining is highly beneficial, so that is certainly good to keep. How are you thinking that this would relate to the general multiplier? I could do some more benchmarking without this also if you think it would be relevant.

I'm mainly trying to figure out how important that imagick regression is. If the s390x backend is really going out of its way to make imagick optimize as desired (not just by adjusting the multiplier, but also doing other threshold adjustments), then this is probably too specific to a single benchmark and we should not be overly concerned about it.

(Looking back through some old discussions, it looks like imagick on s390x was badly hit by deferred inlining removal in https://reviews.llvm.org/D115497, and I think the inlining threshold adjustment was an attempt to mitigate that?)

JonPsson1 · 2024-07-23T13:43:21Z

(Looking back through some old discussions, it looks like imagick on s390x was badly hit by deferred inlining removal in https://reviews.llvm.org/D115497, and I think the inlining threshold adjustment was an attempt to mitigate that?)

Yeah - that rings a bell :)

JonPsson1 · 2024-07-24T08:32:41Z

I'm mainly trying to figure out how important that imagick regression is. If the s390x backend is really going out of its way to make imagick optimize as desired (not just by adjusting the multiplier, but also doing other threshold adjustments), then this is probably too specific to a single benchmark and we should not be overly concerned about it.

I checked back again on imagick, and it seems that removing either the multiplier or the specific threshold adjustment cause a similar regression (~10%). So they are both needed for the current performance. Without the multiplier, the specific adjustment is ~1% improvement only.

Using this TTI hook is only appropriate for GPU targets that need to maximally flatten all functions. The inlining threshold multiplier should not be increased for ordinary targets -- while this may increase benchmark scores, it can also result in large code size increases and runaway inlining, resulting in build OOM/timeouts.

nikic · 2024-08-16T13:32:21Z

I've updated this PR to change the threshold multiplier from 3 to 2 (instead of 1). Based on the numbers in #94612 (comment), at the multiplier of 2 only imagick has significant negative performance impact, and the s390x backend is currently over-specialized towards that benchmark. The original proposal to use multiplier 1 had negative impact on more benchmarks.

JonPsson1 · 2024-08-16T14:29:37Z

Thanks for the update. However, I have been working on tracking down the regressions I have seen (imagick, povray and perl), and have made some progress. I believe I will have a patch ready soon, which will set the multiplier to 1 and boost the threshold for specific types of callees.

JonPsson1 · 2024-09-25T15:56:48Z

Can probably close this now (patch in progress at #106058)

JonPsson1 · 2024-10-07T09:03:02Z

Fix committed with the general multiplier removed entirely.

nikic requested review from uweigand and JonPsson1 June 6, 2024 13:07

llvmbot added the backend:SystemZ label Jun 6, 2024

nikic added 2 commits August 16, 2024 15:23

Set the threshold multiplier to 2 instead

2fdbb7e

nikic force-pushed the s390x-inline-threshold branch from 5c94dc5 to 2fdbb7e Compare August 16, 2024 13:27

JonPsson1 mentioned this pull request Aug 26, 2024

[SystemZ] Remove high inlining threshold multiplier. #106058

Merged

nikic closed this Nov 25, 2024

[SystemZ] Remove getInliningThresholdMultiplier() override #94612

[SystemZ] Remove getInliningThresholdMultiplier() override #94612

Uh oh!

Conversation

nikic commented Jun 6, 2024

Uh oh!

llvmbot commented Jun 6, 2024

Uh oh!

JonPsson1 commented Jun 6, 2024

Uh oh!

JonPsson1 commented Jun 11, 2024

Uh oh!

nikic commented Jun 11, 2024

Uh oh!

JonPsson1 commented Jun 11, 2024

Uh oh!

cuviper commented Jun 11, 2024

Uh oh!

uweigand commented Jul 2, 2024

Uh oh!

nikic commented Jul 16, 2024

Uh oh!

JonPsson1 commented Jul 23, 2024

Uh oh!

nikic commented Jul 23, 2024

Uh oh!

JonPsson1 commented Jul 23, 2024

Uh oh!

nikic commented Jul 23, 2024

Uh oh!

JonPsson1 commented Jul 23, 2024

Uh oh!

JonPsson1 commented Jul 24, 2024

Uh oh!

nikic commented Aug 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JonPsson1 commented Aug 16, 2024

Uh oh!

JonPsson1 commented Sep 25, 2024

Uh oh!

JonPsson1 commented Oct 7, 2024

Uh oh!

Uh oh!

nikic commented Aug 16, 2024 •

edited

Loading