Skip to content

[SystemZ] Remove getInliningThresholdMultiplier() override #94612

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

nikic
Copy link
Contributor

@nikic nikic commented Jun 6, 2024

Using this TTI hook is only appropriate for GPU targets that need to maximally flatten all functions.

The inlining threshold multiplier should not be increased for ordinary targets -- while this may increase benchmark scores, it can also result in large code size increases and runaway inlining, resulting in build OOM/timeouts.

@llvmbot
Copy link
Member

llvmbot commented Jun 6, 2024

@llvm/pr-subscribers-backend-systemz

Author: Nikita Popov (nikic)

Changes

Using this TTI hook is only appropriate for GPU targets that need to maximally flatten all functions.

The inlining threshold multiplier should not be increased for ordinary targets -- while this may increase benchmark scores, it can also result in large code size increases and runaway inlining, resulting in build OOM/timeouts.


Full diff: https://github.com/llvm/llvm-project/pull/94612.diff

1 Files Affected:

  • (modified) llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.h (-1)
diff --git a/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.h b/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.h
index 3cf4a69ac2818..18effe6950e6c 100644
--- a/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.h
+++ b/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.h
@@ -38,7 +38,6 @@ class SystemZTTIImpl : public BasicTTIImplBase<SystemZTTIImpl> {
   /// \name Scalar TTI Implementations
   /// @{
 
-  unsigned getInliningThresholdMultiplier() const { return 3; }
   unsigned adjustInliningThreshold(const CallBase *CB) const;
 
   InstructionCost getIntImmCost(const APInt &Imm, Type *Ty,

@JonPsson1
Copy link
Contributor

What are the alternatives to using this?

I will check how this would impact benchmarks.

@JonPsson1
Copy link
Contributor

I don't think we can just disable this without a replacement: I see regressions on imagick (9%) and perl (5%) and probably others as well...

@nikic
Copy link
Contributor Author

nikic commented Jun 11, 2024

The default inlining thresholds make a tradeoff between performance, code size and compile-time. Yes, you can increase performance numbers by increasing inlining thresholds, but you lose other things along the way.

For example libLLVM.so on x86 is 122MB large. Other architectures (like aarch64 and ppc64le) are within 5% of that. s390x instead clocks in at 185MB, that is 50% larger. It is expected that -O2 favors performance over code size, but that does not mean that it can be ignored completely.

The reason why I ended up submitting this now is that I hit another s390x-only runaway inlining issue (where the build either OOMs or takes hours). It's certainly not the first time I ran into this issue.

I don't think that it's okay for the s390x target, and only the s390x target, to change otherwise project-global inlining tradeoffs in such a significant way. Basically, I think you need to eat the performance regressions and return to the project-global baseline.

Doing something like this would be appropriate if there is something special about the s390x architecture that justified it having a much, much higher inlining threshold than other targets. This is the case for GPU architectures, for which this hook exists. I could reasonably see thresholds being slightly different on s390x, e.g. because it has a higher function call overhead or something like that, but I find it hard to imagine what would justify a 3x threshold increase. (In reality it's actually even worse than 3x, because adjustInliningThreshold hands out additional big inlining bonuses).

@JonPsson1
Copy link
Contributor

Ok, we may have to reconsider this. Let's wait and see what @uweigand says (currently on vacation).

@cuviper
Copy link
Member

cuviper commented Jun 11, 2024

For a concrete recent example, see https://bugzilla.redhat.com/show_bug.cgi?id=2283132

@uweigand
Copy link
Member

uweigand commented Jul 2, 2024

This is interesting. To answer the question if there's anything fundamental about s390x that would require higher inlining thresholds - not really. However, our ABI historically has had a somewhat higher overhead than some others, which made inlining in some benchmarks more helpful than elsewhere. Also, in our typical usage, executable size tends to be less of a concern than on some other platforms. As a result, we've always thought somewhat more aggressive inlining might be a better tradeoff for us.

That said, I agree we certainly do not want inlining thresholds so high that they cause extreme increases in compile time and/or executable size - in particular, if that introduces notable differences between s390x and other server architectures. If that is indeed the case, we likely have to revisit this choice.

Given that, one remaining question would be whether instead of simply generally increasing the threshold, we can find a way to make inlining decisions more intelligent so that we can get back those benchmark regressions. @JonPsson1 that might be something worth spending some time investigating.

@nikic
Copy link
Contributor Author

nikic commented Jul 16, 2024

Is there something we can reasonably do here in time for the LLVM 19 release? For example, would reducing the multiplier to 2 rather than 1 reduce the benchmark impact? It's not the ideal end-state, but maybe a reasonable intermediate step...

@JonPsson1
Copy link
Contributor

I have now looked into this, building SPEC with both x3 (main), x2 and x1 (no) multipliers. The code size changes are listed below, relative to main. It seems that changing to x2 is indeed somewhat helpful. Looking at the benchmarks though (below), it is clear that this still looses performance in some cases.

Code size
Multiplier vs x3 (main)    x2        x1

i500.perlbench_r        -7.4%    -14.6%
i505.mcf_r              +0.0%     +0.0%
i523.xalancbmk_r        -6.6%    -11.0%
i531.deepsjeng_r        +0.1%     -2.8%
i502.gcc_r              -5.4%    -13.6%
i520.omnetpp_r          -4.0%     -9.6%
i525.x264_r             -3.8%    -13.3%
i541.leela_r            -1.0%     -9.1%
i557.xz_r               -3.1%     -6.2%
f507.cactuBSSN_r        -1.4%     -3.3%
f508.namd_r            -21.6%    -25.0%
f510.parest_r           -3.4%    -10.3%
f511.povray_r          -12.3%    -16.0%
f519.lbm_r              +0.0%     +0.0%
f521.wrf_r              -0.1%     -0.2%
f526.blender_r          -2.5%     -6.5%
f527.cam4_r             -0.0%     -0.1%
f538.imagick_r          -1.8%     -5.0%
f544.nab_r              -6.1%    -13.4%


Runtime vs main            x2        x1

f538.imagick_r           +13%      +15%
i500.perlbench_r          +2%       +6%
f511.povray_r             -1%       +3%
...

Trying the reproducer :

/usr/bin/time -v ./bin/opt -O3 -mtriple=s390x-linux-gnu -mcpu=z16 ./bad-42752cbe095.bc -o /dev/null

x1
User time (seconds): 4.81
Maximum resident set size (kbytes): 132960
Average resident set size (kbytes): 0

x2
User time (seconds): 9.44
Maximum resident set size (kbytes): 175012
Average resident set size (kbytes): 0

x3
Bad: Keeps on going after 45 minutes, having slowly increased memory usage up to 30G and still going.

So I can confirm getting the same results as reported - x2 would remedy this case.

Looking at imagick, it is clear that in some cases increased inlining can have very good performance improvement while not affecting code size as much. And e.g. namd shows that inlining can increase code size significantly without giving much of an improvement...

@nikic
Copy link
Contributor Author

nikic commented Jul 23, 2024

Do I remember correctly that the change in 27e8c50 was also targeted specifically at increasing the imagick SPEC score?

@JonPsson1
Copy link
Contributor

Do I remember correctly that the change in 27e8c50 was also targeted specifically at increasing the imagick SPEC score?

Unfortunately, it seems the old discussion on Phabricator is lost so I am not sure which benchmark(s?) that relates to. This is a case where we actually do know for a specific type of function that inlining is highly beneficial, so that is certainly good to keep. How are you thinking that this would relate to the general multiplier? I could do some more benchmarking without this also if you think it would be relevant.

@nikic
Copy link
Contributor Author

nikic commented Jul 23, 2024

Do I remember correctly that the change in 27e8c50 was also targeted specifically at increasing the imagick SPEC score?

Unfortunately, it seems the old discussion on Phabricator is lost so I am not sure which benchmark(s?) that relates to. This is a case where we actually do know for a specific type of function that inlining is highly beneficial, so that is certainly good to keep. How are you thinking that this would relate to the general multiplier? I could do some more benchmarking without this also if you think it would be relevant.

I'm mainly trying to figure out how important that imagick regression is. If the s390x backend is really going out of its way to make imagick optimize as desired (not just by adjusting the multiplier, but also doing other threshold adjustments), then this is probably too specific to a single benchmark and we should not be overly concerned about it.

(Looking back through some old discussions, it looks like imagick on s390x was badly hit by deferred inlining removal in https://reviews.llvm.org/D115497, and I think the inlining threshold adjustment was an attempt to mitigate that?)

@JonPsson1
Copy link
Contributor

(Looking back through some old discussions, it looks like imagick on s390x was badly hit by deferred inlining removal in https://reviews.llvm.org/D115497, and I think the inlining threshold adjustment was an attempt to mitigate that?)

Yeah - that rings a bell :)

@JonPsson1
Copy link
Contributor

I'm mainly trying to figure out how important that imagick regression is. If the s390x backend is really going out of its way to make imagick optimize as desired (not just by adjusting the multiplier, but also doing other threshold adjustments), then this is probably too specific to a single benchmark and we should not be overly concerned about it.

I checked back again on imagick, and it seems that removing either the multiplier or the specific threshold adjustment cause a similar regression (~10%). So they are both needed for the current performance. Without the multiplier, the specific adjustment is ~1% improvement only.

nikic added 2 commits August 16, 2024 15:23
Using this TTI hook is only appropriate for GPU targets that need
to maximally flatten all functions.

The inlining threshold multiplier should not be increased for
ordinary targets -- while this may increase benchmark scores, it
can also result in large code size increases and runaway inlining,
resulting in build OOM/timeouts.
@nikic nikic force-pushed the s390x-inline-threshold branch from 5c94dc5 to 2fdbb7e Compare August 16, 2024 13:27
@nikic
Copy link
Contributor Author

nikic commented Aug 16, 2024

I've updated this PR to change the threshold multiplier from 3 to 2 (instead of 1). Based on the numbers in #94612 (comment), at the multiplier of 2 only imagick has significant negative performance impact, and the s390x backend is currently over-specialized towards that benchmark. The original proposal to use multiplier 1 had negative impact on more benchmarks.

@JonPsson1
Copy link
Contributor

Thanks for the update. However, I have been working on tracking down the regressions I have seen (imagick, povray and perl), and have made some progress. I believe I will have a patch ready soon, which will set the multiplier to 1 and boost the threshold for specific types of callees.

@JonPsson1
Copy link
Contributor

Can probably close this now (patch in progress at #106058)

@JonPsson1
Copy link
Contributor

Fix committed with the general multiplier removed entirely.

@nikic nikic closed this Nov 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants