-
Notifications
You must be signed in to change notification settings - Fork 13.5k
[SystemZ] Remove getInliningThresholdMultiplier() override #94612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-backend-systemz Author: Nikita Popov (nikic) ChangesUsing this TTI hook is only appropriate for GPU targets that need to maximally flatten all functions. The inlining threshold multiplier should not be increased for ordinary targets -- while this may increase benchmark scores, it can also result in large code size increases and runaway inlining, resulting in build OOM/timeouts. Full diff: https://github.com/llvm/llvm-project/pull/94612.diff 1 Files Affected:
diff --git a/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.h b/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.h
index 3cf4a69ac2818..18effe6950e6c 100644
--- a/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.h
+++ b/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.h
@@ -38,7 +38,6 @@ class SystemZTTIImpl : public BasicTTIImplBase<SystemZTTIImpl> {
/// \name Scalar TTI Implementations
/// @{
- unsigned getInliningThresholdMultiplier() const { return 3; }
unsigned adjustInliningThreshold(const CallBase *CB) const;
InstructionCost getIntImmCost(const APInt &Imm, Type *Ty,
|
What are the alternatives to using this? I will check how this would impact benchmarks. |
I don't think we can just disable this without a replacement: I see regressions on imagick (9%) and perl (5%) and probably others as well... |
The default inlining thresholds make a tradeoff between performance, code size and compile-time. Yes, you can increase performance numbers by increasing inlining thresholds, but you lose other things along the way. For example libLLVM.so on x86 is 122MB large. Other architectures (like aarch64 and ppc64le) are within 5% of that. s390x instead clocks in at 185MB, that is 50% larger. It is expected that The reason why I ended up submitting this now is that I hit another s390x-only runaway inlining issue (where the build either OOMs or takes hours). It's certainly not the first time I ran into this issue. I don't think that it's okay for the s390x target, and only the s390x target, to change otherwise project-global inlining tradeoffs in such a significant way. Basically, I think you need to eat the performance regressions and return to the project-global baseline. Doing something like this would be appropriate if there is something special about the s390x architecture that justified it having a much, much higher inlining threshold than other targets. This is the case for GPU architectures, for which this hook exists. I could reasonably see thresholds being slightly different on s390x, e.g. because it has a higher function call overhead or something like that, but I find it hard to imagine what would justify a 3x threshold increase. (In reality it's actually even worse than 3x, because adjustInliningThreshold hands out additional big inlining bonuses). |
Ok, we may have to reconsider this. Let's wait and see what @uweigand says (currently on vacation). |
For a concrete recent example, see https://bugzilla.redhat.com/show_bug.cgi?id=2283132 |
This is interesting. To answer the question if there's anything fundamental about s390x that would require higher inlining thresholds - not really. However, our ABI historically has had a somewhat higher overhead than some others, which made inlining in some benchmarks more helpful than elsewhere. Also, in our typical usage, executable size tends to be less of a concern than on some other platforms. As a result, we've always thought somewhat more aggressive inlining might be a better tradeoff for us. That said, I agree we certainly do not want inlining thresholds so high that they cause extreme increases in compile time and/or executable size - in particular, if that introduces notable differences between s390x and other server architectures. If that is indeed the case, we likely have to revisit this choice. Given that, one remaining question would be whether instead of simply generally increasing the threshold, we can find a way to make inlining decisions more intelligent so that we can get back those benchmark regressions. @JonPsson1 that might be something worth spending some time investigating. |
Is there something we can reasonably do here in time for the LLVM 19 release? For example, would reducing the multiplier to 2 rather than 1 reduce the benchmark impact? It's not the ideal end-state, but maybe a reasonable intermediate step... |
I have now looked into this, building SPEC with both x3 (main), x2 and x1 (no) multipliers. The code size changes are listed below, relative to main. It seems that changing to x2 is indeed somewhat helpful. Looking at the benchmarks though (below), it is clear that this still looses performance in some cases.
Trying the reproducer : /usr/bin/time -v ./bin/opt -O3 -mtriple=s390x-linux-gnu -mcpu=z16 ./bad-42752cbe095.bc -o /dev/null x1 x2 x3 So I can confirm getting the same results as reported - x2 would remedy this case. Looking at imagick, it is clear that in some cases increased inlining can have very good performance improvement while not affecting code size as much. And e.g. namd shows that inlining can increase code size significantly without giving much of an improvement... |
Do I remember correctly that the change in 27e8c50 was also targeted specifically at increasing the imagick SPEC score? |
Unfortunately, it seems the old discussion on Phabricator is lost so I am not sure which benchmark(s?) that relates to. This is a case where we actually do know for a specific type of function that inlining is highly beneficial, so that is certainly good to keep. How are you thinking that this would relate to the general multiplier? I could do some more benchmarking without this also if you think it would be relevant. |
I'm mainly trying to figure out how important that imagick regression is. If the s390x backend is really going out of its way to make imagick optimize as desired (not just by adjusting the multiplier, but also doing other threshold adjustments), then this is probably too specific to a single benchmark and we should not be overly concerned about it. (Looking back through some old discussions, it looks like imagick on s390x was badly hit by deferred inlining removal in https://reviews.llvm.org/D115497, and I think the inlining threshold adjustment was an attempt to mitigate that?) |
Yeah - that rings a bell :) |
I checked back again on imagick, and it seems that removing either the multiplier or the specific threshold adjustment cause a similar regression (~10%). So they are both needed for the current performance. Without the multiplier, the specific adjustment is ~1% improvement only. |
Using this TTI hook is only appropriate for GPU targets that need to maximally flatten all functions. The inlining threshold multiplier should not be increased for ordinary targets -- while this may increase benchmark scores, it can also result in large code size increases and runaway inlining, resulting in build OOM/timeouts.
5c94dc5
to
2fdbb7e
Compare
I've updated this PR to change the threshold multiplier from 3 to 2 (instead of 1). Based on the numbers in #94612 (comment), at the multiplier of 2 only imagick has significant negative performance impact, and the s390x backend is currently over-specialized towards that benchmark. The original proposal to use multiplier 1 had negative impact on more benchmarks. |
Thanks for the update. However, I have been working on tracking down the regressions I have seen (imagick, povray and perl), and have made some progress. I believe I will have a patch ready soon, which will set the multiplier to 1 and boost the threshold for specific types of callees. |
Can probably close this now (patch in progress at #106058) |
Fix committed with the general multiplier removed entirely. |
Using this TTI hook is only appropriate for GPU targets that need to maximally flatten all functions.
The inlining threshold multiplier should not be increased for ordinary targets -- while this may increase benchmark scores, it can also result in large code size increases and runaway inlining, resulting in build OOM/timeouts.