-
Notifications
You must be signed in to change notification settings - Fork 13.5k
[RISCV][RegAlloc] Add getCSRFirstUseCost for RISC-V #131349
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RISCV][RegAlloc] Add getCSRFirstUseCost for RISC-V #131349
Conversation
@llvm/pr-subscribers-backend-risc-v Author: Michael Maitland (michaelmaitland) ChangesThis is based off of 63efd8e. The following table shows the percent change to the dynamic instruction count when the function in this patch returns 0 (default) versus other values.
The following table shows the percent change to the runtime when the function in this patch returns 0 (default) versus other values.
I chose to set the value to 64 on RISC-V because it has improvement to both the dynamic IC and the runtime and because AMDGPU set their number to 100, and callee-saved-spills are probably less expensive on RISC-V than on AMDGPU. I looked at some diff and it seems like this patch leads to two things:
Full diff: https://github.com/llvm/llvm-project/pull/131349.diff 2 Files Affected:
diff --git a/llvm/lib/Target/RISCV/RISCVRegisterInfo.h b/llvm/lib/Target/RISCV/RISCVRegisterInfo.h
index 0830191dde3f4..fc5852e081a22 100644
--- a/llvm/lib/Target/RISCV/RISCVRegisterInfo.h
+++ b/llvm/lib/Target/RISCV/RISCVRegisterInfo.h
@@ -61,6 +61,13 @@ struct RISCVRegisterInfo : public RISCVGenRegisterInfo {
const uint32_t *getCallPreservedMask(const MachineFunction &MF,
CallingConv::ID) const override;
+ unsigned getCSRFirstUseCost() const override {
+ // The cost will be compared against BlockFrequency where entry has the
+ // value of 1 << 14. A value of 64 will choose to spill or split cold
+ // path instead of using a callee-saved register.
+ return 64;
+ }
+
const MCPhysReg *getCalleeSavedRegs(const MachineFunction *MF) const override;
const MCPhysReg *getIPRACSRegs(const MachineFunction *MF) const override;
diff --git a/llvm/test/CodeGen/RISCV/csr-first-use-cost.ll b/llvm/test/CodeGen/RISCV/csr-first-use-cost.ll
new file mode 100644
index 0000000000000..62ad116256050
--- /dev/null
+++ b/llvm/test/CodeGen/RISCV/csr-first-use-cost.ll
@@ -0,0 +1,105 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc %s -mtriple=riscv64 -regalloc-csr-first-time-cost=0 | FileCheck %s -check-prefix=ZERO-COST
+; RUN: llc %s -mtriple=riscv64 -regalloc-csr-first-time-cost=64 | FileCheck %s -check-prefix=SOME-COST
+
+define fastcc void @Perl_sv_setnv(ptr %.str.54.3682) {
+; ZERO-COST-LABEL: Perl_sv_setnv:
+; ZERO-COST: # %bb.0: # %entry
+; ZERO-COST-NEXT: addi sp, sp, -32
+; ZERO-COST-NEXT: .cfi_def_cfa_offset 32
+; ZERO-COST-NEXT: sd ra, 24(sp) # 8-byte Folded Spill
+; ZERO-COST-NEXT: sd s0, 16(sp) # 8-byte Folded Spill
+; ZERO-COST-NEXT: sd s1, 8(sp) # 8-byte Folded Spill
+; ZERO-COST-NEXT: .cfi_offset ra, -8
+; ZERO-COST-NEXT: .cfi_offset s0, -16
+; ZERO-COST-NEXT: .cfi_offset s1, -24
+; ZERO-COST-NEXT: bnez zero, .LBB0_5
+; ZERO-COST-NEXT: # %bb.1: # %entry
+; ZERO-COST-NEXT: li a1, 1
+; ZERO-COST-NEXT: bnez a1, .LBB0_6
+; ZERO-COST-NEXT: .LBB0_2: # %entry
+; ZERO-COST-NEXT: mv s0, a0
+; ZERO-COST-NEXT: beqz zero, .LBB0_4
+; ZERO-COST-NEXT: # %bb.3: # %sw.bb34.i
+; ZERO-COST-NEXT: li s0, 0
+; ZERO-COST-NEXT: .LBB0_4: # %Perl_sv_reftype.exit
+; ZERO-COST-NEXT: li s1, 0
+; ZERO-COST-NEXT: li a0, 0
+; ZERO-COST-NEXT: li a1, 0
+; ZERO-COST-NEXT: jalr s1
+; ZERO-COST-NEXT: li a0, 0
+; ZERO-COST-NEXT: mv a1, s0
+; ZERO-COST-NEXT: li a2, 0
+; ZERO-COST-NEXT: jalr s1
+; ZERO-COST-NEXT: .LBB0_5: # %entry
+; ZERO-COST-NEXT: beqz zero, .LBB0_2
+; ZERO-COST-NEXT: .LBB0_6: # %sw.bb3
+; ZERO-COST-NEXT: ld ra, 24(sp) # 8-byte Folded Reload
+; ZERO-COST-NEXT: ld s0, 16(sp) # 8-byte Folded Reload
+; ZERO-COST-NEXT: ld s1, 8(sp) # 8-byte Folded Reload
+; ZERO-COST-NEXT: .cfi_restore ra
+; ZERO-COST-NEXT: .cfi_restore s0
+; ZERO-COST-NEXT: .cfi_restore s1
+; ZERO-COST-NEXT: addi sp, sp, 32
+; ZERO-COST-NEXT: .cfi_def_cfa_offset 0
+; ZERO-COST-NEXT: ret
+;
+; SOME-COST-LABEL: Perl_sv_setnv:
+; SOME-COST: # %bb.0: # %entry
+; SOME-COST-NEXT: addi sp, sp, -32
+; SOME-COST-NEXT: .cfi_def_cfa_offset 32
+; SOME-COST-NEXT: sd ra, 24(sp) # 8-byte Folded Spill
+; SOME-COST-NEXT: sd s0, 16(sp) # 8-byte Folded Spill
+; SOME-COST-NEXT: .cfi_offset ra, -8
+; SOME-COST-NEXT: .cfi_offset s0, -16
+; SOME-COST-NEXT: bnez zero, .LBB0_5
+; SOME-COST-NEXT: # %bb.1: # %entry
+; SOME-COST-NEXT: li a1, 1
+; SOME-COST-NEXT: bnez a1, .LBB0_6
+; SOME-COST-NEXT: .LBB0_2: # %entry
+; SOME-COST-NEXT: sd a0, 8(sp) # 8-byte Folded Spill
+; SOME-COST-NEXT: beqz zero, .LBB0_4
+; SOME-COST-NEXT: # %bb.3: # %sw.bb34.i
+; SOME-COST-NEXT: sd zero, 8(sp) # 8-byte Folded Spill
+; SOME-COST-NEXT: .LBB0_4: # %Perl_sv_reftype.exit
+; SOME-COST-NEXT: li s0, 0
+; SOME-COST-NEXT: li a0, 0
+; SOME-COST-NEXT: li a1, 0
+; SOME-COST-NEXT: jalr s0
+; SOME-COST-NEXT: li a0, 0
+; SOME-COST-NEXT: ld a1, 8(sp) # 8-byte Folded Reload
+; SOME-COST-NEXT: li a2, 0
+; SOME-COST-NEXT: jalr s0
+; SOME-COST-NEXT: .LBB0_5: # %entry
+; SOME-COST-NEXT: beqz zero, .LBB0_2
+; SOME-COST-NEXT: .LBB0_6: # %sw.bb3
+; SOME-COST-NEXT: ld ra, 24(sp) # 8-byte Folded Reload
+; SOME-COST-NEXT: ld s0, 16(sp) # 8-byte Folded Reload
+; SOME-COST-NEXT: .cfi_restore ra
+; SOME-COST-NEXT: .cfi_restore s0
+; SOME-COST-NEXT: addi sp, sp, 32
+; SOME-COST-NEXT: .cfi_def_cfa_offset 0
+; SOME-COST-NEXT: ret
+entry:
+ switch i8 0, label %Perl_sv_reftype.exit [
+ i8 1, label %sw.bb4
+ i8 12, label %sw.bb34.i
+ i8 3, label %sw.bb3
+ i8 0, label %sw.bb3
+ ]
+
+sw.bb3: ; preds = %entry, %entry
+ ret void
+
+sw.bb4: ; preds = %entry
+ br label %Perl_sv_reftype.exit
+
+sw.bb34.i: ; preds = %entry
+ br label %Perl_sv_reftype.exit
+
+Perl_sv_reftype.exit: ; preds = %sw.bb34.i, %sw.bb4, %entry
+ %retval.0.i = phi ptr [ null, %sw.bb34.i ], [ null, %sw.bb4 ], [ %.str.54.3682, %entry ]
+ %call17 = tail call fastcc i64 null(ptr null, i32 0)
+ tail call void (ptr, ...) null(ptr null, ptr %retval.0.i, ptr null)
+ unreachable
+}
|
This is based off of 63efd8e. The following table shows the percent change to the dynamic instruction count when the function in this patch returns 0 (default) versus other values. | benchmark | % speedup 1 over 0 | % speedup 4 over 0 | % speedup 16 over 0 | % speedup 64 over 0 | % speedup 128 over 0 | | --------------- | ---------------------- | --------------------- | --------------------- | -------------------- | -------------------- | | 500.perlbench_r | 0.001018570165 | 0.001049508358 | 0.001001106529 | 0.03382582818 | 0.03395354577 | | 502.gcc_r | 0.02850551412 | 0.02170512371 | 0.01453021263 | 0.06011008637 | 0.1215691521 | | 505.mcf_r | -0.00009506373338 | -0.00009090057642 | -0.0000860991497 | -0.00005027849766 | 0.00001251173791 | | 520.omnetpp_r | 0.2958940288 | 0.2959715925 | 0.2961141505 | 0.2959823497 | 0.2963124341 | | 523.xalancbmk_r | -0.0327074721 | -0.01037021046 | -0.3226810542 | 0.02127133714 | 0.02765388389 | | 525.x264_r | 0.0000001381714403 | -0.00000007041540345 | -0.00000002156399465 | 0.0000002108993364 | 0.0000002463382874 | | 531.deepsjeng_r | 0.00000000339777238 | 0.000000003874652714 | 0.000000003636212547 | 0.000000003874652714 | 0.000000003159332213 | | 541.leela_r | 0.0009186059953 | -0.000424159199 | 0.0004984456879 | 0.274948447 | 0.8135521414 | | 557.xz_r | -0.000000003547118854 | -0.00004896449559 | -0.00004910691576 | -0.0000491109983 | -0.00004895599589 | | geomean | 0.03265937388 | 0.03424232324 | -0.00107917442 | 0.07629116165 | 0.1439913192 | The following table shows the percent change to the runtime when the function in this patch returns 0 (default) versus other values. | benchmark | % speedup 1 over 0 | % speedup 4 over 0 | % speedup 16 over 0 | % speedup 64 over 0 | %speedup 128 over 0 | | --------------- | ------------------ | ------------------ | ------------------- | ------------------- | ------------------- | | 500.perlbench_r | 0.1722356761 | 0.2269681109 | 0.2596825578 | 0.361573851 | 1.15041305 | | 502.gcc_r | -0.548415855 | -0.06187002799 | -0.5553684674 | -0.8876686237 | -0.4668665535 | | 505.mcf_r | -0.8786414258 | -0.4150938441 | -1.035517726 | -0.1860770377 | -0.01904825648 | | 520.omnetpp_r | 0.4130256072 | 0.6595976188 | 0.897332171 | 0.6252625622 | 0.3869467278 | | 523.xalancbmk_r | 1.318132014 | -0.003927574 | 1.025962975 | 1.090320253 | -0.789206202 | | 525.x264_r | -0.03112871796 | -0.00167557587 | 0.06932423155 | -0.1919840015 | -0.1203585732 | | 531.deepsjeng_r | -0.259516072 | -0.01973455652 | -0.2723227894 | -0.005417022257 | -0.02222388177 | | 541.leela_r | -0.3497178495 | -0.3510447393 | 0.1274508001 | 0.6485542452 | 0.2880651727 | | 557.xz_r | 0.7683565263 | -0.2197509447 | -0.0431183874 | 0.07518130872 | 0.5236853039 | | geomean | 0.06506952742 | -0.0211865386 | 0.05072694648 | 0.1684530637 | 0.1020533557 | I chose to set the value to 64 on RISC-V because it has improvement to both the dynamic IC and the runtime and because AMDGPU set their number to 100, and callee-saved-spills are probably less expensive on RISC-V than on AMDGPU. I looked at some diff and it seems like this patch leads to two things: 1. Less spilling -- not spilling the CSR led to better register allocation and helped us avoid spills down the line 2. Avoid spilling CSR but spill more on paths that static heuristics estimate as cold.
4946dbe
to
9fa4578
Compare
This seems generally reasonable, and well motivated. A couple of questions:
|
; DEFAULT-COST-NEXT: bnez a1, .LBB0_6 | ||
; DEFAULT-COST-NEXT: .LBB0_2: # %entry | ||
; DEFAULT-COST-NEXT: sd a0, 8(sp) # 8-byte Folded Spill | ||
; DEFAULT-COST-NEXT: beqz zero, .LBB0_4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a non-sensical instruction. Its comparing x0 to x0. Any idea what caused it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably llvm-reduce caused it.
One other question. @lukel97 Might be good to queue this one on the BP3. Your setup does a decent job of pruning statistical false positives, and as above, it looks like that might be relevant here. |
; DEFAULT-COST-NEXT: addi sp, sp, -32 | ||
; DEFAULT-COST-NEXT: sd ra, 24(sp) # 8-byte Folded Spill | ||
; DEFAULT-COST-NEXT: sd s0, 16(sp) # 8-byte Folded Spill | ||
; DEFAULT-COST-NEXT: bnez zero, .LBB0_5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also comparing zero to zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably llvm-reduce. May still be worth peephole optimizing these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect they are created by tail duplication after and MachineCopyProp very late in the pipeline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found some cases of us generating this in real code. I will take a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're generating it in RISC-V DAG->DAG Pattern Instruction Selection (riscv-isel). I'll work on a separate patch.
It had vector, but I do not see any diff related to vector while skimming benchmarks.
This was measured on our higher end out of order core (p470).
These numbers are mostly within our noise band. It would be great if @lukel97 could have a look! I did notice some cases where some CSR spills turned into non-csr moves (which is a win), which is why I wanted to look at runtime numbers. |
; DEFAULT-COST-NEXT: addi sp, sp, 32 | ||
; DEFAULT-COST-NEXT: ret | ||
entry: | ||
switch i8 0, label %Perl_sv_reftype.exit [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make this condition a function argument. That will fix the non-sensical branches. I don't think its worth optimizing in SelectionDAG, but I also don't want this test to break if it gets optimized in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
I've queued up a run for this on the BPI-F3, will have results sometime tomorrow hopefully! |
f007674
to
ad0a1ec
Compare
I'm happy to report that there's some fairly confident gains with this PR, 3.27% on on 511.povray_r and 2.76% on 523.xalancbmk_r, both with very little noise: https://lnt.lukelau.me/db_default/v4/nts/328 On povray a good chunk of the speed up comes from the frame setup in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The result is promising! One concern is that I don't know if we have set the cost too high. What is the logic chain after setting the value of this cost? Definitely it is not higher is better, but what is the rule to balance it?
Also I don't see the big improvement on 523.xalancbmk_r on SiFive core, which makes me hesitate about the value 64.
From Michael's results there's a 0.2% dyn instcount improvement and a 0.4% improvement on the BPI-F3, which seems to check out? Inspecting the codegen diff on -O3 -march=rva22u64_v there's lots of functions with diffs like e.g: --- build.rva22u64_v-O3.a//External/SPEC/CINT2017rate/523.xalancbmk_r/CMakeFiles/523.xalancbmk_r.dir/Users/luke/Developer/cpu2017/benchspec/CPU/523.xalancbmk_r/src/Abstrac
tNumericValidator.s 2025-03-19 17:29:10
+++ build.rva22u64_v-O3.b//External/SPEC/CINT2017rate/523.xalancbmk_r/CMakeFiles/523.xalancbmk_r.dir/Users/luke/Developer/cpu2017/benchspec/CPU/523.xalancbmk_r/src/Abstrac
tNumericValidator.s 2025-03-19 17:27:46
@@ -79,27 +79,25 @@
.cfi_personality 155, DW.ref.__gxx_personality_v0
.cfi_lsda 27, .Lexception0
# %bb.0: # %entry
- addi sp, sp, -64
- .cfi_def_cfa_offset 64
- sd ra, 56(sp) # 8-byte Folded Spill
- sd s0, 48(sp) # 8-byte Folded Spill
- sd s1, 40(sp) # 8-byte Folded Spill
- sd s2, 32(sp) # 8-byte Folded Spill
- sd s3, 24(sp) # 8-byte Folded Spill
- sd s4, 16(sp) # 8-byte Folded Spill
+ addi sp, sp, -48
+ .cfi_def_cfa_offset 48
+ sd ra, 40(sp) # 8-byte Folded Spill
+ sd s0, 32(sp) # 8-byte Folded Spill
+ sd s1, 24(sp) # 8-byte Folded Spill
+ sd s2, 16(sp) # 8-byte Folded Spill
+ sd s3, 8(sp) # 8-byte Folded Spill
.cfi_offset ra, -8
.cfi_offset s0, -16
.cfi_offset s1, -24
.cfi_offset s2, -32
.cfi_offset s3, -40
- .cfi_offset s4, -48
.cfi_remember_state |
It all makes sense to me, I just wonder if the value 64 is too high since the value on AArch64 is 5. I don't see any significant arch difference between RISC-V and AArch64 that will reuslt in the different costs since they are both arch for RISC CPUs. Maybe @lukel97 you can issue another several runs with different costs? |
Oh I see what you mean, I'll kick off a run with 5 and see how that compares |
I wonder if this is making up for lack of per register shrink-wrapping in some cases. |
I want to be careful that we don't make perfection the enemy of the good here. We have measured results - both icount and runtime - that show the current number produces reasonable results. I encourage a data driven sweep here, but I propose we should not spend too much time on this before picking something reasonable (64 seems reasonable), and landing it. We can continue to sweep the space for a better overall result, but let's take the win we have rather than trying to find the global maxima. We can always revise the number in the future. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I did't mean to be picky here, but please just wait for another run for value 5. We are not in a release merge window so we are not so hurried, right? :-)
Here are the results with 5 as the cost, there's still a 3% gain on 511.povray_r: https://lnt.lukelau.me/db_default/v4/nts/335?show_delta=yes&show_previous=yes&show_stddev=yes&show_mad=yes&show_all=yes&show_all_samples=yes&show_sample_counts=yes&show_small_diff=yes&num_comparison_runs=0&test_filter=&test_min_value_filter=&aggregation_fn=min&MW_confidence_lv=0.05&compare_to=333&submit=Update There's not as large of a gain on 523.xalanbcmk_r anymore but after looking at the nightly runs it seems to be low noise but highly variable, probably due to code layout https://lnt.lukelau.me/db_default/v4/nts/graph?highlight_run=331&plot.12=25.12.3. So I think we should probably ignore it. So both 5 and 64 have the same effect on the BPI-F3 |
Updated to make this patch use 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
std::max((unsigned)CSRFirstTimeCost, TRI->getCSRFirstUseCost())); | ||
CSRFirstTimeCost.getNumOccurrences() | ||
? CSRFirstTimeCost | ||
: std::max((unsigned)CSRFirstTimeCost, TRI->getCSRFirstUseCost())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the command line option isn't explicitly set, CSRFirstTimeCost
will be the default value of 0 so this max will always return TRI->getCSRFirstUseCost()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If someone goes in and changes the default value, wouldn't we want the comparison to occur without having to change the code here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess, but I'm not sure why the default of the CSRFirstTimeCost
would be changed instead of the base class implementation of TargetRegisterInfo::getCSRFirstUseCost()
.
We can take the code how you have it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/30/builds/18064 Here is the relevant piece of the build log for the reference
|
Adding a comment here for future reference: we should re-evaluate setting this option once the improvements to ShrinkWrapping land. |
I see a gain on 511.povray_r on the SiFive P550 board too. |
This is based off of 63efd8e.
The following table shows the percent change to the dynamic instruction count when the function in this patch returns 0 (default) versus other values.
The following table shows the percent change to the runtime when the function in this patch returns 0 (default) versus other values.
I chose to set the value to 5 on RISC-V because it has improvement to both the dynamic IC and the runtime and because it showed good results empirically and had a similar effect as setting it to higher numbers.
I looked at some diff and it seems like this patch leads to two things: