[RISCV][RegAlloc] Add getCSRFirstUseCost for RISC-V #131349

michaelmaitland · 2025-03-14T16:10:27Z

This is based off of 63efd8e.

The following table shows the percent change to the dynamic instruction count when the function in this patch returns 0 (default) versus other values.

benchmark	% speedup 1 over 0	% speedup 4 over 0	% speedup 16 over 0	% speedup 64 over 0	% speedup 128 over 0
500.perlbench_r	0.001018570165	0.001049508358	0.001001106529	0.03382582818	0.03395354577
502.gcc_r	0.02850551412	0.02170512371	0.01453021263	0.06011008637	0.1215691521
505.mcf_r	-0.00009506373338	-0.00009090057642	-0.0000860991497	-0.00005027849766	0.00001251173791
520.omnetpp_r	0.2958940288	0.2959715925	0.2961141505	0.2959823497	0.2963124341
523.xalancbmk_r	-0.0327074721	-0.01037021046	-0.3226810542	0.02127133714	0.02765388389
525.x264_r	0.0000001381714403	-0.00000007041540345	-0.00000002156399465	0.0000002108993364	0.0000002463382874
531.deepsjeng_r	0.00000000339777238	0.000000003874652714	0.000000003636212547	0.000000003874652714	0.000000003159332213
541.leela_r	0.0009186059953	-0.000424159199	0.0004984456879	0.274948447	0.8135521414
557.xz_r	-0.000000003547118854	-0.00004896449559	-0.00004910691576	-0.0000491109983	-0.00004895599589
geomean	0.03265937388	0.03424232324	-0.00107917442	0.07629116165	0.1439913192

The following table shows the percent change to the runtime when the function in this patch returns 0 (default) versus other values.

benchmark	% speedup 1 over 0	% speedup 4 over 0	% speedup 16 over 0	% speedup 64 over 0	%speedup 128 over 0
500.perlbench_r	0.1722356761	0.2269681109	0.2596825578	0.361573851	1.15041305
502.gcc_r	-0.548415855	-0.06187002799	-0.5553684674	-0.8876686237	-0.4668665535
505.mcf_r	-0.8786414258	-0.4150938441	-1.035517726	-0.1860770377	-0.01904825648
520.omnetpp_r	0.4130256072	0.6595976188	0.897332171	0.6252625622	0.3869467278
523.xalancbmk_r	1.318132014	-0.003927574	1.025962975	1.090320253	-0.789206202
525.x264_r	-0.03112871796	-0.00167557587	0.06932423155	-0.1919840015	-0.1203585732
531.deepsjeng_r	-0.259516072	-0.01973455652	-0.2723227894	-0.005417022257	-0.02222388177
541.leela_r	-0.3497178495	-0.3510447393	0.1274508001	0.6485542452	0.2880651727
557.xz_r	0.7683565263	-0.2197509447	-0.0431183874	0.07518130872	0.5236853039
geomean	0.06506952742	-0.0211865386	0.05072694648	0.1684530637	0.1020533557

I chose to set the value to 5 on RISC-V because it has improvement to both the dynamic IC and the runtime and because it showed good results empirically and had a similar effect as setting it to higher numbers.

I looked at some diff and it seems like this patch leads to two things:

Less spilling -- not spilling the CSR led to better register allocation and helped us avoid spills down the line
Avoid spilling CSR but spill more on paths that static heuristics estimate as cold.

llvmbot · 2025-03-14T16:11:16Z

@llvm/pr-subscribers-backend-risc-v

Author: Michael Maitland (michaelmaitland)

Changes

This is based off of 63efd8e.

The following table shows the percent change to the dynamic instruction count when the function in this patch returns 0 (default) versus other values.

benchmark	% speedup 1 over 0	% speedup 4 over 0	% speedup 16 over 0	% speedup 64 over 0	% speedup 128 over 0
500.perlbench_r	0.001018570165	0.001049508358	0.001001106529	0.03382582818	0.03395354577
502.gcc_r	0.02850551412	0.02170512371	0.01453021263	0.06011008637	0.1215691521
505.mcf_r	-0.00009506373338	-0.00009090057642	-0.0000860991497	-0.00005027849766	0.00001251173791
520.omnetpp_r	0.2958940288	0.2959715925	0.2961141505	0.2959823497	0.2963124341
523.xalancbmk_r	-0.0327074721	-0.01037021046	-0.3226810542	0.02127133714	0.02765388389
525.x264_r	0.0000001381714403	-0.00000007041540345	-0.00000002156399465	0.0000002108993364	0.0000002463382874
531.deepsjeng_r	0.00000000339777238	0.000000003874652714	0.000000003636212547	0.000000003874652714	0.000000003159332213
541.leela_r	0.0009186059953	-0.000424159199	0.0004984456879	0.274948447	0.8135521414
557.xz_r	-0.000000003547118854	-0.00004896449559	-0.00004910691576	-0.0000491109983	-0.00004895599589
geomean	0.03265937388	0.03424232324	-0.00107917442	0.07629116165	0.1439913192

The following table shows the percent change to the runtime when the function in this patch returns 0 (default) versus other values.

benchmark	% speedup 1 over 0	% speedup 4 over 0	% speedup 16 over 0	% speedup 64 over 0	%speedup 128 over 0
500.perlbench_r	0.1722356761	0.2269681109	0.2596825578	0.361573851	1.15041305
502.gcc_r	-0.548415855	-0.06187002799	-0.5553684674	-0.8876686237	-0.4668665535
505.mcf_r	-0.8786414258	-0.4150938441	-1.035517726	-0.1860770377	-0.01904825648
520.omnetpp_r	0.4130256072	0.6595976188	0.897332171	0.6252625622	0.3869467278
523.xalancbmk_r	1.318132014	-0.003927574	1.025962975	1.090320253	-0.789206202
525.x264_r	-0.03112871796	-0.00167557587	0.06932423155	-0.1919840015	-0.1203585732
531.deepsjeng_r	-0.259516072	-0.01973455652	-0.2723227894	-0.005417022257	-0.02222388177
541.leela_r	-0.3497178495	-0.3510447393	0.1274508001	0.6485542452	0.2880651727
557.xz_r	0.7683565263	-0.2197509447	-0.0431183874	0.07518130872	0.5236853039
geomean	0.06506952742	-0.0211865386	0.05072694648	0.1684530637	0.1020533557

I chose to set the value to 64 on RISC-V because it has improvement to both the dynamic IC and the runtime and because AMDGPU set their number to 100, and callee-saved-spills are probably less expensive on RISC-V than on AMDGPU.

I looked at some diff and it seems like this patch leads to two things:

Less spilling -- not spilling the CSR led to better register allocation and helped us avoid spills down the line
Avoid spilling CSR but spill more on paths that static heuristics estimate as cold.

Full diff: https://github.com/llvm/llvm-project/pull/131349.diff

2 Files Affected:

(modified) llvm/lib/Target/RISCV/RISCVRegisterInfo.h (+7)
(added) llvm/test/CodeGen/RISCV/csr-first-use-cost.ll (+105)

diff --git a/llvm/lib/Target/RISCV/RISCVRegisterInfo.h b/llvm/lib/Target/RISCV/RISCVRegisterInfo.h
index 0830191dde3f4..fc5852e081a22 100644
--- a/llvm/lib/Target/RISCV/RISCVRegisterInfo.h
+++ b/llvm/lib/Target/RISCV/RISCVRegisterInfo.h
@@ -61,6 +61,13 @@ struct RISCVRegisterInfo : public RISCVGenRegisterInfo {
   const uint32_t *getCallPreservedMask(const MachineFunction &MF,
                                        CallingConv::ID) const override;
 
+  unsigned getCSRFirstUseCost() const override {
+    // The cost will be compared against BlockFrequency where entry has the
+    // value of 1 << 14. A value of 64 will choose to spill or split cold
+    // path instead of using a callee-saved register.
+    return 64;
+  }
+
   const MCPhysReg *getCalleeSavedRegs(const MachineFunction *MF) const override;
 
   const MCPhysReg *getIPRACSRegs(const MachineFunction *MF) const override;
diff --git a/llvm/test/CodeGen/RISCV/csr-first-use-cost.ll b/llvm/test/CodeGen/RISCV/csr-first-use-cost.ll
new file mode 100644
index 0000000000000..62ad116256050
--- /dev/null
+++ b/llvm/test/CodeGen/RISCV/csr-first-use-cost.ll
@@ -0,0 +1,105 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc %s -mtriple=riscv64 -regalloc-csr-first-time-cost=0 | FileCheck %s -check-prefix=ZERO-COST
+; RUN: llc %s -mtriple=riscv64 -regalloc-csr-first-time-cost=64 | FileCheck %s -check-prefix=SOME-COST
+
+define fastcc void @Perl_sv_setnv(ptr %.str.54.3682) {
+; ZERO-COST-LABEL: Perl_sv_setnv:
+; ZERO-COST:       # %bb.0: # %entry
+; ZERO-COST-NEXT:    addi sp, sp, -32
+; ZERO-COST-NEXT:    .cfi_def_cfa_offset 32
+; ZERO-COST-NEXT:    sd ra, 24(sp) # 8-byte Folded Spill
+; ZERO-COST-NEXT:    sd s0, 16(sp) # 8-byte Folded Spill
+; ZERO-COST-NEXT:    sd s1, 8(sp) # 8-byte Folded Spill
+; ZERO-COST-NEXT:    .cfi_offset ra, -8
+; ZERO-COST-NEXT:    .cfi_offset s0, -16
+; ZERO-COST-NEXT:    .cfi_offset s1, -24
+; ZERO-COST-NEXT:    bnez zero, .LBB0_5
+; ZERO-COST-NEXT:  # %bb.1: # %entry
+; ZERO-COST-NEXT:    li a1, 1
+; ZERO-COST-NEXT:    bnez a1, .LBB0_6
+; ZERO-COST-NEXT:  .LBB0_2: # %entry
+; ZERO-COST-NEXT:    mv s0, a0
+; ZERO-COST-NEXT:    beqz zero, .LBB0_4
+; ZERO-COST-NEXT:  # %bb.3: # %sw.bb34.i
+; ZERO-COST-NEXT:    li s0, 0
+; ZERO-COST-NEXT:  .LBB0_4: # %Perl_sv_reftype.exit
+; ZERO-COST-NEXT:    li s1, 0
+; ZERO-COST-NEXT:    li a0, 0
+; ZERO-COST-NEXT:    li a1, 0
+; ZERO-COST-NEXT:    jalr s1
+; ZERO-COST-NEXT:    li a0, 0
+; ZERO-COST-NEXT:    mv a1, s0
+; ZERO-COST-NEXT:    li a2, 0
+; ZERO-COST-NEXT:    jalr s1
+; ZERO-COST-NEXT:  .LBB0_5: # %entry
+; ZERO-COST-NEXT:    beqz zero, .LBB0_2
+; ZERO-COST-NEXT:  .LBB0_6: # %sw.bb3
+; ZERO-COST-NEXT:    ld ra, 24(sp) # 8-byte Folded Reload
+; ZERO-COST-NEXT:    ld s0, 16(sp) # 8-byte Folded Reload
+; ZERO-COST-NEXT:    ld s1, 8(sp) # 8-byte Folded Reload
+; ZERO-COST-NEXT:    .cfi_restore ra
+; ZERO-COST-NEXT:    .cfi_restore s0
+; ZERO-COST-NEXT:    .cfi_restore s1
+; ZERO-COST-NEXT:    addi sp, sp, 32
+; ZERO-COST-NEXT:    .cfi_def_cfa_offset 0
+; ZERO-COST-NEXT:    ret
+;
+; SOME-COST-LABEL: Perl_sv_setnv:
+; SOME-COST:       # %bb.0: # %entry
+; SOME-COST-NEXT:    addi sp, sp, -32
+; SOME-COST-NEXT:    .cfi_def_cfa_offset 32
+; SOME-COST-NEXT:    sd ra, 24(sp) # 8-byte Folded Spill
+; SOME-COST-NEXT:    sd s0, 16(sp) # 8-byte Folded Spill
+; SOME-COST-NEXT:    .cfi_offset ra, -8
+; SOME-COST-NEXT:    .cfi_offset s0, -16
+; SOME-COST-NEXT:    bnez zero, .LBB0_5
+; SOME-COST-NEXT:  # %bb.1: # %entry
+; SOME-COST-NEXT:    li a1, 1
+; SOME-COST-NEXT:    bnez a1, .LBB0_6
+; SOME-COST-NEXT:  .LBB0_2: # %entry
+; SOME-COST-NEXT:    sd a0, 8(sp) # 8-byte Folded Spill
+; SOME-COST-NEXT:    beqz zero, .LBB0_4
+; SOME-COST-NEXT:  # %bb.3: # %sw.bb34.i
+; SOME-COST-NEXT:    sd zero, 8(sp) # 8-byte Folded Spill
+; SOME-COST-NEXT:  .LBB0_4: # %Perl_sv_reftype.exit
+; SOME-COST-NEXT:    li s0, 0
+; SOME-COST-NEXT:    li a0, 0
+; SOME-COST-NEXT:    li a1, 0
+; SOME-COST-NEXT:    jalr s0
+; SOME-COST-NEXT:    li a0, 0
+; SOME-COST-NEXT:    ld a1, 8(sp) # 8-byte Folded Reload
+; SOME-COST-NEXT:    li a2, 0
+; SOME-COST-NEXT:    jalr s0
+; SOME-COST-NEXT:  .LBB0_5: # %entry
+; SOME-COST-NEXT:    beqz zero, .LBB0_2
+; SOME-COST-NEXT:  .LBB0_6: # %sw.bb3
+; SOME-COST-NEXT:    ld ra, 24(sp) # 8-byte Folded Reload
+; SOME-COST-NEXT:    ld s0, 16(sp) # 8-byte Folded Reload
+; SOME-COST-NEXT:    .cfi_restore ra
+; SOME-COST-NEXT:    .cfi_restore s0
+; SOME-COST-NEXT:    addi sp, sp, 32
+; SOME-COST-NEXT:    .cfi_def_cfa_offset 0
+; SOME-COST-NEXT:    ret
+entry:
+  switch i8 0, label %Perl_sv_reftype.exit [
+    i8 1, label %sw.bb4
+    i8 12, label %sw.bb34.i
+    i8 3, label %sw.bb3
+    i8 0, label %sw.bb3
+  ]
+
+sw.bb3:                                           ; preds = %entry, %entry
+  ret void
+
+sw.bb4:                                           ; preds = %entry
+  br label %Perl_sv_reftype.exit
+
+sw.bb34.i:                                        ; preds = %entry
+  br label %Perl_sv_reftype.exit
+
+Perl_sv_reftype.exit:                             ; preds = %sw.bb34.i, %sw.bb4, %entry
+  %retval.0.i = phi ptr [ null, %sw.bb34.i ], [ null, %sw.bb4 ], [ %.str.54.3682, %entry ]
+  %call17 = tail call fastcc i64 null(ptr null, i32 0)
+  tail call void (ptr, ...) null(ptr null, ptr %retval.0.i, ptr null)
+  unreachable
+}

llvm/test/CodeGen/RISCV/csr-first-use-cost.ll

This is based off of 63efd8e. The following table shows the percent change to the dynamic instruction count when the function in this patch returns 0 (default) versus other values. | benchmark | % speedup 1 over 0 | % speedup 4 over 0 | % speedup 16 over 0 | % speedup 64 over 0 | % speedup 128 over 0 | | --------------- | ---------------------- | --------------------- | --------------------- | -------------------- | -------------------- | | 500.perlbench_r | 0.001018570165 | 0.001049508358 | 0.001001106529 | 0.03382582818 | 0.03395354577 | | 502.gcc_r | 0.02850551412 | 0.02170512371 | 0.01453021263 | 0.06011008637 | 0.1215691521 | | 505.mcf_r | -0.00009506373338 | -0.00009090057642 | -0.0000860991497 | -0.00005027849766 | 0.00001251173791 | | 520.omnetpp_r | 0.2958940288 | 0.2959715925 | 0.2961141505 | 0.2959823497 | 0.2963124341 | | 523.xalancbmk_r | -0.0327074721 | -0.01037021046 | -0.3226810542 | 0.02127133714 | 0.02765388389 | | 525.x264_r | 0.0000001381714403 | -0.00000007041540345 | -0.00000002156399465 | 0.0000002108993364 | 0.0000002463382874 | | 531.deepsjeng_r | 0.00000000339777238 | 0.000000003874652714 | 0.000000003636212547 | 0.000000003874652714 | 0.000000003159332213 | | 541.leela_r | 0.0009186059953 | -0.000424159199 | 0.0004984456879 | 0.274948447 | 0.8135521414 | | 557.xz_r | -0.000000003547118854 | -0.00004896449559 | -0.00004910691576 | -0.0000491109983 | -0.00004895599589 | | geomean | 0.03265937388 | 0.03424232324 | -0.00107917442 | 0.07629116165 | 0.1439913192 | The following table shows the percent change to the runtime when the function in this patch returns 0 (default) versus other values. | benchmark | % speedup 1 over 0 | % speedup 4 over 0 | % speedup 16 over 0 | % speedup 64 over 0 | %speedup 128 over 0 | | --------------- | ------------------ | ------------------ | ------------------- | ------------------- | ------------------- | | 500.perlbench_r | 0.1722356761 | 0.2269681109 | 0.2596825578 | 0.361573851 | 1.15041305 | | 502.gcc_r | -0.548415855 | -0.06187002799 | -0.5553684674 | -0.8876686237 | -0.4668665535 | | 505.mcf_r | -0.8786414258 | -0.4150938441 | -1.035517726 | -0.1860770377 | -0.01904825648 | | 520.omnetpp_r | 0.4130256072 | 0.6595976188 | 0.897332171 | 0.6252625622 | 0.3869467278 | | 523.xalancbmk_r | 1.318132014 | -0.003927574 | 1.025962975 | 1.090320253 | -0.789206202 | | 525.x264_r | -0.03112871796 | -0.00167557587 | 0.06932423155 | -0.1919840015 | -0.1203585732 | | 531.deepsjeng_r | -0.259516072 | -0.01973455652 | -0.2723227894 | -0.005417022257 | -0.02222388177 | | 541.leela_r | -0.3497178495 | -0.3510447393 | 0.1274508001 | 0.6485542452 | 0.2880651727 | | 557.xz_r | 0.7683565263 | -0.2197509447 | -0.0431183874 | 0.07518130872 | 0.5236853039 | | geomean | 0.06506952742 | -0.0211865386 | 0.05072694648 | 0.1684530637 | 0.1020533557 | I chose to set the value to 64 on RISC-V because it has improvement to both the dynamic IC and the runtime and because AMDGPU set their number to 100, and callee-saved-spills are probably less expensive on RISC-V than on AMDGPU. I looked at some diff and it seems like this patch leads to two things: 1. Less spilling -- not spilling the CSR led to better register allocation and helped us avoid spills down the line 2. Avoid spilling CSR but spill more on paths that static heuristics estimate as cold.

llvm/test/CodeGen/RISCV/csr-first-use-cost.ll

preames · 2025-03-14T16:26:49Z

This seems generally reasonable, and well motivated. A couple of questions:

What architecture did you measure the dynamic IC on? I'm mostly interested in whether it had vector or not. I could see the presence of vector skewing results through interactions with our other heuristics (e.g. vector remat causing more scalar register pressure). Whichever you measured, might be a good idea to sanity check in the other. You don't need to repeat the whole matrix of tests, but maybe confirm the chosen point at least doesn't regress the other?
Which CPU did you measure the performance (runtime) on? Is it in order? Or out of order? Does it have any kind of stack engine or ST/LD forwarding? I'm mostly just trying to understand how general this result might be. If this is e.g. a low end in-order, the perf impact might not carry over to other CPUs.

topperc · 2025-03-14T16:31:30Z

llvm/test/CodeGen/RISCV/csr-first-use-cost.ll

+; DEFAULT-COST-NEXT:    bnez a1, .LBB0_6
+; DEFAULT-COST-NEXT:  .LBB0_2: # %entry
+; DEFAULT-COST-NEXT:    sd a0, 8(sp) # 8-byte Folded Spill
+; DEFAULT-COST-NEXT:    beqz zero, .LBB0_4


This is a non-sensical instruction. Its comparing x0 to x0. Any idea what caused it?

Probably llvm-reduce caused it.

preames · 2025-03-14T16:32:08Z

One other question.
3) What are the noise bands on your runtime measurements? Glancing at e.g. x264 which has basically no dynamic IC change, the runtime measurements look a lot like random noise. That's somewhat unavoidable when the impact is this small, but I want to have some clue which numbers I should be treating as having likely statistical significance.

@lukel97 Might be good to queue this one on the BP3. Your setup does a decent job of pruning statistical false positives, and as above, it looks like that might be relevant here.

topperc · 2025-03-14T16:33:19Z

llvm/test/CodeGen/RISCV/csr-first-use-cost.ll

+; DEFAULT-COST-NEXT:    addi sp, sp, -32
+; DEFAULT-COST-NEXT:    sd ra, 24(sp) # 8-byte Folded Spill
+; DEFAULT-COST-NEXT:    sd s0, 16(sp) # 8-byte Folded Spill
+; DEFAULT-COST-NEXT:    bnez zero, .LBB0_5


This is also comparing zero to zero.

Probably llvm-reduce. May still be worth peephole optimizing these?

I suspect they are created by tail duplication after and MachineCopyProp very late in the pipeline

I found some cases of us generating this in real code. I will take a look.

We're generating it in RISC-V DAG->DAG Pattern Instruction Selection (riscv-isel). I'll work on a separate patch.

michaelmaitland · 2025-03-14T16:41:14Z

What architecture did you measure the dynamic IC on? I'm mostly interested in whether it had vector or not. I could see the presence of vector skewing results through interactions with our other heuristics (e.g. vector remat causing more scalar register pressure). Whichever you measured, might be a good idea to sanity check in the other. You don't need to repeat the whole matrix of tests, but maybe confirm the chosen point at least doesn't regress the other?

It had vector, but I do not see any diff related to vector while skimming benchmarks.

Which CPU did you measure the performance (runtime) on? Is it in order? Or out of order? Does it have any kind of stack engine or ST/LD forwarding? I'm mostly just trying to understand how general this result might be. If this is e.g. a low end in-order, the perf impact might not carry over to other CPUs.

This was measured on our higher end out of order core (p470).

What are the noise bands on your runtime measurements? Glancing at e.g. x264 which has basically no dynamic IC change, the runtime measurements look a lot like random noise. That's somewhat unavoidable when the impact is this small, but I want to have some clue which numbers I should be treating as having likely statistical significance.

@lukel97 Might be good to queue this one on the BP3. Your setup does a decent job of pruning statistical false positives, and as above, it looks like that might be relevant here.

These numbers are mostly within our noise band. It would be great if @lukel97 could have a look! I did notice some cases where some CSR spills turned into non-csr moves (which is a win), which is why I wanted to look at runtime numbers.

topperc · 2025-03-14T19:22:25Z

llvm/test/CodeGen/RISCV/csr-first-use-cost.ll

+; DEFAULT-COST-NEXT:    addi sp, sp, 32
+; DEFAULT-COST-NEXT:    ret
+entry:
+  switch i8 0, label %Perl_sv_reftype.exit [


Please make this condition a function argument. That will fix the non-sensical branches. I don't think its worth optimizing in SelectionDAG, but I also don't want this test to break if it gets optimized in the future.

lukel97 · 2025-03-18T15:43:43Z

I've queued up a run for this on the BPI-F3, will have results sometime tomorrow hopefully!

lukel97 · 2025-03-19T07:52:57Z

I've queued up a run for this on the BPI-F3, will have results sometime tomorrow hopefully!

I'm happy to report that there's some fairly confident gains with this PR, 3.27% on on 511.povray_r and 2.76% on 523.xalancbmk_r, both with very little noise: https://lnt.lukelau.me/db_default/v4/nts/328

On povray a good chunk of the speed up comes from the frame setup in pov::All_Sphere_Intersections being cheaper now that it no longer uses fs3-fs8. If you view it in the profile here and select View: Control-Flow Graph (RISC-V), you can see the extra spills inserted instead in the last block, which is very cold.

wangpc-pp

The result is promising! One concern is that I don't know if we have set the cost too high. What is the logic chain after setting the value of this cost? Definitely it is not higher is better, but what is the rule to balance it?
Also I don't see the big improvement on 523.xalancbmk_r on SiFive core, which makes me hesitate about the value 64.

lukel97 · 2025-03-19T09:39:47Z

Also I don't see the big improvement on 523.xalancbmk_r on SiFive core, which makes me hesitate about the value 64.

From Michael's results there's a 0.2% dyn instcount improvement and a 0.4% improvement on the BPI-F3, which seems to check out? Inspecting the codegen diff on -O3 -march=rva22u64_v there's lots of functions with diffs like e.g:

--- build.rva22u64_v-O3.a//External/SPEC/CINT2017rate/523.xalancbmk_r/CMakeFiles/523.xalancbmk_r.dir/Users/luke/Developer/cpu2017/benchspec/CPU/523.xalancbmk_r/src/Abstrac
tNumericValidator.s     2025-03-19 17:29:10
+++ build.rva22u64_v-O3.b//External/SPEC/CINT2017rate/523.xalancbmk_r/CMakeFiles/523.xalancbmk_r.dir/Users/luke/Developer/cpu2017/benchspec/CPU/523.xalancbmk_r/src/Abstrac
tNumericValidator.s     2025-03-19 17:27:46
@@ -79,27 +79,25 @@
        .cfi_personality 155, DW.ref.__gxx_personality_v0
        .cfi_lsda 27, .Lexception0
 # %bb.0:                                # %entry
-       addi    sp, sp, -64
-       .cfi_def_cfa_offset 64
-       sd      ra, 56(sp)                      # 8-byte Folded Spill
-       sd      s0, 48(sp)                      # 8-byte Folded Spill
-       sd      s1, 40(sp)                      # 8-byte Folded Spill
-       sd      s2, 32(sp)                      # 8-byte Folded Spill
-       sd      s3, 24(sp)                      # 8-byte Folded Spill
-       sd      s4, 16(sp)                      # 8-byte Folded Spill
+       addi    sp, sp, -48
+       .cfi_def_cfa_offset 48
+       sd      ra, 40(sp)                      # 8-byte Folded Spill
+       sd      s0, 32(sp)                      # 8-byte Folded Spill
+       sd      s1, 24(sp)                      # 8-byte Folded Spill
+       sd      s2, 16(sp)                      # 8-byte Folded Spill
+       sd      s3, 8(sp)                       # 8-byte Folded Spill
        .cfi_offset ra, -8
        .cfi_offset s0, -16
        .cfi_offset s1, -24
        .cfi_offset s2, -32
        .cfi_offset s3, -40
-       .cfi_offset s4, -48
        .cfi_remember_state

wangpc-pp · 2025-03-19T10:09:40Z

It all makes sense to me, I just wonder if the value 64 is too high since the value on AArch64 is 5. I don't see any significant arch difference between RISC-V and AArch64 that will reuslt in the different costs since they are both arch for RISC CPUs.

Maybe @lukel97 you can issue another several runs with different costs?

lukel97 · 2025-03-19T10:12:55Z

It all makes sense to me, I just wonder if the value 64 is too high since the value on AArch64 is 5. I don't see any big arch difference between RISC-V and AArch64 since they are both arch for RISC CPUs.

Maybe @lukel97 you can issue another several runs with different costs?

Oh I see what you mean, I'll kick off a run with 5 and see how that compares

topperc · 2025-03-19T15:50:42Z

I wonder if this is making up for lack of per register shrink-wrapping in some cases.

preames · 2025-03-19T17:22:04Z

Maybe @lukel97 you can issue another several runs with different costs?

Oh I see what you mean, I'll kick off a run with 5 and see how that compares

I want to be careful that we don't make perfection the enemy of the good here. We have measured results - both icount and runtime - that show the current number produces reasonable results. I encourage a data driven sweep here, but I propose we should not spend too much time on this before picking something reasonable (64 seems reasonable), and landing it. We can continue to sweep the space for a better overall result, but let's take the win we have rather than trying to find the global maxima. We can always revise the number in the future.

wangpc-pp

OK, I did't mean to be picky here, but please just wait for another run for value 5. We are not in a release merge window so we are not so hurried, right? :-)

llvm/test/CodeGen/RISCV/csr-first-use-cost.ll

lukel97 · 2025-03-20T09:27:21Z

Here are the results with 5 as the cost, there's still a 3% gain on 511.povray_r: https://lnt.lukelau.me/db_default/v4/nts/335?show_delta=yes&show_previous=yes&show_stddev=yes&show_mad=yes&show_all=yes&show_all_samples=yes&show_sample_counts=yes&show_small_diff=yes&num_comparison_runs=0&test_filter=&test_min_value_filter=&aggregation_fn=min&MW_confidence_lv=0.05&compare_to=333&submit=Update

There's not as large of a gain on 523.xalanbcmk_r anymore but after looking at the nightly runs it seems to be low noise but highly variable, probably due to code layout https://lnt.lukelau.me/db_default/v4/nts/graph?highlight_run=331&plot.12=25.12.3. So I think we should probably ignore it.

So both 5 and 64 have the same effect on the BPI-F3

michaelmaitland · 2025-03-20T14:09:34Z

Updated to make this patch use 5

preames

LGTM

topperc · 2025-03-20T15:13:16Z

llvm/lib/CodeGen/RegAllocGreedy.cpp

-      std::max((unsigned)CSRFirstTimeCost, TRI->getCSRFirstUseCost()));
+      CSRFirstTimeCost.getNumOccurrences()
+          ? CSRFirstTimeCost
+          : std::max((unsigned)CSRFirstTimeCost, TRI->getCSRFirstUseCost()));


If the command line option isn't explicitly set, CSRFirstTimeCost will be the default value of 0 so this max will always return TRI->getCSRFirstUseCost()

If someone goes in and changes the default value, wouldn't we want the comparison to occur without having to change the code here?

I guess, but I'm not sure why the default of the CSRFirstTimeCost would be changed instead of the base class implementation of TargetRegisterInfo::getCSRFirstUseCost().

We can take the code how you have it.

topperc

LGTM

lukel97

LGTM too

wangpc-pp

LGTM. Thanks!

llvm-ci · 2025-03-20T19:31:55Z

LLVM Buildbot has detected a new failure on builder openmp-offload-amdgpu-runtime running on omp-vega20-0 while building llvm at step 7 "Add check check-offload".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/30/builds/18064

Here is the relevant piece of the build log for the reference

Step 7 (Add check check-offload) failure: test (failure)
******************** TEST 'libomptarget :: amdgcn-amd-amdhsa :: offloading/gpupgo/pgo1.c' FAILED ********************
Exit Code: 1

Command Output (stdout):
--
# RUN: at line 1
/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./bin/clang -fopenmp    -I /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/offload/test -I /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -L /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -L /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./lib -L /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src  -nogpulib -Wl,-rpath,/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -Wl,-rpath,/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -Wl,-rpath,/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./lib  -fopenmp-targets=amdgcn-amd-amdhsa /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/offload/test/offloading/gpupgo/pgo1.c -o /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/offloading/gpupgo/Output/pgo1.c.tmp /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./lib/libomptarget.devicertl.a -fcreate-profile      -Xarch_device -fprofile-generate
# executed command: /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./bin/clang -fopenmp -I /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/offload/test -I /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -L /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -L /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./lib -L /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -nogpulib -Wl,-rpath,/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -Wl,-rpath,/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -Wl,-rpath,/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./lib -fopenmp-targets=amdgcn-amd-amdhsa /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/offload/test/offloading/gpupgo/pgo1.c -o /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/offloading/gpupgo/Output/pgo1.c.tmp /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./lib/libomptarget.devicertl.a -fcreate-profile -Xarch_device -fprofile-generate
# note: command had no output on stdout or stderr
# RUN: at line 3
env LLVM_PROFILE_FILE=pgo1.c.llvm.profraw      /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/offloading/gpupgo/Output/pgo1.c.tmp 2>&1
# executed command: env LLVM_PROFILE_FILE=pgo1.c.llvm.profraw /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/offloading/gpupgo/Output/pgo1.c.tmp
# note: command had no output on stdout or stderr
# RUN: at line 5
llvm-profdata show --all-functions --counts      amdgcn-amd-amdhsa.pgo1.c.llvm.profraw |      /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./bin/FileCheck /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/offload/test/offloading/gpupgo/pgo1.c --check-prefix="LLVM-PGO"
# executed command: llvm-profdata show --all-functions --counts amdgcn-amd-amdhsa.pgo1.c.llvm.profraw
# note: command had no output on stdout or stderr
# executed command: /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./bin/FileCheck /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/offload/test/offloading/gpupgo/pgo1.c --check-prefix=LLVM-PGO
# .---command stderr------------
# | /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/offload/test/offloading/gpupgo/pgo1.c:37:14: error: LLVM-PGO: expected string not found in input
# | // LLVM-PGO: Block counts: [20, 10, 2, 1]
# |              ^
# | <stdin>:4:13: note: scanning from here
# |  Counters: 4
# |             ^
# | <stdin>:5:2: note: possible intended match here
# |  Block counts: [20, 10, 3, 1]
# |  ^
# | 
# | Input file: <stdin>
# | Check file: /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/offload/test/offloading/gpupgo/pgo1.c
# | 
# | -dump-input=help explains the following input dump.
# | 
# | Input was:
# | <<<<<<
# |             1: Counters: 
# |             2:  __omp_offloading_802_b3a8120_main_l25: 
# |             3:  Hash: 0x03fd5b902019ff2d 
# |             4:  Counters: 4 
# | check:37'0                 X error: no match found
# |             5:  Block counts: [20, 10, 3, 1] 
# | check:37'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# | check:37'1      ?                             possible intended match
# |             6:  test1: 
# | check:37'0     ~~~~~~~
# |             7:  Hash: 0x0a4d0ad3efffffff 
# |             8:  Counters: 1 
# |             9:  Block counts: [10] 
# |            10:  test2: 
...

michaelmaitland · 2025-03-20T20:01:13Z

Adding a comment here for future reference: we should re-evaluate setting this option once the improvements to ShrinkWrapping land.

topperc · 2025-03-21T16:12:05Z

I see a gain on 511.povray_r on the SiFive P550 board too.

michaelmaitland requested review from arsenm, preames, topperc and wangpc-pp March 14, 2025 16:10

llvmbot added the backend:RISC-V label Mar 14, 2025

topperc reviewed Mar 14, 2025

View reviewed changes

llvm/test/CodeGen/RISCV/csr-first-use-cost.ll Outdated Show resolved Hide resolved

michaelmaitland force-pushed the init-csr-cost-regalloc branch from 4946dbe to 9fa4578 Compare March 14, 2025 16:17

michaelmaitland requested a review from topperc March 14, 2025 16:17

asb reviewed Mar 14, 2025

View reviewed changes

llvm/test/CodeGen/RISCV/csr-first-use-cost.ll Outdated Show resolved Hide resolved

topperc reviewed Mar 14, 2025

View reviewed changes

michaelmaitland added 2 commits March 18, 2025 09:20

fixup! rename to DEFAULT-COST

bbbd352

fixup! fix test case to avoid beqz zero]

ad0a1ec

michaelmaitland force-pushed the init-csr-cost-regalloc branch from f007674 to ad0a1ec Compare March 18, 2025 16:21

michaelmaitland requested review from topperc and asb March 18, 2025 16:21

wangpc-pp reviewed Mar 19, 2025

View reviewed changes

wangpc-pp reviewed Mar 20, 2025

View reviewed changes

llvm/test/CodeGen/RISCV/csr-first-use-cost.ll Outdated Show resolved Hide resolved

fixup! fix test and change threshold

0f5c06e

michaelmaitland requested a review from wangpc-pp March 20, 2025 14:10

preames approved these changes Mar 20, 2025

View reviewed changes

topperc reviewed Mar 20, 2025

View reviewed changes

topperc approved these changes Mar 20, 2025

View reviewed changes

lukel97 approved these changes Mar 20, 2025

View reviewed changes

wangpc-pp approved these changes Mar 20, 2025

View reviewed changes

michaelmaitland merged commit 00fabd2 into llvm:main Mar 20, 2025
11 checks passed

michaelmaitland deleted the init-csr-cost-regalloc branch March 20, 2025 19:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RISCV][RegAlloc] Add getCSRFirstUseCost for RISC-V #131349

[RISCV][RegAlloc] Add getCSRFirstUseCost for RISC-V #131349

michaelmaitland commented Mar 14, 2025 •

edited

Loading

llvmbot commented Mar 14, 2025

preames commented Mar 14, 2025

topperc Mar 14, 2025 •

edited

Loading

michaelmaitland Mar 14, 2025

preames commented Mar 14, 2025

topperc Mar 14, 2025

michaelmaitland Mar 14, 2025

topperc Mar 14, 2025 •

edited

Loading

michaelmaitland Mar 14, 2025

michaelmaitland Mar 14, 2025

michaelmaitland commented Mar 14, 2025

topperc Mar 14, 2025

michaelmaitland Mar 18, 2025

lukel97 commented Mar 18, 2025

lukel97 commented Mar 19, 2025

wangpc-pp left a comment

lukel97 commented Mar 19, 2025 •

edited

Loading

wangpc-pp commented Mar 19, 2025 •

edited

Loading

lukel97 commented Mar 19, 2025

topperc commented Mar 19, 2025

preames commented Mar 19, 2025

wangpc-pp left a comment

lukel97 commented Mar 20, 2025

michaelmaitland commented Mar 20, 2025

preames left a comment

topperc Mar 20, 2025

michaelmaitland Mar 20, 2025

topperc Mar 20, 2025

topperc left a comment

lukel97 left a comment

wangpc-pp left a comment

llvm-ci commented Mar 20, 2025

michaelmaitland commented Mar 20, 2025

topperc commented Mar 21, 2025 •

edited

Loading

[RISCV][RegAlloc] Add getCSRFirstUseCost for RISC-V #131349

[RISCV][RegAlloc] Add getCSRFirstUseCost for RISC-V #131349

Conversation

michaelmaitland commented Mar 14, 2025 • edited Loading

llvmbot commented Mar 14, 2025

preames commented Mar 14, 2025

topperc Mar 14, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

preames commented Mar 14, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topperc Mar 14, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michaelmaitland commented Mar 14, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lukel97 commented Mar 18, 2025

lukel97 commented Mar 19, 2025

wangpc-pp left a comment

Choose a reason for hiding this comment

lukel97 commented Mar 19, 2025 • edited Loading

wangpc-pp commented Mar 19, 2025 • edited Loading

lukel97 commented Mar 19, 2025

topperc commented Mar 19, 2025

preames commented Mar 19, 2025

wangpc-pp left a comment

Choose a reason for hiding this comment

lukel97 commented Mar 20, 2025

michaelmaitland commented Mar 20, 2025

preames left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topperc left a comment

Choose a reason for hiding this comment

lukel97 left a comment

Choose a reason for hiding this comment

wangpc-pp left a comment

Choose a reason for hiding this comment

llvm-ci commented Mar 20, 2025

michaelmaitland commented Mar 20, 2025

topperc commented Mar 21, 2025 • edited Loading

michaelmaitland commented Mar 14, 2025 •

edited

Loading

topperc Mar 14, 2025 •

edited

Loading

topperc Mar 14, 2025 •

edited

Loading

lukel97 commented Mar 19, 2025 •

edited

Loading

wangpc-pp commented Mar 19, 2025 •

edited

Loading

topperc commented Mar 21, 2025 •

edited

Loading