Skip to content

[RISCV][RegAlloc] Add getCSRFirstUseCost for RISC-V #131349

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Mar 20, 2025

Conversation

michaelmaitland
Copy link
Contributor

@michaelmaitland michaelmaitland commented Mar 14, 2025

This is based off of 63efd8e.

The following table shows the percent change to the dynamic instruction count when the function in this patch returns 0 (default) versus other values.

benchmark % speedup 1 over 0 % speedup 4 over 0 % speedup 16 over 0 % speedup 64 over 0 % speedup 128 over 0
500.perlbench_r 0.001018570165 0.001049508358 0.001001106529 0.03382582818 0.03395354577
502.gcc_r 0.02850551412 0.02170512371 0.01453021263 0.06011008637 0.1215691521
505.mcf_r -0.00009506373338 -0.00009090057642 -0.0000860991497 -0.00005027849766 0.00001251173791
520.omnetpp_r 0.2958940288 0.2959715925 0.2961141505 0.2959823497 0.2963124341
523.xalancbmk_r -0.0327074721 -0.01037021046 -0.3226810542 0.02127133714 0.02765388389
525.x264_r 0.0000001381714403 -0.00000007041540345 -0.00000002156399465 0.0000002108993364 0.0000002463382874
531.deepsjeng_r 0.00000000339777238 0.000000003874652714 0.000000003636212547 0.000000003874652714 0.000000003159332213
541.leela_r 0.0009186059953 -0.000424159199 0.0004984456879 0.274948447 0.8135521414
557.xz_r -0.000000003547118854 -0.00004896449559 -0.00004910691576 -0.0000491109983 -0.00004895599589
geomean 0.03265937388 0.03424232324 -0.00107917442 0.07629116165 0.1439913192

The following table shows the percent change to the runtime when the function in this patch returns 0 (default) versus other values.

benchmark % speedup 1 over 0 % speedup 4 over 0 % speedup 16 over 0 % speedup 64 over 0 %speedup 128 over 0
500.perlbench_r 0.1722356761 0.2269681109 0.2596825578 0.361573851 1.15041305
502.gcc_r -0.548415855 -0.06187002799 -0.5553684674 -0.8876686237 -0.4668665535
505.mcf_r -0.8786414258 -0.4150938441 -1.035517726 -0.1860770377 -0.01904825648
520.omnetpp_r 0.4130256072 0.6595976188 0.897332171 0.6252625622 0.3869467278
523.xalancbmk_r 1.318132014 -0.003927574 1.025962975 1.090320253 -0.789206202
525.x264_r -0.03112871796 -0.00167557587 0.06932423155 -0.1919840015 -0.1203585732
531.deepsjeng_r -0.259516072 -0.01973455652 -0.2723227894 -0.005417022257 -0.02222388177
541.leela_r -0.3497178495 -0.3510447393 0.1274508001 0.6485542452 0.2880651727
557.xz_r 0.7683565263 -0.2197509447 -0.0431183874 0.07518130872 0.5236853039
geomean 0.06506952742 -0.0211865386 0.05072694648 0.1684530637 0.1020533557

I chose to set the value to 5 on RISC-V because it has improvement to both the dynamic IC and the runtime and because it showed good results empirically and had a similar effect as setting it to higher numbers.

I looked at some diff and it seems like this patch leads to two things:

  1. Less spilling -- not spilling the CSR led to better register allocation and helped us avoid spills down the line
  2. Avoid spilling CSR but spill more on paths that static heuristics estimate as cold.

@llvmbot
Copy link
Member

llvmbot commented Mar 14, 2025

@llvm/pr-subscribers-backend-risc-v

Author: Michael Maitland (michaelmaitland)

Changes

This is based off of 63efd8e.

The following table shows the percent change to the dynamic instruction count when the function in this patch returns 0 (default) versus other values.

benchmark % speedup 1 over 0 % speedup 4 over 0 % speedup 16 over 0 % speedup 64 over 0 % speedup 128 over 0
500.perlbench_r 0.001018570165 0.001049508358 0.001001106529 0.03382582818 0.03395354577
502.gcc_r 0.02850551412 0.02170512371 0.01453021263 0.06011008637 0.1215691521
505.mcf_r -0.00009506373338 -0.00009090057642 -0.0000860991497 -0.00005027849766 0.00001251173791
520.omnetpp_r 0.2958940288 0.2959715925 0.2961141505 0.2959823497 0.2963124341
523.xalancbmk_r -0.0327074721 -0.01037021046 -0.3226810542 0.02127133714 0.02765388389
525.x264_r 0.0000001381714403 -0.00000007041540345 -0.00000002156399465 0.0000002108993364 0.0000002463382874
531.deepsjeng_r 0.00000000339777238 0.000000003874652714 0.000000003636212547 0.000000003874652714 0.000000003159332213
541.leela_r 0.0009186059953 -0.000424159199 0.0004984456879 0.274948447 0.8135521414
557.xz_r -0.000000003547118854 -0.00004896449559 -0.00004910691576 -0.0000491109983 -0.00004895599589
geomean 0.03265937388 0.03424232324 -0.00107917442 0.07629116165 0.1439913192

The following table shows the percent change to the runtime when the function in this patch returns 0 (default) versus other values.

benchmark % speedup 1 over 0 % speedup 4 over 0 % speedup 16 over 0 % speedup 64 over 0 %speedup 128 over 0
500.perlbench_r 0.1722356761 0.2269681109 0.2596825578 0.361573851 1.15041305
502.gcc_r -0.548415855 -0.06187002799 -0.5553684674 -0.8876686237 -0.4668665535
505.mcf_r -0.8786414258 -0.4150938441 -1.035517726 -0.1860770377 -0.01904825648
520.omnetpp_r 0.4130256072 0.6595976188 0.897332171 0.6252625622 0.3869467278
523.xalancbmk_r 1.318132014 -0.003927574 1.025962975 1.090320253 -0.789206202
525.x264_r -0.03112871796 -0.00167557587 0.06932423155 -0.1919840015 -0.1203585732
531.deepsjeng_r -0.259516072 -0.01973455652 -0.2723227894 -0.005417022257 -0.02222388177
541.leela_r -0.3497178495 -0.3510447393 0.1274508001 0.6485542452 0.2880651727
557.xz_r 0.7683565263 -0.2197509447 -0.0431183874 0.07518130872 0.5236853039
geomean 0.06506952742 -0.0211865386 0.05072694648 0.1684530637 0.1020533557

I chose to set the value to 64 on RISC-V because it has improvement to both the dynamic IC and the runtime and because AMDGPU set their number to 100, and callee-saved-spills are probably less expensive on RISC-V than on AMDGPU.

I looked at some diff and it seems like this patch leads to two things:

  1. Less spilling -- not spilling the CSR led to better register allocation and helped us avoid spills down the line
  2. Avoid spilling CSR but spill more on paths that static heuristics estimate as cold.

Full diff: https://github.com/llvm/llvm-project/pull/131349.diff

2 Files Affected:

  • (modified) llvm/lib/Target/RISCV/RISCVRegisterInfo.h (+7)
  • (added) llvm/test/CodeGen/RISCV/csr-first-use-cost.ll (+105)
diff --git a/llvm/lib/Target/RISCV/RISCVRegisterInfo.h b/llvm/lib/Target/RISCV/RISCVRegisterInfo.h
index 0830191dde3f4..fc5852e081a22 100644
--- a/llvm/lib/Target/RISCV/RISCVRegisterInfo.h
+++ b/llvm/lib/Target/RISCV/RISCVRegisterInfo.h
@@ -61,6 +61,13 @@ struct RISCVRegisterInfo : public RISCVGenRegisterInfo {
   const uint32_t *getCallPreservedMask(const MachineFunction &MF,
                                        CallingConv::ID) const override;
 
+  unsigned getCSRFirstUseCost() const override {
+    // The cost will be compared against BlockFrequency where entry has the
+    // value of 1 << 14. A value of 64 will choose to spill or split cold
+    // path instead of using a callee-saved register.
+    return 64;
+  }
+
   const MCPhysReg *getCalleeSavedRegs(const MachineFunction *MF) const override;
 
   const MCPhysReg *getIPRACSRegs(const MachineFunction *MF) const override;
diff --git a/llvm/test/CodeGen/RISCV/csr-first-use-cost.ll b/llvm/test/CodeGen/RISCV/csr-first-use-cost.ll
new file mode 100644
index 0000000000000..62ad116256050
--- /dev/null
+++ b/llvm/test/CodeGen/RISCV/csr-first-use-cost.ll
@@ -0,0 +1,105 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc %s -mtriple=riscv64 -regalloc-csr-first-time-cost=0 | FileCheck %s -check-prefix=ZERO-COST
+; RUN: llc %s -mtriple=riscv64 -regalloc-csr-first-time-cost=64 | FileCheck %s -check-prefix=SOME-COST
+
+define fastcc void @Perl_sv_setnv(ptr %.str.54.3682) {
+; ZERO-COST-LABEL: Perl_sv_setnv:
+; ZERO-COST:       # %bb.0: # %entry
+; ZERO-COST-NEXT:    addi sp, sp, -32
+; ZERO-COST-NEXT:    .cfi_def_cfa_offset 32
+; ZERO-COST-NEXT:    sd ra, 24(sp) # 8-byte Folded Spill
+; ZERO-COST-NEXT:    sd s0, 16(sp) # 8-byte Folded Spill
+; ZERO-COST-NEXT:    sd s1, 8(sp) # 8-byte Folded Spill
+; ZERO-COST-NEXT:    .cfi_offset ra, -8
+; ZERO-COST-NEXT:    .cfi_offset s0, -16
+; ZERO-COST-NEXT:    .cfi_offset s1, -24
+; ZERO-COST-NEXT:    bnez zero, .LBB0_5
+; ZERO-COST-NEXT:  # %bb.1: # %entry
+; ZERO-COST-NEXT:    li a1, 1
+; ZERO-COST-NEXT:    bnez a1, .LBB0_6
+; ZERO-COST-NEXT:  .LBB0_2: # %entry
+; ZERO-COST-NEXT:    mv s0, a0
+; ZERO-COST-NEXT:    beqz zero, .LBB0_4
+; ZERO-COST-NEXT:  # %bb.3: # %sw.bb34.i
+; ZERO-COST-NEXT:    li s0, 0
+; ZERO-COST-NEXT:  .LBB0_4: # %Perl_sv_reftype.exit
+; ZERO-COST-NEXT:    li s1, 0
+; ZERO-COST-NEXT:    li a0, 0
+; ZERO-COST-NEXT:    li a1, 0
+; ZERO-COST-NEXT:    jalr s1
+; ZERO-COST-NEXT:    li a0, 0
+; ZERO-COST-NEXT:    mv a1, s0
+; ZERO-COST-NEXT:    li a2, 0
+; ZERO-COST-NEXT:    jalr s1
+; ZERO-COST-NEXT:  .LBB0_5: # %entry
+; ZERO-COST-NEXT:    beqz zero, .LBB0_2
+; ZERO-COST-NEXT:  .LBB0_6: # %sw.bb3
+; ZERO-COST-NEXT:    ld ra, 24(sp) # 8-byte Folded Reload
+; ZERO-COST-NEXT:    ld s0, 16(sp) # 8-byte Folded Reload
+; ZERO-COST-NEXT:    ld s1, 8(sp) # 8-byte Folded Reload
+; ZERO-COST-NEXT:    .cfi_restore ra
+; ZERO-COST-NEXT:    .cfi_restore s0
+; ZERO-COST-NEXT:    .cfi_restore s1
+; ZERO-COST-NEXT:    addi sp, sp, 32
+; ZERO-COST-NEXT:    .cfi_def_cfa_offset 0
+; ZERO-COST-NEXT:    ret
+;
+; SOME-COST-LABEL: Perl_sv_setnv:
+; SOME-COST:       # %bb.0: # %entry
+; SOME-COST-NEXT:    addi sp, sp, -32
+; SOME-COST-NEXT:    .cfi_def_cfa_offset 32
+; SOME-COST-NEXT:    sd ra, 24(sp) # 8-byte Folded Spill
+; SOME-COST-NEXT:    sd s0, 16(sp) # 8-byte Folded Spill
+; SOME-COST-NEXT:    .cfi_offset ra, -8
+; SOME-COST-NEXT:    .cfi_offset s0, -16
+; SOME-COST-NEXT:    bnez zero, .LBB0_5
+; SOME-COST-NEXT:  # %bb.1: # %entry
+; SOME-COST-NEXT:    li a1, 1
+; SOME-COST-NEXT:    bnez a1, .LBB0_6
+; SOME-COST-NEXT:  .LBB0_2: # %entry
+; SOME-COST-NEXT:    sd a0, 8(sp) # 8-byte Folded Spill
+; SOME-COST-NEXT:    beqz zero, .LBB0_4
+; SOME-COST-NEXT:  # %bb.3: # %sw.bb34.i
+; SOME-COST-NEXT:    sd zero, 8(sp) # 8-byte Folded Spill
+; SOME-COST-NEXT:  .LBB0_4: # %Perl_sv_reftype.exit
+; SOME-COST-NEXT:    li s0, 0
+; SOME-COST-NEXT:    li a0, 0
+; SOME-COST-NEXT:    li a1, 0
+; SOME-COST-NEXT:    jalr s0
+; SOME-COST-NEXT:    li a0, 0
+; SOME-COST-NEXT:    ld a1, 8(sp) # 8-byte Folded Reload
+; SOME-COST-NEXT:    li a2, 0
+; SOME-COST-NEXT:    jalr s0
+; SOME-COST-NEXT:  .LBB0_5: # %entry
+; SOME-COST-NEXT:    beqz zero, .LBB0_2
+; SOME-COST-NEXT:  .LBB0_6: # %sw.bb3
+; SOME-COST-NEXT:    ld ra, 24(sp) # 8-byte Folded Reload
+; SOME-COST-NEXT:    ld s0, 16(sp) # 8-byte Folded Reload
+; SOME-COST-NEXT:    .cfi_restore ra
+; SOME-COST-NEXT:    .cfi_restore s0
+; SOME-COST-NEXT:    addi sp, sp, 32
+; SOME-COST-NEXT:    .cfi_def_cfa_offset 0
+; SOME-COST-NEXT:    ret
+entry:
+  switch i8 0, label %Perl_sv_reftype.exit [
+    i8 1, label %sw.bb4
+    i8 12, label %sw.bb34.i
+    i8 3, label %sw.bb3
+    i8 0, label %sw.bb3
+  ]
+
+sw.bb3:                                           ; preds = %entry, %entry
+  ret void
+
+sw.bb4:                                           ; preds = %entry
+  br label %Perl_sv_reftype.exit
+
+sw.bb34.i:                                        ; preds = %entry
+  br label %Perl_sv_reftype.exit
+
+Perl_sv_reftype.exit:                             ; preds = %sw.bb34.i, %sw.bb4, %entry
+  %retval.0.i = phi ptr [ null, %sw.bb34.i ], [ null, %sw.bb4 ], [ %.str.54.3682, %entry ]
+  %call17 = tail call fastcc i64 null(ptr null, i32 0)
+  tail call void (ptr, ...) null(ptr null, ptr %retval.0.i, ptr null)
+  unreachable
+}

This is based off of 63efd8e.

The following table shows the percent change to the dynamic instruction count
when the function in this patch returns 0 (default) versus other values.

| benchmark       | % speedup 1 over 0     | % speedup 4 over 0    | % speedup 16 over 0   | % speedup 64 over 0  | % speedup 128 over 0 |
| --------------- | ---------------------- | --------------------- | --------------------- | -------------------- | -------------------- |
| 500.perlbench_r | 0.001018570165         | 0.001049508358        | 0.001001106529        | 0.03382582818        | 0.03395354577        |
| 502.gcc_r       | 0.02850551412          | 0.02170512371         | 0.01453021263         | 0.06011008637        | 0.1215691521         |
| 505.mcf_r       | -0.00009506373338      | -0.00009090057642     | -0.0000860991497      | -0.00005027849766    | 0.00001251173791     |
| 520.omnetpp_r   | 0.2958940288           | 0.2959715925          | 0.2961141505          | 0.2959823497         | 0.2963124341         |
| 523.xalancbmk_r | -0.0327074721          | -0.01037021046        | -0.3226810542         | 0.02127133714        | 0.02765388389        |
| 525.x264_r      | 0.0000001381714403     | -0.00000007041540345  | -0.00000002156399465  | 0.0000002108993364   | 0.0000002463382874   |
| 531.deepsjeng_r | 0.00000000339777238    | 0.000000003874652714  | 0.000000003636212547  | 0.000000003874652714 | 0.000000003159332213 |
| 541.leela_r     | 0.0009186059953        | -0.000424159199       | 0.0004984456879       | 0.274948447          | 0.8135521414         |
| 557.xz_r        | -0.000000003547118854  | -0.00004896449559     | -0.00004910691576     | -0.0000491109983     | -0.00004895599589    |
| geomean         | 0.03265937388          | 0.03424232324         | -0.00107917442        | 0.07629116165        | 0.1439913192         |

The following table shows the percent change to the runtime when the function
in this patch returns 0 (default) versus other values.

| benchmark       | % speedup 1 over 0 | % speedup 4 over 0 | % speedup 16 over 0 | % speedup 64 over 0 | %speedup 128 over 0 |
| --------------- | ------------------ | ------------------ | ------------------- | ------------------- | ------------------- |
| 500.perlbench_r | 0.1722356761       | 0.2269681109       | 0.2596825578        | 0.361573851         | 1.15041305          |
| 502.gcc_r       | -0.548415855       | -0.06187002799     | -0.5553684674       | -0.8876686237       | -0.4668665535       |
| 505.mcf_r       | -0.8786414258      | -0.4150938441      | -1.035517726        | -0.1860770377       | -0.01904825648      |
| 520.omnetpp_r   | 0.4130256072       | 0.6595976188       | 0.897332171         | 0.6252625622        | 0.3869467278        |
| 523.xalancbmk_r | 1.318132014        | -0.003927574       | 1.025962975         | 1.090320253         | -0.789206202        |
| 525.x264_r      | -0.03112871796     | -0.00167557587     | 0.06932423155       | -0.1919840015       | -0.1203585732       |
| 531.deepsjeng_r | -0.259516072       | -0.01973455652     | -0.2723227894       | -0.005417022257     | -0.02222388177      |
| 541.leela_r     | -0.3497178495      | -0.3510447393      | 0.1274508001        | 0.6485542452        | 0.2880651727        |
| 557.xz_r        | 0.7683565263       | -0.2197509447      | -0.0431183874       | 0.07518130872       | 0.5236853039        |
| geomean         | 0.06506952742      | -0.0211865386      | 0.05072694648       | 0.1684530637        | 0.1020533557        |

I chose to set the value to 64 on RISC-V because it has improvement to both the
dynamic IC and the runtime and because AMDGPU set their number to 100, and
callee-saved-spills are probably less expensive on RISC-V than on AMDGPU.

I looked at some diff and it seems like this patch leads to two things:
1. Less spilling -- not spilling the CSR led to better register allocation and
   helped us avoid spills down the line
2. Avoid spilling CSR but spill more on paths that static heuristics estimate
   as cold.
@preames
Copy link
Collaborator

preames commented Mar 14, 2025

This seems generally reasonable, and well motivated. A couple of questions:

  1. What architecture did you measure the dynamic IC on? I'm mostly interested in whether it had vector or not. I could see the presence of vector skewing results through interactions with our other heuristics (e.g. vector remat causing more scalar register pressure). Whichever you measured, might be a good idea to sanity check in the other. You don't need to repeat the whole matrix of tests, but maybe confirm the chosen point at least doesn't regress the other?
  2. Which CPU did you measure the performance (runtime) on? Is it in order? Or out of order? Does it have any kind of stack engine or ST/LD forwarding? I'm mostly just trying to understand how general this result might be. If this is e.g. a low end in-order, the perf impact might not carry over to other CPUs.

; DEFAULT-COST-NEXT: bnez a1, .LBB0_6
; DEFAULT-COST-NEXT: .LBB0_2: # %entry
; DEFAULT-COST-NEXT: sd a0, 8(sp) # 8-byte Folded Spill
; DEFAULT-COST-NEXT: beqz zero, .LBB0_4
Copy link
Collaborator

@topperc topperc Mar 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a non-sensical instruction. Its comparing x0 to x0. Any idea what caused it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably llvm-reduce caused it.

@preames
Copy link
Collaborator

preames commented Mar 14, 2025

One other question.
3) What are the noise bands on your runtime measurements? Glancing at e.g. x264 which has basically no dynamic IC change, the runtime measurements look a lot like random noise. That's somewhat unavoidable when the impact is this small, but I want to have some clue which numbers I should be treating as having likely statistical significance.

@lukel97 Might be good to queue this one on the BP3. Your setup does a decent job of pruning statistical false positives, and as above, it looks like that might be relevant here.

; DEFAULT-COST-NEXT: addi sp, sp, -32
; DEFAULT-COST-NEXT: sd ra, 24(sp) # 8-byte Folded Spill
; DEFAULT-COST-NEXT: sd s0, 16(sp) # 8-byte Folded Spill
; DEFAULT-COST-NEXT: bnez zero, .LBB0_5
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also comparing zero to zero.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably llvm-reduce. May still be worth peephole optimizing these?

Copy link
Collaborator

@topperc topperc Mar 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect they are created by tail duplication after and MachineCopyProp very late in the pipeline

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found some cases of us generating this in real code. I will take a look.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're generating it in RISC-V DAG->DAG Pattern Instruction Selection (riscv-isel). I'll work on a separate patch.

@michaelmaitland
Copy link
Contributor Author

What architecture did you measure the dynamic IC on? I'm mostly interested in whether it had vector or not. I could see the presence of vector skewing results through interactions with our other heuristics (e.g. vector remat causing more scalar register pressure). Whichever you measured, might be a good idea to sanity check in the other. You don't need to repeat the whole matrix of tests, but maybe confirm the chosen point at least doesn't regress the other?

It had vector, but I do not see any diff related to vector while skimming benchmarks.

Which CPU did you measure the performance (runtime) on? Is it in order? Or out of order? Does it have any kind of stack engine or ST/LD forwarding? I'm mostly just trying to understand how general this result might be. If this is e.g. a low end in-order, the perf impact might not carry over to other CPUs.

This was measured on our higher end out of order core (p470).

What are the noise bands on your runtime measurements? Glancing at e.g. x264 which has basically no dynamic IC change, the runtime measurements look a lot like random noise. That's somewhat unavoidable when the impact is this small, but I want to have some clue which numbers I should be treating as having likely statistical significance.

@lukel97 Might be good to queue this one on the BP3. Your setup does a decent job of pruning statistical false positives, and as above, it looks like that might be relevant here.

These numbers are mostly within our noise band. It would be great if @lukel97 could have a look! I did notice some cases where some CSR spills turned into non-csr moves (which is a win), which is why I wanted to look at runtime numbers.

; DEFAULT-COST-NEXT: addi sp, sp, 32
; DEFAULT-COST-NEXT: ret
entry:
switch i8 0, label %Perl_sv_reftype.exit [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make this condition a function argument. That will fix the non-sensical branches. I don't think its worth optimizing in SelectionDAG, but I also don't want this test to break if it gets optimized in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@lukel97
Copy link
Contributor

lukel97 commented Mar 18, 2025

I've queued up a run for this on the BPI-F3, will have results sometime tomorrow hopefully!

@michaelmaitland michaelmaitland force-pushed the init-csr-cost-regalloc branch from f007674 to ad0a1ec Compare March 18, 2025 16:21
@michaelmaitland michaelmaitland requested review from topperc and asb March 18, 2025 16:21
@lukel97
Copy link
Contributor

lukel97 commented Mar 19, 2025

I've queued up a run for this on the BPI-F3, will have results sometime tomorrow hopefully!

I'm happy to report that there's some fairly confident gains with this PR, 3.27% on on 511.povray_r and 2.76% on 523.xalancbmk_r, both with very little noise: https://lnt.lukelau.me/db_default/v4/nts/328

On povray a good chunk of the speed up comes from the frame setup in pov::All_Sphere_Intersections being cheaper now that it no longer uses fs3-fs8. If you view it in the profile here and select View: Control-Flow Graph (RISC-V), you can see the extra spills inserted instead in the last block, which is very cold.

Copy link
Contributor

@wangpc-pp wangpc-pp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The result is promising! One concern is that I don't know if we have set the cost too high. What is the logic chain after setting the value of this cost? Definitely it is not higher is better, but what is the rule to balance it?
Also I don't see the big improvement on 523.xalancbmk_r on SiFive core, which makes me hesitate about the value 64.

@lukel97
Copy link
Contributor

lukel97 commented Mar 19, 2025

Also I don't see the big improvement on 523.xalancbmk_r on SiFive core, which makes me hesitate about the value 64.

From Michael's results there's a 0.2% dyn instcount improvement and a 0.4% improvement on the BPI-F3, which seems to check out? Inspecting the codegen diff on -O3 -march=rva22u64_v there's lots of functions with diffs like e.g:

--- build.rva22u64_v-O3.a//External/SPEC/CINT2017rate/523.xalancbmk_r/CMakeFiles/523.xalancbmk_r.dir/Users/luke/Developer/cpu2017/benchspec/CPU/523.xalancbmk_r/src/Abstrac
tNumericValidator.s     2025-03-19 17:29:10
+++ build.rva22u64_v-O3.b//External/SPEC/CINT2017rate/523.xalancbmk_r/CMakeFiles/523.xalancbmk_r.dir/Users/luke/Developer/cpu2017/benchspec/CPU/523.xalancbmk_r/src/Abstrac
tNumericValidator.s     2025-03-19 17:27:46
@@ -79,27 +79,25 @@
        .cfi_personality 155, DW.ref.__gxx_personality_v0
        .cfi_lsda 27, .Lexception0
 # %bb.0:                                # %entry
-       addi    sp, sp, -64
-       .cfi_def_cfa_offset 64
-       sd      ra, 56(sp)                      # 8-byte Folded Spill
-       sd      s0, 48(sp)                      # 8-byte Folded Spill
-       sd      s1, 40(sp)                      # 8-byte Folded Spill
-       sd      s2, 32(sp)                      # 8-byte Folded Spill
-       sd      s3, 24(sp)                      # 8-byte Folded Spill
-       sd      s4, 16(sp)                      # 8-byte Folded Spill
+       addi    sp, sp, -48
+       .cfi_def_cfa_offset 48
+       sd      ra, 40(sp)                      # 8-byte Folded Spill
+       sd      s0, 32(sp)                      # 8-byte Folded Spill
+       sd      s1, 24(sp)                      # 8-byte Folded Spill
+       sd      s2, 16(sp)                      # 8-byte Folded Spill
+       sd      s3, 8(sp)                       # 8-byte Folded Spill
        .cfi_offset ra, -8
        .cfi_offset s0, -16
        .cfi_offset s1, -24
        .cfi_offset s2, -32
        .cfi_offset s3, -40
-       .cfi_offset s4, -48
        .cfi_remember_state

@wangpc-pp
Copy link
Contributor

wangpc-pp commented Mar 19, 2025

It all makes sense to me, I just wonder if the value 64 is too high since the value on AArch64 is 5. I don't see any significant arch difference between RISC-V and AArch64 that will reuslt in the different costs since they are both arch for RISC CPUs.

Maybe @lukel97 you can issue another several runs with different costs?

@lukel97
Copy link
Contributor

lukel97 commented Mar 19, 2025

It all makes sense to me, I just wonder if the value 64 is too high since the value on AArch64 is 5. I don't see any big arch difference between RISC-V and AArch64 since they are both arch for RISC CPUs.

Maybe @lukel97 you can issue another several runs with different costs?

Oh I see what you mean, I'll kick off a run with 5 and see how that compares

@topperc
Copy link
Collaborator

topperc commented Mar 19, 2025

I wonder if this is making up for lack of per register shrink-wrapping in some cases.

@preames
Copy link
Collaborator

preames commented Mar 19, 2025

Maybe @lukel97 you can issue another several runs with different costs?

Oh I see what you mean, I'll kick off a run with 5 and see how that compares

I want to be careful that we don't make perfection the enemy of the good here. We have measured results - both icount and runtime - that show the current number produces reasonable results. I encourage a data driven sweep here, but I propose we should not spend too much time on this before picking something reasonable (64 seems reasonable), and landing it. We can continue to sweep the space for a better overall result, but let's take the win we have rather than trying to find the global maxima. We can always revise the number in the future.

Copy link
Contributor

@wangpc-pp wangpc-pp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I did't mean to be picky here, but please just wait for another run for value 5. We are not in a release merge window so we are not so hurried, right? :-)

@lukel97
Copy link
Contributor

lukel97 commented Mar 20, 2025

Here are the results with 5 as the cost, there's still a 3% gain on 511.povray_r: https://lnt.lukelau.me/db_default/v4/nts/335?show_delta=yes&show_previous=yes&show_stddev=yes&show_mad=yes&show_all=yes&show_all_samples=yes&show_sample_counts=yes&show_small_diff=yes&num_comparison_runs=0&test_filter=&test_min_value_filter=&aggregation_fn=min&MW_confidence_lv=0.05&compare_to=333&submit=Update

There's not as large of a gain on 523.xalanbcmk_r anymore but after looking at the nightly runs it seems to be low noise but highly variable, probably due to code layout https://lnt.lukelau.me/db_default/v4/nts/graph?highlight_run=331&plot.12=25.12.3. So I think we should probably ignore it.

So both 5 and 64 have the same effect on the BPI-F3

@michaelmaitland
Copy link
Contributor Author

Updated to make this patch use 5

Copy link
Collaborator

@preames preames left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

std::max((unsigned)CSRFirstTimeCost, TRI->getCSRFirstUseCost()));
CSRFirstTimeCost.getNumOccurrences()
? CSRFirstTimeCost
: std::max((unsigned)CSRFirstTimeCost, TRI->getCSRFirstUseCost()));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the command line option isn't explicitly set, CSRFirstTimeCost will be the default value of 0 so this max will always return TRI->getCSRFirstUseCost()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If someone goes in and changes the default value, wouldn't we want the comparison to occur without having to change the code here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess, but I'm not sure why the default of the CSRFirstTimeCost would be changed instead of the base class implementation of TargetRegisterInfo::getCSRFirstUseCost().

We can take the code how you have it.

Copy link
Collaborator

@topperc topperc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@lukel97 lukel97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM too

Copy link
Contributor

@wangpc-pp wangpc-pp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@michaelmaitland michaelmaitland merged commit 00fabd2 into llvm:main Mar 20, 2025
11 checks passed
@michaelmaitland michaelmaitland deleted the init-csr-cost-regalloc branch March 20, 2025 19:20
@llvm-ci
Copy link
Collaborator

llvm-ci commented Mar 20, 2025

LLVM Buildbot has detected a new failure on builder openmp-offload-amdgpu-runtime running on omp-vega20-0 while building llvm at step 7 "Add check check-offload".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/30/builds/18064

Here is the relevant piece of the build log for the reference
Step 7 (Add check check-offload) failure: test (failure)
******************** TEST 'libomptarget :: amdgcn-amd-amdhsa :: offloading/gpupgo/pgo1.c' FAILED ********************
Exit Code: 1

Command Output (stdout):
--
# RUN: at line 1
/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./bin/clang -fopenmp    -I /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/offload/test -I /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -L /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -L /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./lib -L /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src  -nogpulib -Wl,-rpath,/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -Wl,-rpath,/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -Wl,-rpath,/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./lib  -fopenmp-targets=amdgcn-amd-amdhsa /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/offload/test/offloading/gpupgo/pgo1.c -o /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/offloading/gpupgo/Output/pgo1.c.tmp /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./lib/libomptarget.devicertl.a -fcreate-profile      -Xarch_device -fprofile-generate
# executed command: /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./bin/clang -fopenmp -I /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/offload/test -I /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -L /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -L /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./lib -L /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -nogpulib -Wl,-rpath,/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload -Wl,-rpath,/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -Wl,-rpath,/home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./lib -fopenmp-targets=amdgcn-amd-amdhsa /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/offload/test/offloading/gpupgo/pgo1.c -o /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/offloading/gpupgo/Output/pgo1.c.tmp /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./lib/libomptarget.devicertl.a -fcreate-profile -Xarch_device -fprofile-generate
# note: command had no output on stdout or stderr
# RUN: at line 3
env LLVM_PROFILE_FILE=pgo1.c.llvm.profraw      /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/offloading/gpupgo/Output/pgo1.c.tmp 2>&1
# executed command: env LLVM_PROFILE_FILE=pgo1.c.llvm.profraw /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/runtimes/runtimes-bins/offload/test/amdgcn-amd-amdhsa/offloading/gpupgo/Output/pgo1.c.tmp
# note: command had no output on stdout or stderr
# RUN: at line 5
llvm-profdata show --all-functions --counts      amdgcn-amd-amdhsa.pgo1.c.llvm.profraw |      /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./bin/FileCheck /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/offload/test/offloading/gpupgo/pgo1.c --check-prefix="LLVM-PGO"
# executed command: llvm-profdata show --all-functions --counts amdgcn-amd-amdhsa.pgo1.c.llvm.profraw
# note: command had no output on stdout or stderr
# executed command: /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.build/./bin/FileCheck /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/offload/test/offloading/gpupgo/pgo1.c --check-prefix=LLVM-PGO
# .---command stderr------------
# | /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/offload/test/offloading/gpupgo/pgo1.c:37:14: error: LLVM-PGO: expected string not found in input
# | // LLVM-PGO: Block counts: [20, 10, 2, 1]
# |              ^
# | <stdin>:4:13: note: scanning from here
# |  Counters: 4
# |             ^
# | <stdin>:5:2: note: possible intended match here
# |  Block counts: [20, 10, 3, 1]
# |  ^
# | 
# | Input file: <stdin>
# | Check file: /home/ompworker/bbot/openmp-offload-amdgpu-runtime/llvm.src/offload/test/offloading/gpupgo/pgo1.c
# | 
# | -dump-input=help explains the following input dump.
# | 
# | Input was:
# | <<<<<<
# |             1: Counters: 
# |             2:  __omp_offloading_802_b3a8120_main_l25: 
# |             3:  Hash: 0x03fd5b902019ff2d 
# |             4:  Counters: 4 
# | check:37'0                 X error: no match found
# |             5:  Block counts: [20, 10, 3, 1] 
# | check:37'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# | check:37'1      ?                             possible intended match
# |             6:  test1: 
# | check:37'0     ~~~~~~~
# |             7:  Hash: 0x0a4d0ad3efffffff 
# |             8:  Counters: 1 
# |             9:  Block counts: [10] 
# |            10:  test2: 
...

@michaelmaitland
Copy link
Contributor Author

Adding a comment here for future reference: we should re-evaluate setting this option once the improvements to ShrinkWrapping land.

@topperc
Copy link
Collaborator

topperc commented Mar 21, 2025

I see a gain on 511.povray_r on the SiFive P550 board too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants