Skip to content

[RegAlloc] Scale the spill weight by target factor #113675

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 13, 2025

Conversation

wangpc-pp
Copy link
Contributor

@wangpc-pp wangpc-pp commented Oct 25, 2024

Currently, the spill weight is only determined by isDef/isUse and
block frequency. However, for registers with different register
classes, the costs of spilling them are different.

For example, for LMUL>1 registers (in which, several physical
registers compound a bigger logical register), the costs are larger
than LMUL=1 case (in which, there is only one physical register).

To solve this problem, a new target hook getSpillWeightScaleFactor
is added. Targets can override the default factor (which is 1.0)
according to the register class.

For RISC-V, the factors are set to the RegClassWeight which is
used to track register pressure. The values of RegClassWeight
happen to be the number of register units.

I believe all of the targets with compounded registers can benefit
from this change, but only RISC-V is customized in this patch since
it has widely been agreed to do so. The other targets need more
performance data to go further.

Partially fixes #113489.

@llvmbot
Copy link
Member

llvmbot commented Oct 25, 2024

@llvm/pr-subscribers-backend-loongarch
@llvm/pr-subscribers-backend-arm
@llvm/pr-subscribers-backend-aarch64
@llvm/pr-subscribers-backend-x86
@llvm/pr-subscribers-backend-powerpc
@llvm/pr-subscribers-backend-hexagon
@llvm/pr-subscribers-backend-amdgpu
@llvm/pr-subscribers-debuginfo

@llvm/pr-subscribers-llvm-regalloc

Author: Pengcheng Wang (wangpc-pp)

Changes

Currently, the spill weight is only determined by isDef/isUse and
block frequency. However, for registers with different register
classes, the costs of spilling them are different.

For example, for LMUL>1 registers (in which, several physical regsiter
compound a bigger logical register), the costs are larger than
LMUL=1 case (in which, there is only one physical register).

To sovle this problem, a new target hook getSpillWeightFactor is
added. Targets can override the default factor (which is 1) according
to the register classes.

For RISC-V, the factors are set to the RegClassWeight which is
used to track regsiter pressure. The values of RegClassWeight
are the number of register units.

I believe all the targets can benefit from this change, but I will
shrink the range of tests to RISC-V only.

Partially fixes #113489.


Patch is 870.37 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/113675.diff

78 Files Affected:

  • (modified) llvm/include/llvm/CodeGen/LiveIntervals.h (+2-2)
  • (modified) llvm/include/llvm/CodeGen/TargetRegisterInfo.h (+3)
  • (modified) llvm/lib/CodeGen/CalcSpillWeights.cpp (+7-5)
  • (modified) llvm/lib/CodeGen/LiveIntervals.cpp (+4-4)
  • (modified) llvm/lib/CodeGen/TargetRegisterInfo.cpp (+5)
  • (modified) llvm/lib/Target/RISCV/RISCVRegisterInfo.cpp (+5)
  • (modified) llvm/lib/Target/RISCV/RISCVRegisterInfo.h (+2)
  • (modified) llvm/test/CodeGen/RISCV/rvv/abs-vp.ll (+6-29)
  • (modified) llvm/test/CodeGen/RISCV/rvv/bitreverse-vp.ll (+67-87)
  • (modified) llvm/test/CodeGen/RISCV/rvv/bswap-vp.ll (+36-56)
  • (modified) llvm/test/CodeGen/RISCV/rvv/ceil-vp.ll (+44-130)
  • (modified) llvm/test/CodeGen/RISCV/rvv/compressstore.ll (+7-30)
  • (modified) llvm/test/CodeGen/RISCV/rvv/ctpop-vp.ll (+57-157)
  • (modified) llvm/test/CodeGen/RISCV/rvv/cttz-vp.ll (+74-225)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-bitreverse-vp.ll (+58-78)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-bswap-vp.ll (+37-57)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-ceil-vp.ll (-14)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-ctlz-vp.ll (+166-472)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-ctpop-vp.ll (+80-181)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-cttz-vp.ll (+146-452)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-floor-vp.ll (-14)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-interleaved-access.ll (+361-358)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-masked-store-fp.ll (+8-96)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-masked-store-int.ll (+8-96)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-reduction-fp.ll (+76-224)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-rint-vp.ll (-13)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-round-vp.ll (-14)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-roundeven-vp.ll (-14)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-roundtozero-vp.ll (-14)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-setcc-fp-vp.ll (+20-40)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-setcc-int-vp.ll (+18-38)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-trunc-vp.ll (+94-91)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vcopysign-vp.ll (+5-15)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vfmax-vp.ll (+5-15)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vfmin-vp.ll (+5-15)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vselect-vp.ll (+15-25)
  • (modified) llvm/test/CodeGen/RISCV/rvv/floor-vp.ll (+44-130)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fmaximum-sdnode.ll (+4-30)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fmaximum-vp.ll (+231-282)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fminimum-sdnode.ll (+4-30)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fminimum-vp.ll (+231-282)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fshr-fshl-vp.ll (+67-201)
  • (modified) llvm/test/CodeGen/RISCV/rvv/mscatter-sdnode.ll (+9-19)
  • (modified) llvm/test/CodeGen/RISCV/rvv/nearbyint-vp.ll (+36-92)
  • (modified) llvm/test/CodeGen/RISCV/rvv/rint-vp.ll (+44-123)
  • (modified) llvm/test/CodeGen/RISCV/rvv/round-vp.ll (+44-130)
  • (modified) llvm/test/CodeGen/RISCV/rvv/roundeven-vp.ll (+44-130)
  • (modified) llvm/test/CodeGen/RISCV/rvv/roundtozero-vp.ll (+44-130)
  • (modified) llvm/test/CodeGen/RISCV/rvv/setcc-fp-vp.ll (+84-230)
  • (modified) llvm/test/CodeGen/RISCV/rvv/setcc-int-vp.ll (+20-40)
  • (modified) llvm/test/CodeGen/RISCV/rvv/strided-vpstore.ll (+6-20)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vector-deinterleave-load.ll (+17-19)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vector-deinterleave.ll (+40-22)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vector-interleave-store.ll (+7-29)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vector-interleave.ll (+48-96)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vfadd-vp.ll (+74-90)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vfdiv-vp.ll (+74-90)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vfma-vp.ll (+918-978)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vfmadd-constrained-sdnode.ll (+94-68)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vfmadd-sdnode.ll (+147-106)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vfmul-vp.ll (+37-45)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vfmuladd-vp.ll (+4-4)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vfnmadd-constrained-sdnode.ll (+60-54)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vfnmsub-constrained-sdnode.ll (+55-64)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vfptosi-vp.ll (+1-14)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vfptoui-vp.ll (+1-14)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vfptrunc-vp.ll (+5-19)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vfsub-vp.ll (+74-90)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vfwmacc-vp.ll (+5-17)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vfwnmacc-vp.ll (+5-17)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vfwnmsac-vp.ll (+5-17)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vpscatter-sdnode.ll (+14-14)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vpstore.ll (+6-20)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vselect-fp.ll (-1)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vselect-vp.ll (+27-101)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vsitofp-vp.ll (+1-14)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vtrunc-vp.ll (+4-5)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vuitofp-vp.ll (+1-14)
diff --git a/llvm/include/llvm/CodeGen/LiveIntervals.h b/llvm/include/llvm/CodeGen/LiveIntervals.h
index 161bb247a0e968..a58ba178ac8484 100644
--- a/llvm/include/llvm/CodeGen/LiveIntervals.h
+++ b/llvm/include/llvm/CodeGen/LiveIntervals.h
@@ -117,14 +117,14 @@ class LiveIntervals {
   /// If \p PSI is provided the calculation is altered for optsize functions.
   static float getSpillWeight(bool isDef, bool isUse,
                               const MachineBlockFrequencyInfo *MBFI,
-                              const MachineInstr &MI,
+                              const MachineInstr &MI, unsigned Factor = 1,
                               ProfileSummaryInfo *PSI = nullptr);
 
   /// Calculate the spill weight to assign to a single instruction.
   /// If \p PSI is provided the calculation is altered for optsize functions.
   static float getSpillWeight(bool isDef, bool isUse,
                               const MachineBlockFrequencyInfo *MBFI,
-                              const MachineBasicBlock *MBB,
+                              const MachineBasicBlock *MBB, unsigned Factor = 1,
                               ProfileSummaryInfo *PSI = nullptr);
 
   LiveInterval &getInterval(Register Reg) {
diff --git a/llvm/include/llvm/CodeGen/TargetRegisterInfo.h b/llvm/include/llvm/CodeGen/TargetRegisterInfo.h
index 292fa3c94969be..8726d2e33dbc83 100644
--- a/llvm/include/llvm/CodeGen/TargetRegisterInfo.h
+++ b/llvm/include/llvm/CodeGen/TargetRegisterInfo.h
@@ -926,6 +926,9 @@ class TargetRegisterInfo : public MCRegisterInfo {
   /// Returns a -1 terminated array of pressure set IDs.
   virtual const int *getRegUnitPressureSets(unsigned RegUnit) const = 0;
 
+  /// Get the factor of spill weight for this register class.
+  virtual unsigned getSpillWeightFactor(const TargetRegisterClass *RC) const;
+
   /// Get a list of 'hint' registers that the register allocator should try
   /// first when allocating a physical register for the virtual register
   /// VirtReg. These registers are effectively moved to the front of the
diff --git a/llvm/lib/CodeGen/CalcSpillWeights.cpp b/llvm/lib/CodeGen/CalcSpillWeights.cpp
index f361c956092e88..8c3ab0d1e43a89 100644
--- a/llvm/lib/CodeGen/CalcSpillWeights.cpp
+++ b/llvm/lib/CodeGen/CalcSpillWeights.cpp
@@ -189,6 +189,7 @@ float VirtRegAuxInfo::weightCalcHelper(LiveInterval &LI, SlotIndex *Start,
   // Do not update future local split artifacts.
   bool ShouldUpdateLI = !IsLocalSplitArtifact;
 
+  unsigned Factor = TRI.getSpillWeightFactor(MRI.getRegClass(LI.reg()));
   if (IsLocalSplitArtifact) {
     MachineBasicBlock *LocalMBB = LIS.getMBBFromIndex(*End);
     assert(LocalMBB == LIS.getMBBFromIndex(*Start) &&
@@ -199,10 +200,10 @@ float VirtRegAuxInfo::weightCalcHelper(LiveInterval &LI, SlotIndex *Start,
     // localLI = COPY other
     // ...
     // other   = COPY localLI
-    TotalWeight +=
-        LiveIntervals::getSpillWeight(true, false, &MBFI, LocalMBB, PSI);
-    TotalWeight +=
-        LiveIntervals::getSpillWeight(false, true, &MBFI, LocalMBB, PSI);
+    TotalWeight += LiveIntervals::getSpillWeight(true, false, &MBFI, LocalMBB,
+                                                 Factor, PSI);
+    TotalWeight += LiveIntervals::getSpillWeight(false, true, &MBFI, LocalMBB,
+                                                 Factor, PSI);
 
     NumInstr += 2;
   }
@@ -274,7 +275,8 @@ float VirtRegAuxInfo::weightCalcHelper(LiveInterval &LI, SlotIndex *Start,
       // Calculate instr weight.
       bool Reads, Writes;
       std::tie(Reads, Writes) = MI->readsWritesVirtualRegister(LI.reg());
-      Weight = LiveIntervals::getSpillWeight(Writes, Reads, &MBFI, *MI, PSI);
+      Weight =
+          LiveIntervals::getSpillWeight(Writes, Reads, &MBFI, *MI, Factor, PSI);
 
       // Give extra weight to what looks like a loop induction variable update.
       if (Writes && IsExiting && LIS.isLiveOutOfMBB(LI, MBB))
diff --git a/llvm/lib/CodeGen/LiveIntervals.cpp b/llvm/lib/CodeGen/LiveIntervals.cpp
index 21a316cf99a217..48f4538122be3e 100644
--- a/llvm/lib/CodeGen/LiveIntervals.cpp
+++ b/llvm/lib/CodeGen/LiveIntervals.cpp
@@ -877,15 +877,15 @@ LiveIntervals::hasPHIKill(const LiveInterval &LI, const VNInfo *VNI) const {
 
 float LiveIntervals::getSpillWeight(bool isDef, bool isUse,
                                     const MachineBlockFrequencyInfo *MBFI,
-                                    const MachineInstr &MI,
+                                    const MachineInstr &MI, unsigned Factor,
                                     ProfileSummaryInfo *PSI) {
-  return getSpillWeight(isDef, isUse, MBFI, MI.getParent(), PSI);
+  return getSpillWeight(isDef, isUse, MBFI, MI.getParent(), Factor, PSI);
 }
 
 float LiveIntervals::getSpillWeight(bool isDef, bool isUse,
                                     const MachineBlockFrequencyInfo *MBFI,
                                     const MachineBasicBlock *MBB,
-                                    ProfileSummaryInfo *PSI) {
+                                    unsigned Factor, ProfileSummaryInfo *PSI) {
   float Weight = isDef + isUse;
   const auto *MF = MBB->getParent();
   // When optimizing for size we only consider the codesize impact of spilling
@@ -893,7 +893,7 @@ float LiveIntervals::getSpillWeight(bool isDef, bool isUse,
   if (PSI && (MF->getFunction().hasOptSize() ||
               llvm::shouldOptimizeForSize(MF, PSI, MBFI)))
     return Weight;
-  return Weight * MBFI->getBlockFreqRelativeToEntryBlock(MBB);
+  return Weight * MBFI->getBlockFreqRelativeToEntryBlock(MBB) * Factor;
 }
 
 LiveRange::Segment
diff --git a/llvm/lib/CodeGen/TargetRegisterInfo.cpp b/llvm/lib/CodeGen/TargetRegisterInfo.cpp
index ac9a3d6f0d1a60..d1f02489db62cb 100644
--- a/llvm/lib/CodeGen/TargetRegisterInfo.cpp
+++ b/llvm/lib/CodeGen/TargetRegisterInfo.cpp
@@ -415,6 +415,11 @@ bool TargetRegisterInfo::shouldRewriteCopySrc(const TargetRegisterClass *DefRC,
   return shareSameRegisterFile(*this, DefRC, DefSubReg, SrcRC, SrcSubReg);
 }
 
+unsigned
+TargetRegisterInfo::getSpillWeightFactor(const TargetRegisterClass *RC) const {
+  return 1;
+}
+
 // Compute target-independent register allocator hints to help eliminate copies.
 bool TargetRegisterInfo::getRegAllocationHints(
     Register VirtReg, ArrayRef<MCPhysReg> Order,
diff --git a/llvm/lib/Target/RISCV/RISCVRegisterInfo.cpp b/llvm/lib/Target/RISCV/RISCVRegisterInfo.cpp
index 26195ef721db39..884a62c3e70679 100644
--- a/llvm/lib/Target/RISCV/RISCVRegisterInfo.cpp
+++ b/llvm/lib/Target/RISCV/RISCVRegisterInfo.cpp
@@ -803,6 +803,11 @@ RISCVRegisterInfo::getRegisterCostTableIndex(const MachineFunction &MF) const {
              : 0;
 }
 
+unsigned
+RISCVRegisterInfo::getSpillWeightFactor(const TargetRegisterClass *RC) const {
+  return getRegClassWeight(RC).RegWeight;
+}
+
 // Add two address hints to improve chances of being able to use a compressed
 // instruction.
 bool RISCVRegisterInfo::getRegAllocationHints(
diff --git a/llvm/lib/Target/RISCV/RISCVRegisterInfo.h b/llvm/lib/Target/RISCV/RISCVRegisterInfo.h
index 6ddb1eb9c14d5e..51e7f9d3b0cc1d 100644
--- a/llvm/lib/Target/RISCV/RISCVRegisterInfo.h
+++ b/llvm/lib/Target/RISCV/RISCVRegisterInfo.h
@@ -127,6 +127,8 @@ struct RISCVRegisterInfo : public RISCVGenRegisterInfo {
 
   unsigned getRegisterCostTableIndex(const MachineFunction &MF) const override;
 
+  unsigned getSpillWeightFactor(const TargetRegisterClass *RC) const override;
+
   bool getRegAllocationHints(Register VirtReg, ArrayRef<MCPhysReg> Order,
                              SmallVectorImpl<MCPhysReg> &Hints,
                              const MachineFunction &MF, const VirtRegMap *VRM,
diff --git a/llvm/test/CodeGen/RISCV/rvv/abs-vp.ll b/llvm/test/CodeGen/RISCV/rvv/abs-vp.ll
index cd2208e31eb6d3..b37454b3b24434 100644
--- a/llvm/test/CodeGen/RISCV/rvv/abs-vp.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/abs-vp.ll
@@ -561,18 +561,7 @@ declare <vscale x 16 x i64> @llvm.vp.abs.nxv16i64(<vscale x 16 x i64>, i1 immarg
 define <vscale x 16 x i64> @vp_abs_nxv16i64(<vscale x 16 x i64> %va, <vscale x 16 x i1> %m, i32 zeroext %evl) {
 ; CHECK-LABEL: vp_abs_nxv16i64:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    addi sp, sp, -16
-; CHECK-NEXT:    .cfi_def_cfa_offset 16
-; CHECK-NEXT:    csrr a1, vlenb
-; CHECK-NEXT:    slli a1, a1, 4
-; CHECK-NEXT:    sub sp, sp, a1
-; CHECK-NEXT:    .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x10, 0x22, 0x11, 0x10, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 16 + 16 * vlenb
-; CHECK-NEXT:    vmv1r.v v24, v0
-; CHECK-NEXT:    csrr a1, vlenb
-; CHECK-NEXT:    slli a1, a1, 3
-; CHECK-NEXT:    add a1, sp, a1
-; CHECK-NEXT:    addi a1, a1, 16
-; CHECK-NEXT:    vs8r.v v8, (a1) # Unknown-size Folded Spill
+; CHECK-NEXT:    vmv1r.v v7, v0
 ; CHECK-NEXT:    csrr a1, vlenb
 ; CHECK-NEXT:    srli a2, a1, 3
 ; CHECK-NEXT:    vsetvli a3, zero, e8, mf4, ta, ma
@@ -582,28 +571,16 @@ define <vscale x 16 x i64> @vp_abs_nxv16i64(<vscale x 16 x i64> %va, <vscale x 1
 ; CHECK-NEXT:    addi a3, a3, -1
 ; CHECK-NEXT:    and a2, a3, a2
 ; CHECK-NEXT:    vsetvli zero, a2, e64, m8, ta, ma
-; CHECK-NEXT:    vrsub.vi v8, v16, 0, v0.t
-; CHECK-NEXT:    vmax.vv v8, v16, v8, v0.t
-; CHECK-NEXT:    addi a2, sp, 16
-; CHECK-NEXT:    vs8r.v v8, (a2) # Unknown-size Folded Spill
+; CHECK-NEXT:    vrsub.vi v24, v16, 0, v0.t
+; CHECK-NEXT:    vmax.vv v16, v16, v24, v0.t
 ; CHECK-NEXT:    bltu a0, a1, .LBB46_2
 ; CHECK-NEXT:  # %bb.1:
 ; CHECK-NEXT:    mv a0, a1
 ; CHECK-NEXT:  .LBB46_2:
-; CHECK-NEXT:    vmv1r.v v0, v24
-; CHECK-NEXT:    slli a1, a1, 3
-; CHECK-NEXT:    add a1, sp, a1
-; CHECK-NEXT:    addi a1, a1, 16
-; CHECK-NEXT:    vl8r.v v8, (a1) # Unknown-size Folded Reload
+; CHECK-NEXT:    vmv1r.v v0, v7
 ; CHECK-NEXT:    vsetvli zero, a0, e64, m8, ta, ma
-; CHECK-NEXT:    vrsub.vi v16, v8, 0, v0.t
-; CHECK-NEXT:    vmax.vv v8, v8, v16, v0.t
-; CHECK-NEXT:    addi a0, sp, 16
-; CHECK-NEXT:    vl8r.v v16, (a0) # Unknown-size Folded Reload
-; CHECK-NEXT:    csrr a0, vlenb
-; CHECK-NEXT:    slli a0, a0, 4
-; CHECK-NEXT:    add sp, sp, a0
-; CHECK-NEXT:    addi sp, sp, 16
+; CHECK-NEXT:    vrsub.vi v24, v8, 0, v0.t
+; CHECK-NEXT:    vmax.vv v8, v8, v24, v0.t
 ; CHECK-NEXT:    ret
   %v = call <vscale x 16 x i64> @llvm.vp.abs.nxv16i64(<vscale x 16 x i64> %va, i1 false, <vscale x 16 x i1> %m, i32 %evl)
   ret <vscale x 16 x i64> %v
diff --git a/llvm/test/CodeGen/RISCV/rvv/bitreverse-vp.ll b/llvm/test/CodeGen/RISCV/rvv/bitreverse-vp.ll
index afce04d107e728..94bc2851a6bf40 100644
--- a/llvm/test/CodeGen/RISCV/rvv/bitreverse-vp.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/bitreverse-vp.ll
@@ -2307,7 +2307,7 @@ define <vscale x 7 x i64> @vp_bitreverse_nxv7i64(<vscale x 7 x i64> %va, <vscale
 ; RV32-NEXT:    vsll.vx v24, v24, a3, v0.t
 ; RV32-NEXT:    vor.vv v16, v16, v24, v0.t
 ; RV32-NEXT:    csrr a4, vlenb
-; RV32-NEXT:    slli a4, a4, 4
+; RV32-NEXT:    slli a4, a4, 3
 ; RV32-NEXT:    add a4, sp, a4
 ; RV32-NEXT:    addi a4, a4, 16
 ; RV32-NEXT:    vs8r.v v16, (a4) # Unknown-size Folded Spill
@@ -2315,28 +2315,30 @@ define <vscale x 7 x i64> @vp_bitreverse_nxv7i64(<vscale x 7 x i64> %va, <vscale
 ; RV32-NEXT:    vsetvli a5, zero, e64, m8, ta, ma
 ; RV32-NEXT:    vlse64.v v16, (a4), zero
 ; RV32-NEXT:    csrr a4, vlenb
-; RV32-NEXT:    slli a4, a4, 3
+; RV32-NEXT:    slli a4, a4, 4
 ; RV32-NEXT:    add a4, sp, a4
 ; RV32-NEXT:    addi a4, a4, 16
 ; RV32-NEXT:    vs8r.v v16, (a4) # Unknown-size Folded Spill
 ; RV32-NEXT:    lui a4, 4080
 ; RV32-NEXT:    vsetvli zero, a0, e64, m8, ta, ma
-; RV32-NEXT:    vand.vx v24, v8, a4, v0.t
-; RV32-NEXT:    vsll.vi v24, v24, 24, v0.t
-; RV32-NEXT:    addi a5, sp, 16
-; RV32-NEXT:    vs8r.v v24, (a5) # Unknown-size Folded Spill
-; RV32-NEXT:    vand.vv v24, v8, v16, v0.t
-; RV32-NEXT:    vsll.vi v16, v24, 8, v0.t
-; RV32-NEXT:    vl8r.v v24, (a5) # Unknown-size Folded Reload
-; RV32-NEXT:    vor.vv v16, v24, v16, v0.t
+; RV32-NEXT:    vand.vx v16, v8, a4, v0.t
+; RV32-NEXT:    vsll.vi v24, v16, 24, v0.t
 ; RV32-NEXT:    csrr a5, vlenb
 ; RV32-NEXT:    slli a5, a5, 4
 ; RV32-NEXT:    add a5, sp, a5
 ; RV32-NEXT:    addi a5, a5, 16
+; RV32-NEXT:    vl8r.v v16, (a5) # Unknown-size Folded Reload
+; RV32-NEXT:    vand.vv v16, v8, v16, v0.t
+; RV32-NEXT:    vsll.vi v16, v16, 8, v0.t
+; RV32-NEXT:    vor.vv v16, v24, v16, v0.t
+; RV32-NEXT:    csrr a5, vlenb
+; RV32-NEXT:    slli a5, a5, 3
+; RV32-NEXT:    add a5, sp, a5
+; RV32-NEXT:    addi a5, a5, 16
 ; RV32-NEXT:    vl8r.v v24, (a5) # Unknown-size Folded Reload
 ; RV32-NEXT:    vor.vv v16, v24, v16, v0.t
 ; RV32-NEXT:    csrr a5, vlenb
-; RV32-NEXT:    slli a5, a5, 4
+; RV32-NEXT:    slli a5, a5, 3
 ; RV32-NEXT:    add a5, sp, a5
 ; RV32-NEXT:    addi a5, a5, 16
 ; RV32-NEXT:    vs8r.v v16, (a5) # Unknown-size Folded Spill
@@ -2350,7 +2352,7 @@ define <vscale x 7 x i64> @vp_bitreverse_nxv7i64(<vscale x 7 x i64> %va, <vscale
 ; RV32-NEXT:    vand.vx v24, v24, a4, v0.t
 ; RV32-NEXT:    vsrl.vi v8, v8, 8, v0.t
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    slli a1, a1, 3
+; RV32-NEXT:    slli a1, a1, 4
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
 ; RV32-NEXT:    vl8r.v v16, (a1) # Unknown-size Folded Reload
@@ -2360,7 +2362,7 @@ define <vscale x 7 x i64> @vp_bitreverse_nxv7i64(<vscale x 7 x i64> %va, <vscale
 ; RV32-NEXT:    vl8r.v v16, (a1) # Unknown-size Folded Reload
 ; RV32-NEXT:    vor.vv v8, v8, v16, v0.t
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    slli a1, a1, 4
+; RV32-NEXT:    slli a1, a1, 3
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
 ; RV32-NEXT:    vl8r.v v16, (a1) # Unknown-size Folded Reload
@@ -2668,7 +2670,7 @@ define <vscale x 8 x i64> @vp_bitreverse_nxv8i64(<vscale x 8 x i64> %va, <vscale
 ; RV32-NEXT:    vsll.vx v24, v24, a3, v0.t
 ; RV32-NEXT:    vor.vv v16, v16, v24, v0.t
 ; RV32-NEXT:    csrr a4, vlenb
-; RV32-NEXT:    slli a4, a4, 4
+; RV32-NEXT:    slli a4, a4, 3
 ; RV32-NEXT:    add a4, sp, a4
 ; RV32-NEXT:    addi a4, a4, 16
 ; RV32-NEXT:    vs8r.v v16, (a4) # Unknown-size Folded Spill
@@ -2676,28 +2678,30 @@ define <vscale x 8 x i64> @vp_bitreverse_nxv8i64(<vscale x 8 x i64> %va, <vscale
 ; RV32-NEXT:    vsetvli a5, zero, e64, m8, ta, ma
 ; RV32-NEXT:    vlse64.v v16, (a4), zero
 ; RV32-NEXT:    csrr a4, vlenb
-; RV32-NEXT:    slli a4, a4, 3
+; RV32-NEXT:    slli a4, a4, 4
 ; RV32-NEXT:    add a4, sp, a4
 ; RV32-NEXT:    addi a4, a4, 16
 ; RV32-NEXT:    vs8r.v v16, (a4) # Unknown-size Folded Spill
 ; RV32-NEXT:    lui a4, 4080
 ; RV32-NEXT:    vsetvli zero, a0, e64, m8, ta, ma
-; RV32-NEXT:    vand.vx v24, v8, a4, v0.t
-; RV32-NEXT:    vsll.vi v24, v24, 24, v0.t
-; RV32-NEXT:    addi a5, sp, 16
-; RV32-NEXT:    vs8r.v v24, (a5) # Unknown-size Folded Spill
-; RV32-NEXT:    vand.vv v24, v8, v16, v0.t
-; RV32-NEXT:    vsll.vi v16, v24, 8, v0.t
-; RV32-NEXT:    vl8r.v v24, (a5) # Unknown-size Folded Reload
-; RV32-NEXT:    vor.vv v16, v24, v16, v0.t
+; RV32-NEXT:    vand.vx v16, v8, a4, v0.t
+; RV32-NEXT:    vsll.vi v24, v16, 24, v0.t
 ; RV32-NEXT:    csrr a5, vlenb
 ; RV32-NEXT:    slli a5, a5, 4
 ; RV32-NEXT:    add a5, sp, a5
 ; RV32-NEXT:    addi a5, a5, 16
+; RV32-NEXT:    vl8r.v v16, (a5) # Unknown-size Folded Reload
+; RV32-NEXT:    vand.vv v16, v8, v16, v0.t
+; RV32-NEXT:    vsll.vi v16, v16, 8, v0.t
+; RV32-NEXT:    vor.vv v16, v24, v16, v0.t
+; RV32-NEXT:    csrr a5, vlenb
+; RV32-NEXT:    slli a5, a5, 3
+; RV32-NEXT:    add a5, sp, a5
+; RV32-NEXT:    addi a5, a5, 16
 ; RV32-NEXT:    vl8r.v v24, (a5) # Unknown-size Folded Reload
 ; RV32-NEXT:    vor.vv v16, v24, v16, v0.t
 ; RV32-NEXT:    csrr a5, vlenb
-; RV32-NEXT:    slli a5, a5, 4
+; RV32-NEXT:    slli a5, a5, 3
 ; RV32-NEXT:    add a5, sp, a5
 ; RV32-NEXT:    addi a5, a5, 16
 ; RV32-NEXT:    vs8r.v v16, (a5) # Unknown-size Folded Spill
@@ -2711,7 +2715,7 @@ define <vscale x 8 x i64> @vp_bitreverse_nxv8i64(<vscale x 8 x i64> %va, <vscale
 ; RV32-NEXT:    vand.vx v24, v24, a4, v0.t
 ; RV32-NEXT:    vsrl.vi v8, v8, 8, v0.t
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    slli a1, a1, 3
+; RV32-NEXT:    slli a1, a1, 4
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
 ; RV32-NEXT:    vl8r.v v16, (a1) # Unknown-size Folded Reload
@@ -2721,7 +2725,7 @@ define <vscale x 8 x i64> @vp_bitreverse_nxv8i64(<vscale x 8 x i64> %va, <vscale
 ; RV32-NEXT:    vl8r.v v16, (a1) # Unknown-size Folded Reload
 ; RV32-NEXT:    vor.vv v8, v8, v16, v0.t
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    slli a1, a1, 4
+; RV32-NEXT:    slli a1, a1, 3
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
 ; RV32-NEXT:    vl8r.v v16, (a1) # Unknown-size Folded Reload
@@ -3010,89 +3014,65 @@ declare <vscale x 64 x i16> @llvm.vp.bitreverse.nxv64i16(<vscale x 64 x i16>, <v
 define <vscale x 64 x i16> @vp_bitreverse_nxv64i16(<vscale x 64 x i16> %va, <vscale x 64 x i1> %m, i32 zeroext %evl) {
 ; CHECK-LABEL: vp_bitreverse_nxv64i16:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    addi sp, sp, -16
-; CHECK-NEXT:    .cfi_def_cfa_offset 16
+; CHECK-NEXT:    vmv1r.v v7, v0
 ; CHECK-NEXT:    csrr a1, vlenb
-; CHECK-NEXT:    slli a1, a1, 4
-; CHECK-NEXT:    sub sp, sp, a1
-; CHECK-NEXT:    .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x10, 0x22, 0x11, 0x10, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 16 + 16 * vlenb
-; CHECK-NEXT:    vmv1r.v v24, v0
-; CHECK-NEXT:    csrr a1, vlenb
-; CHECK-NEXT:    slli a1, a1, 3
-; CHECK-NEXT:    add a1, sp, a1
-; CHECK-NEXT:    addi a1, a1, 16
-; CHECK-NEXT:    vs8r.v v8, (a1) # Unknown-size Folded Spill
-; CHECK-NEXT:    csrr a2, vlenb
-; CHECK-NEXT:    srli a1, a2, 1
+; CHECK-NEXT:    srli a2, a1, 1
 ; CHECK-NEXT:    vsetvli a3, zero, e8, m1, ta, ma
-; CHECK-NEXT:    vslidedown.vx v0, v0, a1
-; CHECK-NEXT:    slli a2, a2, 2
-; CHECK-NEXT:    sub a1, a0, a2
-; CHECK-NEXT:    sltu a3, a0, a1
+; CHECK-NEXT:    vslidedown.vx v0, v0, a2
+; CHECK-NEXT:    slli a1, a1, 2
+; CHECK-NEXT:    sub a2, a0, a1
+; CHECK-NEXT:    sltu a3, a0, a2
 ; CHECK-NEXT:    addi a3, a3, -1
-; CHECK-NEXT:    and a1, a3, a1
-; CHECK-NEXT:    vsetvli zero, a1, e16, m8, ta, ma
-; CHECK-NEXT:    vsrl.vi v8, v16, 8, v0.t
+; CHECK-NEXT:    and a2, a3, a2
+; CHECK-NEXT:    vsetvli zero, a2, e16, m8, ta, ma
+; CHECK-NEXT:    vsrl.vi v24, v16, 8, v0.t
 ; CHECK-NEXT:    vsll.vi v16, v16, 8, v0.t
-; CHECK-NEXT:    vor.vv v16, v16, v8, v0.t
-; CHECK-NEXT:    vsrl.vi v8, v16, 4, v0.t
-; CHECK-NEXT:    lui a1, 1
-; CHECK-NEXT:    addi a1, a1, -241
-; CHECK-NEXT:    vand.vx v8, v8, a1, v0.t
-; CHECK-NEXT:    vand.vx v16, v16, a1, v0.t
+; CHECK-NEXT:    vor.vv v16, v16, v24, v0.t
+; CHECK-NEXT:    vsrl.vi v24, v16, 4, v0.t
+; CHECK-NEXT:    lui a2, 1
+; CHECK-NEXT:    addi a2, a2, -241
+; CHECK-NEXT:    vand.vx v24, v24, a2, v0.t
+; CHECK-NEXT:    vand.vx v16, v16, a2, v0.t
 ; CHECK-NEXT:    vsll.vi v16, v16, 4, v0.t
-; CHECK-NEXT:    vor.vv v16, v8, v16, v0.t
-; CHECK-NEXT:    vsrl.vi v8, v16, 2, v0.t
+; CHECK-NEXT:    vor.vv v16, v24, v16, v0.t
+; CHECK-NEXT:    vsrl.vi v24, v16, 2, v0.t
 ; CHECK-NEXT:    lui a3, 3
 ; CHECK-NEXT:    addi a3, a3, 819
-; CHECK-NEXT:    vand.vx v8, v8, a3, v0.t
+; CHECK-NEXT:    vand.vx v24, v24, a3, v0.t
 ; CHECK-NEXT:    vand.vx v16, v16, a3, v0.t
 ; CHECK-NEXT:    vsll.vi v16, v16, 2, v0.t
-; CHECK-NEXT:    vor.vv v16, v8, v16, v0.t
-; CHECK-NEXT:    vsrl.vi v8, v16, 1, v0.t
+; CHECK-NEXT:    vor.vv v16, v24, v16, v0.t
+; CHECK-NEXT:    vsrl.vi v24, v16, 1, v0.t
 ; CHECK-NEXT:    lui a4, 5
 ; CHECK-NEXT:    addi a4, a4, 1365
-; CHECK-NEXT:    vand.vx v8, v8, a4, v0.t
+; CHECK-NEXT:    vand.vx v24, v24, a4, v0.t
 ; CHECK-NEXT:    vand.vx v16, v16, a4, v0.t
 ; CHECK-NEXT:    vsll.vi v16, v16, 1, v0.t
-; CHECK-NEXT:    vor.vv v8, v8, v16, v0.t
-; CHECK-NEXT:    addi a5, sp, 16
-; CHECK-NEXT:    vs8r.v v8, (a5) # Unknown-size Folded Spill
-; CHECK-NEXT:    bltu a0, a2, .LBB46_2
+; CHECK-NEXT:    vor.vv v16, v24, v16, v0.t
+; CHECK-NEXT:    bltu a0, a1, .LBB46_2
 ; CHECK-NEXT:  # %bb.1:
-; CHECK-NEXT:    mv a0, a2
+; CHECK-NEXT:    mv a0, a1
 ; CHECK-NEXT:  .LBB46_2:
-; CHECK-NEXT:    vmv1r.v v0, v24
-; CHECK-NEXT:    csrr a2, vlenb
-; CHECK-NEXT:    slli a2, a2, 3
-; CHECK-NEXT:    add a2, sp, a2
-; CHECK-NEXT:    addi a2, a2, 16
-; CHECK-NEXT:    vl8r.v v8, (a2) # Unknown-size Folded Reload
+; CHECK-NEXT:    vmv1r.v v0, v7
 ; CHECK-NEXT:    vsetvli zero, a0, e16, m8, ta, ma
-; CHECK-NEXT:    vsrl.vi v16, v8, 8, v0.t
+; CHECK-NEXT:    vsrl.vi v24, v8, 8, v0.t
 ; CHECK-NEXT:    vsll.vi v8, v8, 8, v0.t
-; CHECK-NEXT:    vor.vv v8, v8, v16, v0.t
-; CHECK-NEXT:    vsrl.vi v16, v8, 4, v0.t...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Oct 25, 2024

@llvm/pr-subscribers-backend-risc-v

Author: Pengcheng Wang (wangpc-pp)

Changes

Currently, the spill weight is only determined by isDef/isUse and
block frequency. However, for registers with different register
classes, the costs of spilling them are different.

For example, for LMUL&gt;1 registers (in which, several physical regsiter
compound a bigger logical register), the costs are larger than
LMUL=1 case (in which, there is only one physical register).

To sovle this problem, a new target hook getSpillWeightFactor is
added. Targets can override the default factor (which is 1) according
to the register classes.

For RISC-V, the factors are set to the RegClassWeight which is
used to track regsiter pressure. The values of RegClassWeight
are the number of register units.

I believe all the targets can benefit from this change, but I will
shrink the range of tests to RISC-V only.

Partially fixes #113489.


Patch is 870.37 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/113675.diff

78 Files Affected:

  • (modified) llvm/include/llvm/CodeGen/LiveIntervals.h (+2-2)
  • (modified) llvm/include/llvm/CodeGen/TargetRegisterInfo.h (+3)
  • (modified) llvm/lib/CodeGen/CalcSpillWeights.cpp (+7-5)
  • (modified) llvm/lib/CodeGen/LiveIntervals.cpp (+4-4)
  • (modified) llvm/lib/CodeGen/TargetRegisterInfo.cpp (+5)
  • (modified) llvm/lib/Target/RISCV/RISCVRegisterInfo.cpp (+5)
  • (modified) llvm/lib/Target/RISCV/RISCVRegisterInfo.h (+2)
  • (modified) llvm/test/CodeGen/RISCV/rvv/abs-vp.ll (+6-29)
  • (modified) llvm/test/CodeGen/RISCV/rvv/bitreverse-vp.ll (+67-87)
  • (modified) llvm/test/CodeGen/RISCV/rvv/bswap-vp.ll (+36-56)
  • (modified) llvm/test/CodeGen/RISCV/rvv/ceil-vp.ll (+44-130)
  • (modified) llvm/test/CodeGen/RISCV/rvv/compressstore.ll (+7-30)
  • (modified) llvm/test/CodeGen/RISCV/rvv/ctpop-vp.ll (+57-157)
  • (modified) llvm/test/CodeGen/RISCV/rvv/cttz-vp.ll (+74-225)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-bitreverse-vp.ll (+58-78)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-bswap-vp.ll (+37-57)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-ceil-vp.ll (-14)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-ctlz-vp.ll (+166-472)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-ctpop-vp.ll (+80-181)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-cttz-vp.ll (+146-452)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-floor-vp.ll (-14)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-interleaved-access.ll (+361-358)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-masked-store-fp.ll (+8-96)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-masked-store-int.ll (+8-96)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-reduction-fp.ll (+76-224)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-rint-vp.ll (-13)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-round-vp.ll (-14)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-roundeven-vp.ll (-14)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-roundtozero-vp.ll (-14)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-setcc-fp-vp.ll (+20-40)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-setcc-int-vp.ll (+18-38)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-trunc-vp.ll (+94-91)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vcopysign-vp.ll (+5-15)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vfmax-vp.ll (+5-15)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vfmin-vp.ll (+5-15)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vselect-vp.ll (+15-25)
  • (modified) llvm/test/CodeGen/RISCV/rvv/floor-vp.ll (+44-130)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fmaximum-sdnode.ll (+4-30)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fmaximum-vp.ll (+231-282)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fminimum-sdnode.ll (+4-30)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fminimum-vp.ll (+231-282)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fshr-fshl-vp.ll (+67-201)
  • (modified) llvm/test/CodeGen/RISCV/rvv/mscatter-sdnode.ll (+9-19)
  • (modified) llvm/test/CodeGen/RISCV/rvv/nearbyint-vp.ll (+36-92)
  • (modified) llvm/test/CodeGen/RISCV/rvv/rint-vp.ll (+44-123)
  • (modified) llvm/test/CodeGen/RISCV/rvv/round-vp.ll (+44-130)
  • (modified) llvm/test/CodeGen/RISCV/rvv/roundeven-vp.ll (+44-130)
  • (modified) llvm/test/CodeGen/RISCV/rvv/roundtozero-vp.ll (+44-130)
  • (modified) llvm/test/CodeGen/RISCV/rvv/setcc-fp-vp.ll (+84-230)
  • (modified) llvm/test/CodeGen/RISCV/rvv/setcc-int-vp.ll (+20-40)
  • (modified) llvm/test/CodeGen/RISCV/rvv/strided-vpstore.ll (+6-20)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vector-deinterleave-load.ll (+17-19)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vector-deinterleave.ll (+40-22)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vector-interleave-store.ll (+7-29)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vector-interleave.ll (+48-96)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vfadd-vp.ll (+74-90)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vfdiv-vp.ll (+74-90)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vfma-vp.ll (+918-978)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vfmadd-constrained-sdnode.ll (+94-68)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vfmadd-sdnode.ll (+147-106)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vfmul-vp.ll (+37-45)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vfmuladd-vp.ll (+4-4)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vfnmadd-constrained-sdnode.ll (+60-54)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vfnmsub-constrained-sdnode.ll (+55-64)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vfptosi-vp.ll (+1-14)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vfptoui-vp.ll (+1-14)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vfptrunc-vp.ll (+5-19)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vfsub-vp.ll (+74-90)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vfwmacc-vp.ll (+5-17)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vfwnmacc-vp.ll (+5-17)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vfwnmsac-vp.ll (+5-17)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vpscatter-sdnode.ll (+14-14)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vpstore.ll (+6-20)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vselect-fp.ll (-1)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vselect-vp.ll (+27-101)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vsitofp-vp.ll (+1-14)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vtrunc-vp.ll (+4-5)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vuitofp-vp.ll (+1-14)
diff --git a/llvm/include/llvm/CodeGen/LiveIntervals.h b/llvm/include/llvm/CodeGen/LiveIntervals.h
index 161bb247a0e968..a58ba178ac8484 100644
--- a/llvm/include/llvm/CodeGen/LiveIntervals.h
+++ b/llvm/include/llvm/CodeGen/LiveIntervals.h
@@ -117,14 +117,14 @@ class LiveIntervals {
   /// If \p PSI is provided the calculation is altered for optsize functions.
   static float getSpillWeight(bool isDef, bool isUse,
                               const MachineBlockFrequencyInfo *MBFI,
-                              const MachineInstr &MI,
+                              const MachineInstr &MI, unsigned Factor = 1,
                               ProfileSummaryInfo *PSI = nullptr);
 
   /// Calculate the spill weight to assign to a single instruction.
   /// If \p PSI is provided the calculation is altered for optsize functions.
   static float getSpillWeight(bool isDef, bool isUse,
                               const MachineBlockFrequencyInfo *MBFI,
-                              const MachineBasicBlock *MBB,
+                              const MachineBasicBlock *MBB, unsigned Factor = 1,
                               ProfileSummaryInfo *PSI = nullptr);
 
   LiveInterval &getInterval(Register Reg) {
diff --git a/llvm/include/llvm/CodeGen/TargetRegisterInfo.h b/llvm/include/llvm/CodeGen/TargetRegisterInfo.h
index 292fa3c94969be..8726d2e33dbc83 100644
--- a/llvm/include/llvm/CodeGen/TargetRegisterInfo.h
+++ b/llvm/include/llvm/CodeGen/TargetRegisterInfo.h
@@ -926,6 +926,9 @@ class TargetRegisterInfo : public MCRegisterInfo {
   /// Returns a -1 terminated array of pressure set IDs.
   virtual const int *getRegUnitPressureSets(unsigned RegUnit) const = 0;
 
+  /// Get the factor of spill weight for this register class.
+  virtual unsigned getSpillWeightFactor(const TargetRegisterClass *RC) const;
+
   /// Get a list of 'hint' registers that the register allocator should try
   /// first when allocating a physical register for the virtual register
   /// VirtReg. These registers are effectively moved to the front of the
diff --git a/llvm/lib/CodeGen/CalcSpillWeights.cpp b/llvm/lib/CodeGen/CalcSpillWeights.cpp
index f361c956092e88..8c3ab0d1e43a89 100644
--- a/llvm/lib/CodeGen/CalcSpillWeights.cpp
+++ b/llvm/lib/CodeGen/CalcSpillWeights.cpp
@@ -189,6 +189,7 @@ float VirtRegAuxInfo::weightCalcHelper(LiveInterval &LI, SlotIndex *Start,
   // Do not update future local split artifacts.
   bool ShouldUpdateLI = !IsLocalSplitArtifact;
 
+  unsigned Factor = TRI.getSpillWeightFactor(MRI.getRegClass(LI.reg()));
   if (IsLocalSplitArtifact) {
     MachineBasicBlock *LocalMBB = LIS.getMBBFromIndex(*End);
     assert(LocalMBB == LIS.getMBBFromIndex(*Start) &&
@@ -199,10 +200,10 @@ float VirtRegAuxInfo::weightCalcHelper(LiveInterval &LI, SlotIndex *Start,
     // localLI = COPY other
     // ...
     // other   = COPY localLI
-    TotalWeight +=
-        LiveIntervals::getSpillWeight(true, false, &MBFI, LocalMBB, PSI);
-    TotalWeight +=
-        LiveIntervals::getSpillWeight(false, true, &MBFI, LocalMBB, PSI);
+    TotalWeight += LiveIntervals::getSpillWeight(true, false, &MBFI, LocalMBB,
+                                                 Factor, PSI);
+    TotalWeight += LiveIntervals::getSpillWeight(false, true, &MBFI, LocalMBB,
+                                                 Factor, PSI);
 
     NumInstr += 2;
   }
@@ -274,7 +275,8 @@ float VirtRegAuxInfo::weightCalcHelper(LiveInterval &LI, SlotIndex *Start,
       // Calculate instr weight.
       bool Reads, Writes;
       std::tie(Reads, Writes) = MI->readsWritesVirtualRegister(LI.reg());
-      Weight = LiveIntervals::getSpillWeight(Writes, Reads, &MBFI, *MI, PSI);
+      Weight =
+          LiveIntervals::getSpillWeight(Writes, Reads, &MBFI, *MI, Factor, PSI);
 
       // Give extra weight to what looks like a loop induction variable update.
       if (Writes && IsExiting && LIS.isLiveOutOfMBB(LI, MBB))
diff --git a/llvm/lib/CodeGen/LiveIntervals.cpp b/llvm/lib/CodeGen/LiveIntervals.cpp
index 21a316cf99a217..48f4538122be3e 100644
--- a/llvm/lib/CodeGen/LiveIntervals.cpp
+++ b/llvm/lib/CodeGen/LiveIntervals.cpp
@@ -877,15 +877,15 @@ LiveIntervals::hasPHIKill(const LiveInterval &LI, const VNInfo *VNI) const {
 
 float LiveIntervals::getSpillWeight(bool isDef, bool isUse,
                                     const MachineBlockFrequencyInfo *MBFI,
-                                    const MachineInstr &MI,
+                                    const MachineInstr &MI, unsigned Factor,
                                     ProfileSummaryInfo *PSI) {
-  return getSpillWeight(isDef, isUse, MBFI, MI.getParent(), PSI);
+  return getSpillWeight(isDef, isUse, MBFI, MI.getParent(), Factor, PSI);
 }
 
 float LiveIntervals::getSpillWeight(bool isDef, bool isUse,
                                     const MachineBlockFrequencyInfo *MBFI,
                                     const MachineBasicBlock *MBB,
-                                    ProfileSummaryInfo *PSI) {
+                                    unsigned Factor, ProfileSummaryInfo *PSI) {
   float Weight = isDef + isUse;
   const auto *MF = MBB->getParent();
   // When optimizing for size we only consider the codesize impact of spilling
@@ -893,7 +893,7 @@ float LiveIntervals::getSpillWeight(bool isDef, bool isUse,
   if (PSI && (MF->getFunction().hasOptSize() ||
               llvm::shouldOptimizeForSize(MF, PSI, MBFI)))
     return Weight;
-  return Weight * MBFI->getBlockFreqRelativeToEntryBlock(MBB);
+  return Weight * MBFI->getBlockFreqRelativeToEntryBlock(MBB) * Factor;
 }
 
 LiveRange::Segment
diff --git a/llvm/lib/CodeGen/TargetRegisterInfo.cpp b/llvm/lib/CodeGen/TargetRegisterInfo.cpp
index ac9a3d6f0d1a60..d1f02489db62cb 100644
--- a/llvm/lib/CodeGen/TargetRegisterInfo.cpp
+++ b/llvm/lib/CodeGen/TargetRegisterInfo.cpp
@@ -415,6 +415,11 @@ bool TargetRegisterInfo::shouldRewriteCopySrc(const TargetRegisterClass *DefRC,
   return shareSameRegisterFile(*this, DefRC, DefSubReg, SrcRC, SrcSubReg);
 }
 
+unsigned
+TargetRegisterInfo::getSpillWeightFactor(const TargetRegisterClass *RC) const {
+  return 1;
+}
+
 // Compute target-independent register allocator hints to help eliminate copies.
 bool TargetRegisterInfo::getRegAllocationHints(
     Register VirtReg, ArrayRef<MCPhysReg> Order,
diff --git a/llvm/lib/Target/RISCV/RISCVRegisterInfo.cpp b/llvm/lib/Target/RISCV/RISCVRegisterInfo.cpp
index 26195ef721db39..884a62c3e70679 100644
--- a/llvm/lib/Target/RISCV/RISCVRegisterInfo.cpp
+++ b/llvm/lib/Target/RISCV/RISCVRegisterInfo.cpp
@@ -803,6 +803,11 @@ RISCVRegisterInfo::getRegisterCostTableIndex(const MachineFunction &MF) const {
              : 0;
 }
 
+unsigned
+RISCVRegisterInfo::getSpillWeightFactor(const TargetRegisterClass *RC) const {
+  return getRegClassWeight(RC).RegWeight;
+}
+
 // Add two address hints to improve chances of being able to use a compressed
 // instruction.
 bool RISCVRegisterInfo::getRegAllocationHints(
diff --git a/llvm/lib/Target/RISCV/RISCVRegisterInfo.h b/llvm/lib/Target/RISCV/RISCVRegisterInfo.h
index 6ddb1eb9c14d5e..51e7f9d3b0cc1d 100644
--- a/llvm/lib/Target/RISCV/RISCVRegisterInfo.h
+++ b/llvm/lib/Target/RISCV/RISCVRegisterInfo.h
@@ -127,6 +127,8 @@ struct RISCVRegisterInfo : public RISCVGenRegisterInfo {
 
   unsigned getRegisterCostTableIndex(const MachineFunction &MF) const override;
 
+  unsigned getSpillWeightFactor(const TargetRegisterClass *RC) const override;
+
   bool getRegAllocationHints(Register VirtReg, ArrayRef<MCPhysReg> Order,
                              SmallVectorImpl<MCPhysReg> &Hints,
                              const MachineFunction &MF, const VirtRegMap *VRM,
diff --git a/llvm/test/CodeGen/RISCV/rvv/abs-vp.ll b/llvm/test/CodeGen/RISCV/rvv/abs-vp.ll
index cd2208e31eb6d3..b37454b3b24434 100644
--- a/llvm/test/CodeGen/RISCV/rvv/abs-vp.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/abs-vp.ll
@@ -561,18 +561,7 @@ declare <vscale x 16 x i64> @llvm.vp.abs.nxv16i64(<vscale x 16 x i64>, i1 immarg
 define <vscale x 16 x i64> @vp_abs_nxv16i64(<vscale x 16 x i64> %va, <vscale x 16 x i1> %m, i32 zeroext %evl) {
 ; CHECK-LABEL: vp_abs_nxv16i64:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    addi sp, sp, -16
-; CHECK-NEXT:    .cfi_def_cfa_offset 16
-; CHECK-NEXT:    csrr a1, vlenb
-; CHECK-NEXT:    slli a1, a1, 4
-; CHECK-NEXT:    sub sp, sp, a1
-; CHECK-NEXT:    .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x10, 0x22, 0x11, 0x10, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 16 + 16 * vlenb
-; CHECK-NEXT:    vmv1r.v v24, v0
-; CHECK-NEXT:    csrr a1, vlenb
-; CHECK-NEXT:    slli a1, a1, 3
-; CHECK-NEXT:    add a1, sp, a1
-; CHECK-NEXT:    addi a1, a1, 16
-; CHECK-NEXT:    vs8r.v v8, (a1) # Unknown-size Folded Spill
+; CHECK-NEXT:    vmv1r.v v7, v0
 ; CHECK-NEXT:    csrr a1, vlenb
 ; CHECK-NEXT:    srli a2, a1, 3
 ; CHECK-NEXT:    vsetvli a3, zero, e8, mf4, ta, ma
@@ -582,28 +571,16 @@ define <vscale x 16 x i64> @vp_abs_nxv16i64(<vscale x 16 x i64> %va, <vscale x 1
 ; CHECK-NEXT:    addi a3, a3, -1
 ; CHECK-NEXT:    and a2, a3, a2
 ; CHECK-NEXT:    vsetvli zero, a2, e64, m8, ta, ma
-; CHECK-NEXT:    vrsub.vi v8, v16, 0, v0.t
-; CHECK-NEXT:    vmax.vv v8, v16, v8, v0.t
-; CHECK-NEXT:    addi a2, sp, 16
-; CHECK-NEXT:    vs8r.v v8, (a2) # Unknown-size Folded Spill
+; CHECK-NEXT:    vrsub.vi v24, v16, 0, v0.t
+; CHECK-NEXT:    vmax.vv v16, v16, v24, v0.t
 ; CHECK-NEXT:    bltu a0, a1, .LBB46_2
 ; CHECK-NEXT:  # %bb.1:
 ; CHECK-NEXT:    mv a0, a1
 ; CHECK-NEXT:  .LBB46_2:
-; CHECK-NEXT:    vmv1r.v v0, v24
-; CHECK-NEXT:    slli a1, a1, 3
-; CHECK-NEXT:    add a1, sp, a1
-; CHECK-NEXT:    addi a1, a1, 16
-; CHECK-NEXT:    vl8r.v v8, (a1) # Unknown-size Folded Reload
+; CHECK-NEXT:    vmv1r.v v0, v7
 ; CHECK-NEXT:    vsetvli zero, a0, e64, m8, ta, ma
-; CHECK-NEXT:    vrsub.vi v16, v8, 0, v0.t
-; CHECK-NEXT:    vmax.vv v8, v8, v16, v0.t
-; CHECK-NEXT:    addi a0, sp, 16
-; CHECK-NEXT:    vl8r.v v16, (a0) # Unknown-size Folded Reload
-; CHECK-NEXT:    csrr a0, vlenb
-; CHECK-NEXT:    slli a0, a0, 4
-; CHECK-NEXT:    add sp, sp, a0
-; CHECK-NEXT:    addi sp, sp, 16
+; CHECK-NEXT:    vrsub.vi v24, v8, 0, v0.t
+; CHECK-NEXT:    vmax.vv v8, v8, v24, v0.t
 ; CHECK-NEXT:    ret
   %v = call <vscale x 16 x i64> @llvm.vp.abs.nxv16i64(<vscale x 16 x i64> %va, i1 false, <vscale x 16 x i1> %m, i32 %evl)
   ret <vscale x 16 x i64> %v
diff --git a/llvm/test/CodeGen/RISCV/rvv/bitreverse-vp.ll b/llvm/test/CodeGen/RISCV/rvv/bitreverse-vp.ll
index afce04d107e728..94bc2851a6bf40 100644
--- a/llvm/test/CodeGen/RISCV/rvv/bitreverse-vp.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/bitreverse-vp.ll
@@ -2307,7 +2307,7 @@ define <vscale x 7 x i64> @vp_bitreverse_nxv7i64(<vscale x 7 x i64> %va, <vscale
 ; RV32-NEXT:    vsll.vx v24, v24, a3, v0.t
 ; RV32-NEXT:    vor.vv v16, v16, v24, v0.t
 ; RV32-NEXT:    csrr a4, vlenb
-; RV32-NEXT:    slli a4, a4, 4
+; RV32-NEXT:    slli a4, a4, 3
 ; RV32-NEXT:    add a4, sp, a4
 ; RV32-NEXT:    addi a4, a4, 16
 ; RV32-NEXT:    vs8r.v v16, (a4) # Unknown-size Folded Spill
@@ -2315,28 +2315,30 @@ define <vscale x 7 x i64> @vp_bitreverse_nxv7i64(<vscale x 7 x i64> %va, <vscale
 ; RV32-NEXT:    vsetvli a5, zero, e64, m8, ta, ma
 ; RV32-NEXT:    vlse64.v v16, (a4), zero
 ; RV32-NEXT:    csrr a4, vlenb
-; RV32-NEXT:    slli a4, a4, 3
+; RV32-NEXT:    slli a4, a4, 4
 ; RV32-NEXT:    add a4, sp, a4
 ; RV32-NEXT:    addi a4, a4, 16
 ; RV32-NEXT:    vs8r.v v16, (a4) # Unknown-size Folded Spill
 ; RV32-NEXT:    lui a4, 4080
 ; RV32-NEXT:    vsetvli zero, a0, e64, m8, ta, ma
-; RV32-NEXT:    vand.vx v24, v8, a4, v0.t
-; RV32-NEXT:    vsll.vi v24, v24, 24, v0.t
-; RV32-NEXT:    addi a5, sp, 16
-; RV32-NEXT:    vs8r.v v24, (a5) # Unknown-size Folded Spill
-; RV32-NEXT:    vand.vv v24, v8, v16, v0.t
-; RV32-NEXT:    vsll.vi v16, v24, 8, v0.t
-; RV32-NEXT:    vl8r.v v24, (a5) # Unknown-size Folded Reload
-; RV32-NEXT:    vor.vv v16, v24, v16, v0.t
+; RV32-NEXT:    vand.vx v16, v8, a4, v0.t
+; RV32-NEXT:    vsll.vi v24, v16, 24, v0.t
 ; RV32-NEXT:    csrr a5, vlenb
 ; RV32-NEXT:    slli a5, a5, 4
 ; RV32-NEXT:    add a5, sp, a5
 ; RV32-NEXT:    addi a5, a5, 16
+; RV32-NEXT:    vl8r.v v16, (a5) # Unknown-size Folded Reload
+; RV32-NEXT:    vand.vv v16, v8, v16, v0.t
+; RV32-NEXT:    vsll.vi v16, v16, 8, v0.t
+; RV32-NEXT:    vor.vv v16, v24, v16, v0.t
+; RV32-NEXT:    csrr a5, vlenb
+; RV32-NEXT:    slli a5, a5, 3
+; RV32-NEXT:    add a5, sp, a5
+; RV32-NEXT:    addi a5, a5, 16
 ; RV32-NEXT:    vl8r.v v24, (a5) # Unknown-size Folded Reload
 ; RV32-NEXT:    vor.vv v16, v24, v16, v0.t
 ; RV32-NEXT:    csrr a5, vlenb
-; RV32-NEXT:    slli a5, a5, 4
+; RV32-NEXT:    slli a5, a5, 3
 ; RV32-NEXT:    add a5, sp, a5
 ; RV32-NEXT:    addi a5, a5, 16
 ; RV32-NEXT:    vs8r.v v16, (a5) # Unknown-size Folded Spill
@@ -2350,7 +2352,7 @@ define <vscale x 7 x i64> @vp_bitreverse_nxv7i64(<vscale x 7 x i64> %va, <vscale
 ; RV32-NEXT:    vand.vx v24, v24, a4, v0.t
 ; RV32-NEXT:    vsrl.vi v8, v8, 8, v0.t
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    slli a1, a1, 3
+; RV32-NEXT:    slli a1, a1, 4
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
 ; RV32-NEXT:    vl8r.v v16, (a1) # Unknown-size Folded Reload
@@ -2360,7 +2362,7 @@ define <vscale x 7 x i64> @vp_bitreverse_nxv7i64(<vscale x 7 x i64> %va, <vscale
 ; RV32-NEXT:    vl8r.v v16, (a1) # Unknown-size Folded Reload
 ; RV32-NEXT:    vor.vv v8, v8, v16, v0.t
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    slli a1, a1, 4
+; RV32-NEXT:    slli a1, a1, 3
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
 ; RV32-NEXT:    vl8r.v v16, (a1) # Unknown-size Folded Reload
@@ -2668,7 +2670,7 @@ define <vscale x 8 x i64> @vp_bitreverse_nxv8i64(<vscale x 8 x i64> %va, <vscale
 ; RV32-NEXT:    vsll.vx v24, v24, a3, v0.t
 ; RV32-NEXT:    vor.vv v16, v16, v24, v0.t
 ; RV32-NEXT:    csrr a4, vlenb
-; RV32-NEXT:    slli a4, a4, 4
+; RV32-NEXT:    slli a4, a4, 3
 ; RV32-NEXT:    add a4, sp, a4
 ; RV32-NEXT:    addi a4, a4, 16
 ; RV32-NEXT:    vs8r.v v16, (a4) # Unknown-size Folded Spill
@@ -2676,28 +2678,30 @@ define <vscale x 8 x i64> @vp_bitreverse_nxv8i64(<vscale x 8 x i64> %va, <vscale
 ; RV32-NEXT:    vsetvli a5, zero, e64, m8, ta, ma
 ; RV32-NEXT:    vlse64.v v16, (a4), zero
 ; RV32-NEXT:    csrr a4, vlenb
-; RV32-NEXT:    slli a4, a4, 3
+; RV32-NEXT:    slli a4, a4, 4
 ; RV32-NEXT:    add a4, sp, a4
 ; RV32-NEXT:    addi a4, a4, 16
 ; RV32-NEXT:    vs8r.v v16, (a4) # Unknown-size Folded Spill
 ; RV32-NEXT:    lui a4, 4080
 ; RV32-NEXT:    vsetvli zero, a0, e64, m8, ta, ma
-; RV32-NEXT:    vand.vx v24, v8, a4, v0.t
-; RV32-NEXT:    vsll.vi v24, v24, 24, v0.t
-; RV32-NEXT:    addi a5, sp, 16
-; RV32-NEXT:    vs8r.v v24, (a5) # Unknown-size Folded Spill
-; RV32-NEXT:    vand.vv v24, v8, v16, v0.t
-; RV32-NEXT:    vsll.vi v16, v24, 8, v0.t
-; RV32-NEXT:    vl8r.v v24, (a5) # Unknown-size Folded Reload
-; RV32-NEXT:    vor.vv v16, v24, v16, v0.t
+; RV32-NEXT:    vand.vx v16, v8, a4, v0.t
+; RV32-NEXT:    vsll.vi v24, v16, 24, v0.t
 ; RV32-NEXT:    csrr a5, vlenb
 ; RV32-NEXT:    slli a5, a5, 4
 ; RV32-NEXT:    add a5, sp, a5
 ; RV32-NEXT:    addi a5, a5, 16
+; RV32-NEXT:    vl8r.v v16, (a5) # Unknown-size Folded Reload
+; RV32-NEXT:    vand.vv v16, v8, v16, v0.t
+; RV32-NEXT:    vsll.vi v16, v16, 8, v0.t
+; RV32-NEXT:    vor.vv v16, v24, v16, v0.t
+; RV32-NEXT:    csrr a5, vlenb
+; RV32-NEXT:    slli a5, a5, 3
+; RV32-NEXT:    add a5, sp, a5
+; RV32-NEXT:    addi a5, a5, 16
 ; RV32-NEXT:    vl8r.v v24, (a5) # Unknown-size Folded Reload
 ; RV32-NEXT:    vor.vv v16, v24, v16, v0.t
 ; RV32-NEXT:    csrr a5, vlenb
-; RV32-NEXT:    slli a5, a5, 4
+; RV32-NEXT:    slli a5, a5, 3
 ; RV32-NEXT:    add a5, sp, a5
 ; RV32-NEXT:    addi a5, a5, 16
 ; RV32-NEXT:    vs8r.v v16, (a5) # Unknown-size Folded Spill
@@ -2711,7 +2715,7 @@ define <vscale x 8 x i64> @vp_bitreverse_nxv8i64(<vscale x 8 x i64> %va, <vscale
 ; RV32-NEXT:    vand.vx v24, v24, a4, v0.t
 ; RV32-NEXT:    vsrl.vi v8, v8, 8, v0.t
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    slli a1, a1, 3
+; RV32-NEXT:    slli a1, a1, 4
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
 ; RV32-NEXT:    vl8r.v v16, (a1) # Unknown-size Folded Reload
@@ -2721,7 +2725,7 @@ define <vscale x 8 x i64> @vp_bitreverse_nxv8i64(<vscale x 8 x i64> %va, <vscale
 ; RV32-NEXT:    vl8r.v v16, (a1) # Unknown-size Folded Reload
 ; RV32-NEXT:    vor.vv v8, v8, v16, v0.t
 ; RV32-NEXT:    csrr a1, vlenb
-; RV32-NEXT:    slli a1, a1, 4
+; RV32-NEXT:    slli a1, a1, 3
 ; RV32-NEXT:    add a1, sp, a1
 ; RV32-NEXT:    addi a1, a1, 16
 ; RV32-NEXT:    vl8r.v v16, (a1) # Unknown-size Folded Reload
@@ -3010,89 +3014,65 @@ declare <vscale x 64 x i16> @llvm.vp.bitreverse.nxv64i16(<vscale x 64 x i16>, <v
 define <vscale x 64 x i16> @vp_bitreverse_nxv64i16(<vscale x 64 x i16> %va, <vscale x 64 x i1> %m, i32 zeroext %evl) {
 ; CHECK-LABEL: vp_bitreverse_nxv64i16:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    addi sp, sp, -16
-; CHECK-NEXT:    .cfi_def_cfa_offset 16
+; CHECK-NEXT:    vmv1r.v v7, v0
 ; CHECK-NEXT:    csrr a1, vlenb
-; CHECK-NEXT:    slli a1, a1, 4
-; CHECK-NEXT:    sub sp, sp, a1
-; CHECK-NEXT:    .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x10, 0x22, 0x11, 0x10, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 16 + 16 * vlenb
-; CHECK-NEXT:    vmv1r.v v24, v0
-; CHECK-NEXT:    csrr a1, vlenb
-; CHECK-NEXT:    slli a1, a1, 3
-; CHECK-NEXT:    add a1, sp, a1
-; CHECK-NEXT:    addi a1, a1, 16
-; CHECK-NEXT:    vs8r.v v8, (a1) # Unknown-size Folded Spill
-; CHECK-NEXT:    csrr a2, vlenb
-; CHECK-NEXT:    srli a1, a2, 1
+; CHECK-NEXT:    srli a2, a1, 1
 ; CHECK-NEXT:    vsetvli a3, zero, e8, m1, ta, ma
-; CHECK-NEXT:    vslidedown.vx v0, v0, a1
-; CHECK-NEXT:    slli a2, a2, 2
-; CHECK-NEXT:    sub a1, a0, a2
-; CHECK-NEXT:    sltu a3, a0, a1
+; CHECK-NEXT:    vslidedown.vx v0, v0, a2
+; CHECK-NEXT:    slli a1, a1, 2
+; CHECK-NEXT:    sub a2, a0, a1
+; CHECK-NEXT:    sltu a3, a0, a2
 ; CHECK-NEXT:    addi a3, a3, -1
-; CHECK-NEXT:    and a1, a3, a1
-; CHECK-NEXT:    vsetvli zero, a1, e16, m8, ta, ma
-; CHECK-NEXT:    vsrl.vi v8, v16, 8, v0.t
+; CHECK-NEXT:    and a2, a3, a2
+; CHECK-NEXT:    vsetvli zero, a2, e16, m8, ta, ma
+; CHECK-NEXT:    vsrl.vi v24, v16, 8, v0.t
 ; CHECK-NEXT:    vsll.vi v16, v16, 8, v0.t
-; CHECK-NEXT:    vor.vv v16, v16, v8, v0.t
-; CHECK-NEXT:    vsrl.vi v8, v16, 4, v0.t
-; CHECK-NEXT:    lui a1, 1
-; CHECK-NEXT:    addi a1, a1, -241
-; CHECK-NEXT:    vand.vx v8, v8, a1, v0.t
-; CHECK-NEXT:    vand.vx v16, v16, a1, v0.t
+; CHECK-NEXT:    vor.vv v16, v16, v24, v0.t
+; CHECK-NEXT:    vsrl.vi v24, v16, 4, v0.t
+; CHECK-NEXT:    lui a2, 1
+; CHECK-NEXT:    addi a2, a2, -241
+; CHECK-NEXT:    vand.vx v24, v24, a2, v0.t
+; CHECK-NEXT:    vand.vx v16, v16, a2, v0.t
 ; CHECK-NEXT:    vsll.vi v16, v16, 4, v0.t
-; CHECK-NEXT:    vor.vv v16, v8, v16, v0.t
-; CHECK-NEXT:    vsrl.vi v8, v16, 2, v0.t
+; CHECK-NEXT:    vor.vv v16, v24, v16, v0.t
+; CHECK-NEXT:    vsrl.vi v24, v16, 2, v0.t
 ; CHECK-NEXT:    lui a3, 3
 ; CHECK-NEXT:    addi a3, a3, 819
-; CHECK-NEXT:    vand.vx v8, v8, a3, v0.t
+; CHECK-NEXT:    vand.vx v24, v24, a3, v0.t
 ; CHECK-NEXT:    vand.vx v16, v16, a3, v0.t
 ; CHECK-NEXT:    vsll.vi v16, v16, 2, v0.t
-; CHECK-NEXT:    vor.vv v16, v8, v16, v0.t
-; CHECK-NEXT:    vsrl.vi v8, v16, 1, v0.t
+; CHECK-NEXT:    vor.vv v16, v24, v16, v0.t
+; CHECK-NEXT:    vsrl.vi v24, v16, 1, v0.t
 ; CHECK-NEXT:    lui a4, 5
 ; CHECK-NEXT:    addi a4, a4, 1365
-; CHECK-NEXT:    vand.vx v8, v8, a4, v0.t
+; CHECK-NEXT:    vand.vx v24, v24, a4, v0.t
 ; CHECK-NEXT:    vand.vx v16, v16, a4, v0.t
 ; CHECK-NEXT:    vsll.vi v16, v16, 1, v0.t
-; CHECK-NEXT:    vor.vv v8, v8, v16, v0.t
-; CHECK-NEXT:    addi a5, sp, 16
-; CHECK-NEXT:    vs8r.v v8, (a5) # Unknown-size Folded Spill
-; CHECK-NEXT:    bltu a0, a2, .LBB46_2
+; CHECK-NEXT:    vor.vv v16, v24, v16, v0.t
+; CHECK-NEXT:    bltu a0, a1, .LBB46_2
 ; CHECK-NEXT:  # %bb.1:
-; CHECK-NEXT:    mv a0, a2
+; CHECK-NEXT:    mv a0, a1
 ; CHECK-NEXT:  .LBB46_2:
-; CHECK-NEXT:    vmv1r.v v0, v24
-; CHECK-NEXT:    csrr a2, vlenb
-; CHECK-NEXT:    slli a2, a2, 3
-; CHECK-NEXT:    add a2, sp, a2
-; CHECK-NEXT:    addi a2, a2, 16
-; CHECK-NEXT:    vl8r.v v8, (a2) # Unknown-size Folded Reload
+; CHECK-NEXT:    vmv1r.v v0, v7
 ; CHECK-NEXT:    vsetvli zero, a0, e16, m8, ta, ma
-; CHECK-NEXT:    vsrl.vi v16, v8, 8, v0.t
+; CHECK-NEXT:    vsrl.vi v24, v8, 8, v0.t
 ; CHECK-NEXT:    vsll.vi v8, v8, 8, v0.t
-; CHECK-NEXT:    vor.vv v8, v8, v16, v0.t
-; CHECK-NEXT:    vsrl.vi v16, v8, 4, v0.t...
[truncated]

; RV32-NEXT: vand.vv v16, v8, v16, v0.t
; RV32-NEXT: vsll.vi v16, v16, 8, v0.t
; RV32-NEXT: vor.vv v16, v24, v16, v0.t
; RV32-NEXT: csrr a5, vlenb
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be considered as a small regression because we have more reads of vscale (vlenb).

Copy link
Contributor

@lukel97 lukel97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking at this, I also tried this a few months ago, and saw the same test diff where we're able to remove a lot of LMUL 8 spills.

However whilst it helps the in-tree tests a lot when I tried it out on SPEC CPU 2017 I got a 50% increase in spills generated overall, have you been able to check if you run into this too?

My approach also didn't use the reg class weight but instead I scaled the weight by the physical register size in bytes. From what I remember investigating the register class weight was related to register pressure calculation, which seemed orthogonal to spilling

@wangpc-pp
Copy link
Contributor Author

Thanks for looking at this, I also tried this a few months ago, and saw the same test diff where we're able to remove a lot of LMUL 8 spills.

However whilst it helps the in-tree tests a lot when I tried it out on SPEC CPU 2017 I got a 50% increase in spills generated overall, have you been able to check if you run into this too?

Here is the result of statistics (-march=rv64gcv):

# utils/compare.py -m regalloc.NumSpills baseline.json experimental.json                            
Tests: 10                                                                                                                                                                                                    
Metric: regalloc.NumSpills                                                                                                                                                                                   
                                                                                                                                                                                                             
Program                                       regalloc.NumSpills                                                                                                                                             
                                              baseline           experimental diff                                                                                                                           
500.perlbench_r/500.perlbench_r.test           4255.00            4255.00      0.0%                                                                                                                          
505.mcf_r/505.mcf_r.test                        128.00             128.00      0.0%                                                                                                                          
520.omnetpp_r/520.omnetpp_r.test                971.00             971.00      0.0%                                                                                                                          
523.xalancbmk_r/523.xalancbmk_r.test           1578.00            1578.00      0.0%                                                                                                                          
525.x264_r/525.x264_r.test                     2027.00            2027.00      0.0%                                                                                                                          
531.deepsjeng_r/531.deepsjeng_r.test            355.00             355.00      0.0%                                                                                                                          
541.leela_r/541.leela_r.test                    361.00             361.00      0.0%                                                                                                                          
557.xz_r/557.xz_r.test                          296.00             296.00      0.0%                                                                                                                          
502.gcc_r/502.gcc_r.test                      13739.00           13735.00     -0.0%                                                                                                                          
999.specrand_ir/999.specrand_ir.test              0.00               0.00                                                                                                                                    
                           Geomean difference                                 -0.0%                                                                                                                          
      regalloc.NumSpills                                                                                                                                                                                     
run             baseline  experimental      diff                                                                                                                                                             
count  10.000000          10.000000     9.000000                                                                                                                                                             
mean   2371.000000        2370.600000  -0.000032                                                                                                                                                             
std    4197.231045        4196.027306   0.000097                                                                                                                                                             
min    0.000000           0.000000     -0.000291                                                                                                                                                             
25%    310.750000         310.750000    0.000000                                                                                                                                                             
50%    666.000000         666.000000    0.000000
75%    1914.750000        1914.750000   0.000000
max    13739.000000       13735.000000  0.000000

# utils/compare.py -m regalloc.NumReloads baseline.json experimental.json 
Tests: 10
Metric: regalloc.NumReloads

Program                                       regalloc.NumReloads                   
                                              baseline            experimental diff 
500.perlbench_r/500.perlbench_r.test           9790.00             9790.00      0.0%
505.mcf_r/505.mcf_r.test                        438.00              438.00      0.0%
520.omnetpp_r/520.omnetpp_r.test               1474.00             1474.00      0.0%
523.xalancbmk_r/523.xalancbmk_r.test           2511.00             2511.00      0.0%
525.x264_r/525.x264_r.test                     4694.00             4694.00      0.0%
531.deepsjeng_r/531.deepsjeng_r.test            674.00              674.00      0.0%
541.leela_r/541.leela_r.test                    509.00              509.00      0.0%
557.xz_r/557.xz_r.test                          631.00              631.00      0.0%
502.gcc_r/502.gcc_r.test                      31924.00            31920.00     -0.0%
999.specrand_ir/999.specrand_ir.test              0.00                0.00          
                           Geomean difference                                  -0.0%
      regalloc.NumReloads                        
run              baseline  experimental      diff
count  10.000000           10.000000     9.000000
mean   5264.500000         5264.100000  -0.000014
std    9829.948957         9828.743601   0.000042
min    0.000000            0.000000     -0.000125
25%    539.500000          539.500000    0.000000
50%    1074.000000         1074.000000   0.000000
75%    4148.250000         4148.250000   0.000000
max    31924.000000        31920.000000  0.000000

I don't see the increase of spills, but the decrease is also imperceptible.

My approach also didn't use the reg class weight but instead I scaled the weight by the physical register size in bytes. From what I remember investigating the register class weight was related to register pressure calculation, which seemed orthogonal to spilling

Scaling the weight by bytes may be over estimated. I wanted to scale the weight by the number of physical registers, which is accidentally the register class weight.

@lukel97
Copy link
Contributor

lukel97 commented Oct 28, 2024

My bad, I was getting two things conflated. I had also tried a different approach where instead of adjusting the spill weights, I changed the eviction advisor to aggressively perform local reassignment (i.e. when trying to evict an m8 register, it would now try to evict an m1 register instead if it could be immediately assigned a register): lukel97@4075bfc. Off the top of my head I think this was the one that led to all the spilling in SPEC (but also fixed the spilling in-tree)

But I checked out this PR locally and tried it with -march=rva23u64 -O3 -flto and still got a slight increase in the number of spills. I think with LTO there's more areas of high register pressure that this affects.

Program            regalloc.NumSpills                regalloc.NumReloads               
                   lhs                rhs      diff  lhs                 rhs      diff 
   531.deepsjeng_r   370.00             376.00  1.6%   793.00              800.00  0.9%
   631.deepsjeng_s   370.00             376.00  1.6%   793.00              800.00  0.9%
     538.imagick_r  4406.00            4423.00  0.4% 14425.00            14467.00  0.3%
     638.imagick_s  4406.00            4423.00  0.4% 14425.00            14467.00  0.3%
     526.blender_r 21007.00           21034.00  0.1% 47955.00            47987.00  0.1%
         602.gcc_s 19380.00           19385.00  0.0% 52642.00            52658.00  0.0%
         502.gcc_r 19380.00           19385.00  0.0% 52642.00            52658.00  0.0%
        508.namd_r  5440.00            5440.00  0.0% 12717.00            12717.00  0.0%
       641.leela_s   246.00             246.00  0.0%   482.00              482.00  0.0%
   623.xalancbmk_s  2258.00            2258.00  0.0%  4460.00             4460.00  0.0%
     620.omnetpp_s   997.00             997.00  0.0%  2837.00             2837.00  0.0%
         605.mcf_s    58.00              58.00  0.0%   107.00              107.00  0.0%
   600.perlbench_s  5432.00            5432.00  0.0% 14193.00            14193.00  0.0%
          557.xz_r   316.00             316.00  0.0%   637.00              637.00  0.0%
       541.leela_r   246.00             246.00  0.0%   482.00              482.00  0.0%
   523.xalancbmk_r  2258.00            2258.00  0.0%  4460.00             4460.00  0.0%
      510.parest_r  4565.00            4565.00  0.0% 11879.00            11879.00  0.0%
     520.omnetpp_r   997.00             997.00  0.0%  2837.00             2837.00  0.0%
         505.mcf_r    58.00              58.00  0.0%   107.00              107.00  0.0%
   500.perlbench_r  5432.00            5432.00  0.0% 14193.00            14193.00  0.0%
         644.nab_s   354.00             354.00  0.0%   787.00              787.00  0.0%
         619.lbm_s    70.00              70.00  0.0%    89.00               89.00  0.0%
         544.nab_r   354.00             354.00  0.0%   787.00              787.00  0.0%
         519.lbm_r    72.00              72.00  0.0%    91.00               91.00  0.0%
      511.povray_r  2453.00            2453.00  0.0%  5813.00             5813.00  0.0%
          657.xz_s   316.00             316.00  0.0%   637.00              637.00  0.0%
        525.x264_r  2456.00            2454.00 -0.1%  5504.00             5503.00 -0.0%
        625.x264_s  2456.00            2454.00 -0.1%  5504.00             5503.00 -0.0%
   997.specrand_fr     0.00               0.00                                         
   996.specrand_fs     0.00               0.00                                         
   999.specrand_ir     0.00               0.00                                         
   998.specrand_is     0.00               0.00                                         
Geomean difference                              0.1%                               0.1%

With that said, I'm not sure if these spills or reloads are signficant for performance. And on the llvm-test-suite loop vectorization micro benchmarks it actually removes all of the spills.

                   test-suite :: MicroBenchmarks/LoopVectorization/LoopVectorizationBenchmarks.test   391.00               0.00 -100.0%   905.00                     -100.0%
                    test-suite :: MicroBenchmarks/LoopVectorization/LoopInterleavingBenchmarks.test   391.00               0.00 -100.0%   905.00                     -100.0%

I will try and a do a run on deepsjeng and imagick to check this, but I would be overall in favour of this.

I wanted to scale the weight by the number of physical registers, which is accidentally the register class weight.

For what it's worth, I also previously tried scaling it by the number of reg units: lukel97@65bd27e

But I think this might actually be the same thing that the reg class weight should reflect?

  // Weight override for register pressure calculation. This is the value
  // TargetRegisterClass::getRegClassWeight() will return. The weight is in
  // units of pressure for this register class. If unset tablegen will
  // calculate a weight based on a number of register units in this register
  // class registers. The weight is per register.
  int Weight = ?;

@preames
Copy link
Collaborator

preames commented Oct 30, 2024

I am (conceptually) in favor of this direction. I haven't looked at the code yet in detail, and plan on deferring that until the empirical measurement discussion has converged.

@preames
Copy link
Collaborator

preames commented Nov 8, 2024

@lukel97 Any update on the perf measurements you mentioned in your last comment?

(Also FYI @mikhailramalho)

@wangpc-pp
Copy link
Contributor Author

Ping.

@bzEq
Copy link
Collaborator

bzEq commented Nov 27, 2024

Have you tried taking LaneBitMask into account? The LaneBitMask indicates regunits occupied by the RegClass.

@bzEq bzEq requested a review from qcolombet November 27, 2024 12:08
@bzEq
Copy link
Collaborator

bzEq commented Nov 27, 2024

But I think this might actually be the same thing that the reg class weight should reflect?

// Weight override for register pressure calculation. This is the value
// TargetRegisterClass::getRegClassWeight() will return. The weight is in
// units of pressure for this register class. If unset tablegen will
// calculate a weight based on a number of register units in this register
// class registers. The weight is per register.
int Weight = ?;

I think it's viable. Except that regunit introduced by aliases might not be included in calculating the SpillWeight.

@wangpc-pp
Copy link
Contributor Author

Have you tried taking LaneBitMask into account? The LaneBitMask indicates regunits occupied by the RegClass.

Yes, I tried it before. It is just the same as lukel97@65bd27e I think.

@preames
Copy link
Collaborator

preames commented Nov 27, 2024

ping @lukel97 and @mikhailramalho re: BP3 perf measurements.

@lukel97
Copy link
Contributor

lukel97 commented Nov 27, 2024

ping @lukel97 and @mikhailramalho re: BP3 perf measurements.

I've kicked off a run now, will report back when the results are ready.

I'm wondering if this is the right place to discourage the spilling high LMUL registers. There's multiple ways we could do this and increasing the spill weights is just one, e.g. Local reassignment is another. I have a feeling that the spill weights might really be intended to calculate the frequency of spills, not the overall cost. E.g. there is the possibility that spilling one M8 register allows 8 M1 registers to avoid spills.

FWIW this is approach is also what I originally tried but I think it would be good to hear some thoughts from people that have worked more with the greedy register allocator.

Copy link
Collaborator

@preames preames left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few off-topic observations from the diffs.

; RV32-NEXT: add a4, sp, a4
; RV32-NEXT: addi a4, a4, 16
; RV32-NEXT: vs8r.v v16, (a4) # Unknown-size Folded Spill
; RV32-NEXT: addi a4, sp, 8
; RV32-NEXT: vsetvli a5, zero, e64, m8, ta, ma
; RV32-NEXT: vlse64.v v16, (a4), zero
; RV32-NEXT: csrr a4, vlenb
; RV32-NEXT: slli a4, a4, 3
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Off topic for this review, but just an observation.

If I'm reading this code right, we're spilling a zero strided load from a constant on the stack. I think this is coming from SPLAT_VECTOR_SPLIT_I64_VL.

This is a missed rematerialization optimization if we can prove the stack slot constant.

We could probably also just use a vector select between two constants (i.e. vmv.v.x + vmerge.vx) here instead. This form wouldn't be easy to remat.

This is also a case where we have a splat of one vreg value to a whole register group. (We don't model it that way today.) We could spill only one vreg here, and "remat" via a chain of whole register moves. This pattern comes up with e.g. vrgather.vi and a few other shuffles, so this might be interesting to explore as a way to decrease register pressure in these cases.

; RV32-NEXT: add a3, sp, a3
; RV32-NEXT: addi a3, a3, 16
; RV32-NEXT: vs8r.v v16, (a3) # Unknown-size Folded Spill
; RV32-NEXT: vs8r.v v8, (a3) # Unknown-size Folded Spill
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case looks suspicious. We have existing logic to rematerialize a vmv.v.x instead of spilling it. Why didn't that kick in here? Both before and after.

; CHECK-NEXT: fsrm a1
; CHECK-NEXT: addi a1, sp, 16
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is weird? Is fsrm maybe marked as invalidating vector registers? We have a spill immediately before a fill from the same address a few lines later?

@lukel97
Copy link
Contributor

lukel97 commented Nov 30, 2024

I'm not seeing any measurable runtime difference on SPEC CPU 2017 or MicroBenchmarks/LoopVectorization

Copy link
Collaborator

@qcolombet qcolombet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The direction sounds fine to me.

@wangpc-pp
Copy link
Contributor Author

Ping. @arsenm Can I make it a hook now?

@arsenm
Copy link
Contributor

arsenm commented Feb 13, 2025

Ping. @arsenm Can I make it a hook now?

It's still not a hook, or that hook already exists. You can adjust the spill weights of the classes instead of adding another layer of control on top of the existing control

@wangpc-pp
Copy link
Contributor Author

Ping. @arsenm Can I make it a hook now?

It's still not a hook, or that hook already exists. You can adjust the spill weights of the classes instead of adding another layer of control on top of the existing control

No. I mean, yes we can adjust the spill weight of a register class, but we don't count it in when calculating the spill weights of LiveIntervals. And this PR is trying to do it.

@preames
Copy link
Collaborator

preames commented Mar 11, 2025

Ok, I stumbled back into this by analyzing another case where LMUL8 spilling goes bad. I'm going to briefly explain the issue I was chasing, and then return to the status of this patch.

If we have a case which uses masks intermixed with m8 code, but for some reason the mask can't always be placed in v0, we can end up with cases where the register allocator will fragment an m8, and then be unable to "fix" the mistake. In the particular example, we happen to assign v8-v15, then v16-v23, then assign two masks to v24, and v25 respective. Because one of the instructions uses a mask, v0 is unavailable for an m8 value. Next we try to assign an m8, and find no registers are available. The key thing here is that we should have evicted v24, and v25 placed the m8 in v24-v31, and then reassigned the two m1 values into v7, and v6 respectively. The key problem which prevented the eviction is that the spill weight of the m8 was perceived to be lower than either of the m1s, much less both. This is definitely backwards. Applying this patch has exactly the desired effect, and undoes the fragmentation.

In terms of understanding this patch, I find it helpful to think of the scaling as applying just after the call to LiveIntervals::getSpillWeight. (We're multiply an add reduction, so this is equivalent on real numbers. On floats, it's "mostly so".) What we're doing here is increasing the cost of the spill or fill instruction. This is 100% accurate as a m8 spill does cost 8x a m1 spill.

I have applied this patch locally, and manually reviewed the RISC-V changes. All but a very small handful are improvements. I want this change. I also inspected the values returned by getRegClassWeight to confirm they were reasonable for the RISC-V register classes; they are.

@wangpc-pp I think it's clear that we can't enable this for all targets at this point. Given that, we need to move this back under a target hook. You seemed a bit confused by @arsenm 's comments on that. Do you wan to take a second attempt, or would you like me to put up an alternate patch?

@wangpc-pp
Copy link
Contributor Author

@wangpc-pp I think it's clear that we can't enable this for all targets at this point. Given that, we need to move this back under a target hook. You seemed a bit confused by @arsenm 's comments on that. Do you wan to take a second attempt, or would you like me to put up an alternate patch?

Big thanks for raising this up and the detailed analysis! I will make it a hook and retry it!

@wangpc-pp wangpc-pp force-pushed the main-ra-spill-weight branch from 977471f to 754c792 Compare March 12, 2025 07:02
@wangpc-pp wangpc-pp changed the title [RegAlloc] Scale the spill weight by the weight of register class [RegAlloc] Scale the spill weight by target factor Mar 12, 2025
@wangpc-pp wangpc-pp force-pushed the main-ra-spill-weight branch from 754c792 to b978595 Compare March 12, 2025 07:06
@wangpc-pp
Copy link
Contributor Author

I added the hook (the difference compared to the initial version is that the scale factor is a float now so that the target can do a fine-grained control) and adjusted it for RISC-V. Please review @preames @lukel97

@wangpc-pp wangpc-pp requested review from preames, lukel97 and arsenm March 12, 2025 07:10
Copy link
Contributor

@lukel97 lukel97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some statistics to collect the total "lmul" that is spilled and reloaded statically with this change:

diff --git a/llvm/lib/Target/RISCV/RISCVInstrInfo.cpp b/llvm/lib/Target/RISCV/RISCVInstrInfo.cpp
index 2fdf6bd36e88..f4b5a5b29e71 100644
--- a/llvm/lib/Target/RISCV/RISCVInstrInfo.cpp
+++ b/llvm/lib/Target/RISCV/RISCVInstrInfo.cpp
@@ -18,6 +18,7 @@
 #include "RISCVSubtarget.h"
 #include "llvm/ADT/STLExtras.h"
 #include "llvm/ADT/SmallVector.h"
+#include "llvm/ADT/Statistic.h"
 #include "llvm/Analysis/MemoryLocation.h"
 #include "llvm/Analysis/ValueTracking.h"
 #include "llvm/CodeGen/LiveIntervals.h"
@@ -43,6 +44,10 @@ using namespace llvm;
 #define GET_INSTRINFO_NAMED_OPS
 #include "RISCVGenInstrInfo.inc"
 
+#define DEBUG_TYPE "riscv-instr-info"
+STATISTIC(TotalLMULSpilled, "Total LMUL spilled");
+STATISTIC(TotalLMULReloaded, "Total LMUL reloaded");
+
 static cl::opt<bool> PreferWholeRegisterMove(
     "riscv-prefer-whole-register-move", cl::init(false), cl::Hidden,
     cl::desc("Prefer whole register move for vector registers."));
@@ -615,12 +620,16 @@ void RISCVInstrInfo::storeRegToStackSlot(MachineBasicBlock &MBB,
     IsScalableVector = false;
   } else if (RISCV::VRRegClass.hasSubClassEq(RC)) {
     Opcode = RISCV::VS1R_V;
+    TotalLMULSpilled += 1;
   } else if (RISCV::VRM2RegClass.hasSubClassEq(RC)) {
     Opcode = RISCV::VS2R_V;
+    TotalLMULSpilled += 2;
   } else if (RISCV::VRM4RegClass.hasSubClassEq(RC)) {
     Opcode = RISCV::VS4R_V;
+    TotalLMULSpilled += 4;
   } else if (RISCV::VRM8RegClass.hasSubClassEq(RC)) {
     Opcode = RISCV::VS8R_V;
+    TotalLMULSpilled += 8;
   } else if (RISCV::VRN2M1RegClass.hasSubClassEq(RC))
     Opcode = RISCV::PseudoVSPILL2_M1;
   else if (RISCV::VRN2M2RegClass.hasSubClassEq(RC))
@@ -706,12 +715,16 @@ void RISCVInstrInfo::loadRegFromStackSlot(
     IsScalableVector = false;
   } else if (RISCV::VRRegClass.hasSubClassEq(RC)) {
     Opcode = RISCV::VL1RE8_V;
+    TotalLMULReloaded += 1;
   } else if (RISCV::VRM2RegClass.hasSubClassEq(RC)) {
     Opcode = RISCV::VL2RE8_V;
+    TotalLMULReloaded += 2;
   } else if (RISCV::VRM4RegClass.hasSubClassEq(RC)) {
     Opcode = RISCV::VL4RE8_V;
+    TotalLMULReloaded += 4;
   } else if (RISCV::VRM8RegClass.hasSubClassEq(RC)) {
     Opcode = RISCV::VL8RE8_V;
+    TotalLMULReloaded += 8;
   } else if (RISCV::VRN2M1RegClass.hasSubClassEq(RC))
     Opcode = RISCV::PseudoVRELOAD2_M1;
   else if (RISCV::VRN2M2RegClass.hasSubClassEq(RC))

Running it against SPEC CPU 2017, -O3 -march=rva23u64, it looks like this helps a good bit on imagick/gcc/deepsjeng!

Program                                       riscv-instr-info.TotalLMULReloaded                riscv-instr-info.TotalLMULSpilled               
                                              lhs                                rhs     diff   lhs                               rhs     diff  
FP2017rate/508.namd_r/508.namd_r                 6.00                               6.00   0.0%    1.00                              1.00   0.0%
INT2017spe...ed/620.omnetpp_s/620.omnetpp_s      5.00                               5.00   0.0%    4.00                              4.00   0.0%
INT2017spe...00.perlbench_s/600.perlbench_s      8.00                               8.00   0.0%    4.00                              4.00   0.0%
INT2017speed/625.x264_s/625.x264_s              43.00                              43.00   0.0%   47.00                             47.00   0.0%
INT2017rate/525.x264_r/525.x264_r               43.00                              43.00   0.0%   47.00                             47.00   0.0%
INT2017rat...23.xalancbmk_r/523.xalancbmk_r      6.00                               6.00   0.0%    6.00                              6.00   0.0%
INT2017rate/520.omnetpp_r/520.omnetpp_r          5.00                               5.00   0.0%    4.00                              4.00   0.0%
INT2017rat...00.perlbench_r/500.perlbench_r      8.00                               8.00   0.0%    4.00                              4.00   0.0%
FP2017speed/644.nab_s/644.nab_s                 25.00                              25.00   0.0%   25.00                             25.00   0.0%
FP2017speed/619.lbm_s/619.lbm_s                 42.00                              42.00   0.0%   42.00                             42.00   0.0%
FP2017rate/544.nab_r/544.nab_r                  25.00                              25.00   0.0%   25.00                             25.00   0.0%
FP2017rate/519.lbm_r/519.lbm_r                  42.00                              42.00   0.0%   42.00                             42.00   0.0%
FP2017rate/511.povray_r/511.povray_r           122.00                             122.00   0.0%   66.00                             66.00   0.0%
INT2017spe...23.xalancbmk_s/623.xalancbmk_s      6.00                               6.00   0.0%    6.00                              6.00   0.0%
FP2017speed/638.imagick_s/638.imagick_s       5054.00                            5053.00  -0.0% 4433.00                           3803.00 -14.2%
FP2017rate/538.imagick_r/538.imagick_r        5054.00                            5053.00  -0.0% 4433.00                           3803.00 -14.2%
FP2017rate/510.parest_r/510.parest_r          1349.00                            1343.00  -0.4% 1089.00                           1083.00  -0.6%
FP2017rate/526.blender_r/526.blender_r        1138.00                            1127.00  -1.0% 1064.00                           1025.00  -3.7%
INT2017spe...31.deepsjeng_s/631.deepsjeng_s    284.00                             274.00  -3.5%  154.00                            132.00 -14.3%
INT2017rat...31.deepsjeng_r/531.deepsjeng_r    284.00                             274.00  -3.5%  154.00                            132.00 -14.3%
INT2017speed/602.gcc_s/602.gcc_s               113.00                              83.00 -26.5%  107.00                             77.00 -28.0%
INT2017rate/502.gcc_r/502.gcc_r                113.00                              83.00 -26.5%  107.00                             77.00 -28.0%
INT2017rate/505.mcf_r/505.mcf_r                  0.00                               0.00                                                        
INT2017rate/541.leela_r/541.leela_r              0.00                               0.00                                                        
INT2017rate/557.xz_r/557.xz_r                    0.00                               0.00                                                        
INT2017speed/605.mcf_s/605.mcf_s                 0.00                               0.00                                                        
INT2017speed/641.leela_s/641.leela_s             0.00                               0.00                                                        
INT2017speed/657.xz_s/657.xz_s                   0.00                               0.00                                                        
                           Geomean difference                                             -3.1%                                            -5.8%
      riscv-instr-info.TotalLMULReloaded                         riscv-instr-info.TotalLMULSpilled                        

I think I was previously skeptical because I only measured the absolute number of spills/reloads, which included scalars + didn't show much of a change:

Program                                       regalloc.NumSpills                regalloc.NumReloads               
                                              lhs                rhs      diff  lhs                 rhs      diff 
FP2017rate/526.blender_r/526.blender_r        13411.00           13430.00  0.1% 27478.00            27509.00  0.1%
INT2017speed/602.gcc_s/602.gcc_s              11376.00           11381.00  0.0% 25795.00            25800.00  0.0%
INT2017rate/502.gcc_r/502.gcc_r               11376.00           11381.00  0.0% 25795.00            25800.00  0.0%
FP2017rate/508.namd_r/508.namd_r               6729.00            6729.00  0.0% 16370.00            16370.00  0.0%
FP2017rate/510.parest_r/510.parest_r          44293.00           44293.00  0.0% 87404.00            87404.00  0.0%
INT2017speed/641.leela_s/641.leela_s            310.00             310.00  0.0%   449.00              449.00  0.0%
INT2017speed/625.x264_s/625.x264_s             2147.00            2147.00  0.0%  4598.00             4598.00  0.0%
INT2017spe...23.xalancbmk_s/623.xalancbmk_s    1822.00            1822.00  0.0%  2969.00             2969.00  0.0%
INT2017spe...ed/620.omnetpp_s/620.omnetpp_s     719.00             719.00  0.0%  1210.00             1210.00  0.0%
INT2017speed/605.mcf_s/605.mcf_s                123.00             123.00  0.0%   372.00              372.00  0.0%
INT2017spe...00.perlbench_s/600.perlbench_s    4375.00            4375.00  0.0%  9740.00             9740.00  0.0%
INT2017rate/557.xz_r/557.xz_r                   300.00             300.00  0.0%   603.00              603.00  0.0%
INT2017rate/541.leela_r/541.leela_r             310.00             310.00  0.0%   449.00              449.00  0.0%
INT2017rate/525.x264_r/525.x264_r              2147.00            2147.00  0.0%  4598.00             4598.00  0.0%
INT2017rat...23.xalancbmk_r/523.xalancbmk_r    1822.00            1822.00  0.0%  2969.00             2969.00  0.0%
INT2017rate/520.omnetpp_r/520.omnetpp_r         719.00             719.00  0.0%  1210.00             1210.00  0.0%
INT2017rate/505.mcf_r/505.mcf_r                 123.00             123.00  0.0%   372.00              372.00  0.0%
INT2017rat...00.perlbench_r/500.perlbench_r    4375.00            4375.00  0.0%  9740.00             9740.00  0.0%
FP2017speed/644.nab_s/644.nab_s                 713.00             713.00  0.0%  1066.00             1066.00  0.0%
FP2017speed/619.lbm_s/619.lbm_s                  88.00              88.00  0.0%    90.00               90.00  0.0%
FP2017rate/544.nab_r/544.nab_r                  713.00             713.00  0.0%  1066.00             1066.00  0.0%
FP2017rate/519.lbm_r/519.lbm_r                   90.00              90.00  0.0%    92.00               92.00  0.0%
FP2017rate/511.povray_r/511.povray_r           1571.00            1571.00  0.0%  3043.00             3043.00  0.0%
INT2017speed/657.xz_s/657.xz_s                  300.00             300.00  0.0%   603.00              603.00  0.0%
FP2017speed/638.imagick_s/638.imagick_s        4074.00            4054.00 -0.5% 10335.00            10452.00  1.1%
FP2017rate/538.imagick_r/538.imagick_r         4074.00            4054.00 -0.5% 10335.00            10452.00  1.1%
INT2017rat...31.deepsjeng_r/531.deepsjeng_r     344.00             341.00 -0.9%   690.00              691.00  0.1%
INT2017spe...31.deepsjeng_s/631.deepsjeng_s     344.00             341.00 -0.9%   690.00              691.00  0.1%
                           Geomean difference                             -0.1%                               0.1%

So I guess that answers my worry that increasing LMUL 8 weights might increase smaller LMUL spills, i.e. it looks like it doesn't!

Thanks for being patient on this for so long, the code changes + test diff LGTM :)

Copy link
Collaborator

@preames preames left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

(Make sure you rebase carefully when landing - there's enough test changes I suspect you'll easily get conflicts.)

Currently, the spill weight is only determined by isDef/isUse and
block frequency. However, for registers with different register
classes, the costs of spilling them are different.

For example, for `LMUL>1` registers (in which, several physical
registers compound a bigger logical register), the costs are larger
than `LMUL=1` case (in which, there is only one physical register).

To solve this problem, a new target hook `getSpillWeightScaleFactor`
is added. Targets can override the default factor (which is `1.0`)
according to the register class.

For RISC-V, the factors are set to the `RegClassWeight` which is
used to track register pressure. The values of `RegClassWeight`
happen to be the number of register units.

I believe all of the targets with compounded registers can benefit
from this change, but only RISC-V is customized in this patch since
it has widely been agreed to do so. The other targets need more
performance data to go further.

Partially fixes llvm#113489.
@wangpc-pp wangpc-pp force-pushed the main-ra-spill-weight branch from b978595 to 46b38e5 Compare March 13, 2025 04:27
@wangpc-pp wangpc-pp merged commit 08a3c53 into llvm:main Mar 13, 2025
6 of 10 checks passed
@wangpc-pp wangpc-pp deleted the main-ra-spill-weight branch March 13, 2025 04:32
@arsenm
Copy link
Contributor

arsenm commented Mar 13, 2025

I still think we should be making active effort to just make this the default. I believe this is another case where the target hook is just being used to paper over issues in other areas (I'm not objecting to this as an incremental step).

In particular in the previous revision of the patch, the AMDGPU behavior showed rematerialize was totally broken. I also think that something is wrong with whatever tablegen is doing to compute the default class weight. We have quite a lot of code scattered in the allocator and tablegen that were never written to account for subregisters. The defaults are wrong for the allocation priorities (which RISCV also does not look like it's trying to set). We also lack a splitting strategy to evict only a subregister when it would help, and spilling induces new liveness to dead lanes.

@wangpc-pp
Copy link
Contributor Author

I still think we should be making active effort to just make this the default. I believe this is another case where the target hook is just being used to paper over issues in other areas (I'm not objecting to this as an incremental step).

Thanks for your precious reviewing and suggestions! I also believe this should be made default. But I am not an expert on all targets and not aware of every detail in other targets, so I can only say current approach is good for RISC-V.

In particular in the previous revision of the patch, the AMDGPU behavior showed rematerialize was totally broken. I also think that something is wrong with whatever tablegen is doing to compute the default class weight. We have quite a lot of code scattered in the allocator and tablegen that were never written to account for subregisters. The defaults are wrong for the allocation priorities (which RISCV also does not look like it's trying to set). We also lack a splitting strategy to evict only a subregister when it would help, and spilling induces new liveness to dead lanes.

Yes, I felt there are some problems when dealing with register pressure. I think AMDGPU/X86 can start to enable this as well so that we can have more separate data to support making it default (and of course uncover these issues you have said).

Please feel free to add me to review when every target is going to customize the spill weight.

@preames
Copy link
Collaborator

preames commented Mar 13, 2025

I still think we should be making active effort to just make this the default. I believe this is another case where the target hook is just being used to paper over issues in other areas (I'm not objecting to this as an incremental step).

I agree with this, but FYI, your last clause had gotten lost in previous review comments, at least for me. I do think it's important to highlight that the thing being papered over is for the target keeping the old behavior. Long term, I do think all targets should have this particular behavior.

In particular in the previous revision of the patch, the AMDGPU behavior showed rematerialize was totally broken. I also think that something is wrong with whatever tablegen is doing to compute the default class weight. We have quite a lot of code scattered in the allocator and tablegen that were never written to account for subregisters. The defaults are wrong for the allocation priorities (which RISCV also does not look like it's trying to set). We also lack a splitting strategy to evict only a subregister when it would help, and spilling induces new liveness to dead lanes.

I've been looking into some of this, and trying to find ways to split things apart into some actionable pieces. Incrementalism here is critical for keeping the problem vaguely understandable. However, it sounds like you have a couple cases identified that I haven't yet stumbled into to. Could we compare notes offline?

@preames
Copy link
Collaborator

preames commented Mar 13, 2025

The defaults are wrong for the allocation priorities (which RISCV also does not look like it's trying to set).

Thanks for catching this, I hadn't been aware of that mechanism. Adding in #131176 though it doesn't seem to have too much effect.

@arsenm
Copy link
Contributor

arsenm commented Mar 14, 2025

Could we compare notes offline?

Sure, I have a long list of regalloc issues (and each one takes forever to work through)

lukel97 added a commit to lukel97/llvm-project that referenced this pull request Mar 18, 2025
The cost of a vector spill/reload may vary highly depending on the size of the vector register being spilled, i.e. LMUL, so the usual regalloc.NumSpills/regalloc.NumReloads statistics may not be an accurate reflection of the total cost.

This adds two new statistics for RISCVInstrInfo that collects the total LMUL for vector register spills and reloads. It can be used to get a better idea of regalloc changes in e.g. llvm#131176 llvm#113675
frederik-h pushed a commit to frederik-h/llvm-project that referenced this pull request Mar 18, 2025
Currently, the spill weight is only determined by isDef/isUse and
block frequency. However, for registers with different register
classes, the costs of spilling them are different.

For example, for `LMUL>1` registers (in which, several physical
registers compound a bigger logical register), the costs are larger
than `LMUL=1` case (in which, there is only one physical register).

To solve this problem, a new target hook `getSpillWeightScaleFactor`
is added. Targets can override the default factor (which is `1.0`)
according to the register class.

For RISC-V, the factors are set to the `RegClassWeight` which is
used to track register pressure. The values of `RegClassWeight`
happen to be the number of register units.

I believe all of the targets with compounded registers can benefit
from this change, but only RISC-V is customized in this patch since
it has widely been agreed to do so. The other targets need more
performance data to go further.

Partially fixes llvm#113489.
lukel97 added a commit that referenced this pull request Mar 19, 2025
The cost of a vector spill/reload may vary highly depending on the size
of the vector register being spilled, i.e. LMUL, so the usual
regalloc.NumSpills/regalloc.NumReloads statistics may not be an accurate
reflection of the total cost.

This adds two new statistics for RISCVInstrInfo that collects the total
number of vector registers spilled/reloaded within groups. It can be
used to get a better idea of regalloc changes in e.g. #131176 #113675
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RISC-V: RVV register allocation problem causes costly and unecessary spill
9 participants