-
Notifications
You must be signed in to change notification settings - Fork 13.6k
RISC-V: RVV register allocation problem causes costly and unecessary spill #113489
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@llvm/issue-subscribers-backend-risc-v Author: Camel Coder (camel-cdr)
Hi, I ran into a problem, where reordering a single RVV intrinsic without changing the program logic caused llvm to spill a vector register, suggesting that the register allocation has trouble reordering in this case:
#include <riscv_vector.h>
void test(int *out, const int *in, size_t n)
{
for (size_t vl; n > 0; n -= vl, out += vl, in += vl) {
vl = __riscv_vsetvl_e32m8(n);
vint32m8_t v1 = __riscv_vle32_v_i32m8(in, vl);
vint32m8_t v2 = __riscv_vadd(v1, v1, vl);
vbool4_t mlt = __riscv_vmslt(v1, 0, vl);
#ifdef REORDER
vint32m8_t v4 = __riscv_vmerge(v2, v1, mlt, vl);
#endif
vint32m8_t v3 = __riscv_vadd(v1, 3, vl);
#ifndef REORDER
vint32m8_t v4 = __riscv_vmerge(v2, v1, mlt, vl);
#endif
vbool4_t mgt = __riscv_vmsgt(v1, 4, vl);
v1 = __riscv_vadd_mu(__riscv_vmor(mlt, mgt, vl), v1, v3, v4, vl);
__riscv_vse32(out, v1, vl);
}
} See also the godbolt link: https://godbolt.org/z/6vdf4vEjn This example was adapted from real code, and minimized while still retaining the problematic behavior. gcc manages to figure out the proper register allocation. |
Thanks for reporting! I will take a look first as others are attending LLVM dev meeting I think. |
After a rough investigation, the preliminary conclusion is we are extending the live inteval of |
Currently, the spill weight is only determined by isDef/isUse and block frequency. However, for registers with different register classes, the costs of spilling them are different. For example, for `LMUL>1` registers (in which, several physical regsiter compound a bigger logical register), the costs are larger than `LMUL=1` case (in which, there is only one physical register). To sovle this problem, a new target hook `getSpillWeightFactor` is added. Targets can override the default factor (which is 1) according to the register classes. For RISC-V, the factors are set to the `RegClassWeight` which is used to track regsiter pressure. The values of `RegClassWeight` are the number of register units. I believe all the targets can benefit from this change, but I will shrink the range of tests to RISC-V only. Partially fixes llvm#113489.
I have done a deep investigation, and this problem is much more complicated than what I was thinking.
test:
beqz a2, .LBB0_2
.LBB0_1:
vsetvli a3, a2, e32, m8, ta, mu
vle32.v v16, (a1)
vadd.vv v8, v16, v16
vmslt.vx v0, v16, zero
vadd.vi v24, v16, 3
vmsgt.vi v7, v16, 4
vmor.mm v7, v0, v7
vmerge.vvm v8, v8, v16, v0
vmv1r.v v0, v7
vadd.vv v16, v24, v8, v0.t
vse32.v v16, (a0)
sub a2, a2, a3
slli a3, a3, 2
add a0, a0, a3
add a1, a1, a3
bnez a2, .LBB0_1
.LBB0_2:
ret
test:
beqz a2, .LBB0_2
.LBB0_1:
vsetvli a3, a2, e32, m8, ta, mu
vle32.v v16, (a1)
vadd.vv v8, v16, v16
vmslt.vx v0, v16, zero
vadd.vi v24, v16, 3
vmsgt.vi v7, v16, 4
- vmor.mm v7, v0, v7
vmerge.vvm v8, v8, v16, v0
- vmv1r.v v0, v7
+ vmor.mm v0, v0, v7
vadd.vv v16, v24, v8, v0.t
vse32.v v16, (a0)
sub a2, a2, a3
slli a3, a3, 2
add a0, a0, a3
add a1, a1, a3
bnez a2, .LBB0_1
.LBB0_2:
ret To fix this, I may try to add scheduling DAG mutation that will add a dependency between |
Here we add a scheduling mutation in pre-ra scheduling, which will adds an artificial dependency edge between mask producer and its previous nearest instruction that uses V0 register. This prevents making live intervals of mask registers longer and as a consequence we can reduce some spills/moves. From the test changes, we can see some improvements and also some regressions (more vtype toggles). Partially fixes llvm#113489.
Here we add a scheduling mutation in pre-ra scheduling, which will adds an artificial dependency edge between mask producer and its previous nearest instruction that uses V0 register. This prevents the overlap of live intervals of mask registers and as a consequence we can reduce some spills/moves. From the test changes, we can see some improvements and also some regressions (more vtype toggles). Partially fixes llvm#113489.
Here we add a scheduling mutation in pre-ra scheduling, which will adds an artificial dependency edge between mask producer and its previous nearest instruction that uses V0 register. This prevents the overlap of live intervals of mask registers and as a consequence we can reduce some spills/moves. From the test changes, we can see some improvements and also some regressions (more vtype toggles). Partially fixes llvm#113489.
Here we add a scheduling mutation in pre-ra scheduling, which will add an artificial dependency edge between mask producer and its previous nearest instruction that uses V0 register. This prevents the overlap of live intervals of mask registers and as a consequence we can reduce some spills/moves. From the test changes, we can see some improvements and also some regressions (more vtype toggles). Partially fixes #113489.
Currently, the spill weight is only determined by isDef/isUse and block frequency. However, for registers with different register classes, the costs of spilling them are different. For example, for `LMUL>1` registers (in which, several physical regsiter compound a bigger logical register), the costs are larger than `LMUL=1` case (in which, there is only one physical register). To sovle this problem, a new target hook `getSpillWeightFactor` is added. Targets can override the default factor (which is 1) according to the register classes. For RISC-V, the factors are set to the `RegClassWeight` which is used to track regsiter pressure. The values of `RegClassWeight` are the number of register units. I believe all the targets can benefit from this change, but I will shrink the range of tests to RISC-V only. Partially fixes llvm#113489.
Currently, the spill weight is only determined by isDef/isUse and block frequency. However, for registers with different register classes, the costs of spilling them are different. For example, for `LMUL>1` registers (in which, several physical registers compound a bigger logical register), the costs are larger than `LMUL=1` case (in which, there is only one physical register). To solve this problem, a new target hook `getSpillWeightScaleFactor` is added. Targets can override the default factor (which is `1.0`) according to the register class. For RISC-V, the factors are set to the `RegClassWeight` which is used to track register pressure. The values of `RegClassWeight` happen to be the number of register units. I believe all of the targets with compounded registers can benefit from this change, but only RISC-V is customized in this patch since it has widely been agreed to do so. The other targets need more performance data to go further. Partially fixes llvm#113489.
Currently, the spill weight is only determined by isDef/isUse and block frequency. However, for registers with different register classes, the costs of spilling them are different. For example, for `LMUL>1` registers (in which, several physical registers compound a bigger logical register), the costs are larger than `LMUL=1` case (in which, there is only one physical register). To solve this problem, a new target hook `getSpillWeightScaleFactor` is added. Targets can override the default factor (which is `1.0`) according to the register class. For RISC-V, the factors are set to the `RegClassWeight` which is used to track register pressure. The values of `RegClassWeight` happen to be the number of register units. I believe all of the targets with compounded registers can benefit from this change, but only RISC-V is customized in this patch since it has widely been agreed to do so. The other targets need more performance data to go further. Partially fixes llvm#113489.
Currently, the spill weight is only determined by isDef/isUse and block frequency. However, for registers with different register classes, the costs of spilling them are different. For example, for `LMUL>1` registers (in which, several physical registers compound a bigger logical register), the costs are larger than `LMUL=1` case (in which, there is only one physical register). To solve this problem, a new target hook `getSpillWeightScaleFactor` is added. Targets can override the default factor (which is `1.0`) according to the register class. For RISC-V, the factors are set to the `RegClassWeight` which is used to track register pressure. The values of `RegClassWeight` happen to be the number of register units. I believe all of the targets with compounded registers can benefit from this change, but only RISC-V is customized in this patch since it has widely been agreed to do so. The other targets need more performance data to go further. Partially fixes #113489.
Currently, the spill weight is only determined by isDef/isUse and block frequency. However, for registers with different register classes, the costs of spilling them are different. For example, for `LMUL>1` registers (in which, several physical registers compound a bigger logical register), the costs are larger than `LMUL=1` case (in which, there is only one physical register). To solve this problem, a new target hook `getSpillWeightScaleFactor` is added. Targets can override the default factor (which is `1.0`) according to the register class. For RISC-V, the factors are set to the `RegClassWeight` which is used to track register pressure. The values of `RegClassWeight` happen to be the number of register units. I believe all of the targets with compounded registers can benefit from this change, but only RISC-V is customized in this patch since it has widely been agreed to do so. The other targets need more performance data to go further. Partially fixes llvm#113489.
Hi, I ran into a problem, where reordering a single RVV intrinsic without changing the program logic caused llvm to spill a vector register, suggesting that the register allocation has trouble reordering in this case:
See also the godbolt link: https://godbolt.org/z/6vdf4vEjn
This example was adapted from real code, and minimized while still retaining the problematic behavior.
gcc manages to figure out the proper register allocation.
The text was updated successfully, but these errors were encountered: