RISC-V: RVV register allocation problem causes costly and unecessary spill #113489

camel-cdr · 2024-10-23T19:50:56Z

Hi, I ran into a problem, where reordering a single RVV intrinsic without changing the program logic caused llvm to spill a vector register, suggesting that the register allocation has trouble reordering in this case:

#include <riscv_vector.h>

void test(int *out, const int *in, size_t n)
{
    for (size_t vl; n > 0; n -= vl, out += vl, in += vl) {
        vl = __riscv_vsetvl_e32m8(n);
        vint32m8_t v1 = __riscv_vle32_v_i32m8(in, vl);
        vint32m8_t v2 = __riscv_vadd(v1, v1, vl);
        vbool4_t mlt = __riscv_vmslt(v1, 0, vl);

#ifdef REORDER
	vint32m8_t v4 = __riscv_vmerge(v2, v1, mlt, vl);
#endif
        vint32m8_t v3 = __riscv_vadd(v1, 3, vl);
#ifndef REORDER
        vint32m8_t v4 = __riscv_vmerge(v2, v1, mlt, vl);
#endif

        vbool4_t mgt = __riscv_vmsgt(v1, 4, vl);
        v1 = __riscv_vadd_mu(__riscv_vmor(mlt, mgt, vl), v1, v3, v4, vl);

        __riscv_vse32(out, v1, vl);
    }
}

See also the godbolt link: https://godbolt.org/z/6vdf4vEjn

This example was adapted from real code, and minimized while still retaining the problematic behavior.

gcc manages to figure out the proper register allocation.

llvmbot · 2024-10-23T21:17:26Z

@llvm/issue-subscribers-backend-risc-v

Author: Camel Coder (camel-cdr)

Hi, I ran into a problem, where reordering a single RVV intrinsic without changing the program logic caused llvm to spill a vector register, suggesting that the register allocation has trouble reordering in this case:

#include &lt;riscv_vector.h&gt;

void test(int *out, const int *in, size_t n)
{
    for (size_t vl; n &gt; 0; n -= vl, out += vl, in += vl) {
        vl = __riscv_vsetvl_e32m8(n);
        vint32m8_t v1 = __riscv_vle32_v_i32m8(in, vl);
        vint32m8_t v2 = __riscv_vadd(v1, v1, vl);
        vbool4_t mlt = __riscv_vmslt(v1, 0, vl);

#ifdef REORDER
	vint32m8_t v4 = __riscv_vmerge(v2, v1, mlt, vl);
#endif
        vint32m8_t v3 = __riscv_vadd(v1, 3, vl);
#ifndef REORDER
        vint32m8_t v4 = __riscv_vmerge(v2, v1, mlt, vl);
#endif

        vbool4_t mgt = __riscv_vmsgt(v1, 4, vl);
        v1 = __riscv_vadd_mu(__riscv_vmor(mlt, mgt, vl), v1, v3, v4, vl);

        __riscv_vse32(out, v1, vl);
    }
}

See also the godbolt link: https://godbolt.org/z/6vdf4vEjn

This example was adapted from real code, and minimized while still retaining the problematic behavior.

gcc manages to figure out the proper register allocation.

wangpc-pp · 2024-10-24T03:18:32Z

Thanks for reporting! I will take a look first as others are attending LLVM dev meeting I think.

wangpc-pp · 2024-10-24T04:35:49Z

After a rough investigation, the preliminary conclusion is we are extending the live inteval of vmsgt.
Simply disabling misched to avoid reordering, or using topdown scheduling, can generate desired code: https://godbolt.org/z/n3sKzc8E1

Currently, the spill weight is only determined by isDef/isUse and block frequency. However, for registers with different register classes, the costs of spilling them are different. For example, for `LMUL>1` registers (in which, several physical regsiter compound a bigger logical register), the costs are larger than `LMUL=1` case (in which, there is only one physical register). To sovle this problem, a new target hook `getSpillWeightFactor` is added. Targets can override the default factor (which is 1) according to the register classes. For RISC-V, the factors are set to the `RegClassWeight` which is used to track regsiter pressure. The values of `RegClassWeight` are the number of register units. I believe all the targets can benefit from this change, but I will shrink the range of tests to RISC-V only. Partially fixes llvm#113489.

wangpc-pp · 2024-10-25T10:59:54Z

I have done a deep investigation, and this problem is much more complicated than what I was thinking.
To make a long story short, there are two things I think we should fix:

LMUL=1 registers have the same logic of calculating the spill weight as LMUL>1 registers. I should have fixed this in [RegAlloc] Scale the spill weight by target factor #113675.
After this change, the generated code will be:

test:
	beqz	a2, .LBB0_2
.LBB0_1:
	vsetvli	a3, a2, e32, m8, ta, mu
	vle32.v	v16, (a1)
	vadd.vv	v8, v16, v16
	vmslt.vx	v0, v16, zero
	vadd.vi	v24, v16, 3
	vmsgt.vi	v7, v16, 4
	vmor.mm	v7, v0, v7
	vmerge.vvm	v8, v8, v16, v0
	vmv1r.v	v0, v7
	vadd.vv	v16, v24, v8, v0.t
	vse32.v	v16, (a0)
	sub	a2, a2, a3
	slli	a3, a3, 2
	add	a0, a0, a3
	add	a1, a1, a3
	bnez	a2, .LBB0_1
.LBB0_2:
	ret

We shouldn't move masks producing instructions accross masked instructions, or we will extend the live inteval of mask register. For example, we shouldn't move vmor.mm v7, v0, v7 to the front of vmerge.vvm v8, v8, v16, v0. The ideal code should be like:

test:
	beqz	a2, .LBB0_2
.LBB0_1:
	vsetvli	a3, a2, e32, m8, ta, mu
	vle32.v	v16, (a1)
	vadd.vv	v8, v16, v16
	vmslt.vx	v0, v16, zero
	vadd.vi	v24, v16, 3
	vmsgt.vi	v7, v16, 4
-	vmor.mm	v7, v0, v7
	vmerge.vvm	v8, v8, v16, v0
-	vmv1r.v	v0, v7
+	vmor.mm	v0, v0, v7
	vadd.vv	v16, v24, v8, v0.t
	vse32.v	v16, (a0)
	sub	a2, a2, a3
	slli	a3, a3, 2
	add	a0, a0, a3
	add	a1, a1, a3
	bnez	a2, .LBB0_1
.LBB0_2:
	ret

To fix this, I may try to add scheduling DAG mutation that will add a dependency between vmor.mm and vmerge.vvm.

Here we add a scheduling mutation in pre-ra scheduling, which will adds an artificial dependency edge between mask producer and its previous nearest instruction that uses V0 register. This prevents making live intervals of mask registers longer and as a consequence we can reduce some spills/moves. From the test changes, we can see some improvements and also some regressions (more vtype toggles). Partially fixes llvm#113489.

Here we add a scheduling mutation in pre-ra scheduling, which will adds an artificial dependency edge between mask producer and its previous nearest instruction that uses V0 register. This prevents the overlap of live intervals of mask registers and as a consequence we can reduce some spills/moves. From the test changes, we can see some improvements and also some regressions (more vtype toggles). Partially fixes llvm#113489.

Here we add a scheduling mutation in pre-ra scheduling, which will add an artificial dependency edge between mask producer and its previous nearest instruction that uses V0 register. This prevents the overlap of live intervals of mask registers and as a consequence we can reduce some spills/moves. From the test changes, we can see some improvements and also some regressions (more vtype toggles). Partially fixes #113489.

Currently, the spill weight is only determined by isDef/isUse and block frequency. However, for registers with different register classes, the costs of spilling them are different. For example, for `LMUL>1` registers (in which, several physical regsiter compound a bigger logical register), the costs are larger than `LMUL=1` case (in which, there is only one physical register). To sovle this problem, a new target hook `getSpillWeightFactor` is added. Targets can override the default factor (which is 1) according to the register classes. For RISC-V, the factors are set to the `RegClassWeight` which is used to track regsiter pressure. The values of `RegClassWeight` are the number of register units. I believe all the targets can benefit from this change, but I will shrink the range of tests to RISC-V only. Partially fixes llvm#113489.

Currently, the spill weight is only determined by isDef/isUse and block frequency. However, for registers with different register classes, the costs of spilling them are different. For example, for `LMUL>1` registers (in which, several physical registers compound a bigger logical register), the costs are larger than `LMUL=1` case (in which, there is only one physical register). To solve this problem, a new target hook `getSpillWeightScaleFactor` is added. Targets can override the default factor (which is `1.0`) according to the register class. For RISC-V, the factors are set to the `RegClassWeight` which is used to track register pressure. The values of `RegClassWeight` happen to be the number of register units. I believe all of the targets with compounded registers can benefit from this change, but only RISC-V is customized in this patch since it has widely been agreed to do so. The other targets need more performance data to go further. Partially fixes llvm#113489.

Currently, the spill weight is only determined by isDef/isUse and block frequency. However, for registers with different register classes, the costs of spilling them are different. For example, for `LMUL>1` registers (in which, several physical registers compound a bigger logical register), the costs are larger than `LMUL=1` case (in which, there is only one physical register). To solve this problem, a new target hook `getSpillWeightScaleFactor` is added. Targets can override the default factor (which is `1.0`) according to the register class. For RISC-V, the factors are set to the `RegClassWeight` which is used to track register pressure. The values of `RegClassWeight` happen to be the number of register units. I believe all of the targets with compounded registers can benefit from this change, but only RISC-V is customized in this patch since it has widely been agreed to do so. The other targets need more performance data to go further. Partially fixes #113489.

Currently, the spill weight is only determined by isDef/isUse and block frequency. However, for registers with different register classes, the costs of spilling them are different. For example, for `LMUL>1` registers (in which, several physical registers compound a bigger logical register), the costs are larger than `LMUL=1` case (in which, there is only one physical register). To solve this problem, a new target hook `getSpillWeightScaleFactor` is added. Targets can override the default factor (which is `1.0`) according to the register class. For RISC-V, the factors are set to the `RegClassWeight` which is used to track register pressure. The values of `RegClassWeight` happen to be the number of register units. I believe all of the targets with compounded registers can benefit from this change, but only RISC-V is customized in this patch since it has widely been agreed to do so. The other targets need more performance data to go further. Partially fixes llvm#113489.

github-actions bot added the new issue label Oct 23, 2024

EugeneZelenko added backend:RISC-V and removed new issue labels Oct 23, 2024

wangpc-pp self-assigned this Oct 24, 2024

wangpc-pp mentioned this issue Oct 25, 2024

[RegAlloc] Scale the spill weight by target factor #113675

Merged

wangpc-pp mentioned this issue Oct 29, 2024

[RISCV] Set a barrier between mask producer and user of V0 #114012

Merged

wangpc-pp closed this as completed in #114012 Nov 29, 2024

wangpc-pp mentioned this issue Mar 14, 2025

[RISCV] Set AllocationPriority in line with LMUL #131176

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RISC-V: RVV register allocation problem causes costly and unecessary spill #113489

RISC-V: RVV register allocation problem causes costly and unecessary spill #113489

camel-cdr commented Oct 23, 2024

llvmbot commented Oct 23, 2024

Uh oh!

wangpc-pp commented Oct 24, 2024

Uh oh!

wangpc-pp commented Oct 24, 2024 •

edited

Loading

Uh oh!

wangpc-pp commented Oct 25, 2024

Uh oh!

RISC-V: RVV register allocation problem causes costly and unecessary spill #113489

RISC-V: RVV register allocation problem causes costly and unecessary spill #113489

Comments

camel-cdr commented Oct 23, 2024

llvmbot commented Oct 23, 2024

Uh oh!

wangpc-pp commented Oct 24, 2024

Uh oh!

wangpc-pp commented Oct 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wangpc-pp commented Oct 25, 2024

Uh oh!

wangpc-pp commented Oct 24, 2024 •

edited

Loading