Skip to content

RISC-V: RVV register allocation problem causes costly and unecessary spill #113489

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
camel-cdr opened this issue Oct 23, 2024 · 4 comments · Fixed by #113675 or #114012
Closed

RISC-V: RVV register allocation problem causes costly and unecessary spill #113489

camel-cdr opened this issue Oct 23, 2024 · 4 comments · Fixed by #113675 or #114012
Assignees

Comments

@camel-cdr
Copy link

Hi, I ran into a problem, where reordering a single RVV intrinsic without changing the program logic caused llvm to spill a vector register, suggesting that the register allocation has trouble reordering in this case:

#include <riscv_vector.h>

void test(int *out, const int *in, size_t n)
{
    for (size_t vl; n > 0; n -= vl, out += vl, in += vl) {
        vl = __riscv_vsetvl_e32m8(n);
        vint32m8_t v1 = __riscv_vle32_v_i32m8(in, vl);
        vint32m8_t v2 = __riscv_vadd(v1, v1, vl);
        vbool4_t mlt = __riscv_vmslt(v1, 0, vl);

#ifdef REORDER
	vint32m8_t v4 = __riscv_vmerge(v2, v1, mlt, vl);
#endif
        vint32m8_t v3 = __riscv_vadd(v1, 3, vl);
#ifndef REORDER
        vint32m8_t v4 = __riscv_vmerge(v2, v1, mlt, vl);
#endif

        vbool4_t mgt = __riscv_vmsgt(v1, 4, vl);
        v1 = __riscv_vadd_mu(__riscv_vmor(mlt, mgt, vl), v1, v3, v4, vl);

        __riscv_vse32(out, v1, vl);
    }
}

See also the godbolt link: https://godbolt.org/z/6vdf4vEjn

This example was adapted from real code, and minimized while still retaining the problematic behavior.

gcc manages to figure out the proper register allocation.

@llvmbot
Copy link
Member

llvmbot commented Oct 23, 2024

@llvm/issue-subscribers-backend-risc-v

Author: Camel Coder (camel-cdr)

Hi, I ran into a problem, where reordering a single RVV intrinsic without changing the program logic caused llvm to spill a vector register, suggesting that the register allocation has trouble reordering in this case:
#include &lt;riscv_vector.h&gt;

void test(int *out, const int *in, size_t n)
{
    for (size_t vl; n &gt; 0; n -= vl, out += vl, in += vl) {
        vl = __riscv_vsetvl_e32m8(n);
        vint32m8_t v1 = __riscv_vle32_v_i32m8(in, vl);
        vint32m8_t v2 = __riscv_vadd(v1, v1, vl);
        vbool4_t mlt = __riscv_vmslt(v1, 0, vl);

#ifdef REORDER
	vint32m8_t v4 = __riscv_vmerge(v2, v1, mlt, vl);
#endif
        vint32m8_t v3 = __riscv_vadd(v1, 3, vl);
#ifndef REORDER
        vint32m8_t v4 = __riscv_vmerge(v2, v1, mlt, vl);
#endif

        vbool4_t mgt = __riscv_vmsgt(v1, 4, vl);
        v1 = __riscv_vadd_mu(__riscv_vmor(mlt, mgt, vl), v1, v3, v4, vl);

        __riscv_vse32(out, v1, vl);
    }
}

See also the godbolt link: https://godbolt.org/z/6vdf4vEjn

This example was adapted from real code, and minimized while still retaining the problematic behavior.

gcc manages to figure out the proper register allocation.

@wangpc-pp wangpc-pp self-assigned this Oct 24, 2024
@wangpc-pp
Copy link
Contributor

Thanks for reporting! I will take a look first as others are attending LLVM dev meeting I think.

@wangpc-pp
Copy link
Contributor

wangpc-pp commented Oct 24, 2024

After a rough investigation, the preliminary conclusion is we are extending the live inteval of vmsgt.
Simply disabling misched to avoid reordering, or using topdown scheduling, can generate desired code: https://godbolt.org/z/n3sKzc8E1

wangpc-pp added a commit to wangpc-pp/llvm-project that referenced this issue Oct 25, 2024
Currently, the spill weight is only determined by isDef/isUse and
block frequency. However, for registers with different register
classes, the costs of spilling them are different.

For example, for `LMUL>1` registers (in which, several physical regsiter
compound a bigger logical register), the costs are larger than
`LMUL=1` case (in which, there is only one physical register).

To sovle this problem, a new target hook `getSpillWeightFactor` is
added. Targets can override the default factor (which is 1) according
to the register classes.

For RISC-V, the factors are set to the `RegClassWeight` which is
used to track regsiter pressure. The values of `RegClassWeight`
are the number of register units.

I believe all the targets can benefit from this change, but I will
shrink the range of tests to RISC-V only.

Partially fixes llvm#113489.
@wangpc-pp
Copy link
Contributor

I have done a deep investigation, and this problem is much more complicated than what I was thinking.
To make a long story short, there are two things I think we should fix:

  1. LMUL=1 registers have the same logic of calculating the spill weight as LMUL>1 registers. I should have fixed this in [RegAlloc] Scale the spill weight by target factor #113675.
    After this change, the generated code will be:
test:
	beqz	a2, .LBB0_2
.LBB0_1:
	vsetvli	a3, a2, e32, m8, ta, mu
	vle32.v	v16, (a1)
	vadd.vv	v8, v16, v16
	vmslt.vx	v0, v16, zero
	vadd.vi	v24, v16, 3
	vmsgt.vi	v7, v16, 4
	vmor.mm	v7, v0, v7
	vmerge.vvm	v8, v8, v16, v0
	vmv1r.v	v0, v7
	vadd.vv	v16, v24, v8, v0.t
	vse32.v	v16, (a0)
	sub	a2, a2, a3
	slli	a3, a3, 2
	add	a0, a0, a3
	add	a1, a1, a3
	bnez	a2, .LBB0_1
.LBB0_2:
	ret
  1. We shouldn't move masks producing instructions accross masked instructions, or we will extend the live inteval of mask register. For example, we shouldn't move vmor.mm v7, v0, v7 to the front of vmerge.vvm v8, v8, v16, v0. The ideal code should be like:
test:
	beqz	a2, .LBB0_2
.LBB0_1:
	vsetvli	a3, a2, e32, m8, ta, mu
	vle32.v	v16, (a1)
	vadd.vv	v8, v16, v16
	vmslt.vx	v0, v16, zero
	vadd.vi	v24, v16, 3
	vmsgt.vi	v7, v16, 4
-	vmor.mm	v7, v0, v7
	vmerge.vvm	v8, v8, v16, v0
-	vmv1r.v	v0, v7
+	vmor.mm	v0, v0, v7
	vadd.vv	v16, v24, v8, v0.t
	vse32.v	v16, (a0)
	sub	a2, a2, a3
	slli	a3, a3, 2
	add	a0, a0, a3
	add	a1, a1, a3
	bnez	a2, .LBB0_1
.LBB0_2:
	ret

To fix this, I may try to add scheduling DAG mutation that will add a dependency between vmor.mm and vmerge.vvm.

wangpc-pp added a commit to wangpc-pp/llvm-project that referenced this issue Oct 29, 2024
Here we add a scheduling mutation in pre-ra scheduling, which will
adds an artificial dependency edge between mask producer and its
previous nearest instruction that uses V0 register.

This prevents making live intervals of mask registers longer and as
a consequence we can reduce some spills/moves.

From the test changes, we can see some improvements and also some
regressions (more vtype toggles).

Partially fixes llvm#113489.
wangpc-pp added a commit to wangpc-pp/llvm-project that referenced this issue Nov 27, 2024
Here we add a scheduling mutation in pre-ra scheduling, which will
adds an artificial dependency edge between mask producer and its
previous nearest instruction that uses V0 register.

This prevents the overlap of live intervals of mask registers and
as a consequence we can reduce some spills/moves.

From the test changes, we can see some improvements and also some
regressions (more vtype toggles).

Partially fixes llvm#113489.
wangpc-pp added a commit to wangpc-pp/llvm-project that referenced this issue Nov 29, 2024
Here we add a scheduling mutation in pre-ra scheduling, which will
adds an artificial dependency edge between mask producer and its
previous nearest instruction that uses V0 register.

This prevents the overlap of live intervals of mask registers and
as a consequence we can reduce some spills/moves.

From the test changes, we can see some improvements and also some
regressions (more vtype toggles).

Partially fixes llvm#113489.
wangpc-pp added a commit that referenced this issue Nov 29, 2024
Here we add a scheduling mutation in pre-ra scheduling, which will
add an artificial dependency edge between mask producer and its
previous nearest instruction that uses V0 register.

This prevents the overlap of live intervals of mask registers and
as a consequence we can reduce some spills/moves.

From the test changes, we can see some improvements and also some
regressions (more vtype toggles).

Partially fixes #113489.
wangpc-pp added a commit to wangpc-pp/llvm-project that referenced this issue Dec 5, 2024
Currently, the spill weight is only determined by isDef/isUse and
block frequency. However, for registers with different register
classes, the costs of spilling them are different.

For example, for `LMUL>1` registers (in which, several physical regsiter
compound a bigger logical register), the costs are larger than
`LMUL=1` case (in which, there is only one physical register).

To sovle this problem, a new target hook `getSpillWeightFactor` is
added. Targets can override the default factor (which is 1) according
to the register classes.

For RISC-V, the factors are set to the `RegClassWeight` which is
used to track regsiter pressure. The values of `RegClassWeight`
are the number of register units.

I believe all the targets can benefit from this change, but I will
shrink the range of tests to RISC-V only.

Partially fixes llvm#113489.
wangpc-pp added a commit to wangpc-pp/llvm-project that referenced this issue Mar 12, 2025
Currently, the spill weight is only determined by isDef/isUse and
block frequency. However, for registers with different register
classes, the costs of spilling them are different.

For example, for `LMUL>1` registers (in which, several physical
registers compound a bigger logical register), the costs are larger
than `LMUL=1` case (in which, there is only one physical register).

To solve this problem, a new target hook `getSpillWeightScaleFactor`
is added. Targets can override the default factor (which is `1.0`)
according to the register class.

For RISC-V, the factors are set to the `RegClassWeight` which is
used to track register pressure. The values of `RegClassWeight`
happen to be the number of register units.

I believe all of the targets with compounded registers can benefit
from this change, but only RISC-V is customized in this patch since
it has widely been agreed to do so. The other targets need more
performance data to go further.

Partially fixes llvm#113489.
wangpc-pp added a commit to wangpc-pp/llvm-project that referenced this issue Mar 13, 2025
Currently, the spill weight is only determined by isDef/isUse and
block frequency. However, for registers with different register
classes, the costs of spilling them are different.

For example, for `LMUL>1` registers (in which, several physical
registers compound a bigger logical register), the costs are larger
than `LMUL=1` case (in which, there is only one physical register).

To solve this problem, a new target hook `getSpillWeightScaleFactor`
is added. Targets can override the default factor (which is `1.0`)
according to the register class.

For RISC-V, the factors are set to the `RegClassWeight` which is
used to track register pressure. The values of `RegClassWeight`
happen to be the number of register units.

I believe all of the targets with compounded registers can benefit
from this change, but only RISC-V is customized in this patch since
it has widely been agreed to do so. The other targets need more
performance data to go further.

Partially fixes llvm#113489.
wangpc-pp added a commit that referenced this issue Mar 13, 2025
Currently, the spill weight is only determined by isDef/isUse and
block frequency. However, for registers with different register
classes, the costs of spilling them are different.

For example, for `LMUL>1` registers (in which, several physical
registers compound a bigger logical register), the costs are larger
than `LMUL=1` case (in which, there is only one physical register).

To solve this problem, a new target hook `getSpillWeightScaleFactor`
is added. Targets can override the default factor (which is `1.0`)
according to the register class.

For RISC-V, the factors are set to the `RegClassWeight` which is
used to track register pressure. The values of `RegClassWeight`
happen to be the number of register units.

I believe all of the targets with compounded registers can benefit
from this change, but only RISC-V is customized in this patch since
it has widely been agreed to do so. The other targets need more
performance data to go further.

Partially fixes #113489.
frederik-h pushed a commit to frederik-h/llvm-project that referenced this issue Mar 18, 2025
Currently, the spill weight is only determined by isDef/isUse and
block frequency. However, for registers with different register
classes, the costs of spilling them are different.

For example, for `LMUL>1` registers (in which, several physical
registers compound a bigger logical register), the costs are larger
than `LMUL=1` case (in which, there is only one physical register).

To solve this problem, a new target hook `getSpillWeightScaleFactor`
is added. Targets can override the default factor (which is `1.0`)
according to the register class.

For RISC-V, the factors are set to the `RegClassWeight` which is
used to track register pressure. The values of `RegClassWeight`
happen to be the number of register units.

I believe all of the targets with compounded registers can benefit
from this change, but only RISC-V is customized in this patch since
it has widely been agreed to do so. The other targets need more
performance data to go further.

Partially fixes llvm#113489.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants