Skip to content

[LoopInterchange] vectorisation opportunity (tsvc, s231) #71519

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
sjoerdmeijer opened this issue Nov 7, 2023 · 7 comments
Open

[LoopInterchange] vectorisation opportunity (tsvc, s231) #71519

sjoerdmeijer opened this issue Nov 7, 2023 · 7 comments

Comments

@sjoerdmeijer
Copy link
Collaborator

Looks like we are 1400% (?!) behind for kernel s231 in TSVC compared to GCC.
Compile this code with -O3 -mcpu=neoverse-v2 -ffast-math:

__attribute__((aligned(64))) float a[32000],b[32000],c[32000],d[32000],e[32000],
                                   aa[256][256],bb[256][256],cc[256][256],tt[256][256];

int dummy(float[32000], float[32000], float[32000], float[32000], float[32000], float[256][256], float[256][256], float[256][256], float);

float s231()
{
    for (int nl = 0; nl < 100*(100000/256); nl++) {
        for (int i = 0; i < 256; ++i) {
            for (int j = 1; j < 256; j++) {
                aa[j][i] = aa[j - 1][i] + bb[j][i];
            }
        }
        dummy(a, b, c, d, e, aa, bb, cc, 0.);
    }
}

Clang's codegen:

.LBB44_3:                               //   Parent Loop BB44_1 Depth=1
                                        //     Parent Loop BB44_2 Depth=2
                                        // =>    This Inner Loop Header: Depth=3
        add     x12, x21, x10
        add     x13, x20, x10
        subs    x11, x11, #5
        add     x10, x10, x19
        ldr     s1, [x12, #1024]
        fadd    s0, s1, s0
        ldr     s1, [x12, #2048]
        str     s0, [x13, #1024]
        fadd    s0, s1, s0
        ldr     s1, [x12, #3072]
        str     s0, [x13, #2048]
        fadd    s0, s1, s0
        ldr     s1, [x12, #4096]
        str     s0, [x13, #3072]
        fadd    s0, s1, s0
        ldr     s1, [x12, #5120]
        str     s0, [x13, #4096]
        fadd    s0, s1, s0
        str     s0, [x13, #5120]
        b.ne    .LBB44_3

vs. GCC's codegen:

.L521:
        ldr     q0, [x8, x0]
        ldr     q1, [x2, x0]
        fadd    v0.4s, v0.4s, v1.4s
        str     q0, [x1, x0]
        add     x0, x0, 16
        cmp     x0, 1024
        bne     .L521

See also:
https://godbolt.org/z/jr9WKW95v

TODO:
root cause analysis.

@llvmbot
Copy link
Member

llvmbot commented Nov 7, 2023

@llvm/issue-subscribers-backend-aarch64

Author: Sjoerd Meijer (sjoerdmeijer)

Looks like we are 1400% (?!) behind for kernel s231 in TSVC compared to GCC. Compile this code with `-O3 -mcpu=neoverse-v2 -ffast-math`:
__attribute__((aligned(64))) float a[32000],b[32000],c[32000],d[32000],e[32000],
                                   aa[256][256],bb[256][256],cc[256][256],tt[256][256];

int dummy(float[32000], float[32000], float[32000], float[32000], float[32000], float[256][256], float[256][256], float[256][256], float);

float s231()
{
    for (int nl = 0; nl &lt; 100*(100000/256); nl++) {
        for (int i = 0; i &lt; 256; ++i) {
            for (int j = 1; j &lt; 256; j++) {
                aa[j][i] = aa[j - 1][i] + bb[j][i];
            }
        }
        dummy(a, b, c, d, e, aa, bb, cc, 0.);
    }
}

Clang's codegen:

.LBB44_3:                               //   Parent Loop BB44_1 Depth=1
                                        //     Parent Loop BB44_2 Depth=2
                                        // =&gt;    This Inner Loop Header: Depth=3
        add     x12, x21, x10
        add     x13, x20, x10
        subs    x11, x11, #<!-- -->5
        add     x10, x10, x19
        ldr     s1, [x12, #<!-- -->1024]
        fadd    s0, s1, s0
        ldr     s1, [x12, #<!-- -->2048]
        str     s0, [x13, #<!-- -->1024]
        fadd    s0, s1, s0
        ldr     s1, [x12, #<!-- -->3072]
        str     s0, [x13, #<!-- -->2048]
        fadd    s0, s1, s0
        ldr     s1, [x12, #<!-- -->4096]
        str     s0, [x13, #<!-- -->3072]
        fadd    s0, s1, s0
        ldr     s1, [x12, #<!-- -->5120]
        str     s0, [x13, #<!-- -->4096]
        fadd    s0, s1, s0
        str     s0, [x13, #<!-- -->5120]
        b.ne    .LBB44_3

vs. GCC's codegen:

.L521:
        ldr     q0, [x8, x0]
        ldr     q1, [x2, x0]
        fadd    v0.4s, v0.4s, v1.4s
        str     q0, [x1, x0]
        add     x0, x0, 16
        cmp     x0, 1024
        bne     .L521

See also:
https://godbolt.org/z/jr9WKW95v

TODO:
root cause analysis.

@m-sai
Copy link

m-sai commented Dec 30, 2023

The original loop can be vectorized by changing it to a double loop and adding the -enable-loopinterchange option as shown below.

float s231_tmp()
{
        for (int i = 0; i < 256; ++i) {
            for (int j = 1; j < 256; j++) {
                aa[j][i] = aa[j - 1][i] + bb[j][i];
            }
        }
        dummy(a, b, c, d, e, aa, bb, cc, 0.);
}
.LBB0_2:                                // %vector.body
                                        //   Parent Loop BB0_1 Depth=1
                                        // =>  This Inner Loop Header: Depth=2
        ld1w    { z0.s }, p0/z, [x8, x14, lsl #2]
        ld1w    { z1.s }, p0/z, [x9, x14, lsl #2]
        ld1w    { z2.s }, p0/z, [x10, x14, lsl #2]
        add     x15, x8, x14, lsl #2
        add     x16, x9, x14, lsl #2
        fadd    z0.s, z0.s, z2.s
        ld1w    { z3.s }, p0/z, [x11, x14, lsl #2]
        fadd    z1.s, z1.s, z3.s
        inch    x14
        cmp     x14, #256
        st1w    { z0.s }, p0, [x15, x13, lsl #2]
        st1w    { z1.s }, p0, [x16, x13, lsl #2]
        b.ne    .LBB0_2

The original loop cannot be vectorized even with the -enable-loopinterchange option. This is because loop interchange does not work.

The reason loop-interchange doesn't work is because the dependency analysis of the load/store instruction determines that there are dependencies that cannot be loop-interchanged. However, I don't think the original loop has any dependencies, so I think it is a bug in the dependency analysis.

@sjoerdmeijer
Copy link
Collaborator Author

Thanks for the analysis, interesting result/conclusion!

ShivaChen added a commit to ShivaChen/llvm-project that referenced this issue Jan 26, 2024
This commit enables loop-interchange for the case in llvm#71519.
With the loop-interchange, the case can be vectorized.

  for (int nl = 0; nl < 10000000/256; nl++) // Level 1
    for (int i = 0; i < 256; ++i)           // Level 2
      for (int j = 1; j < 256; j++)         // Level 3
        aa[j][i] = aa[j - 1][i] + bb[j][i];

The case can't be interchanged without normalizaion.
normalizaion didn't occur because the direction of level 1 loop dependence
between aa[j][i] and aa[j - 1][i] is default value '*'.

By scanning SCEV form of the pointer of aa[j][i] and aa[j - 1][i], the pass
and determine the IV of loop 1(nl) didn't affect the value of aa[j][i] and
aa[j - 1][i]. And then updating the direction of loop 1 to '=' to enable
the normalization.
ShivaChen added a commit to ShivaChen/llvm-project that referenced this issue Jan 26, 2024
This commit enables loop-interchange for the case in llvm#71519.
With the loop-interchange, the case can be vectorized.

  for (int nl = 0; nl < 10000000/256; nl++) // Level 1
    for (int i = 0; i < 256; ++i)           // Level 2
      for (int j = 1; j < 256; j++)         // Level 3
        aa[j][i] = aa[j - 1][i] + bb[j][i];

The case can't be interchanged without normalizaion.
normalizaion didn't occur because the direction of level 1 loop dependence
between aa[j][i] and aa[j - 1][i] is default value '*'.

By scanning SCEV form of the pointer of aa[j][i] and aa[j - 1][i], the pass
and determine the IV of loop 1(nl) didn't affect the value of aa[j][i] and
aa[j - 1][i]. And then updating the direction of loop 1 to '=' to enable
the normalization.
@sjoerdmeijer sjoerdmeijer changed the title [AArch64] Missed fadd vectorisation opportunity (tsvc, s231) [LoopInterchange] vectorisation opportunity (tsvc, s231) Sep 9, 2024
@sebpop
Copy link
Contributor

sebpop commented Sep 18, 2024

The reason loop-interchange doesn't work is because the dependency analysis of the load/store instruction determines that there are dependencies that cannot be loop-interchanged. However, I don't think the original loop has any dependencies, so I think it is a bug in the dependency analysis.

Dependence analysis is correct.

Following the discussion on PR #78951 (comment) there are dependences carried by the outermost loop.

Loop-interchange needs to focus on the inner two loops.

@kasuga-fj
Copy link
Contributor

Hi. I'm interested in this issue and have been investigating these days. The PR #78951 tries to implement custom normalize function for LoopInterchange. However I think it's better to change isLegalToInterChangeLoops to accept cases where the leftmost < or > is not changed.

static bool isLegalToInterChangeLoops(CharMatrix &DepMatrix,

That is, we don't need to restrict the DepVector to be positive, it's legal if the positive or negative does not change before and after interchange. Looks like the PR I mentioned hasn't made any progress for a while, may I create a new one?

@sjoerdmeijer
Copy link
Collaborator Author

Hi. I'm interested in this issue and have been investigating these days. The PR #78951 tries to implement custom normalize function for LoopInterchange. However I think it's better to change isLegalToInterChangeLoops to accept cases where the leftmost < or > is not changed.

static bool isLegalToInterChangeLoops(CharMatrix &DepMatrix,

That is, we don't need to restrict the DepVector to be positive, it's legal if the positive or negative does not change before and after interchange. Looks like the PR I mentioned hasn't made any progress for a while, may I create a new one?

Sure, please go ahead, and thanks for doing that. If you were going to take the same approach, taking over the patch would perhaps be an option (don't know how this works in github though), but since you have a different approach creating a new PR sounds better. Please feel free to add @sebpop and myself as reviewers (among other folks who might be interested in reviewing this).

@kasuga-fj
Copy link
Contributor

Thanks, then I will submit a new PR (hopefully soon).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants