-
Notifications
You must be signed in to change notification settings - Fork 13.6k
[LoopInterchange] vectorisation opportunity (tsvc, s231) #71519
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@llvm/issue-subscribers-backend-aarch64 Author: Sjoerd Meijer (sjoerdmeijer)
Looks like we are 1400% (?!) behind for kernel s231 in TSVC compared to GCC.
Compile this code with `-O3 -mcpu=neoverse-v2 -ffast-math`:
Clang's codegen:
vs. GCC's codegen:
See also: TODO: |
The original loop can be vectorized by changing it to a double loop and adding the float s231_tmp()
{
for (int i = 0; i < 256; ++i) {
for (int j = 1; j < 256; j++) {
aa[j][i] = aa[j - 1][i] + bb[j][i];
}
}
dummy(a, b, c, d, e, aa, bb, cc, 0.);
} .LBB0_2: // %vector.body
// Parent Loop BB0_1 Depth=1
// => This Inner Loop Header: Depth=2
ld1w { z0.s }, p0/z, [x8, x14, lsl #2]
ld1w { z1.s }, p0/z, [x9, x14, lsl #2]
ld1w { z2.s }, p0/z, [x10, x14, lsl #2]
add x15, x8, x14, lsl #2
add x16, x9, x14, lsl #2
fadd z0.s, z0.s, z2.s
ld1w { z3.s }, p0/z, [x11, x14, lsl #2]
fadd z1.s, z1.s, z3.s
inch x14
cmp x14, #256
st1w { z0.s }, p0, [x15, x13, lsl #2]
st1w { z1.s }, p0, [x16, x13, lsl #2]
b.ne .LBB0_2 The original loop cannot be vectorized even with the The reason loop-interchange doesn't work is because the dependency analysis of the load/store instruction determines that there are dependencies that cannot be loop-interchanged. However, I don't think the original loop has any dependencies, so I think it is a bug in the dependency analysis. |
Thanks for the analysis, interesting result/conclusion! |
This commit enables loop-interchange for the case in llvm#71519. With the loop-interchange, the case can be vectorized. for (int nl = 0; nl < 10000000/256; nl++) // Level 1 for (int i = 0; i < 256; ++i) // Level 2 for (int j = 1; j < 256; j++) // Level 3 aa[j][i] = aa[j - 1][i] + bb[j][i]; The case can't be interchanged without normalizaion. normalizaion didn't occur because the direction of level 1 loop dependence between aa[j][i] and aa[j - 1][i] is default value '*'. By scanning SCEV form of the pointer of aa[j][i] and aa[j - 1][i], the pass and determine the IV of loop 1(nl) didn't affect the value of aa[j][i] and aa[j - 1][i]. And then updating the direction of loop 1 to '=' to enable the normalization.
This commit enables loop-interchange for the case in llvm#71519. With the loop-interchange, the case can be vectorized. for (int nl = 0; nl < 10000000/256; nl++) // Level 1 for (int i = 0; i < 256; ++i) // Level 2 for (int j = 1; j < 256; j++) // Level 3 aa[j][i] = aa[j - 1][i] + bb[j][i]; The case can't be interchanged without normalizaion. normalizaion didn't occur because the direction of level 1 loop dependence between aa[j][i] and aa[j - 1][i] is default value '*'. By scanning SCEV form of the pointer of aa[j][i] and aa[j - 1][i], the pass and determine the IV of loop 1(nl) didn't affect the value of aa[j][i] and aa[j - 1][i]. And then updating the direction of loop 1 to '=' to enable the normalization.
Dependence analysis is correct. Following the discussion on PR #78951 (comment) there are dependences carried by the outermost loop. Loop-interchange needs to focus on the inner two loops. |
Hi. I'm interested in this issue and have been investigating these days. The PR #78951 tries to implement custom
That is, we don't need to restrict the DepVector to be positive, it's legal if the positive or negative does not change before and after interchange. Looks like the PR I mentioned hasn't made any progress for a while, may I create a new one? |
Sure, please go ahead, and thanks for doing that. If you were going to take the same approach, taking over the patch would perhaps be an option (don't know how this works in github though), but since you have a different approach creating a new PR sounds better. Please feel free to add @sebpop and myself as reviewers (among other folks who might be interested in reviewing this). |
Thanks, then I will submit a new PR (hopefully soon). |
Looks like we are 1400% (?!) behind for kernel s231 in TSVC compared to GCC.
Compile this code with
-O3 -mcpu=neoverse-v2 -ffast-math
:Clang's codegen:
vs. GCC's codegen:
See also:
https://godbolt.org/z/jr9WKW95v
TODO:
root cause analysis.
The text was updated successfully, but these errors were encountered: