-
Notifications
You must be signed in to change notification settings - Fork 14.5k
Open
Description
We are behind a lot compared to GCC. Compile this input with -O3 -mcpu=neoverse-v2 -ffast-math
:
__attribute__((aligned(64))) float a[32000],b[32000],c[32000],d[32000],e[32000],
aa[256][256],bb[256][256],cc[256][256],tt[256][256];
int dummy(float[32000], float[32000], float[32000], float[32000], float[32000], float[256][256], float[256][256], float[256][256], float);
float s173()
{
int k = 32000/2;
for (int nl = 0; nl < 10*100000; nl++) {
for (int i = 0; i < 32000/2; i++) {
a[i+k] = a[i] + b[i];
}
dummy(a, b, c, d, e, aa, bb, cc, 0.);
}
}
Clang's codegen:
.LBB0_3: // Parent Loop BB0_2 Depth=1
add x9, x19, x8, lsl #2
add x10, x20, x8, lsl #2
ld1w { z0.s }, p0/z, [x19, x8, lsl #2]
ld1w { z2.s }, p0/z, [x20, x8, lsl #2]
add x8, x8, x21
ld1w { z1.s }, p0/z, [x9, x28, lsl #2]
ld1w { z3.s }, p0/z, [x10, x28, lsl #2]
add x10, x9, x26
cmp x8, x22
fadd z0.s, z2.s, z0.s
fadd z1.s, z3.s, z1.s
st1w { z0.s }, p0, [x9, x23, lsl #2]
st1w { z1.s }, p0, [x10, x28, lsl #2]
b.ne .LBB0_3
vs. GCC's codegen:
.L3:
ldr q31, [x20, x0]
ldr q30, [x19, x0]
fadd v31.4s, v31.4s, v30.4s
str q31, [x21, x0]
add x0, x0, 16
cmp x0, x28
bne .L3
See also:
https://godbolt.org/z/9zs65h3aq
Might be caused by the same underlying issue as:
#71524