-
Notifications
You must be signed in to change notification settings - Fork 13.7k
Description
I tried this code:
use std::convert::identity;
#[unsafe(no_mangle)]
pub fn compute(data: &mut [[f32; 2]], scalar: f32) {
let closure = identity(
#[inline(never)]
|| {
//let scalar = scalar;
for f in data {
f[0] *= scalar;
f[1] *= scalar;
}
}
);
closure();
}
I expected to see this happen: For the baseline x86-64 target, the loop should be vectorized to load, multiply, and store at least 4 floats at a time. If the commented line is uncommented, this does happen and the code processes 8 floats (32 bytes) each iteration:
movaps xmm1, xmm0
shufps xmm1, xmm0, 0 # xmm1 = xmm1[0,0],xmm0[0,0]
xor r8d, r8d
.LBB1_3: # =>This Inner Loop Header: Depth=1
movups xmm2, xmmword ptr [rdx + 8*r8]
movups xmm3, xmmword ptr [rdx + 8*r8 + 16]
mulps xmm2, xmm1
movups xmmword ptr [rdx + 8*r8], xmm2
mulps xmm3, xmm1
movups xmmword ptr [rdx + 8*r8 + 16], xmm3
add r8, 4
cmp rdi, r8
jne .LBB1_3
Instead, this happened: The loop is vectorized (to process 4 elements each iteration), but it reloads scalar
from memory every single iteration.
.LBB1_4: # =>This Inner Loop Header: Depth=1
movups xmm0, xmmword ptr [rdx + 8*r9]
movss xmm1, dword ptr [rax] # xmm1 = mem[0],zero,zero,zero
shufps xmm1, xmm1, 0 # xmm1 = xmm1[0,0,0,0]
mulps xmm1, xmm0
movups xmmword ptr [rdx + 8*r9], xmm1
add r9, 2
cmp r8, r9
jne .LBB1_4
xmm1
is reloaded from [rax]
every iteration, even though rax
is not modified inside the loop.
This problem appears to be a combination of
- Processing elements of type
[f32; 2]
in the loop (f32
does not reproduce the problem; this example was reduced from a linear algebra library where the element type isVec2f
) - Doing so inside a closure that loads
scalar
from its environment (making the closure takedata
andscalar
as arguments does not reproduce the problem).
llvm-mca
estimates a throughput (on Skylake) of 1.63 IPC for the bad version, and 2.13 IPC for the good version (and the good version processes twice the elements per iteration).
Meta
Reproduces on the playground, with stable 1.89.0 and nightly 2025-08-09.