Skip to content

rustc fails to perform LICM in simple code #145226

@SludgePhD

Description

@SludgePhD

I tried this code:

use std::convert::identity;

#[unsafe(no_mangle)]
pub fn compute(data: &mut [[f32; 2]], scalar: f32) {
    let closure = identity(
        #[inline(never)]
        || {
            //let scalar = scalar;
            for f in data {
                f[0] *= scalar;
                f[1] *= scalar;
            }
        }
    );
    closure();
}

I expected to see this happen: For the baseline x86-64 target, the loop should be vectorized to load, multiply, and store at least 4 floats at a time. If the commented line is uncommented, this does happen and the code processes 8 floats (32 bytes) each iteration:

	movaps	xmm1, xmm0
	shufps	xmm1, xmm0, 0                   # xmm1 = xmm1[0,0],xmm0[0,0]
	xor	r8d, r8d

.LBB1_3:                                # =>This Inner Loop Header: Depth=1
	movups	xmm2, xmmword ptr [rdx + 8*r8]
	movups	xmm3, xmmword ptr [rdx + 8*r8 + 16]
	mulps	xmm2, xmm1
	movups	xmmword ptr [rdx + 8*r8], xmm2
	mulps	xmm3, xmm1
	movups	xmmword ptr [rdx + 8*r8 + 16], xmm3
	add	r8, 4
	cmp	rdi, r8
	jne	.LBB1_3

Instead, this happened: The loop is vectorized (to process 4 elements each iteration), but it reloads scalar from memory every single iteration.

.LBB1_4:                                # =>This Inner Loop Header: Depth=1
	movups	xmm0, xmmword ptr [rdx + 8*r9]
	movss	xmm1, dword ptr [rax]           # xmm1 = mem[0],zero,zero,zero
	shufps	xmm1, xmm1, 0                   # xmm1 = xmm1[0,0,0,0]
	mulps	xmm1, xmm0
	movups	xmmword ptr [rdx + 8*r9], xmm1
	add	r9, 2
	cmp	r8, r9
	jne	.LBB1_4

xmm1 is reloaded from [rax] every iteration, even though rax is not modified inside the loop.

This problem appears to be a combination of

  • Processing elements of type [f32; 2] in the loop (f32 does not reproduce the problem; this example was reduced from a linear algebra library where the element type is Vec2f)
  • Doing so inside a closure that loads scalar from its environment (making the closure take data and scalar as arguments does not reproduce the problem).

llvm-mca estimates a throughput (on Skylake) of 1.63 IPC for the bad version, and 2.13 IPC for the good version (and the good version processes twice the elements per iteration).

Meta

Reproduces on the playground, with stable 1.89.0 and nightly 2025-08-09.

Metadata

Metadata

Assignees

No one assigned

    Labels

    C-optimizationCategory: An issue highlighting optimization opportunities or PRs implementing suchT-compilerRelevant to the compiler team, which will review and decide on the PR/issue.needs-triageThis issue may need triage. Remove it if it has been sufficiently triaged.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions