rustc fails to perform LICM in simple code

I tried this code:

```rust
use std::convert::identity;

#[unsafe(no_mangle)]
pub fn compute(data: &mut [[f32; 2]], scalar: f32) {
    let closure = identity(
        #[inline(never)]
        || {
            //let scalar = scalar;
            for f in data {
                f[0] *= scalar;
                f[1] *= scalar;
            }
        }
    );
    closure();
}
```

I expected to see this happen: For the baseline x86-64 target, the loop should be vectorized to load, multiply, and store at least 4 floats at a time. **If the commented line is uncommented**, this does happen and the code processes 8 floats (32 bytes) each iteration:

```asm
	movaps	xmm1, xmm0
	shufps	xmm1, xmm0, 0                   # xmm1 = xmm1[0,0],xmm0[0,0]
	xor	r8d, r8d

.LBB1_3:                                # =>This Inner Loop Header: Depth=1
	movups	xmm2, xmmword ptr [rdx + 8*r8]
	movups	xmm3, xmmword ptr [rdx + 8*r8 + 16]
	mulps	xmm2, xmm1
	movups	xmmword ptr [rdx + 8*r8], xmm2
	mulps	xmm3, xmm1
	movups	xmmword ptr [rdx + 8*r8 + 16], xmm3
	add	r8, 4
	cmp	rdi, r8
	jne	.LBB1_3
```

Instead, this happened: The loop *is* vectorized (to process 4 elements each iteration), but it reloads `scalar` from memory every single iteration.

```asm
.LBB1_4:                                # =>This Inner Loop Header: Depth=1
	movups	xmm0, xmmword ptr [rdx + 8*r9]
	movss	xmm1, dword ptr [rax]           # xmm1 = mem[0],zero,zero,zero
	shufps	xmm1, xmm1, 0                   # xmm1 = xmm1[0,0,0,0]
	mulps	xmm1, xmm0
	movups	xmmword ptr [rdx + 8*r9], xmm1
	add	r9, 2
	cmp	r8, r9
	jne	.LBB1_4
```
`xmm1` is reloaded from `[rax]` every iteration, even though `rax` is not modified inside the loop.

This problem appears to be a combination of
- Processing elements of type `[f32; 2]` in the loop (`f32` does not reproduce the problem; this example was reduced from a linear algebra library where the element type is `Vec2f`)
- Doing so inside a closure that loads `scalar` from its environment (making the closure take `data` and `scalar` as arguments does not reproduce the problem).

`llvm-mca` estimates a throughput (on Skylake) of 1.63 IPC for the bad version, and 2.13 IPC for the good version (and the good version processes twice the elements per iteration).

### Meta

Reproduces on the playground, with stable 1.89.0 and nightly 2025-08-09.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

rustc fails to perform LICM in simple code #145226

Meta

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

rustc fails to perform LICM in simple code #145226

Description

Meta

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions