Skip to content

[LV] Inefficient gather/scatter address calculation for strided access #129474

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
kinoshita-fj opened this issue Mar 3, 2025 · 7 comments
Open

Comments

@kinoshita-fj
Copy link
Contributor

LLVM generates inefficient code for strided array access in loops. Address calculations within the loop use vector operations on offset vectors instead of the scalar base register, leading to performance degradation.

For example:

void func(double* a, int n)
{
    for (int i = 0; i < n; i++) {
        a[i*5] = 1;
    }
}

SVE

.LBB0_4:
        add     z3.d, z2.d, z0.d
        mul     z2.d, z2.d, #40
        subs    x11, x11, x9
        st1d    { z1.d }, p0, [x0, z2.d]
        mov     z2.d, z3.d
        b.ne    .LBB0_4

AVX-512

.LBB0_4:
        vpmullq ymm4, ymm0, ymm1
        kxnorw  k1, k0, k0
        vscatterqpd     qword ptr [rdi + ymm4] {k1}, ymm2
        vpaddq  ymm0, ymm0, ymm3
        add     rdx, -4
        jne     .LBB0_4

https://godbolt.org/z/9MPnPvKG8

@kinoshita-fj
Copy link
Contributor Author

We've found that running the LoopStrengthReducePass before LoopVectorize resolves this issue.

llvm/lib/Passes/PassBuilderPipelines.cpp

void PassBuilder::addVectorPasses(OptimizationLevel Level,
                                  FunctionPassManager &FPM, bool IsFullLTO) {

+ FPM.addPass(createFunctionToLoopPassAdaptor(LoopStrengthReducePass(),true));

  FPM.addPass(LoopVectorizePass(
      LoopVectorizeOptions(!PTO.LoopInterleaving, !PTO.LoopVectorization)));

This change leads to better address calculation for strided access.

SVE

.LBB0_4:
	st1d	{ z1.d }, p0, [x0, z0.d]
	add	x0, x0, x13
	subs	x14, x14, x11
	b.ne	.LBB0_4

AVX-512

.LBB0_4:
	kxnorw	%k0, %k0, %k1
	vscatterqpd	%ymm1, (%rdi,%ymm0) {%k1}
	vpaddq	%ymm2, %ymm0, %ymm0
	addq	$-4, %rsi
	jne	.LBB0_4

Currently, LoopStrengthReducePass seems to only run before isel. Does anyone know why this pass is not executed in an earlier phase?

@hiraditya
Copy link
Collaborator

cc: @fhahn @ayalz for opinion on this.

@fhahn
Copy link
Contributor

fhahn commented Mar 18, 2025

@kinoshita-fj can you share the IR before/after LV with and without the LSR run?

@kinoshita-fj
Copy link
Contributor Author

Yes.

without LSR

before LV

for.body: 
  %indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.next, %for.body ]
  %arrayidx.idx = mul nuw nsw i64 %indvars.iv, 20
  %arrayidx = getelementptr inbounds nuw i8, ptr %a, i64 %arrayidx.idx
  store float 1.000000e+00, ptr %arrayidx, align 4, !tbaa !6
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
  %exitcond.not = icmp eq i64 %indvars.iv.next, %wide.trip.count
  br i1 %exitcond.not, label %for.cond.cleanup.loopexit, label %for.body, !llvm.loop !10

after LV

vector.body:                                   
%index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
%vec.ind = phi <vscale x 4 x i64> [ %induction, %vector.ph ], [ %vec.ind.next, %vector.body ]
%11 = mul nuw nsw <vscale x 4 x i64> %vec.ind, splat (i64 20)
%12 = getelementptr inbounds nuw i8, ptr %a, <vscale x 4 x i64> %11
call void @llvm.masked.scatter.nxv4f32.nxv4p0(<vscale x 4 x float> splat (float 1.000000e+00), <vscale x 4 x ptr> %12, i32 4, <vscale x 4 x i1> splat (i1 true)), !tbaa !6
%index.next = add nuw i64 %index, %7
%vec.ind.next = add <vscale x 4 x i64> %vec.ind, %.splat
%13 = icmp eq i64 %index.next, %n.vec
br i1 %13, label %middle.block, label %vector.body, !llvm.loop !10

with LSR

before LV

for.body:   
  %lsr.iv7 = phi i64 [ %wide.trip.count, %for.body.preheader ], [ %lsr.iv.next, %for.body ]
  %lsr.iv = phi ptr [ %a, %for.body.preheader ], [ %scevgep, %for.body ]
  store float 1.000000e+00, ptr %lsr.iv, align 4, !tbaa !6
  %scevgep = getelementptr i8, ptr %lsr.iv, i64 20
  %lsr.iv.next = add nsw i64 %lsr.iv7, -1
  %exitcond.not = icmp eq i64 %lsr.iv.next, 0
  br i1 %exitcond.not, label %for.cond.cleanup.loopexit, label %for.body, !llvm.loop !10

after LV

vector.body:                                      
  %pointer.phi = phi ptr [ %a, %vector.ph ], [ %ptr.ind, %vector.body ]
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %10 = call i64 @llvm.vscale.i64()
  %11 = mul i64 %10, 4
  %12 = mul i64 %11, 1
  %13 = mul i64 20, %12
  %14 = mul i64 %11, 0
  %.splatinsert = insertelement <vscale x 4 x i64> poison, i64 %14, i64 0
  %.splat = shufflevector <vscale x 4 x i64> %.splatinsert, <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
  %15 = call <vscale x 4 x i64> @llvm.stepvector.nxv4i64()
  %16 = add <vscale x 4 x i64> %.splat, %15
  %17 = mul <vscale x 4 x i64> %16, splat (i64 20)
  %vector.gep = getelementptr i8, ptr %pointer.phi, <vscale x 4 x i64> %17
  call void @llvm.masked.scatter.nxv4f32.nxv4p0(<vscale x 4 x float> splat (float 1.000000e+00), <vscale x 4 x ptr> %vector.gep, i32 4, <vscale x 4 x i1> splat (i1 true)), !tbaa !6
  %index.next = add nuw i64 %index, %7
  %ptr.ind = getelementptr i8, ptr %pointer.phi, i64 %13
  %18 = icmp eq i64 %index.next, %n.vec
  br i1 %18, label %middle.block, label %vector.body, !llvm.loop !10

@kinoshita-fj
Copy link
Contributor Author

Adding the following RISC-V pass for AArch64 and modifying the output IR for AArch64 seems to be a good approach for fixing this issue.

llvm/lib/Target/RISCV/RISCVGatherScatterLowering.cpp

This pass detects strided access in gather and scatter and replaces the increment offset vector with the increment base address. RISC-V uses this pass to resolve similar issues.

I'm working on this right now. What do you think about this approach?
@kbeyls @DanielKristofKiss @david-arm @huntergr-arm

@Mel-Chen
Copy link
Contributor

Related patch: #128718

@kinoshita-fj
Copy link
Contributor Author

@Mel-Chen
The vectorization pass is considered more appropriate for analyzing stride accesses and converting them into optimal address calculations. Do you think that, once the vectorization pass has been improved, the RISCVGatherScatterLowering pass will become unnecessary?

Since AArch64 does not have stride access instructions, gather/scatter instructions must be used instead. Although lowering stride intrinsics is possible, gather/scatter intrinsics are believed to be used for architectures like AArch64 due to the issues mentioned later.
AArch64's gather/scatter can handle twice the number of vector elements when both data and index are 32-bit, packed into 64-bit. However, this format cannot be used for conversion from stride intrinsics. For example, when vectorizing using stride intrinsics:

void f(float *a, int n, int m) {
  for (int i = 0; i < n; i++)
    a[i*m] = 1;
}

it becomes:

%i_x_m = phi i32 [%i_x_m_next, %loop], ...
%m_scaled = mul i64 %m, i64 4
%base = getelementptr inbounds i32, ptr %a, i32 %i_x_m
%load = call <vscale x 4 x float> @llvm.experimental.vp.strided.load.nxv4float.p0.i64(ptr %base, i64 %m_scaled, ...)
%i_x_m_next = add i32 %i_x_m, %m_x_vlen

Due to the following two reasons, this intrinsic cannot be converted into a packed format scatter:

  • The stride width of the argument is in bytes, so it must be extended to i64.
  • Even if the stride width is i32, as defined below, the address calculation for each element is performed in pointer type, so n*stride must be done in 64-bit. Creating vector offsets for gather/scatter in i32 may result in overflow.

%ptrs = <%ptr, %ptr + %stride, %ptr + 2 * %stride, ... >,
with ‘ptr’ previously casted to a pointer ‘i8’, ‘stride’ always interpreted as a signed integer and all arithmetic occurring in the pointer type.

Therefore, we would like to modify VPWidenStridedLoad(Store)Recipe::execute to generate gather/scatter intrinsics instead of stride intrinsics for AArch64.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants