-
Notifications
You must be signed in to change notification settings - Fork 13.5k
[LV] Inefficient gather/scatter address calculation for strided access #129474
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
We've found that running the LoopStrengthReducePass before LoopVectorize resolves this issue.
void PassBuilder::addVectorPasses(OptimizationLevel Level,
FunctionPassManager &FPM, bool IsFullLTO) {
+ FPM.addPass(createFunctionToLoopPassAdaptor(LoopStrengthReducePass(),true));
FPM.addPass(LoopVectorizePass(
LoopVectorizeOptions(!PTO.LoopInterleaving, !PTO.LoopVectorization))); This change leads to better address calculation for strided access. SVE
AVX-512
Currently, LoopStrengthReducePass seems to only run before isel. Does anyone know why this pass is not executed in an earlier phase? |
@kinoshita-fj can you share the IR before/after LV with and without the |
Yes. without LSR before LV
after LV
with LSR before LV
after LV
|
Adding the following RISC-V pass for AArch64 and modifying the output IR for AArch64 seems to be a good approach for fixing this issue.
This pass detects strided access in gather and scatter and replaces the increment offset vector with the increment base address. RISC-V uses this pass to resolve similar issues. I'm working on this right now. What do you think about this approach? |
Related patch: #128718 |
@Mel-Chen Since AArch64 does not have stride access instructions, gather/scatter instructions must be used instead. Although lowering stride intrinsics is possible, gather/scatter intrinsics are believed to be used for architectures like AArch64 due to the issues mentioned later.
it becomes:
Due to the following two reasons, this intrinsic cannot be converted into a packed format scatter:
Therefore, we would like to modify |
LLVM generates inefficient code for strided array access in loops. Address calculations within the loop use vector operations on offset vectors instead of the scalar base register, leading to performance degradation.
For example:
SVE
AVX-512
https://godbolt.org/z/9MPnPvKG8
The text was updated successfully, but these errors were encountered: