Skip to content

[X86] Investigate using (v)pmaddubsw for vXi8 multiplication on SSSE3+ targets #90748

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
RKSimon opened this issue May 1, 2024 · 1 comment · Fixed by #95690
Closed

[X86] Investigate using (v)pmaddubsw for vXi8 multiplication on SSSE3+ targets #90748

RKSimon opened this issue May 1, 2024 · 1 comment · Fixed by #95690

Comments

@RKSimon
Copy link
Collaborator

RKSimon commented May 1, 2024

For the default SSE2 implementation we extend to vXi16, perform the multiplication and pack the results back to vXi8.

But if we have the (v)pmaddubsw instruction, we can zero out the odd/even parts of the i8-pairs of one of the operands, perform the 2 pmaddubsw calls and then shift+or them back together, all with a single mask.

__m128i _mm_mul_epi8(__m128i x, __m128i y) {
    __m128i m = _mm_set1_epi16(255);
    __m128i ylo = _mm_and_si128(m, y);
    __m128i yhi = _mm_andnot_si128(m, y);
    __m128i lo = _mm_maddubs_epi16(x, ylo);
    __m128i hi = _mm_maddubs_epi16(x, yhi);
    lo = _mm_and_si128(lo, m);
    hi = _mm_slli_epi16(hi, 8);
    return _mm_or_si128(lo, hi);
}
  vmovaps .LCPI0_2(%rip), %xmm5
  vpand %xmm2, %xmm1, %xmm3
  vpandn %xmm2, %xmm1, %xmm4
  vpmaddubsw %xmm3, %xmm0, %xmm3
  vpmaddubsw %xmm4, %xmm0, %xmm4
  vpand %xmm5, %xmm3, %xmm3
  vpsllw $8, %xmm4, %xmm4
  vpor %xmm4, %xmm3, %xmm4

llvm-mca analysis - https://llvm.godbolt.org/z/9361GKrds - most CPUs benefit from this, but SandyBridge appears to be more borderline, it could be that we begin by initially trying this for multiple-by-constants (and shl-by-constants?)

@llvmbot
Copy link
Member

llvmbot commented May 1, 2024

@llvm/issue-subscribers-backend-x86

Author: Simon Pilgrim (RKSimon)

For the default SSE2 implementation we extend to vXi16, perform the multiplication and pack the results back to vXi8.

But if we have the (v)pmaddubsw instruction, we can zero out the off/even parts of the i8-pairs of one of the operands, perform the 2 pmaddubsw calls and then shift+or them back together, all with a single mask.

__m128i _mm_mul_epi8(__m128i x, __m128i y) {
    __m128i m = _mm_set1_epi16(255);
    __m128i ylo = _mm_and_si128(m, y);
    __m128i yhi = _mm_andnot_si128(m, y);
    __m128i lo = _mm_maddubs_epi16(x, ylo);
    __m128i hi = _mm_maddubs_epi16(x, yhi);
    lo = _mm_and_si128(lo, m);
    hi = _mm_slli_epi16(hi, 8);
    return _mm_or_si128(lo, hi);
}
  vmovaps .LCPI0_2(%rip), %xmm5
  vpand %xmm2, %xmm1, %xmm3
  vpandn %xmm2, %xmm1, %xmm4
  vpmaddubsw %xmm3, %xmm0, %xmm3
  vpmaddubsw %xmm4, %xmm0, %xmm4
  vpand %xmm5, %xmm3, %xmm3
  vpsllw $8, %xmm4, %xmm4
  vpor %xmm4, %xmm3, %xmm4

llvm-mca analysis - https://llvm.godbolt.org/z/9361GKrds - most CPUs benefit from this, but SandyBridge appears to be more borderline, it could be that we begin by initially trying this for multiple-by-constants (and shl-by-constants?)

RKSimon added a commit to RKSimon/llvm-project that referenced this issue Jun 13, 2024
RKSimon added a commit to RKSimon/llvm-project that referenced this issue Jun 14, 2024
@RKSimon RKSimon self-assigned this Jun 14, 2024
RKSimon added a commit to RKSimon/llvm-project that referenced this issue Jun 15, 2024
RKSimon added a commit that referenced this issue Jun 15, 2024
…gets (#95403)

As discussed on #90748 - we can avoid unpacks/extensions from vXi8 to vXi16 by using PMADDUBSW instead and packing the vXi16 results back together.
RKSimon added a commit that referenced this issue Jun 16, 2024
Later levels were inheriting some of the worst case costs from SSE/AVX1 etc.

Based off llvm-mca numbers from the check_cost_tables.py script in https://github.com/RKSimon/llvm-scripts

Cleanup prep work for #90748
RKSimon added a commit to RKSimon/llvm-project that referenced this issue Jun 16, 2024
Extends llvm#95403 to handle non-constant cases - we can avoid unpacks/extensions from vXi8 to vXi16 by using PMADDUBSW instead and truncating the vXi16 results back together.

Most targets would benefit from performing this for non-constant cases as well - its just Intel Core/SandyBridge era CPUs that might experience additional Port0/15 contention.

Fixes llvm#90748
RKSimon added a commit that referenced this issue Jun 16, 2024
These should only consume 1cy on either of the 2 pipes (only zmm ops should double pump) - matches AMD SoG + uops.info

Noticed while updating costs for #90748
RKSimon added a commit to RKSimon/llvm-project that referenced this issue Jun 25, 2024
Extends llvm#95403 to handle non-constant cases - we can avoid unpacks/extensions from vXi8 to vXi16 by using PMADDUBSW instead and truncating the vXi16 results back together.

Most targets would benefit from performing this for non-constant cases as well - its just Intel Core/SandyBridge era CPUs that might experience additional Port0/15 contention.

Fixes llvm#90748
AlexisPerry pushed a commit to llvm-project-tlp/llvm-project that referenced this issue Jul 9, 2024
…5690)

Extends llvm#95403 to handle non-constant cases - we can avoid unpacks/extensions from vXi8 to vXi16 by using PMADDUBSW instead and truncating the vXi16 results back together.

Most targets benefit from performing this for non-constant cases - its just Intel Core/SandyBridge era CPUs that might experience additional Port0/15 contention (but lower instruction count).

Fixes llvm#90748
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants