[X86] Investigate using (v)pmaddubsw for vXi8 multiplication on SSSE3+ targets #90748

RKSimon · 2024-05-01T17:04:37Z

For the default SSE2 implementation we extend to vXi16, perform the multiplication and pack the results back to vXi8.

But if we have the (v)pmaddubsw instruction, we can zero out the odd/even parts of the i8-pairs of one of the operands, perform the 2 pmaddubsw calls and then shift+or them back together, all with a single mask.

__m128i _mm_mul_epi8(__m128i x, __m128i y) {
    __m128i m = _mm_set1_epi16(255);
    __m128i ylo = _mm_and_si128(m, y);
    __m128i yhi = _mm_andnot_si128(m, y);
    __m128i lo = _mm_maddubs_epi16(x, ylo);
    __m128i hi = _mm_maddubs_epi16(x, yhi);
    lo = _mm_and_si128(lo, m);
    hi = _mm_slli_epi16(hi, 8);
    return _mm_or_si128(lo, hi);
}

  vmovaps .LCPI0_2(%rip), %xmm5
  vpand %xmm2, %xmm1, %xmm3
  vpandn %xmm2, %xmm1, %xmm4
  vpmaddubsw %xmm3, %xmm0, %xmm3
  vpmaddubsw %xmm4, %xmm0, %xmm4
  vpand %xmm5, %xmm3, %xmm3
  vpsllw $8, %xmm4, %xmm4
  vpor %xmm4, %xmm3, %xmm4

llvm-mca analysis - https://llvm.godbolt.org/z/9361GKrds - most CPUs benefit from this, but SandyBridge appears to be more borderline, it could be that we begin by initially trying this for multiple-by-constants (and shl-by-constants?)

llvmbot · 2024-05-01T17:04:54Z

@llvm/issue-subscribers-backend-x86

Author: Simon Pilgrim (RKSimon)

For the default SSE2 implementation we extend to vXi16, perform the multiplication and pack the results back to vXi8.

But if we have the (v)pmaddubsw instruction, we can zero out the off/even parts of the i8-pairs of one of the operands, perform the 2 pmaddubsw calls and then shift+or them back together, all with a single mask.

__m128i _mm_mul_epi8(__m128i x, __m128i y) {
    __m128i m = _mm_set1_epi16(255);
    __m128i ylo = _mm_and_si128(m, y);
    __m128i yhi = _mm_andnot_si128(m, y);
    __m128i lo = _mm_maddubs_epi16(x, ylo);
    __m128i hi = _mm_maddubs_epi16(x, yhi);
    lo = _mm_and_si128(lo, m);
    hi = _mm_slli_epi16(hi, 8);
    return _mm_or_si128(lo, hi);
}

  vmovaps .LCPI0_2(%rip), %xmm5
  vpand %xmm2, %xmm1, %xmm3
  vpandn %xmm2, %xmm1, %xmm4
  vpmaddubsw %xmm3, %xmm0, %xmm3
  vpmaddubsw %xmm4, %xmm0, %xmm4
  vpand %xmm5, %xmm3, %xmm3
  vpsllw $8, %xmm4, %xmm4
  vpor %xmm4, %xmm3, %xmm4

llvm-mca analysis - https://llvm.godbolt.org/z/9361GKrds - most CPUs benefit from this, but SandyBridge appears to be more borderline, it could be that we begin by initially trying this for multiple-by-constants (and shl-by-constants?)

…gets As discussed on llvm#90748

…gets (#95403) As discussed on #90748 - we can avoid unpacks/extensions from vXi8 to vXi16 by using PMADDUBSW instead and packing the vXi16 results back together.

Later levels were inheriting some of the worst case costs from SSE/AVX1 etc. Based off llvm-mca numbers from the check_cost_tables.py script in https://github.com/RKSimon/llvm-scripts Cleanup prep work for #90748

Extends llvm#95403 to handle non-constant cases - we can avoid unpacks/extensions from vXi8 to vXi16 by using PMADDUBSW instead and truncating the vXi16 results back together. Most targets would benefit from performing this for non-constant cases as well - its just Intel Core/SandyBridge era CPUs that might experience additional Port0/15 contention. Fixes llvm#90748

These should only consume 1cy on either of the 2 pipes (only zmm ops should double pump) - matches AMD SoG + uops.info Noticed while updating costs for #90748

Extends llvm#95403 to handle non-constant cases - we can avoid unpacks/extensions from vXi8 to vXi16 by using PMADDUBSW instead and truncating the vXi16 results back together. Most targets would benefit from performing this for non-constant cases as well - its just Intel Core/SandyBridge era CPUs that might experience additional Port0/15 contention. Fixes llvm#90748

…5690) Extends llvm#95403 to handle non-constant cases - we can avoid unpacks/extensions from vXi8 to vXi16 by using PMADDUBSW instead and truncating the vXi16 results back together. Most targets benefit from performing this for non-constant cases - its just Intel Core/SandyBridge era CPUs that might experience additional Port0/15 contention (but lower instruction count). Fixes llvm#90748

RKSimon added backend:X86 performance labels May 1, 2024

RKSimon added a commit to RKSimon/llvm-project that referenced this issue Jun 13, 2024

[X86] Lower vXi8 multiplies by constant using PMADDUBSW on SSSE3+ tar…

afa8d0d

…gets As discussed on llvm#90748

RKSimon mentioned this issue Jun 13, 2024

[X86] Lower vXi8 multiplies by constant using PMADDUBSW on SSSE3+ targets #95403

Merged

RKSimon added a commit to RKSimon/llvm-project that referenced this issue Jun 14, 2024

[X86] Lower vXi8 multiplies by constant using PMADDUBSW on SSSE3+ tar…

035e0fa

…gets As discussed on llvm#90748

RKSimon self-assigned this Jun 14, 2024

RKSimon added a commit to RKSimon/llvm-project that referenced this issue Jun 15, 2024

[X86] Lower vXi8 multiplies by constant using PMADDUBSW on SSSE3+ tar…

8a78d14

…gets As discussed on llvm#90748

RKSimon mentioned this issue Jun 16, 2024

[X86] Lower vXi8 multiplies using PMADDUBSW on SSSE3+ targets #95690

Merged

RKSimon closed this as completed in #95690 Jun 25, 2024

RKSimon closed this as completed in a46a2c2 Jun 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[X86] Investigate using (v)pmaddubsw for vXi8 multiplication on SSSE3+ targets #90748

[X86] Investigate using (v)pmaddubsw for vXi8 multiplication on SSSE3+ targets #90748

RKSimon commented May 1, 2024 •

edited

Loading

llvmbot commented May 1, 2024

Uh oh!

[X86] Investigate using (v)pmaddubsw for vXi8 multiplication on SSSE3+ targets #90748

[X86] Investigate using (v)pmaddubsw for vXi8 multiplication on SSSE3+ targets #90748

Comments

RKSimon commented May 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

llvmbot commented May 1, 2024

Uh oh!

RKSimon commented May 1, 2024 •

edited

Loading