-
Notifications
You must be signed in to change notification settings - Fork 13.6k
[X86] Investigate using (v)pmaddubsw for vXi8 multiplication on SSSE3+ targets #90748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@llvm/issue-subscribers-backend-x86 Author: Simon Pilgrim (RKSimon)
For the default SSE2 implementation we extend to vXi16, perform the multiplication and pack the results back to vXi8.
But if we have the (v)pmaddubsw instruction, we can zero out the off/even parts of the i8-pairs of one of the operands, perform the 2 pmaddubsw calls and then shift+or them back together, all with a single mask. __m128i _mm_mul_epi8(__m128i x, __m128i y) {
__m128i m = _mm_set1_epi16(255);
__m128i ylo = _mm_and_si128(m, y);
__m128i yhi = _mm_andnot_si128(m, y);
__m128i lo = _mm_maddubs_epi16(x, ylo);
__m128i hi = _mm_maddubs_epi16(x, yhi);
lo = _mm_and_si128(lo, m);
hi = _mm_slli_epi16(hi, 8);
return _mm_or_si128(lo, hi);
} vmovaps .LCPI0_2(%rip), %xmm5
vpand %xmm2, %xmm1, %xmm3
vpandn %xmm2, %xmm1, %xmm4
vpmaddubsw %xmm3, %xmm0, %xmm3
vpmaddubsw %xmm4, %xmm0, %xmm4
vpand %xmm5, %xmm3, %xmm3
vpsllw $8, %xmm4, %xmm4
vpor %xmm4, %xmm3, %xmm4 llvm-mca analysis - https://llvm.godbolt.org/z/9361GKrds - most CPUs benefit from this, but SandyBridge appears to be more borderline, it could be that we begin by initially trying this for multiple-by-constants (and shl-by-constants?) |
…gets As discussed on llvm#90748
…gets As discussed on llvm#90748
…gets As discussed on llvm#90748
Later levels were inheriting some of the worst case costs from SSE/AVX1 etc. Based off llvm-mca numbers from the check_cost_tables.py script in https://github.com/RKSimon/llvm-scripts Cleanup prep work for #90748
Extends llvm#95403 to handle non-constant cases - we can avoid unpacks/extensions from vXi8 to vXi16 by using PMADDUBSW instead and truncating the vXi16 results back together. Most targets would benefit from performing this for non-constant cases as well - its just Intel Core/SandyBridge era CPUs that might experience additional Port0/15 contention. Fixes llvm#90748
These should only consume 1cy on either of the 2 pipes (only zmm ops should double pump) - matches AMD SoG + uops.info Noticed while updating costs for #90748
Extends llvm#95403 to handle non-constant cases - we can avoid unpacks/extensions from vXi8 to vXi16 by using PMADDUBSW instead and truncating the vXi16 results back together. Most targets would benefit from performing this for non-constant cases as well - its just Intel Core/SandyBridge era CPUs that might experience additional Port0/15 contention. Fixes llvm#90748
…5690) Extends llvm#95403 to handle non-constant cases - we can avoid unpacks/extensions from vXi8 to vXi16 by using PMADDUBSW instead and truncating the vXi16 results back together. Most targets benefit from performing this for non-constant cases - its just Intel Core/SandyBridge era CPUs that might experience additional Port0/15 contention (but lower instruction count). Fixes llvm#90748
Uh oh!
There was an error while loading. Please reload this page.
For the default SSE2 implementation we extend to vXi16, perform the multiplication and pack the results back to vXi8.
But if we have the (v)pmaddubsw instruction, we can zero out the odd/even parts of the i8-pairs of one of the operands, perform the 2 pmaddubsw calls and then shift+or them back together, all with a single mask.
llvm-mca analysis - https://llvm.godbolt.org/z/9361GKrds - most CPUs benefit from this, but SandyBridge appears to be more borderline, it could be that we begin by initially trying this for multiple-by-constants (and shl-by-constants?)
The text was updated successfully, but these errors were encountered: