Skip to content

Ineffectual por with constant emitted for pshufb operand #106256

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cvijdea-bd opened this issue Aug 27, 2024 · 2 comments · Fixed by #106377
Closed

Ineffectual por with constant emitted for pshufb operand #106256

cvijdea-bd opened this issue Aug 27, 2024 · 2 comments · Fixed by #106377

Comments

@cvijdea-bd
Copy link

cvijdea-bd commented Aug 27, 2024

Clang example: https://godbolt.org/z/ec4P4j78b, flags: -O3 -march=x86-64-v2. Not clang specific, same behaviour on rust nightly.

#include <immintrin.h>

extern "C" __m128i shuffle_or(__m128i bytes, __m128i idxs) {
    return _mm_shuffle_epi8(bytes, _mm_or_si128(idxs, _mm_set1_epi8(112)));
}

The por of xmm1 with 112 (0b0111_0000) is a no-op and should be optimized out, as pshufb ignores bits 5-7 of the mask argument:

.LCPI0_0:
        .zero   16,112
shuffle_or:
        por     xmm1, xmmword ptr [rip + .LCPI0_0]
        pshufb  xmm0, xmm1
        ret

Writing _mm_shuffle_epi8(bytes, _mm_set1_epi8(127)) in the source emits a pshufb with 15 in the assembly, so it seems like LLVM is aware of this optimization on some level, but fails to apply it here.

@llvmbot
Copy link
Member

llvmbot commented Aug 27, 2024

@llvm/issue-subscribers-backend-x86

Author: Cristian Vîjdea (cvijdea-bd)

Clang example: https://godbolt.org/z/jco9dn95W, flags: `-O3 -march=x86-64-v2`
#include &lt;immintrin.h&gt;

extern "C" __m128i shuffle_or(__m128i bytes, __m128i idxs) {
    return _mm_shuffle_epi8(bytes, _mm_or_si128(idxs, _mm_set1_epi8(112)));
}

The por of xmm1 with 112 (0b0111_0000) is a no-op and should be optimized out, as pshufb ignores bits 5-7 of the mask argument:

.LCPI0_0:
        .zero   16,112
shuffle_or:
        por     xmm1, xmmword ptr [rip + .LCPI0_0]
        pshufb  xmm0, xmm1
        ret

<details>
<summary>EDIT: simplified reproducer, original below</summary>

Clang example: https://godbolt.org/z/r67EqKqK8, flags: -O3 -march=x86-64-v2

#include &lt;immintrin.h&gt;

extern "C" __m128i shuffle_or(__m128i bytes, __m128i idxs) {
    return _mm_shuffle_epi8(bytes, _mm_or_si128(idxs, _mm_set1_epi8(112)));
}

The por of xmm1 with 112 (0b0111_0000) is a no-op, as pshufb ignores bits 5-7 of the mask argument:

.LCPI0_0:
        .zero   16,112
shuffle_or:
        por     xmm1, xmmword ptr [rip + .LCPI0_0]
        pshufb  xmm0, xmm1
        ret

</details>

@RKSimon RKSimon self-assigned this Aug 27, 2024
RKSimon added a commit to RKSimon/llvm-project that referenced this issue Aug 28, 2024
RKSimon added a commit to RKSimon/llvm-project that referenced this issue Aug 28, 2024
(V)PSHUFB only uses the sign bit (for zeroing) and the lower 4 bits (to index per-lane byte 0-15) - so use SimplifyDemandedBits to ignore anything touching the remaining bits.

Fixes llvm#106256
RKSimon added a commit to RKSimon/llvm-project that referenced this issue Aug 28, 2024
RKSimon added a commit to RKSimon/llvm-project that referenced this issue Aug 28, 2024
(V)PSHUFB only uses the sign bit (for zeroing) and the lower 4 bits (to index per-lane byte 0-15) - so use SimplifyDemandedBits to ignore anything touching the remaining bits.

Fixes llvm#106256
@cvijdea-bd
Copy link
Author

Thanks for looking into this so quickly!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants