Skip to content

Ineffectual bitwise or with constant emitted for mask operand of vperm(b|w|d|q|ps|pd) #106413

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cvijdea-bd opened this issue Aug 28, 2024 · 4 comments · Fixed by #106750
Closed

Comments

@cvijdea-bd
Copy link

cvijdea-bd commented Aug 28, 2024

Same thing as #106256, but also happens for the (avx2/avs512) permute[x]var intrinsics, while the PR #106377 seems to only fix it for (v)pshufb specifically.

Godbolt examples: https://godbolt.org/z/MsTcx7qYc

The vector permute intrinsics ignore all bits except the ones that match the required index size, e.g.:

  • vpermb only uses 4, 5, 6 bits out of each mask byte element for 128, 256, 512 bit sized vectors respectively
  • vpermw only uses 3, 4, 5 bits out of each 16-bit element in the mask
  • etc.

The OR operations with unrelated bits should be optimzied out.

Probably applies to vpermt2 (e.g. _mm512_permutex2var_epi16) also, with 1 more bit used since they selected from two concatenated vectors.

cc @RKSimon

@llvmbot
Copy link
Member

llvmbot commented Aug 28, 2024

@llvm/issue-subscribers-backend-x86

Author: Cristian Vîjdea (cvijdea-bd)

Same thing as https://github.com//issues/106256, but also happens for the (avx2/avs512) permute[x]var intrinsics, while the PR https://github.com//pull/106377 seems to only fix it for (v)pshufb specifically.

Godbolt examples: https://godbolt.org/z/MsTcx7qYc

The vector permute intrinsics ignore all bits except the ones that match the required index size, e.g.:

  • vpermb only uses 4, 5, 6 bits out of each mask byte element for 128, 256, 512 bit sized vectors respectively
  • vpermw only uses 3, 4, 5 bits out of each 16-bit element in the mask
  • etc.

The OR operations with unrelated bits should be optimzied out.

Probably applies to vpermt2 (e.g. _mm512_permutex2var_epi16) also, with 1 more bit used since they selected from two concatenated vectors.

cc @RKSimon

@RKSimon
Copy link
Collaborator

RKSimon commented Aug 29, 2024

vpermilpd/vpermilps will need support as well - vpermilpd is annoying as it doesn't use the lsb for the index

RKSimon added a commit that referenced this issue Aug 29, 2024
… values

VPERMILPS lower bits0-3 (to index per-lane i32/f32 0-3)
VPERMILPD uses bit1  (to index per-lane i64/f64 0-1)

Use SimplifyDemandedBits to ignore anything touching the remaining bits.

Part of #106413
RKSimon added a commit to RKSimon/llvm-project that referenced this issue Aug 30, 2024
…V3 mask values

VPERMV/VPERMV3 only uses the lower bits of the vector element indices - so use SimplifyDemandedBits to ignore anything touching the remaining bits.

Fixes llvm#106413
RKSimon added a commit to RKSimon/llvm-project that referenced this issue Aug 30, 2024
…V3 mask values

VPERMV/VPERMV3 only uses the lower bits of the vector element indices - so use SimplifyDemandedBits to ignore anything touching the remaining bits.

Fixes llvm#106413
@RKSimon
Copy link
Collaborator

RKSimon commented Sep 1, 2024

Godbolt examples: https://godbolt.org/z/MsTcx7qYc

@cvijdea-bd Just so you know, the _mm_permutexvar_* intrinsics have the mask at arg0, while _mm256_permutevarintrinscis have it at arg1, and _mm_permutex2_* intrinsics have it in the middle. Just another thing to love about x86...........

@cvijdea-bd
Copy link
Author

Yeah I noticed that while looking over your fix, great stuff...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants